whisperWhisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The developer of Whisper, OpenAI, shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. The 9 models are open-source and can be downloaded.

To my knowledge, Whisper is a very good (and probably the best) ASR engine for now and it can be used as a foundation for building useful applications and for further research on robust speech processing. It was launced in September 2022 and is gaining a lot of positive response.

Privacy

In many cases where AV-recordings are made, privacy can be an important item. Especcialy when the recordings are "sensitive" the interviewer (or owner of the recordings) must take care about a carefull handling of the recodings. Whisper can easily run on a fast and big server, on your own, small laptop and on any device between these two. The recognition will give an equal result. The only real difference is the processing speed of Whisper: the better your computer (and especially when it has a graphical card) the faster the recognition.
So, certainly for people who have a fast computer and who occasionally do have sensitive date, we always recommend installing it at least on your own system as well in order to avoid the risk of data breach.

Is ASR ready?

Yes and No! At this moment (April 2024) there are still a few drawbacks to Whisper. In Februari 2024 the problem of diarization (knowing which speaker is speaking) was solved by WhisperX.In the same update, they combined their Whisper with Fast-Whisper, which results in a 10x speed improvement (at least if you use a GPU).

But, Whisper still tends to hallucinate, although the effect is relatively infrequent. The use of the VAD-filter (WhisperX) helps to prevent this hallucination. It tells if there is speech or not, and if not no recognition is done. Another disadvantage of Whisper is that the transcription is sometimes too polished. For most people this is no problem at all but for academic research you sometimes want to know if people hesitate, what they might have wanted to say but swallowed it halfway through or what so ever. Whisper usually turns the transcription most of the time into a nice, grammatically correct sentence that reads well but that you don't always want. For example: "I um I, I thought I'd to do that for a moment" is usually recognised by Whisper as "I thought I'd to do that for a moment".
This last effect is probably caused by the fact that Whisper uses the chatGPT-alike language model to "translate" the recognition into a well-running sentence. Again, this is excellent for transcribing most speech but may not always be desirable for the research of speech and/or dialogues where hesitations, pauses, repetitions and other disfluencies are the topic of research.

Set-up Whisper

Whisper came out as a Python script. After installing Python (version 3.9 - 3.10) you need to install  PyTorch (1.10.1) and FFMPEG.
Once done, you can download and install (or update to) the latest release of Whisper with the following command:
pip install -U openai-whisper
For more information about this, see here.

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. OpenAi observed that the difference becomes less significant for the small.en and medium.en models.

Performance

Whisper's performance varies widely depending on the language and the model. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model (The smaller the numbers, the better the performance). Additional WER scores corresponding to the other models and datasets can be found on the Whisper website. For more information, see her.

License

Important: Whisper's code and model weights are released under the MIT License. See LICENSE for further details.