Whisper (speech recognition system): Difference between revisions

Content deleted Content added
m lead
Bahariada (talk | contribs)
No edit summary
Line 38:
Whisper has a differing error rate with respect to transcribing different languages, with a higher [[word error rate]] in languages not well-represented in the training data.<ref>{{Cite web |last=Wiggers |first=Kyle |date=2023-03-01 |title=OpenAI debuts Whisper API for speech-to-text transcription and translation |url=https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/ |url-status=live |archive-url=https://web.archive.org/web/20230718040023/https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/ |archive-date=2023-07-18 |access-date=2023-08-21 |website=TechCrunch |language=en-US}}</ref>
 
The model has been used as the base for ana unified model for speech recognition and more general [[sound recognition]].<ref>{{Cite arXiv |arxiv=2307.03183 |first1=Gong |last1=Yuan |first2=Sameer |last2=Khurana |title=Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers |first3=Leonid |last3=Karlinsky |first4=James |last4=Glass}}</ref>
 
== Architecture ==
 
The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a [[Mel-frequency cepstrum]], which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps.<ref name="whisperoff">{{Cite web |date=2022-09-21 |title=Introducing Whisper |url=https://openai.com/research/whisper |url-status=live |archive-url=https://web.archive.org/web/20230820005801/https://openai.com/research/whisper |archive-date=2023-08-20 |access-date=2023-08-21 |website=openai.com |language=en-US}}</ref>
 
== See also ==