Whisper (speech recognition system): Difference between revisions

Whisper (speech recognition system)
Original author(s)	OpenAI
Initial release	September 21, 2022
Repository	https://github.com/openai/whisper
Typ	Transcription software; Encoder-decoder transformer; Foundation model; Acoustic model;

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 09:58, 8 November 2023

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.^[2]

It is capable of transcribing speech in English and several other languages,^[3] and is also capable of translating several non-English languages into English. OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches.^[4]

Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture.^[5]

Background

Speech recognition has had a long history in research; the first approaches made use of statistical methods, such as dynamic time warping, and later hidden Markov models. At around the 2010s, deep neural network approaches became more common for speech recognition models, which were enabled by big data and increased computational performance.^[6] Early approaches to deep learning in speech recognition included convolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments of Seq2seq approaches, which include recurrent neural networks which made use of long short-term memory.^[7]

Transformers, introduced in 2017 by Google, displaced many prior state-of-the art approaches to many problems in machine learning, and started becoming the core neural architecture in fields such as language modeling and computer vision;^[8] weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks.^[9]

Training and capabilities

Whisper has been trained using semi-supervised learning on 680,000 hours of multilingual and multitask data, of which about one-fifth (117,000 hours) were non-English audio data. Whisper does not outperform models which specialize in the LibriSpeech dataset, although when tested across many datasets, it is more robust and makes 50% fewer errors than other models.^[10]

Whisper has a differing error rate with respect to transcribing different languages, with a higher word error rate in languages not well-represented in the training data.^[11]

The model has been used as the base for a unified model for speech recognition and more general sound recognition.^[12]

Architecture

The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a Mel-frequency cepstrum, which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps.^[10]

References

^ Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
^ Golla, Ramsri Goutham (2023-03-06). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2023-03-25. Retrieved 2023-08-12.
^ Dickson, Ben (2022-10-03). "How will OpenAI's Whisper model impact AI applications?". VentureBeat. Archived from the original on 2023-03-15. Retrieved 2023-08-12.
^ Wiggers, Kyle (September 21, 2022). "OpenAI open-sources Whisper, a multilingual speech recognition system". TechCrunch. Archived from the original on February 12, 2023. Retrieved February 12, 2023.
^ Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". p. 3. arXiv:2212.04356 [eess.AS].
^ Yu, Dong; Deng, Li (2014). Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9. ISBN 978-1-4471-5778-6.
^ Siddique, Latif; Zaidi, Aun; Cuayahuitl, Heriberto; Shamshad, Fahad; Shoukat, Moazzam; Qadir, Junaid. "Transformers in Speech Processing: A Survey". arXiv:2303.11607v1.
^ Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022). Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix. ISBN 978-0-367-76734-1.
^ Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. p. 307. arXiv:2302.08575. doi:10.1007/978-3-031-23190-2_7. ISBN 978-3-031-23189-6. S2CID 257019816.
^ ^a ^b "Introducing Whisper". openai.com. 2022-09-21. Archived from the original on 2023-08-20. Retrieved 2023-08-21.
^ Wiggers, Kyle (2023-03-01). "OpenAI debuts Whisper API for speech-to-text transcription and translation". TechCrunch. Archived from the original on 2023-07-18. Retrieved 2023-08-21.
^ Yuan, Gong; Khurana, Sameer; Karlinsky, Leonid; Glass, James. "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers". arXiv:2307.03183.

[paper-1] Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].

[2] Golla, Ramsri Goutham (2023-03-06). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2023-03-25. Retrieved 2023-08-12.

[3] Dickson, Ben (2022-10-03). "How will OpenAI's Whisper model impact AI applications?". VentureBeat. Archived from the original on 2023-03-15. Retrieved 2023-08-12.

[4] Wiggers, Kyle (September 21, 2022). "OpenAI open-sources Whisper, a multilingual speech recognition system". TechCrunch. Archived from the original on February 12, 2023. Retrieved February 12, 2023.

[5] Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". p. 3. arXiv:2212.04356 [eess.AS].

[deepasr-6] Yu, Dong; Deng, Li (2014). Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9. ISBN 978-1-4471-5778-6.

[7] Siddique, Latif; Zaidi, Aun; Cuayahuitl, Heriberto; Shamshad, Fahad; Shoukat, Moazzam; Qadir, Junaid. "Transformers in Speech Processing: A Survey". arXiv:2303.11607v1.

[8] Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022). Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix. ISBN 978-0-367-76734-1.

[9] Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. p. 307. arXiv:2302.08575. doi:10.1007/978-3-031-23190-2_7. ISBN 978-3-031-23189-6. S2CID 257019816.

[whisperoff-10] "Introducing Whisper". openai.com. 2022-09-21. Archived from the original on 2023-08-20. Retrieved 2023-08-21.

[11] Wiggers, Kyle (2023-03-01). "OpenAI debuts Whisper API for speech-to-text transcription and translation". TechCrunch. Archived from the original on 2023-07-18. Retrieved 2023-08-21.

[12] Yuan, Gong; Khurana, Sameer; Karlinsky, Leonid; Glass, James. "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers". arXiv:2307.03183.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

@@ Line 38: / Line 38: @@
 Whisper has a differing error rate with respect to transcribing different languages, with a higher [[word error rate]] in languages not well-represented in the training data.<ref>{{Cite web |last=Wiggers |first=Kyle |date=2023-03-01 |title=OpenAI debuts Whisper API for speech-to-text transcription and translation |url=https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/ |url-status=live |archive-url=https://web.archive.org/web/20230718040023/https://techcrunch.com/2023/03/01/openai-debuts-whisper-api-for-text-to-speech-transcription-and-translation/ |archive-date=2023-07-18 |access-date=2023-08-21 |website=TechCrunch |language=en-US}}</ref>
-The model has been used as the base for an unified model for speech recognition and more general [[sound recognition]].<ref>{{Cite arXiv |arxiv=2307.03183 |first1=Gong |last1=Yuan |first2=Sameer |last2=Khurana |title=Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers |first3=Leonid |last3=Karlinsky |first4=James |last4=Glass}}</ref>
+The model has been used as the base for a unified model for speech recognition and more general [[sound recognition]].<ref>{{Cite arXiv |arxiv=2307.03183 |first1=Gong |last1=Yuan |first2=Sameer |last2=Khurana |title=Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers |first3=Leonid |last3=Karlinsky |first4=James |last4=Glass}}</ref>
 == Architecture ==
-The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a [[Mel-frequency cepstrum]], which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks such as phrase-level timestamps.<ref name="whisperoff">{{Cite web |date=2022-09-21 |title=Introducing Whisper |url=https://openai.com/research/whisper |url-status=live |archive-url=https://web.archive.org/web/20230820005801/https://openai.com/research/whisper |archive-date=2023-08-20 |access-date=2023-08-21 |website=openai.com |language=en-US}}</ref>
+The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a [[Mel-frequency cepstrum]], which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps.<ref name="whisperoff">{{Cite web |date=2022-09-21 |title=Introducing Whisper |url=https://openai.com/research/whisper |url-status=live |archive-url=https://web.archive.org/web/20230820005801/https://openai.com/research/whisper |archive-date=2023-08-20 |access-date=2023-08-21 |website=openai.com |language=en-US}}</ref>
 == See also ==