Omicron detection with large language models and YouTube audio data

James T Anibal; Adam J Landa; Nguyen T T Hang; Miranda J Song; Alec K Peltekian; Ashley Shin; Hannah B Huth; Lindsey A Hazen; Anna S Christou; Jocelyne Rivera; Robert A Morhard; Ulas Bagci; Ming Li; Yael Bensoussan; David A Clifton; Bradford J Wood

doi:10.1101/2022.09.13.22279673

Omicron detection with large language models and YouTube audio data

medRxiv [Preprint]. 2024 Mar 27:2022.09.13.22279673. doi: 10.1101/2022.09.13.22279673.

Authors

James T Anibal^{1

2}, Adam J Landa¹, Nguyen T T Hang³, Miranda J Song¹, Alec K Peltekian⁴, Ashley Shin⁵, Hannah B Huth¹, Lindsey A Hazen¹, Anna S Christou¹, Jocelyne Rivera¹, Robert A Morhard¹, Ulas Bagci⁶, Ming Li¹, Yael Bensoussan⁷, David A Clifton², Bradford J Wood¹

Affiliations

¹ Center for Interventional Oncology, Radiology and Imaging Sciences, NIH Clinical Center, National Cancer Institute, National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, 10 Center Dr, Building 10, Room 1C341, MSC 1182, Bethesda, MD 20892-1182 USA.
² Computational Health Informatics Lab, Oxford Institute of Biomedical Engineering, University of Oxford, Old Road Campus Research Building, Headington, Oxford OX3 7DQ, United Kingdom.
³ Oxford University Clinical Research Unit, Centre for Tropical Medicine, Ho Chi Minh City, Vietnam.
⁴ Department of Computer Science, McCormick School of Engineering, Northwestern University, Mudd Hall, 2233 Tech Drive, Third Floor, Evanston, IL, 60208 USA.
⁵ National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894.
⁶ Feinberg School of Medicine, Northwestern University, 420 E Superior St, Chicago, IL 60611 USA.
⁷ Morsani College of Medicine, University of South Florida, Tampa, FL, USA.

Abstract

Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications.

Publication types

Preprint

Abstract

Publication types

Grants and funding