Digital Vocal Biomarker of Smoking Status Using Ecological Audio Recordings: Results from the Colive Voice Study

Digit Biomark. 2024 Aug 28;8(1):159-170. doi: 10.1159/000540327. eCollection 2024 Jan-Dec.

Abstract

Introduction: The complex health, social, and economic consequences of tobacco smoking underscore the importance of incorporating reliable and scalable data collection on smoking status and habits into research across various disciplines. Given that smoking impacts voice production, we aimed to develop a gender and language-specific vocal biomarker of smoking status.

Methods: Leveraging data from the Colive Voice study, we used statistical analysis methods to quantify the effects of smoking on voice characteristics. Various voice feature extraction methods combined with machine learning algorithms were then used to produce a gender and language-specific (English and French) digital vocal biomarker to differentiate smokers from never-smokers.

Results: A total of 1,332‬ participants were included after propensity score matching (mean age = 43.6 [13.65], 64.41% are female, 56.68% are English speakers, 50% are smokers and 50% are never-smokers). We observed differences in voice features distribution: for women, the fundamental frequency F0, the formants F1, F2, and F3 frequencies and the harmonics-to-noise ratio were lower in smokers compared to never-smokers (p < 0.05) while for men no significant disparities were noted between the two groups. The accuracy and AUC of smoking status prediction reached 0.71 and 0.76, respectively, for the female participants, and 0.65 and 0.68, respectively, for the male participants.

Conclusion: We have shown that voice features are impacted by smoking. We have developed a novel digital vocal biomarker that can be used in clinical and epidemiological research to assess smoking status in a rapid, scalable, and accurate manner using ecological audio recordings.

Keywords: Machine learning; Public health; Smoking; Tobacco; Vocal biomarkers.

Plain language summary

The objective of this study was to develop a tool for determining the smoking status of a person from their voice. Using data from Colive Voice, an international digital health study led by the Luxembourg Institute of Health, we investigated the impact of smoking on voice characteristics utilizing statistical methods. We then employed artificial intelligence algorithms to identify gender and language-specific digital vocal biomarkers, which are combinations of voice features associated, in the context of this project, with the outcome of smoking status. After analyzing data from 1,332 participants, we found differences in voice features between smokers and never-smokers, particularly among women. For example, the pitch and certain frequencies were lower in female smokers compared to never-smokers. We managed to differentiate between smokers and never-smokers with a 71% accuracy for women and 65% for men. This research demonstrates that smoking affects voice and that it is possible to predict its status using audio recorded in real-life settings. This tool could be valuable in clinical and research settings for studying smoking habits in a rapid and scalable manner.

Grants and funding

The Luxembourg Institute of Health funds the Colive Voice study and this research work.