Background: Producing speech is a cognitively complex task and can be collected through devices such as handheld recorders, tablets, and smartphones. Digital voice data can also capture information at a granular millisecond-level precision and serve as a widespread tool to collect cognitively relevant data in almost any diverse real-world environments. Digital voice recordings of spoken responses to neuropsychological test questions have been collected through the Framingham Heart Study (FHS) since 2005. The methods to analyze voice recordings were initially labor and time-intensive approaches that were significant barriers to fully realizing the scientific objective of using speech and language as an alternative approach to cognitive assessment.
Methods: Through a collaboration with the Global Research Integration Platform (GRIP) and FHS, we leveraged existing open-source tools to create a digital voice processing toolkit that can be used by the general scientific community. GRIP is modularizing this toolkit to allow for seamless integration into various sites worldwide with low-to-high levels of technical experience. Table 1 lists the initial open-source tools we tested on 9,253 audio recordings collected on 5,399 FHS participants. Each open-source tool has a Python Github repository that we leveraged.
Results: With minimal manual intervention, we generated prosodic, spectral, cepstral, and sound quality features from the ComParE-2016 feature set via openSMILE. We produced 65 LLDs (low-level descriptors) every 10 milliseconds over a 60-millisecond window and 6373 features generated by applying several statistical functionals to the LLDs. We segmented speakers via pyannote.audio and analyzed 9 acoustic and linguistic PRAAT features and 6373 openSMILE features in the context of a cognitive status classification task. Via Whisper, we generated timestamped transcriptions from both FHS and additional U.S. and international cohorts. We have noted difficulties in language detection and decreased transcription performance for non-English speakers and English speakers with accents. We have used these data in numerous studies relating digital voice to AD related outcomes.
Conclusion: Digital voice is a prime candidate for scalable collection of cognitively relevant information. The modularized toolkit being developed will provide scaled non-proprietary post-processing of digital voice data that can be seamlessly integrated by users worldwide with low-to-high levels of technical experience.
© 2024 The Alzheimer's Association. Alzheimer's & Dementia published by Wiley Periodicals LLC on behalf of Alzheimer's Association.