Early detection of depression is crucial because depression can lead to suicide if the symptoms are left unrecognized or untreated. In hospitals, self-administered questionnaires and interviews are employed to diagnose depression. Although doctors spend considerable time interviewing patients to understand their conditions, depression is a heterogeneous syndrome that makes accurate diagnosis challenging. Therefore, the biological aspects of depression must be investigated to address the limitations of traditional diagnostic methods. Audio data can be easily collected in daily life. Hence, we propose a multimodal fusion cross-modality model that applies audio and text to detect depression. The proposed model achieved F1-scores of 0.67, 0.81, and 0.61 on the Distress Analysis Interview Corpus, Emotional Audio and Textual Depression Corpus, and Korean Depression datasets. The model is designed to be lightweight, reducing the number of parameters while maintaining model accuracy with fewer parameters so that it can be employed in pervasive devices. We used English, Chinese, and Korean depression datasets to evaluate the performance of the proposed model across languages. The cross-language experiments confirm that the proposed model can be applied in other languages, even if the model is not trained in the same vocabulary. This finding suggests that the model has learned distinctive depression characteristics by combining nonlinguistic speech features and linguistic textual features. Therefore, this research is expected to detect depression in everyday life across languages and devices.
Keywords: Cross-modality; Depression dataset; Depression detection; Multimodal fusion.
Copyright © 2025 The Authors. Published by Elsevier Ltd.. All rights reserved.