3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Rina Carines Cabral University of SydneyAustralia [email protected] 0000-0003-3076-0521 Siwen Luo University of Western AustraliaAustralia [email protected] 0000-0003-0480-1991 Josiah Poon University of SydneyAustralia [email protected] 0000-0003-3371-8628  and  Soyeon Caren Han University of MelbourneAustralia [email protected] 0000-0002-1948-6819
(2024)
Abstract.

The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals’ well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance. All relevant codes will be made available upon publication.

mental health classification, knowledge distillation, multimodal
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: 33rd ACM International Conference on Information and Knowledge Management; October 21-25, 2024; Boise, Idahoisbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Decision support systemsccs: Computing methodologies Information extraction

1. Introduction

Mental health is a critical aspect of individual well-being, influencing both personal lives and societal structures (Garg, 2023). Despite advancements in mental healthcare, not everyone with mental health concerns actively seeks professional help. The widespread use of social media platforms, such as Twitter and Reddit, has opened avenues for detecting mental health issues by analysing text-oriented posts. This shift towards online expression has prompted research into text-based mental health classification, focusing on identifying the presence and categories of mental health concerns within social media posts. Recent studies in mental health classification from social media content have embraced diverse components, ranging from historical posts and conversation trees to social graphs and user metadata (Cao et al., 2019, 2022; Lin et al., 2020; Sawhney et al., 2021a, b). However, the availability of these additional sources varies due to data privacy restrictions or user preferences, introducing challenges in research and system reproducibility. In light of these challenges, our research addresses the limitations of existing methodologies by focusing on the analysis of text-only social media posts, a fundamental and universally available component. While semantic pre-trained textual embedding from text-only input may capture explicit emotional words related to mental health, they may fall short in capturing less explicit emotions, limiting their robustness. For instance, some textual posts may lack explicit emotional language yet imply an unhealthy mental state. Recognising the potential shortcomings of text-only datasets, we introduce a novel approach to mental health classification through a Multimodal and Multi-Task Knowledge Distillation Model. Inspired by human comprehension strategies that involve multimodal information integration, our model leverages insights from multimodal human understanding to enhance the efficacy of mental health risk detection. Our approach introduces a new acoustic modality feature generated from original textual posts, motivated by the proven effectiveness of vocal biomarkers in indicating psychological distress and other medical conditions (Iyer et al., 2022). This would create a new modality from text-only input for unimodal text-based mental health risk detection. Simultaneously, we also incorporate emotion-enriched features as additional information. Instead of integrating all modalities into one model, we employ a multimodal and multi-teacher architecture to address the computational complexity of integrating diverse features. This approach distributes the learning process across multiple teachers, each specialising in a particular feature extraction aspect. To the best of our humble knowledge, there have been no attempts to create a new modality from text-only input for the unimodal text-based mental health risk detection tasks. Additionally, we propose a new multimodality knowledge distillation model for the mental health risk detection domain.

2. Related Works

2.1. Mental Health Classification

Recent studies in mental health classification from social media content have incorporated diverse social media components. These components encompass various elements, including historical posts, conversation trees, social and interaction graphs, user or post metadata information, and profile pictures or posted images (Cao et al., 2019, 2022; Lin et al., 2020; Sawhney et al., 2021a, b). However, those additional sources will not always be available in the dataset due to data privacy restrictions or user preferences. This complicates research reproducibility since each study selects features based on what social media components are available to them. Hence, our research focuses on exploring mental health detection by analysing only social media textual posts, which is a compulsory component of text-based social media posts related to mental health. Based on the textual aspect, existing studies have worked on frequency- or score-based emotion features (Aragón et al., 2023; Zogan et al., 2022). More recent works use the fine-tuned contextual embeddings on emotion-based tasks as the emotion features for mental health classification (Lara et al., 2021; Sawhney et al., 2021a). These studies mainly focus on identifying and fixedly matching one type of emotion to each word or entire textual content. On the other hand, our research highlights the complexity of human emotions wherein a single word could be associated with multiple types of emotions by integrating the emotion-enriched features generated through a multi-label, corpus-based representation learning framework. In addition, we propose a new way to include acoustic features generated by original textual posts in this task, motivated by the proven effectiveness of vocal biomarkers in psychological distress and other medical conditions indication (Iyer et al., 2022). This would simultaneously process the emotion-enriched features as additional information. To the best of our knowledge, there have been no attempts to create a new modality from the text-only input to achieve unimodal text-based mental health risk detection tasks.

2.2. Multi-teacher Knowledge Distillation

To integrate the multimodal knowledge efficiently, we design our model with a knowledge distillation idea to compress a complex and large multimodal integration model into a smaller and simpler one while still retaining the accuracy and performance of the resultant model. Knowledge distillation (Hinton et al., 2015) involves transferring knowledge from a teacher model to a student model, commonly applied to compress large models by mapping intermediate layer outputs (Chen et al., 2021; Jiao et al., 2020) or minimising KL-divergence in class distribution (Mirzadeh et al., 2020). Traditionally, knowledge distillation is used within the same modality. However, recent approaches extend it to different modalities (Yang and Xu, 2021; Ni et al., 2022; Li et al., 2020). Some studies explore collaborative learning with multiple teachers for improved compression, such as (Wu et al., 2021) and (Pham et al., 2023) in language and vision models. (Tan et al., 2018) focuses on multilingual language translation, and (Vongkulbhisal et al., 2019) applies multi-teacher knowledge distillation to unify classifiers trained on distinct data sources. Inspired by this, we propose a new Multimodal Multi-Teacher Knowledge Distillation framework for mental health risk detection.

Refer to caption
Figure 1. The overall architecture of 3M-Health: Multimodal Multi-teacher Knowledge Distillation for Mental Health Detection. Note that the teacher models are finetuned using each mental health dataset.

3. 3M-Health

In this section, we introduce the details of our Multimodal Multi-teacher knowledge distillation model for Mental Health detection, 3M-Health. Figure 1 illustrates the overall architecture design. This model consists of three distinct teacher models, each focusing on different modalities to independently learn diverse aspects of features crucial for interpreting mental health-related posts. The acquired features from multimodal teachers serve as a valuable source of knowledge for instructing the student model by utilising the average output distribution of the teacher models as soft targets. Note that we introduce three essential multimodal teacher models for mental health risk detection, including 1) a text-based teacher for understanding semantic aspects from input texts, 2) an emotion-based teacher for interpreting emotion aspects from input texts, and 3) an audio-based teacher for discerning emotions conveyed through audio sounds.

3.1. Multimodal Multi-Teacher Construction

This section articulates each teacher model’s objective and construction process. Teacher fine-tuning can be found in Section 3.2.

3.1.1. Text-based Teacher

The text-based teacher aims to teach contextual semantic comprehension of mental health-related textual posts. We leverage pre-trained large language models (PLMs) since contextualised embeddings from PLMs represent different meanings based on the context (e.g. blue means a kind of the colour, but gloomy in other emotional contexts). More specifically, some words may have opposite meanings in the medical domain (e.g., positive usually means something good but often refers to the presence of a specific condition, which is typically not a desirable outcome). Inspired by this, we explore several general (BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019)) and medical/mental-health specific PLMs (MentalBERT (Ji et al., 2021), and ClinicalBERT (Wang et al., 2023)) in the experiment.

3.1.2. Emotion-based Teacher

The emotion-based teacher aims to teach emotional aspects from the input text of mental health-related posts. It is initialised with representations derived from a graph-based model (Cabral et al., 2024), which produces emotion-enriched word representations by thoroughly incorporating global and local relationships among posts and all the words within those posts. To do so, we first obtain a multi-label emotion class indicating anger, disgust, fear, sadness, surprise, negative, and other111Positive sentiment/emotions are grouped into other to focus on the different negative emotion on mental health-related text. for each post using the SenticNet7 lexicon (Cambria et al., 2022)222https://sentic.net/downloads/, mapping identified words to their corresponding emotion types. This emotion lexicon consists of terms K={k1,,kq}𝐾subscript𝑘1subscript𝑘𝑞K=\{k_{1},...,k_{q}\}italic_K = { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } associated with one or more emotion types from EM={em1,,emr}𝐸𝑀𝑒subscript𝑚1𝑒subscript𝑚𝑟EM=\{em_{1},...,em_{r}\}italic_E italic_M = { italic_e italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }. For each word W={w1,,wp}𝑊subscript𝑤1subscript𝑤𝑝W=\{w_{1},...,w_{p}\}italic_W = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } in a post, we assign EMkj𝐸subscript𝑀subscript𝑘𝑗EM_{k_{j}}italic_E italic_M start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT to wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT whenever wi=kjsubscript𝑤𝑖subscript𝑘𝑗w_{i}=k_{j}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in K𝐾Kitalic_K in this document. Consequently, each post is associated with a multi-label class EMd={EMw1EMw2EMwp}𝐸subscript𝑀𝑑𝐸subscript𝑀subscript𝑤1𝐸subscript𝑀subscript𝑤2𝐸subscript𝑀subscript𝑤𝑝EM_{d}=\{EM_{w_{1}}\cup EM_{w_{2}}\cup...EM_{w_{p}}\}italic_E italic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { italic_E italic_M start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ italic_E italic_M start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ … italic_E italic_M start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. Subsequently, we construct a graph G=(V,E,A)𝐺𝑉𝐸𝐴G=(V,E,A)italic_G = ( italic_V , italic_E , italic_A ) representing all posts and their word tokens, where V𝑉Vitalic_V is the set of all post nodes and token nodes tokenised through wordpiece tokenisation with emoticon preservation333To further preserve and integrate emotions in the posts, emoticons and emojis are added to the tokeniser vocabulary.. Here, E𝐸Eitalic_E encompasses token-token edges Ewi,wjsubscript𝐸subscript𝑤𝑖subscript𝑤𝑗E_{w_{i},w_{j}}italic_E start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, token-post edges Ewi,djsubscript𝐸subscript𝑤𝑖subscript𝑑𝑗E_{w_{i},d_{j}}italic_E start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and post-post edges Edi,djsubscript𝐸subscript𝑑𝑖subscript𝑑𝑗E_{d_{i},d_{j}}italic_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, while A𝐴Aitalic_A specifies weights between related nodes(Han et al., 2022). Post node and token node representations are initialised with the [CLS] embedding and the minimum of contextualised token embeddings from pre-trained BERT word embeddings, respectively. Edge values are determined by Pointwise Mutual Information (PMI) for Ewi,wjsubscript𝐸subscript𝑤𝑖subscript𝑤𝑗E_{w_{i},w_{j}}italic_E start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Term Frequency-Inverse Document Frequency (TF-IDF) for Ewi,djsubscript𝐸subscript𝑤𝑖subscript𝑑𝑗E_{w_{i},d_{j}}italic_E start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Jaccard similarity for Edi,djsubscript𝐸subscript𝑑𝑖subscript𝑑𝑗E_{d_{i},d_{j}}italic_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Utilising these initialised representations and edge values, a two-layer Graph Convolutional Neural Network (Kipf and Welling, 2016) (GCN) is trained with ReLu for the multi-label emotion classification task based on the SenticNet7. The updated second-layer hidden states are extracted and used as initial weights for fine-tuning BERT on the same multi-label emotion classification task to comprehend the associated emotions in the posts further. The updated word embeddings are extracted as the multi-emotion contextual embeddings.

3.1.3. Audio-based Teacher

Existing social media mental health datasets primarily consist of text-only samples, so the two teachers mentioned before are purely based on the text. To address the need for a more comprehensive understanding of complex mental health and emotional contexts, we propose the integration of multimodality information. According to research (Collins et al., 2023), individuals can more accurately interpret the emotions of others through listening rather than observing facial expressions/body language or reading written text. Drawing inspiration from this insight, we introduce an audio-based teacher to enhance knowledge distillation, enabling the interpretation of emotions in mental health posts through sound. To achieve this, we first employ Bark444https://github.com/suno-ai/bark, a pre-trained text-to-audio model, to generate corresponding audio for each post as Bark can capture emotional sounds detected from the text (e.g. [laughs], [gasps] and “…” for hesitations)555A theoretical and practical comparison for text-to-audio generation APIs and a list of Bark’s sound cues in Section 4.2.. Note that Bark can generate only 13 seconds of audio. Hence, we tokenise each post at the sentence level, generate audio for each sentence, and then aggregate these audio segments into a complete audio representation for the entire post. Particularly long sentences or texts that do not have punctuation are further split into a maximum of 45 tokens.

3.2. Multimodal Multi-Teacher Fine-tuning

Researchers (Jiao et al., 2020; Wu et al., 2021) emphasise the significance of distilling knowledge from the hidden states of a teacher model for effective student instruction. In this section, we describe the independent fine-tuning process of each teacher model.

We performed fine-tuning on the text-based teacher, utilising the pre-trained language model, for the mental health classification task with labels C={c1,c2,,c|C|}𝐶subscript𝑐1subscript𝑐2subscript𝑐𝐶C=\{c_{1},c_{2},...,c_{|C|}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT | italic_C | end_POSTSUBSCRIPT }. This process enables the teacher to learn and adapt to the nuances of mental health-related contexts within each dataset.

For the emotion-based teacher, following the generation of emotion-rich representations for each post and its words, these embeddings serve as inputs for fine-tuning a Multi-Layer Perceptron (MLP) for mental health classification, operating over the labels C𝐶Citalic_C.

We employ the Audio Spectrogram Transformer (Gong et al., 2021) (AST) for the audio-based teacher to classify each generated audio into mental health risk classes. AST is a transformer-based model that takes a sequence of audio spectrogram patches as inputs. An audio waveform is first converted into a 128x100t spectrogram based on a sequence of 128-dimensional log Mel filterbank features computed with a 25ms Hamming window every 10ms. Such a spectrogram is then split into a sequence of N 16x16 patches of images with an overlap of 6 in both time and frequency dimensions. A special token [CLS] is added to the beginning of the sequence of spectrogram patches. After passing through transformer encoder layers, the [CLS] embedding is fed into a linear layer with sigmoid activation to classify mental health risk class labels C𝐶Citalic_C.

Every teacher model is individually constructed and fine-tuned to facilitate optimal learning. We are aware of the concerns raised by some researchers (Wu et al., 2021) highlighting the potential inconsistency in the feature space when different teachers are separately pre-trained with distinct settings and then fine-tuned independently. Based on our initial testing, co-finetuning multimodal teachers yields little improvement; in fact, it tends to result in lower performance. We speculate that integrating multimodal information may not perform optimally during co-finetuning.

3.3. Multi-Teacher Knowledge Distillation

For the student model, we use a single modality involving textual posts as input for a pre-trained BERT, which processes the sequence of tokenised words. The student model performs the mental health risk classification task over the same class labels C𝐶Citalic_C, with knowledge distilled from the text-based, emotion-based, and audio-based teacher models. To incorporate the acquired knowledge from these various multimodal sources, the student model is trained to minimise the distillation loss given by L=Ltask+Lkd𝐿subscript𝐿𝑡𝑎𝑠𝑘subscript𝐿𝑘𝑑L=L_{task}+L_{kd}italic_L = italic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT. Here, Ltasksubscript𝐿𝑡𝑎𝑠𝑘L_{task}italic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT represents the cross-entropy loss between the student model’s predictions and the ground truth of mental health risk categories, while Lkdsubscript𝐿𝑘𝑑L_{kd}italic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT stands for the Kullback-Leibler (KL) divergence between the student model and the teacher models’ predictions. Given the presence of multiple teacher models, we calculate Lkdsubscript𝐿𝑘𝑑L_{kd}italic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT by averaging the predicted probability distributions from all three teacher models.

4. Experimental Setup

4.1. Datasets

Table 1. Data statistics. Durations are in a minute:second (mm:ss) format. ±SDCNL categorises suicide and depression-related posts.
TwitSuicide DEPTWEET IdenDep SDCNL
Task Suicide Depression Depression Suicide±
Platform Twitter Twitter Reddit Reddit
Num. Classes 3 4 2 2
Total Samples 660 5128 1841 1895
Evaluation 10-fold 60/20/20 10-fold 80/20
Train/Val - 4,102 - 1,516
Test - 1,026 - 379
Length 13-147 1-926 11-17,641 13-24,590
Avg. Length 90.32 163.28 1,127.57 936.76
Words 3-31 101 3,477 4,411
Avg. Words 16.85 28.15 215.1 178.53
Min Duration 00:01.903 00:01.250 00:02.900 00:02.463
Max Duration 00:32.853 00:56.596 22:07.740 28:09.546
Avg. Duration 00:11.545 00:17.860 01:45.193 01.27.568
Refer to caption
(a) TwitSuicide
Refer to caption
(b) DEPTWEET
Refer to caption
(c) IdenDep
Refer to caption
(d) SDCNL
Figure 2. Class distribution. For (a) TwitSuicide, SI: Safe to Ignore; PC: Possibly Concerning; SC: Strongly Concerning. For (b) DEPTWEET, ND: Non-depression; MI: Mild; MO: Moderate; SE: Severe. For (c) IdenDep, NDE: Non-depression; DE: Depression. For (d) SDCNL, DEP: Depression; SUI: Suicide.
Refer to caption
(a) Twitter-based
Refer to caption
(b) Reddit-based
Figure 3. Audio length comparison. ch: character average

We evaluate our proposed model using four publicly available datasets related to mental health on social media. Table 1 and Figure 2 provide a summary of statistics and an illustration of class distribution.

The TwitSuicide Dataset666Data available upon request. (Long et al., 2022) replicates the data collection, processing, and annotation methods of (O’Dea et al., 2015). A sample of 660 tweets is annotated into three risk levels. The Strongly Concerning (SC) class is assigned to posts with a convincing display of severe suicidal ideation, while Safe to Ignore (SI) shows no evidence of suicide risk. If it doesn’t fall into other categories, a post remains in the Possibly Concerning (PC) class.

DEPTWEET777https://github.com/mohsinulkabir14/DEPTWEET (Kabir et al., 2023) is collected from Twitter using seed terms based on the Patient Health Questionnaire (PHQ-9). The dataset comprises 40,191 tweets; however, only 5,128 tweets were retrieved during this study. The labels include Non-Depressed (ND), Mild (MI), Moderate (MO), and Severe (SE), maintaining an imbalanced class distribution, with around 80% labelled as ND and less than 2% SE.

The Identifying Depression Dataset888https://github.com/Inusette/Identifying-depression (IdenDep) (Pirina and Çöltekin, 2018) consists of 1,841 Reddit posts, with “depression indicative” (DE) posts sourced from the Depression subreddit and non-depressive (NDE) posts from the “family” and “friendship advice” subreddits. No further manual check was done on the samples, increasing the probability of false negatives.

The SDCNL Dataset999https://github.com/ayaanzhaque/SDCNL (Haque et al., 2021) involves distinguishing between Reddit suicide-related and depression-related posts. The dataset contains 1,895 nearly balanced posts labelled as Suicide (SUI) or Depression/Not Suicide (DEP) based on their subreddit. In accordance with (Benton et al., 2017), all posts are de-identified before any analysis, audio generation, and model training.

Table 2. Text statistics for each class per dataset.
Class Total % Length (ave.) Words (ave.)
TwitSuicide
Safe to Ignore 103 15.61 13-139 (77.89) 4-31 (15.25)
Possibly Concerning 264 40.00 24-147 (88.16) 4-31 (16.35)
Strongly Concerning 293 44.39 13-147 (96.65) 3-30 (17.86)
DEPTWEET
Non-Depressed 4213 82.16 1-816 (164.47) 1-101 (28.08)
Mild 606 11.82 4-885 (144.74) 1-87 (26.38)
Moderate 232 4.52 32-926 (184.95) 5-99 (33.25)
Severe 77 1.50 23-398 (178.81) 1-62 (30.57)
IdenDep
Non-Depression 548 29.77 11-17641 (1546.34) 1-3477 (295.75)
Depression 1293 70.23 11-13803 (950.09) 2-2487 (180.92)
SDCNL
Depression 915 48.28 43-16015 (1000.68) 8-3200 (192.84)
Suicide 980 51.72 13-24590 (977.07) 2-4411 (165.16)
Table 3. Audio statistics for each class per dataset in a minute:second (mm:ss) format.
Dataset Class Min Duration Max Duration Ave. Duration
TwitSuicide Safe to Ignore 00:01.903 00:31.000 00:12.215
Possibly Concerning 00:02.143 00:32.853 00:11.393
Strongly Concerning 00:01.943 00:24.760 00:10.280
DEPTWEEET Non-Depressed 00:01.250 00:56.200 00:17.210
Mild 00:02.230 00:56.596 00:15.413
Moderate 00:04.460 00:47.770 00:18.958
Severe 00:02.250 00:40.160 00:17.865
IdenDep Non-Depression 00:02.900 22:07.740 02:20.941
Depression 00:03.100 21:28.643 01:30.420
SDCNL Depression 00:05.756 25:58.966 01:33.802
Suicide 00:02.463 28.09.546 01:21.747

We provide a detailed breakdown of text and audio statistics for each dataset class in Tables 2 and 3 to provide more information regarding the nature of each class, which may influence model learning and performance. Notably, DEPTWEET and IdenDep datasets have highly skewed data, with 82.16% and 70.23% on a single class, respectively. Figure 3 illustrates the differences in the generated audio in terms of duration. The Reddit-based datasets, IdenDep and SDCNL, are significantly longer than the Twitter-based datasets, possibly providing more auditory information inferred from the textual posts.

4.2. Text-to-Audio Generators

In order to generate the best possible audio to represent each textual post in our benchmark datasets, we performed a theoretical and practical comparison between five publicly accessible text-to-speech and text-to-audio generative APIs.

  1. (1)

    Tacotron2101010https://github.com/NVIDIA/tacotron2 (Shen et al., 2018) uses a recurrent neural network architecture to predict mel spectrogram sequences from text followed by a modified WaveNet vocoder.

  2. (2)

    SpeechT5111111https://github.com/microsoft/SpeechT5 (Ao et al., 2022) unifies modalities with a shared encoder-decoder architecture that uses cross-modal vector quantisation for speech and text alignment.

  3. (3)

    SpeechBrain121212https://github.com/speechbrain/speechbrain/ (Ravanelli et al., 2021) is a speech toolkit offering various speech related tasks. Their text-to-speech model is based on Tacotron2 but is trained further on the LJSpeech (Ito and Johnson, 2017) and LibriTTS (Zen et al., 2019) datasets.

  4. (4)

    Balacoon131313https://huggingface.co/balacoon/tts packages offer lightweight and fast text analysis and speech generation going against larger but slower TTS models. It sacrifices multi-speaker and multi-lingual features for lightning-fast speed on the CPU. The detailed model architecture was not publicly available at the time of this paper’s writing.

  5. (5)

    Bark141414https://github.com/suno-ai/bark has a GPT-based architecture using a quantised audio representation that does not require the use of phonemes allowing it to generalise beyond speech, thus making it a text-to-audio model.

Upon comparison of the five generators, we use Bark due to the expressiveness of the audio generated by the model. While the other models suffer from a robotic delivery of the generated speech, not verbalising numerical figures, and reading of emoticons as individual punctuations (e.g. “>:|” as greater than, semicolon, pipeline), Bark produces the most naturally sounding audio recognising textual markers like “,” for pauses, “–” and “…” for hesitations, capitalisation for emphasis (e.g. goodbye vs. GOODBYE), and sentence punctuations for produce tonal shifts (e.g. huh? vs huh!). Bark also verbalises non-speech sounds such as [laughter], [laughs], [sighs], [music], [gasps], [clears throat], haha, uhm, waaah, and ooh. Bark’s ability to infer and convey emotions from only an input text would be valuable to our mental health risk detection model since it can provide additional emotional cues from the generated sound.

4.3. Baselines and Metrics

We assess the performance of our model by comparing it to previously published results using post-only151515In contrast to studies incorporating other components such as posted images or user network and activity. and post-level classification on the same datasets, employing identical class labels and similar evaluation setups. We use results reported in the following studies as our baselines: Bi-LSTM Char+Word (Long et al., 2022) for TwitSuicide; MLP (Tadesse et al., 2019), and EAN (Ren et al., 2021) for IdenDep; and GUSE-DENSE (Haque et al., 2021), and AugBERT+LR (Ansari et al., 2021) for SDCNL. For the DEPTWEET dataset, we use the published DistilBERT code from (Kabir et al., 2023) to replicate baseline results for the retrieved dataset. In addition, we provide strong baselines from fine-tuning state-of-the-art PLMs: BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), MentalBERT, and MentalRoBERTa (Ji et al., 2021). All PLM baselines follow the training setup used by (Long et al., 2022) with a batch size of 8 and a learning rate of 1e-04 trained for three epochs. Given the class imbalance, we evaluate our system based on macro F1 (F1m) and weighted F1 (F1w) scores, followed by accuracy and class F1 scores.

4.4. Implementation Details

We evaluate our model following established evaluation setups from previous literature using the same datasets on the same classification task setup for fair benchmark comparisons. We use 10-fold cross-validation for TwitSuicide and IdenDep, while a train/test split is used for SDCNL and DEPTWEET. Original data splits are retained when provided; otherwise, the data is randomly split (Table 1). When none is given, 10% of the training set is used for validation.

Hyperparameter tuning is done per dataset and model setup using Optuna161616https://optuna.org/ for 50 trials optimising weighted F1 scores. Detailed search space and best-found hyperparameters may be found in the Supplementary Material. Text-based teachers are trained using ReLu and a max length of 256. Audio-based teachers are trained for 25 epochs with a 5-epoch early stop, a 512 max length, and using the ReduceLRonPlateau scheduler. All inputs for the AST model are normalised to zero mean and 0.5 standard deviation. Emotion-based teachers are trained for 100 epochs with a 10-epoch early stop and a 256 max length. The student models are tuned and trained with distilled knowledge from the fine-tuned teachers using a max length of 256. All tuning was done using a 90:10 validation split and was conducted separately from the final model construction. All models are trained using an Adam optimiser on an NVIDIA TITAN RTX machine.

5. Results

5.1. Overall Performance

Table 4. Overall results using all three teacher modalities (Ours (All)) and the best partial teacher combination (Ours (Best Partial Combination)) against baselines. Class abbreviation definitions may be found in the Figure 2 caption. We present a full teacher combination ablation study in Table LABEL:tbl:ablationstudyteachers. indicates replicated results. Bold face indicates best score while second best are underlined.
Overall Performance Breakdown F1 Scores
TwitSuicide Acc F1m F1w (SC) (PC) (SI)
Long et al. (2022) 56.67 - - 40.00 50.00 66.00
BERT 57.58 53.60 57.25 40.00 57.00 64.00
RoBERTa 55.45 50.61 54.43 37.18 51.63 63.03
MentalBERT 57.73 52.57 57.39 35.23 56.65 65.84
MentalRoBERTa 55.91 51.49 55.60 41.62 53.05 62.81
Ours (Text&Emo) 65.76 61.96 65.46 49.72 62.34 73.81
Ours (All) 61.21 59.64 61.23 54.17 59.50 65.26
DEPTWEET Acc F1m F1w (SE) (MO) (MI) (ND)
Kabir et al. (2023) 79.75 38.59 78.89 17.65 21.98 25.43 89.29
BERT 81.89 40.40 80.21 36.36 00.00 34.34 90.90
RoBERTa 82.18 32.04 80.14 00.00 00.00 36.77 91.41
MentalBERT 83.54 36.04 79.72 26.09 00.00 26.67 91.41
MentalRoBERTa 78.48 36.60 78.62 22.22 00.00 34.91 89.28
Ours (Text&Emo) 84.03 46.43 83.09 41.38 12.31 39.07 92.95
Ours (All) 82.77 46.20 82.61 26.09 34.34 32.32 92.04
IdenDep Acc F1m F1w (DE) (NDE)
Tadesse et al. (2019) 91.00 - - 93.00 -
Ren et al. (2021) 91.30 - - 93.98 -
BERT 88.65 85.23 88.10 92.34 78.12
RoBERTa 87.18 82.85 86.34 91.47 74.24
MentalBERT 89.63 86.71 89.23 92.93 80.49
MentalRoBERTa 90.11 87.70 89.91 93.15 82.26
Ours (Text&Audio) 94.30 93.10 94.26 95.97 90.23
Ours (All) 93.92 92.58 93.85 95.73 89.43
SDCNL Acc F1m F1w (SUI) (DEP)
Haque et al. (2021) 72.24 - - 73.61 -
Ansari et al. (2021) - - - 76.00 -
BERT 67.02 66.57 66.64 70.45 62.69
RoBERTa 70.97 70.63 70.69 73.81 67.46
MentalBERT 69.39 69.21 69.26 71.57 66.86
MentalRoBERTa 72.30 72.10 72.14 74.45 69.74
Ours (Text&Audio) 76.52 76.50 76.51 77.12 75.88
Ours (All) 75.20 74.84 74.90 77.83 71.86
Refer to caption
Refer to caption
(a) TwitSuicide
Refer to caption
Refer to caption
(b) DEPTWEET
Refer to caption
Refer to caption
(c) IdenDep
Refer to caption
Refer to caption
(d) SDCNL
Figure 4. Audio analysis using PCA on spectrogram images of audio samples grouped by a maximum of 10s (left) and 10-25s (right). Each sample is labelled with an ID for reference to corresponding texts provided in the Supplementary Material.

We compare our model with fine-tuned PLM baselines and several published baselines that use the same mental health detection datasets and evaluation setup. Note that we select post-only mental health detection models as mentioned in Section 1 and 4.3. We comprehensively evaluate the overall performance and class performance in Table 4.

Overall, our model outperforms all baselines on all four benchmark datasets. What should be noted is that our model does not have to be trained with all three different teachers to achieve the best results. As illustrated in Table 4, we have four datasets, the initial two originating from Twitter and the latter from Reddit. Our model demonstrates superior performance, even with partial teacher combinations. The datasets from Twitter produce the best results with the combination of Text and Emotion, whereas the Reddit-based datasets perform the best with Text and Audio Knowledge Distillation. Note that we detail the efficacy of each modality teacher combination on different datasets in Section 5.3. In addition, our model trained with all three modalities still outperforms the other baseline models in most cases and shows greater performance in identifying certain classes. Especially for the Moderate (MO) class of the DEPTWEET dataset, our model trained with all three modalities achieves a 34.34 F1 score, while our model trained with partial modalities only achieves a 12.31 F1 score. All the other pre-trained baseline models fail to recognise the MO class. Hence, we can conclude that learning from different modality teachers helps our model achieve much better performances than the baseline models that learned from only textual inputs. Such improvement is more noticeable on datasets with shorter texts. Specifically, our model’s best performance is 8.36% and 8.07% higher than the best-performing baseline model on the macro F1 and weighted F1, respectively, on the TwitSuicide dataset. For DEPTWEET, IdenDep, and SDCNL datasets, the best performances of our model are 6.03%, 5.40%, 4.40%, and 2.88%, 4.50%, 4.37% higher than the best-performing baseline model on the macro and weighted F1 scores, respectively.

5.2. Audio Representation Analysis

Table 5. Samples for the TwitSuicide audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. SI: Safe to Ignore; SC: Seriously Concerning.
ID Class Text
0-10 seconds
214 SI _USER_ *** i can’t get that link to work
251 SI _USER_ or, *** anyone from the *** Rookies all-female racers team
355 SI *** kill myself.. watching Drag Me To Hell
457 SI i’m afraid my ups might be dead *** making ticking noises
495 SI It’s too early to be awake *** got up 3 1/2 hours ago! *** never wake up before 8.
152 SC _USER_ *** never wanted to be dead til now…
187 SC _USER_ thanks…now *** me kill myself
336 SC feeling like death *** want to die
407 SC I *** die right now no one loves me
573 SC *** hate my life sometimes i just want to die
10-25 seconds
92 SI _USER_ Its a story about how success as *** columnist, *** helped create, returned him to alcohol & suicidal thoughts
216 SI _USER_ Thx for *your* part ***! Any time you reblog *** I instantly get 3-5x the activity on it I normally do!
359 SI Gosh back to *** from *** with a rather large BANG. Really don’t want to be here, much rather be sailing *** for lunch
506 SI Last paper ***! Lucky I’ve bought *** or I will be dead.
553 SI oh noesss *** is dieing im ganna kill myself!!! *** room with no phone fml.
81 SC _USER_ ugh. *** that one was awful. suicide always *** gets to me…
160 SC _USER_ *** so stupid! *** think I still can? I want to kill myself!
416 SC I need to go on suicide watch. ***, The Fumble, ***, Jose Mesa, ***, and now this… Where’s my razor blade?
417 SC I never have any one to talk to *** i hate my self *** kill myself if no one *** say anything to me on ***
589 SC So ***. I’m in pain. Sucks. That was *** the point. suicide an option?
Table 6. Samples for the DEPTWEET audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. ND: Non-Depressed; SE: Severe.
ID Class Text
0-10 seconds
665 ND *** miss my sc I’m so depressed without it ☹️☹️
965 ND _USER_ _USER_ _USER_ Frustrated *** fan hai _URL_
1142 ND Me checking *** I hate *** Continues to check *** and then gets depressed
2054 ND *** so exhausted *** fighting to stay up until 8pm
3107 ND Do you feel frustrated *** on the simplest things? _USER_ …
3310 ND *** teacher is so tired of *** shit
1085 SE _USER_ sh000t me it would hurt less ***
2442 SE *** so lonely. *** going to hurt someone . #depression _USER_
4389 SE We *** the shit country. *** so depressed. _URL_
4552 SE *** no reason to live. *** I’ll just end it . #depression _USER_
10-25 seconds
159 ND _USER_ Neither. *** people who *** clinically depressed are going to be so regardless of their worldview. IMHO
1849 ND ofc *** days off im taking care of my newphew … im so tired i work every weekday *** takes forever to get home *** on my days off i babysit… i don’t even get paid. im exhausted
2896 ND Football: *** revive World Cup hopes, *** frustrated by *** _URL_
3382 ND *** sad of getting old it made us restless… *** so MAD i’m getting old it makes me reckless!!!
4158 ND I’ve *** my toenails off and split the nail bed - the pain has progressed over *** days to absolutely excruciating - so bad *** struggling to even walk. This week is going amazing
230 SE *** first guest: Me. *** self-sabotage and self-destruction.
1686 SE _USER_ Man, September was so hard *** watched my gma pass away, *** so much other stuff went wrong. I been depressed asf
2807 SE _USER_ _USER_ _USER_ _USER_ _USER_ I personally can’t *** 3 or4 died *** from either trauma or anxiety and *** those who took their own lives because of what happened
2919 SE *** get the hell out. so I’ll just end it . #depression _USER_
4838 SE *** thinking about suicide more and more *** I don’t want to. I don’t want *** that trauma on my kid. But it’s hard… *** suffering from depression *** 15 years… it’s a daily battle… I’m tired
Table 7. Samples for the IdenDep audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. NDE: Non-Depressive; DE: Depressive.
ID Class Text
0-10 seconds
1324 NDE *** Sympathy gift ideas + ***
1427 NDE *** friend vlog
1701 NDE does anyone want to hear the story about *** ’beef’ between ***
1728 NDE TRUST by ” THE HIPSTERS ” (ft. *** and ***)
1754 NDE How to have a strong family *** products or services have helped *** family stay strong together?
172 DE It’s *** easier to fall back in than to fight *** it
655 DE *** friends are throwing a LAN party *** I wasn’t invited. *** only one who didn’t get an invitation.
904 DE *** feel bitter about everything. *** bitter about being bitter.
934 DE I’m sad I feel sad. *** I feel something.
1235 DE … I just want to crawl in a whole and cry ***
10-25 seconds
1377 NDE Having friends *** opposite sex *** in a relationship _URL_
1470 NDE *** introduce my girlfriend [18F] to my family My girlfriend lives in *** I live in *** *** introduce her to my mom but I don’t know how
1589 NDE 365 New Ways To Hug Your Love *** discover and post videos or pictures of New Ways To Hug in the new subreddit ***
1616 NDE With family being a main interest in your lives, what *** would you purchase *** to help the family to grow?
1824 NDE What Would You Do? Would you move away from your family *** to somewhere far where your kids would have a better education *** provide for your family better, like buying a house; *** moving from *** to the *** or ***?
132 DE Anyone else feel like everyone hates them? *** paranoia? *** the dark cloud over my head just gives off a shitty vibe *** people think I don’t like them and vice Versa.
505 DE That feeling when you hate who you *** but can’t *** change because you are so used to being like this for *** years. *** a shitty person. The thought of change seems impossible *** at this point.
605 DE Fuck me When you’re *** a piece of shit *** look at other girls and lie to you, while lying *** next to you. I’ll never be enough. Ever. For anyone. *** want to ducking die.
1027 DE Addicted to depression *** when I feel like *** self-loathing and depressive *** becoming less, I feel shit *** don’t feel depressed anymore. *** I want it to go away *** part of me wants to stay depressive and feel suicidal.
1154 DE Is it depression *** don’t want to build memories anymore? *** I get really nostalgic. *** I don’t want to get too attached to people *** just end up hurting in the future.

To ensure the feasibility of our audio modality for mental health detection, we give an illustrative visualisation of the audio embeddings, which are generated by input text and learned via the Audio Spectrogram Transformer (AST). We conduct Principal Component Analysis (PCA) to visualise the acquired audio embeddings and their corresponding mental health class labels. In order to emphasise the distinguishability of the embeddings, we select samples from both the least and most concerning labels in each dataset, as shown in Figure 4. For each dataset, we group all the generated audio based on durations of 0-to-10-second and 10-to-25-second length171717Note that most generated audios are less than 25 seconds.. For each of these two audio groups, we generated the corresponding spectrograms and randomly selected ten audio samples for each group to visualise the first two principal components after performing PCA. In Figure 4, we annotate each sample with the post ID and the audio duration for detailed comparison. Tables 5 to 7 contains de-identified and masked post contents for TwitSuicide, DEPTWEET, and IdenDep, respectively. SDCNL samples may be found in the Supplementary Material.

The visualisation shows that our audio embeddings can show a noticeable separation between mental health classes for all four datasets. In datasets derived from Twitter, shorter audio samples display more pronounced distinctions between the most and least concerning classes, whereas, in datasets from Reddit, this separation becomes more evident in longer audio segments. We assume that this is primarily due to Twitter posts being generally shorter in length, whereas Reddit posts tend to be longer.

5.3. Effectiveness of Multimodal Multi-Teachers

Table 8. Ablation study using different combinations of teacher modalities. Class abbreviation definitions may be found in the Figure 2 caption. Bold face indicates best score while second best are underlined. A ✓indicates the addition of the emotion (Emo) and/or the audio (Aud) teacher/s. Highlighted rows show the best setup.
Text Emo Aud Overall Performance Breakdown F1 Scores
TwitSuicide Acc F1m F1w (SC) (PC) (SI)
×\times× ×\times× 58.94 47.50 56.13 16.13 56.55 69.81
×\times× 65.76 61.96 65.46 49.72 62.34 73.81
×\times× 63.79 58.94 63.02 44.30 61.76 70.74
61.21 59.64 61.23 54.17 59.50 65.26
DEPTWEET Acc F1m F1w (SE) (MO) (MI) (ND)
×\times× ×\times× 84.52 44.33 82.40 40.00 13.11 31.28 92.90
×\times× 84.03 46.43 83.09 41.38 12.31 39.07 92.95
×\times× 83.06 36.00 81.69 00.00 15.38 36.26 92.33
82.77 46.20 82.61 26.09 34.34 32.32 92.04
IdenDep Acc F1m F1w (DE) (NDE)
×\times× ×\times× 92.32 90.67 92.26 94.60 86.74
×\times× 93.86 92.47 93.78 85.71 89.23
×\times× 94.30 93.10 94.26 95.97 90.23
93.92 92.58 93.85 95.73 89.43
SDCNL Acc F1m F1w (SUI) (DEP)
×\times× ×\times× 75.20 75.07 75.10 76.85 73.30
×\times× 72.82 72.45 72.51 75.65 69.25
×\times× 76.52 76.50 76.51 77.12 75.88
75.20 74.84 74.90 77.83 71.86
Refer to caption
Figure 5. Distribution of multi-label emotion class labels for each dataset.

To examine the efficacy of each teacher modality and their combinations, we evaluate and explore the performance by adding an extra emotion-based teacher, an audio-based teacher, or both alongside the text-based teacher. The results are presented in Table LABEL:tbl:ablationstudyteachers.

In general, the multi-teacher structure outperforms the use of a singular text-based teacher, although the effectiveness of different modalities varies across datasets. For the Twitter-based datasets, applying the emotion-based and text-based teachers together achieves the best results. In contrast, for Reddit-based datasets, the audio- and text-based teachers have better performances. This difference may be attributed to the longer posts in Reddit-based datasets, resulting in longer audio (Table 1) that contains more acoustic information beneficial for the audio-based teacher.

Moreover, due to the lengthier nature of posts in Reddit-based datasets, more SenticNet lexicon tokens are likely to be matched compared to Twitter-based datasets. This results in a higher number of generated emotion labels during the learning process of the multi-label emotion-based teacher. In Figure 5, we compare the count of generated multi-label emotion classes utilised to train the emotion-based teacher across all four datasets. It is evident that a greater proportion of posts in Reddit-based datasets match all seven emotion labels (shown in Section 3.1.2) compared to Twitter-based datasets. This potential increase in the number of matching emotion labels may present challenges in distinguishing between different emotions during the training of the emotion-based teacher, potentially impacting downstream mental health classification, especially for ambiguous classes such as Suicide (SUI) and Depression (DEP) in the SDCNL dataset.

We can conclude that using multimodality teachers generally helps detect mental health, and these findings also suggest varying effectiveness of different modalities across datasets with distinct characteristics, offering valuable insights into selecting suitable modalities for improved performance in future scenarios.

5.4. Impact of Text-based Teachers

Table 9. Ablation study using different PLMs for the text-based teacher. We report results using the best-performing teacher modality combination in Table LABEL:tbl:ablationstudyteachers and change only the text-based teacher. Class abbreviation definitions may be found in the Figure 2 caption. Bold face indicates best score while second best are underlined.
Overall Performance Breakdown F1 Scores
TwitSuicide Acc F1m F1w (SC) (PC) (SI)
BERT 65.76 61.96 65.46 49.72 62.34 73.81
RoBERTa 65.15 61.67 64.70 51.19 61.43 72.40
MentalBERT 63.79 61.08 63.67 51.65 64.08 67.52
ClinicalBERT 65.30 63.23 65.25 56.38 62.18 71.13
DEPTWEET Acc F1m F1w (SE) (MO) (MI) (ND)
BERT 83.84 39.80 83.84 0.00 22.22 44.05 92.93
RoBERTa 82.67 35.99 81.80 0.00 14.29 37.23 92.44
MentalBERT 84.03 46.43 83.09 41.38 12.31 39.07 92.95
ClinicalBERT 83.54 32.29 80.71 0.00 0.00 37.11 92.05
IdenDep Acc F1m F1w (DE) (NDE)
BERT 94.30 93.10 94.26 95.97 90.23
RoBERTa 94.13 92.89 94.09 95.86 89.93
MentalBERT 93.21 91.71 93.14 95.24 88.17
ClinicalBERT 94.24 92.95 94.17 95.97 89.94
SDCNL Acc F1m F1w (SUI) (DEP)
BERT 76.52 76.50 76.51 77.12 75.88
RoBERTa 75.20 75.03 75.07 77.07 72.99
MentalBERT 73.61 73.61 73.61 73.40 73.82
ClinicalBERT 75.46 75.46 75.45 75.07 75.84

We compare the impact of various pre-trained language models on the effectiveness of the text-based teacher. According to the results in Table LABEL:tbl:ablationstudytextteachers, BERT performs the best in terms of the weighted F1 across all four datasets. However, the domain-specific PLMs, such as MentalBERT or ClinicalBERT, perform better for most of the more concerning classes. For instance, ClinicalBERT outperforms BERT by 6.66% in the F1 score for the Strongly Concerning (SC) class in the TwitSuicide dataset, and MentalBERT achieves a 41.38% F1 score for the Severe (SE) class in the DEPTWEET dataset, surpassing all other language models that fail to recognise it. To ensure optimal performance of the text-based teacher, we specifically employ MentalBERT for experiments on the DEPTWEET dataset, while BERT is used for the other three datasets. Nonetheless, the overall performance of the text-based teacher is not significantly impacted by the choice of pre-trained language models. Notably, more performance enhancement stems from the inclusion of different modalities, as mentioned in the previous sections.

5.5. Impact of Student Model Inputs

Table 10. Ablation study using different combinations of input modalities to the student model. Bold face indicates best score while second best are underlined. A ✓indicates the addition of the emotion-based (Emo) and/or the audio-based (Aud) input features. Highlighted rows show our original proposed student setup. VT: randomly initialised vanilla transformer.
Text Emo Aud Overall Performance Breakdown F1 Scores
TwitSuicide Acc F1m F1w (SC) (PC) (SI)
BERT 51.67 45.43 50.46 26.95 52.00 57.34
BERT ×\times× 50.91 44.24 48.58 29.58 41.42 61.71
BERT ×\times× 48.64 33.87 43.31 0.00 41.07 60.55
BERT ×\times× ×\times× 65.76 61.96 65.46 49.72 62.34 73.81
VT ×\times× ×\times× 46.36 32.72 41.77 0.00 41.09 57.06
DEPTWEET Acc F1m F1w (SE) (MO) (MI) (ND)
BERT 85.49 34.75 81.24 0.00 20.59 25.86 92.54
BERT ×\times× 83.84 32.15 80.81 0.00 0.00 36.36 92.55
BERT ×\times× 84.23 22.86 77.01 0.00 0.00 0.00 91.44
BERT ×\times× ×\times× 84.03 46.43 83.09 41.38 12.31 39.07 92.95
VT ×\times× ×\times× 84.23 22.86 77.01 0.00 0.00 0.00 91.44
IdenDep Acc F1m F1w (DE) (NDE)
BERT 91.58 89.99 91.60 93.98 86.00
BERT ×\times× 91.09 89.21 91.03 93.72 84.70
BERT ×\times× 91.75 89.85 91.62 94.23 85.47
BERT ×\times× ×\times× 94.30 93.10 94.26 95.97 90.23
VT ×\times× ×\times× 75.77 61.25 70.85 84.97 37.54
SDCNL Acc F1m F1w (SUI) (DEP)
BERT 67.55 67.54 67.55 67.89 67.20
BERT ×\times× 67.81 67.80 67.80 67.38 68.23
BERT ×\times× 68.34 68.32 68.31 67.57 69.07
BERT ×\times× ×\times× 76.52 76.50 76.51 77.12 75.88
VT ×\times× ×\times× 67.02 66.94 66.97 68.51 65.37

We examine different combinations of multimodal inputs for the student model in Table 10 in order to explore the optimal input for a knowledge distillation for the student model. In this scenario, we concatenate emotion embeddings, audio embeddings, or both with textual post embeddings from pre-trained BERT and then pass them to the transformer layer after a linear layer projection. We also test a randomly initialised vanilla transformer compared to pre-trained BERT for the student model. The results indicate that unimodal textual post inputs outperform the concatenation of multimodal inputs for the student model. Moreover, pre-trained BERT yields better results than the randomly initialised vanilla transformer across all datasets. These outcomes underscore the effectiveness of multi-aspect knowledge acquired from the multi-teachers, efficiently guiding the student to achieve robust performance with only textual inputs.

5.6. Hyperparameter Testing

We further investigate the impact of different patch size values for the audio-based teacher model, Audio Spectrogram Transformer (AST) (Gong et al., 2021), while maintaining a consistent setup for the student model. Figure 6 shows each dataset’s weighted F1 score for each patch size value tested. TwitSuicide, DEPTWEET, and IdenDep datasets show a relatively stable performance between 2 to 64 patch sizes; however, for SDCNL, performance improvement may be achieved using a patch size of 32. This improvement may be due to the higher variance in audio duration of the outliers in the SDCNL dataset compared to the other three datasets, as shown in Figure 3. Despite being shorter on average length and duration than the other Reddit-based dataset, SDCNL has audio samples with longer durations, which may have benefited from a patch size 32. However, a sharp decline in performance could be expected when the patch size is increased to 64.

Refer to caption
(a) TwitSuicide
Refer to caption
(b) DEPTWEET
Refer to caption
(c) IdenDep
Refer to caption
(d) SDCNL
Figure 6. Parameter study for the audio-based teacher model. Ave: average audio duration for the dataset.

6. Conclusion

In conclusion, our study introduces a multimodal multi-teacher knowledge distillation model, 3M-Health, designed for mental health detection and presents a comprehensive exploration. Our experiments demonstrate that the multimodal approach outperforms unimodal counterparts, with the choice of modalities influencing performance across diverse datasets. Notably, the incorporation of audio-based information proves particularly beneficial for social media post-based mental health detection for Reddit-based datasets, emphasising the importance of modality selection based on the nature of the data. Overall, our work contributes valuable insights into the nuanced dynamics of multimodal knowledge distillation for mental health detection, offering a promising avenue for future research in this critical domain.

References

  • (1)
  • Ansari et al. (2021) Gunjan Ansari, Muskan Garg, and Chandni Saxena. 2021. Data Augmentation for Mental Health Classification on Social Media. In Proceedings of the 18th International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), National Institute of Technology Silchar, Silchar, India, 152–161. https://aclanthology.org/2021.icon-main.19
  • Ao et al. (2022) Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei. 2022. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 5723–5738. https://doi.org/10.18653/v1/2022.acl-long.393
  • Aragón et al. (2023) Mario Ezra Aragón, Adrian Pastor López-Monroy, Luis Carlos González-Gurrola, and Manuel Montes-y Gómez. 2023. Detecting Mental Disorders in Social Media Through Emotional Patterns - The Case of Anorexia and Depression. IEEE Transactions on Affective Computing 14, 1 (2023), 211–222. https://doi.org/10.1109/TAFFC.2021.3075638
  • Benton et al. (2017) Adrian Benton, Glen Coppersmith, and Mark Dredze. 2017. Ethical Research Protocols for Social Media Health Research. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube, and Hanna Wallach (Eds.). Association for Computational Linguistics, Valencia, Spain, 94–102. https://doi.org/10.18653/v1/W17-1612
  • Cabral et al. (2024) Rina Carines Cabral, Soyeon Caren Han, Josiah Poon, and Goran Nenadic. 2024. MM-EMOG: Multi-Label Emotion Graph Representation for Mental Health Classification on Social Media. Robotics 13, 3 (2024), 53.
  • Cambria et al. (2022) Erik Cambria, Qian Liu, Sergio Decherchi, Frank Xing, and Kenneth Kwok. 2022. SenticNet 7: A commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 3829–3839.
  • Cao et al. (2022) Lei Cao, Huijun Zhang, and Ling Feng. 2022. Building and Using Personal Knowledge Graph to Improve Suicidal Ideation Detection on Social Media. IEEE Transactions on Multimedia 24 (2022), 87–102. https://doi.org/10.1109/TMM.2020.3046867
  • Cao et al. (2019) Lei Cao, Huijun Zhang, Ling Feng, Zihan Wei, Xin Wang, Ningyun Li, and Xiaohao He. 2019. Latent Suicide Risk Detection on Microblog via Suicide-Oriented Word Embeddings and Layered Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 1718–1728. https://doi.org/10.18653/v1/D19-1181
  • Chen et al. (2021) Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2021. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7028–7036.
  • Collins et al. (2023) Hanne K Collins, Julia A Minson, Ariella Kristal, and Alison Wood Brooks. 2023. Conveying and detecting listening during live conversation. Journal of Experimental Psychology: General (2023).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Garg (2023) Muskan Garg. 2023. Mental health analysis in social media posts: a survey. Archives of Computational Methods in Engineering 30, 3 (2023), 1819–1842.
  • Gong et al. (2021) Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571–575. https://doi.org/10.21437/Interspeech.2021-698
  • Han et al. (2022) Soyeon Caren Han, Zihan Yuan, Kunze Wang, Siqu Long, and Josiah Poon. 2022. Understanding Graph Convolutional Networks for Text Classification. arXiv preprint arXiv:2203.16060 (2022).
  • Haque et al. (2021) Ayaan Haque, Viraaj Reddi, and Tyler Giallanza. 2021. Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction. In Artificial Neural Networks and Machine Learning – ICANN 2021, Igor Farkaš, Paolo Masulli, Sebastian Otte, and Stefan Wermter (Eds.). Springer International Publishing, Cham, 436–447.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  • Ito and Johnson (2017) Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.
  • Iyer et al. (2022) Ravi Iyer, Maja Nedeljkovic, and Denny Meyer. 2022. Using Vocal Characteristics To Classify Psychological Distress in Adult Helpline Callers: Retrospective Observational Study. JMIR Formative Research 6, 12 (Dec. 2022), e42249. https://doi.org/10.2196/42249
  • Ji et al. (2021) Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2021. Mentalbert: Publicly available pretrained language models for mental healthcare. arXiv preprint arXiv:2110.15621 (2021).
  • Jiao et al. (2020) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372
  • Kabir et al. (2023) Mohsinul Kabir, Tasnim Ahmed, Md. Bakhtiar Hasan, Md Tahmid Rahman Laskar, Tarun Kumar Joarder, Hasan Mahmud, and Kamrul Hasan. 2023. DEPTWEET: A typology for social media texts to detect depression severities. Computers in Human Behavior 139 (2023), 107503. https://doi.org/10.1016/j.chb.2022.107503
  • Kipf and Welling (2016) Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. https://doi.org/10.48550/ARXIV.1609.02907
  • Lara et al. (2021) Juan S. Lara, Mario Ezra Aragón, Fabio A. González, and Manuel Montes-y Gómez. 2021. Deep Bag-of-Sub-Emotions for Depression Detection in Social Media. In Text, Speech, and Dialogue, Kamil Ekštein, František Pártl, and Miloslav Konopík (Eds.). Springer International Publishing, Cham, 60–72.
  • Li et al. (2020) Kang Li, Lequan Yu, Shujun Wang, and Pheng-Ann Heng. 2020. Towards cross-modality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 775–783.
  • Lin et al. (2020) Chenhao Lin, Pengwei Hu, Hui Su, Shaochun Li, Jing Mei, Jie Zhou, and Henry Leung. 2020. SenseMood: Depression Detection on Social Media. In Proceedings of the 2020 International Conference on Multimedia Retrieval (Dublin, Ireland) (ICMR ’20). Association for Computing Machinery, New York, NY, USA, 407–411. https://doi.org/10.1145/3372278.3391932
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Long et al. (2022) Siqu Long, Rina Cabral, Josiah Poon, and Soyeon Caren Han. 2022. A Quantitative and Qualitative Analysis of Suicide Ideation Detection using Deep Learning. https://doi.org/10.48550/ARXIV.2206.08673
  • Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 5191–5198.
  • Ni et al. (2022) Jianyuan Ni, Raunak Sarbajna, Yang Liu, Anne HH Ngu, and Yan Yan. 2022. Cross-modal knowledge distillation for Vision-to-Sensor action recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4448–4452.
  • O’Dea et al. (2015) Bridianne O’Dea, Stephen Wan, Philip J. Batterham, Alison L. Calear, Cecile Paris, and Helen Christensen. 2015. Detecting suicidality on Twitter. Internet Interventions 2, 2 (2015), 183–188. https://doi.org/10.1016/j.invent.2015.03.005
  • Pham et al. (2023) Cuong Pham, Tuan Hoang, and Thanh-Toan Do. 2023. Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6435–6443.
  • Pirina and Çöltekin (2018) Inna Pirina and Çağrı Çöltekin. 2018. Identifying Depression on Reddit: The Effect of Training Data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task. Association for Computational Linguistics, Brussels, Belgium, 9–12. https://doi.org/10.18653/v1/W18-5903
  • Ravanelli et al. (2021) Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).
  • Ren et al. (2021) Lu Ren, Hongfei Lin, Bo Xu, Shaowu Zhang, Liang Yang, and Shichang Sun. 2021. Depression Detection on Reddit With an Emotion-Based Attention Network: Algorithm Development and Validation. JMIR Med Inform 9, 7 (16 Jul 2021), e28754. https://doi.org/10.2196/28754
  • Sawhney et al. (2021a) Ramit Sawhney, Harshit Joshi, Lucie Flek, and Rajiv Ratn Shah. 2021a. PHASE: Learning Emotional Phase-aware Representations for Suicide Ideation Detection on Social Media. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 2415–2428. https://doi.org/10.18653/v1/2021.eacl-main.205
  • Sawhney et al. (2021b) Ramit Sawhney, Harshit Joshi, Saumya Gandhi, and Rajiv Ratn Shah. 2021b. Towards Ordinal Suicide Ideation Detection on Social Media. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 22–30. https://doi.org/10.1145/3437963.3441805
  • Shen et al. (2018) Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
  • Tadesse et al. (2019) Michael M. Tadesse, Hongfei Lin, Bo Xu, and Liang Yang. 2019. Detection of Depression-Related Posts in Reddit Social Media Forum. IEEE Access 7 (2019), 44883–44893. https://doi.org/10.1109/ACCESS.2019.2909180
  • Tan et al. (2018) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2018. Multilingual Neural Machine Translation with Knowledge Distillation. In International Conference on Learning Representations.
  • Vongkulbhisal et al. (2019) Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini-Scarzanella. 2019. Unifying heterogeneous classifiers with distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3175–3184.
  • Wang et al. (2023) Guangyu Wang, Xiaohong Liu, Zhen Ying, Guoxing Yang, Zhiwei Chen, Zhiwen Liu, Min Zhang, Hongmei Yan, Yuxing Lu, Yuanxu Gao, and et al. 2023. Optimized glycemic control of type 2 diabetes with reinforcement learning: A proof-of-concept trial. Nature Medicine 29, 10 (2023), 2633–2642. https://doi.org/10.1038/s41591-023-02552-9
  • Wu et al. (2021) Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. 2021. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4408–4413. https://doi.org/10.18653/v1/2021.findings-acl.387
  • Yang and Xu (2021) Lehan Yang and Kele Xu. 2021. Cross modality knowledge distillation for multi-modal aerial view object classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 382–387.
  • Zen et al. (2019) H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech. https://doi.org/10.21437/Interspeech.2019-2441
  • Zogan et al. (2022) Hamad Zogan, Imran Razzak, Xianzhi Wang, Shoaib Jameel, and Guandong Xu. 2022. Explainable depression detection with multi-aspect features using a hybrid deep learning model on social media. World Wide Web 25, 1 (2022), 281–304.