3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Rina Carines Cabral University of SydneyAustralia [email protected] 0000-0003-3076-0521 , Siwen Luo University of Western AustraliaAustralia [email protected] 0000-0003-0480-1991 , Josiah Poon University of SydneyAustralia [email protected] 0000-0003-3371-8628 and Soyeon Caren Han University of MelbourneAustralia [email protected] 0000-0002-1948-6819

(2024)

Abstract.

The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals’ well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance. All relevant codes will be made available upon publication.

mental health classification, knowledge distillation, multimodal

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: 33rd ACM International Conference on Information and Knowledge Management; October 21-25, 2024; Boise, Idaho^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Decision support systems^†^†ccs: Computing methodologies Information extraction

1. Introduction

Mental health is a critical aspect of individual well-being, influencing both personal lives and societal structures (Garg, 2023). Despite advancements in mental healthcare, not everyone with mental health concerns actively seeks professional help. The widespread use of social media platforms, such as Twitter and Reddit, has opened avenues for detecting mental health issues by analysing text-oriented posts. This shift towards online expression has prompted research into text-based mental health classification, focusing on identifying the presence and categories of mental health concerns within social media posts. Recent studies in mental health classification from social media content have embraced diverse components, ranging from historical posts and conversation trees to social graphs and user metadata (Cao et al., 2019, 2022; Lin et al., 2020; Sawhney et al., 2021a, b). However, the availability of these additional sources varies due to data privacy restrictions or user preferences, introducing challenges in research and system reproducibility. In light of these challenges, our research addresses the limitations of existing methodologies by focusing on the analysis of text-only social media posts, a fundamental and universally available component. While semantic pre-trained textual embedding from text-only input may capture explicit emotional words related to mental health, they may fall short in capturing less explicit emotions, limiting their robustness. For instance, some textual posts may lack explicit emotional language yet imply an unhealthy mental state. Recognising the potential shortcomings of text-only datasets, we introduce a novel approach to mental health classification through a Multimodal and Multi-Task Knowledge Distillation Model. Inspired by human comprehension strategies that involve multimodal information integration, our model leverages insights from multimodal human understanding to enhance the efficacy of mental health risk detection. Our approach introduces a new acoustic modality feature generated from original textual posts, motivated by the proven effectiveness of vocal biomarkers in indicating psychological distress and other medical conditions (Iyer et al., 2022). This would create a new modality from text-only input for unimodal text-based mental health risk detection. Simultaneously, we also incorporate emotion-enriched features as additional information. Instead of integrating all modalities into one model, we employ a multimodal and multi-teacher architecture to address the computational complexity of integrating diverse features. This approach distributes the learning process across multiple teachers, each specialising in a particular feature extraction aspect. To the best of our humble knowledge, there have been no attempts to create a new modality from text-only input for the unimodal text-based mental health risk detection tasks. Additionally, we propose a new multimodality knowledge distillation model for the mental health risk detection domain.

2. Related Works

2.1. Mental Health Classification

Recent studies in mental health classification from social media content have incorporated diverse social media components. These components encompass various elements, including historical posts, conversation trees, social and interaction graphs, user or post metadata information, and profile pictures or posted images (Cao et al., 2019, 2022; Lin et al., 2020; Sawhney et al., 2021a, b). However, those additional sources will not always be available in the dataset due to data privacy restrictions or user preferences. This complicates research reproducibility since each study selects features based on what social media components are available to them. Hence, our research focuses on exploring mental health detection by analysing only social media textual posts, which is a compulsory component of text-based social media posts related to mental health. Based on the textual aspect, existing studies have worked on frequency- or score-based emotion features (Aragón et al., 2023; Zogan et al., 2022). More recent works use the fine-tuned contextual embeddings on emotion-based tasks as the emotion features for mental health classification (Lara et al., 2021; Sawhney et al., 2021a). These studies mainly focus on identifying and fixedly matching one type of emotion to each word or entire textual content. On the other hand, our research highlights the complexity of human emotions wherein a single word could be associated with multiple types of emotions by integrating the emotion-enriched features generated through a multi-label, corpus-based representation learning framework. In addition, we propose a new way to include acoustic features generated by original textual posts in this task, motivated by the proven effectiveness of vocal biomarkers in psychological distress and other medical conditions indication (Iyer et al., 2022). This would simultaneously process the emotion-enriched features as additional information. To the best of our knowledge, there have been no attempts to create a new modality from the text-only input to achieve unimodal text-based mental health risk detection tasks.

2.2. Multi-teacher Knowledge Distillation

To integrate the multimodal knowledge efficiently, we design our model with a knowledge distillation idea to compress a complex and large multimodal integration model into a smaller and simpler one while still retaining the accuracy and performance of the resultant model. Knowledge distillation (Hinton et al., 2015) involves transferring knowledge from a teacher model to a student model, commonly applied to compress large models by mapping intermediate layer outputs (Chen et al., 2021; Jiao et al., 2020) or minimising KL-divergence in class distribution (Mirzadeh et al., 2020). Traditionally, knowledge distillation is used within the same modality. However, recent approaches extend it to different modalities (Yang and Xu, 2021; Ni et al., 2022; Li et al., 2020). Some studies explore collaborative learning with multiple teachers for improved compression, such as (Wu et al., 2021) and (Pham et al., 2023) in language and vision models. (Tan et al., 2018) focuses on multilingual language translation, and (Vongkulbhisal et al., 2019) applies multi-teacher knowledge distillation to unify classifiers trained on distinct data sources. Inspired by this, we propose a new Multimodal Multi-Teacher Knowledge Distillation framework for mental health risk detection.

Refer to caption — Figure 1. The overall architecture of 3M-Health: Multimodal Multi-teacher Knowledge Distillation for Mental Health Detection. Note that the teacher models are finetuned using each mental health dataset.

3. 3M-Health

In this section, we introduce the details of our Multimodal Multi-teacher knowledge distillation model for Mental Health detection, 3M-Health. Figure 1 illustrates the overall architecture design. This model consists of three distinct teacher models, each focusing on different modalities to independently learn diverse aspects of features crucial for interpreting mental health-related posts. The acquired features from multimodal teachers serve as a valuable source of knowledge for instructing the student model by utilising the average output distribution of the teacher models as soft targets. Note that we introduce three essential multimodal teacher models for mental health risk detection, including 1) a text-based teacher for understanding semantic aspects from input texts, 2) an emotion-based teacher for interpreting emotion aspects from input texts, and 3) an audio-based teacher for discerning emotions conveyed through audio sounds.

3.1. Multimodal Multi-Teacher Construction

This section articulates each teacher model’s objective and construction process. Teacher fine-tuning can be found in Section 3.2.

3.1.1. Text-based Teacher

The text-based teacher aims to teach contextual semantic comprehension of mental health-related textual posts. We leverage pre-trained large language models (PLMs) since contextualised embeddings from PLMs represent different meanings based on the context (e.g. blue means a kind of the colour, but gloomy in other emotional contexts). More specifically, some words may have opposite meanings in the medical domain (e.g., positive usually means something good but often refers to the presence of a specific condition, which is typically not a desirable outcome). Inspired by this, we explore several general (BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019)) and medical/mental-health specific PLMs (MentalBERT (Ji et al., 2021), and ClinicalBERT (Wang et al., 2023)) in the experiment.

3.1.2. Emotion-based Teacher

The emotion-based teacher aims to teach emotional aspects from the input text of mental health-related posts. It is initialised with representations derived from a graph-based model (Cabral et al., 2024), which produces emotion-enriched word representations by thoroughly incorporating global and local relationships among posts and all the words within those posts. To do so, we first obtain a multi-label emotion class indicating anger, disgust, fear, sadness, surprise, negative, and other¹¹1Positive sentiment/emotions are grouped into other to focus on the different negative emotion on mental health-related text. for each post using the SenticNet7 lexicon (Cambria et al., 2022)²²2https://sentic.net/downloads/, mapping identified words to their corresponding emotion types. This emotion lexicon consists of terms $K=\{k_{1},...,k_{q}\}$ associated with one or more emotion types from $EM=\{em_{1},...,em_{r}\}$ . For each word $W=\{w_{1},...,w_{p}\}$ in a post, we assign $EM_{k_{j}}$ to $w_{i}$ whenever $w_{i}=k_{j}$ in $K$ in this document. Consequently, each post is associated with a multi-label class $EM_{d}=\{EM_{w_{1}}\cup EM_{w_{2}}\cup...EM_{w_{p}}\}$ . Subsequently, we construct a graph $G=(V,E,A)$ representing all posts and their word tokens, where $V$ is the set of all post nodes and token nodes tokenised through wordpiece tokenisation with emoticon preservation³³3To further preserve and integrate emotions in the posts, emoticons and emojis are added to the tokeniser vocabulary.. Here, $E$ encompasses token-token edges $E_{w_{i},w_{j}}$ , token-post edges $E_{w_{i},d_{j}}$ , and post-post edges $E_{d_{i},d_{j}}$ , while $A$ specifies weights between related nodes(Han et al., 2022). Post node and token node representations are initialised with the [CLS] embedding and the minimum of contextualised token embeddings from pre-trained BERT word embeddings, respectively. Edge values are determined by Pointwise Mutual Information (PMI) for $E_{w_{i},w_{j}}$ , Term Frequency-Inverse Document Frequency (TF-IDF) for $E_{w_{i},d_{j}}$ , and Jaccard similarity for $E_{d_{i},d_{j}}$ . Utilising these initialised representations and edge values, a two-layer Graph Convolutional Neural Network (Kipf and Welling, 2016) (GCN) is trained with ReLu for the multi-label emotion classification task based on the SenticNet7. The updated second-layer hidden states are extracted and used as initial weights for fine-tuning BERT on the same multi-label emotion classification task to comprehend the associated emotions in the posts further. The updated word embeddings are extracted as the multi-emotion contextual embeddings.

3.1.3. Audio-based Teacher

Existing social media mental health datasets primarily consist of text-only samples, so the two teachers mentioned before are purely based on the text. To address the need for a more comprehensive understanding of complex mental health and emotional contexts, we propose the integration of multimodality information. According to research (Collins et al., 2023), individuals can more accurately interpret the emotions of others through listening rather than observing facial expressions/body language or reading written text. Drawing inspiration from this insight, we introduce an audio-based teacher to enhance knowledge distillation, enabling the interpretation of emotions in mental health posts through sound. To achieve this, we first employ Bark⁴⁴4https://github.com/suno-ai/bark, a pre-trained text-to-audio model, to generate corresponding audio for each post as Bark can capture emotional sounds detected from the text (e.g. [laughs], [gasps] and “…” for hesitations)⁵⁵5A theoretical and practical comparison for text-to-audio generation APIs and a list of Bark’s sound cues in Section 4.2.. Note that Bark can generate only 13 seconds of audio. Hence, we tokenise each post at the sentence level, generate audio for each sentence, and then aggregate these audio segments into a complete audio representation for the entire post. Particularly long sentences or texts that do not have punctuation are further split into a maximum of 45 tokens.

3.2. Multimodal Multi-Teacher Fine-tuning

Researchers (Jiao et al., 2020; Wu et al., 2021) emphasise the significance of distilling knowledge from the hidden states of a teacher model for effective student instruction. In this section, we describe the independent fine-tuning process of each teacher model.

We performed fine-tuning on the text-based teacher, utilising the pre-trained language model, for the mental health classification task with labels $C=\{c_{1},c_{2},...,c_{|C|}\}$ . This process enables the teacher to learn and adapt to the nuances of mental health-related contexts within each dataset.

For the emotion-based teacher, following the generation of emotion-rich representations for each post and its words, these embeddings serve as inputs for fine-tuning a Multi-Layer Perceptron (MLP) for mental health classification, operating over the labels $C$ .

We employ the Audio Spectrogram Transformer (Gong et al., 2021) (AST) for the audio-based teacher to classify each generated audio into mental health risk classes. AST is a transformer-based model that takes a sequence of audio spectrogram patches as inputs. An audio waveform is first converted into a 128x100t spectrogram based on a sequence of 128-dimensional log Mel filterbank features computed with a 25ms Hamming window every 10ms. Such a spectrogram is then split into a sequence of N 16x16 patches of images with an overlap of 6 in both time and frequency dimensions. A special token [CLS] is added to the beginning of the sequence of spectrogram patches. After passing through transformer encoder layers, the [CLS] embedding is fed into a linear layer with sigmoid activation to classify mental health risk class labels $C$ .

Every teacher model is individually constructed and fine-tuned to facilitate optimal learning. We are aware of the concerns raised by some researchers (Wu et al., 2021) highlighting the potential inconsistency in the feature space when different teachers are separately pre-trained with distinct settings and then fine-tuned independently. Based on our initial testing, co-finetuning multimodal teachers yields little improvement; in fact, it tends to result in lower performance. We speculate that integrating multimodal information may not perform optimally during co-finetuning.

3.3. Multi-Teacher Knowledge Distillation

For the student model, we use a single modality involving textual posts as input for a pre-trained BERT, which processes the sequence of tokenised words. The student model performs the mental health risk classification task over the same class labels $C$ , with knowledge distilled from the text-based, emotion-based, and audio-based teacher models. To incorporate the acquired knowledge from these various multimodal sources, the student model is trained to minimise the distillation loss given by $L=L_{task}+L_{kd}$ . Here, $L_{task}$ represents the cross-entropy loss between the student model’s predictions and the ground truth of mental health risk categories, while $L_{kd}$ stands for the Kullback-Leibler (KL) divergence between the student model and the teacher models’ predictions. Given the presence of multiple teacher models, we calculate $L_{kd}$ by averaging the predicted probability distributions from all three teacher models.

4. Experimental Setup

4.1. Datasets

Table 1. Data statistics. Durations are in a minute:second (mm:ss) format. ^±SDCNL categorises suicide and depression-related posts.

	TwitSuicide	DEPTWEET	IdenDep	SDCNL
Task	Suicide	Depression	Depression	Suicide^±
Platform	Twitter	Twitter	Reddit	Reddit
Num. Classes	3	4	2	2
Total Samples	660	5128	1841	1895
Evaluation	10-fold	60/20/20	10-fold	80/20
Train/Val	-	4,102	-	1,516
Test	-	1,026	-	379
Length	13-147	1-926	11-17,641	13-24,590
Avg. Length	90.32	163.28	1,127.57	936.76
Words	3-31	101	3,477	4,411
Avg. Words	16.85	28.15	215.1	178.53
Min Duration	00:01.903	00:01.250	00:02.900	00:02.463
Max Duration	00:32.853	00:56.596	22:07.740	28:09.546
Avg. Duration	00:11.545	00:17.860	01:45.193	01.27.568

We evaluate our proposed model using four publicly available datasets related to mental health on social media. Table 1 and Figure 2 provide a summary of statistics and an illustration of class distribution.

The TwitSuicide Dataset⁶⁶6Data available upon request. (Long et al., 2022) replicates the data collection, processing, and annotation methods of (O’Dea et al., 2015). A sample of 660 tweets is annotated into three risk levels. The Strongly Concerning (SC) class is assigned to posts with a convincing display of severe suicidal ideation, while Safe to Ignore (SI) shows no evidence of suicide risk. If it doesn’t fall into other categories, a post remains in the Possibly Concerning (PC) class.

DEPTWEET⁷⁷7https://github.com/mohsinulkabir14/DEPTWEET (Kabir et al., 2023) is collected from Twitter using seed terms based on the Patient Health Questionnaire (PHQ-9). The dataset comprises 40,191 tweets; however, only 5,128 tweets were retrieved during this study. The labels include Non-Depressed (ND), Mild (MI), Moderate (MO), and Severe (SE), maintaining an imbalanced class distribution, with around 80% labelled as ND and less than 2% SE.

The Identifying Depression Dataset⁸⁸8https://github.com/Inusette/Identifying-depression (IdenDep) (Pirina and Çöltekin, 2018) consists of 1,841 Reddit posts, with “depression indicative” (DE) posts sourced from the Depression subreddit and non-depressive (NDE) posts from the “family” and “friendship advice” subreddits. No further manual check was done on the samples, increasing the probability of false negatives.

The SDCNL Dataset⁹⁹9https://github.com/ayaanzhaque/SDCNL (Haque et al., 2021) involves distinguishing between Reddit suicide-related and depression-related posts. The dataset contains 1,895 nearly balanced posts labelled as Suicide (SUI) or Depression/Not Suicide (DEP) based on their subreddit. In accordance with (Benton et al., 2017), all posts are de-identified before any analysis, audio generation, and model training.

Table 2. Text statistics for each class per dataset.

TwitSuicide
Class	Total	%	Length (ave.)	Words (ave.)
Safe to Ignore	103	15.61	13-139 (77.89)	4-31 (15.25)
Possibly Concerning	264	40.00	24-147 (88.16)	4-31 (16.35)
Strongly Concerning	293	44.39	13-147 (96.65)	3-30 (17.86)
DEPTWEET
Non-Depressed	4213	82.16	1-816 (164.47)	1-101 (28.08)
Mild	606	11.82	4-885 (144.74)	1-87 (26.38)
Moderate	232	4.52	32-926 (184.95)	5-99 (33.25)
Severe	77	1.50	23-398 (178.81)	1-62 (30.57)
IdenDep
Non-Depression	548	29.77	11-17641 (1546.34)	1-3477 (295.75)
Depression	1293	70.23	11-13803 (950.09)	2-2487 (180.92)
SDCNL
Depression	915	48.28	43-16015 (1000.68)	8-3200 (192.84)
Suicide	980	51.72	13-24590 (977.07)	2-4411 (165.16)

Table 3. Audio statistics for each class per dataset in a minute:second (mm:ss) format.

Dataset	Class	Min Duration	Max Duration	Ave. Duration
TwitSuicide	Safe to Ignore	00:01.903	00:31.000	00:12.215
	Possibly Concerning	00:02.143	00:32.853	00:11.393
	Strongly Concerning	00:01.943	00:24.760	00:10.280
DEPTWEEET	Non-Depressed	00:01.250	00:56.200	00:17.210
	Mild	00:02.230	00:56.596	00:15.413
	Moderate	00:04.460	00:47.770	00:18.958
	Severe	00:02.250	00:40.160	00:17.865
IdenDep	Non-Depression	00:02.900	22:07.740	02:20.941
	Depression	00:03.100	21:28.643	01:30.420
SDCNL	Depression	00:05.756	25:58.966	01:33.802
	Suicide	00:02.463	28.09.546	01:21.747

We provide a detailed breakdown of text and audio statistics for each dataset class in Tables 2 and 3 to provide more information regarding the nature of each class, which may influence model learning and performance. Notably, DEPTWEET and IdenDep datasets have highly skewed data, with 82.16% and 70.23% on a single class, respectively. Figure 3 illustrates the differences in the generated audio in terms of duration. The Reddit-based datasets, IdenDep and SDCNL, are significantly longer than the Twitter-based datasets, possibly providing more auditory information inferred from the textual posts.

4.2. Text-to-Audio Generators

In order to generate the best possible audio to represent each textual post in our benchmark datasets, we performed a theoretical and practical comparison between five publicly accessible text-to-speech and text-to-audio generative APIs.

(1)

Tacotron2¹⁰¹⁰10https://github.com/NVIDIA/tacotron2 (Shen et al., 2018) uses a recurrent neural network architecture to predict mel spectrogram sequences from text followed by a modified WaveNet vocoder.
(2)

SpeechT5¹¹¹¹11https://github.com/microsoft/SpeechT5 (Ao et al., 2022) unifies modalities with a shared encoder-decoder architecture that uses cross-modal vector quantisation for speech and text alignment.
(3)

SpeechBrain¹²¹²12https://github.com/speechbrain/speechbrain/ (Ravanelli et al., 2021) is a speech toolkit offering various speech related tasks. Their text-to-speech model is based on Tacotron2 but is trained further on the LJSpeech (Ito and Johnson, 2017) and LibriTTS (Zen et al., 2019) datasets.
(4)

Balacoon¹³¹³13https://huggingface.co/balacoon/tts packages offer lightweight and fast text analysis and speech generation going against larger but slower TTS models. It sacrifices multi-speaker and multi-lingual features for lightning-fast speed on the CPU. The detailed model architecture was not publicly available at the time of this paper’s writing.
(5)

Bark¹⁴¹⁴14https://github.com/suno-ai/bark has a GPT-based architecture using a quantised audio representation that does not require the use of phonemes allowing it to generalise beyond speech, thus making it a text-to-audio model.

Upon comparison of the five generators, we use Bark due to the expressiveness of the audio generated by the model. While the other models suffer from a robotic delivery of the generated speech, not verbalising numerical figures, and reading of emoticons as individual punctuations (e.g. “>:|” as greater than, semicolon, pipeline), Bark produces the most naturally sounding audio recognising textual markers like “,” for pauses, “–” and “…” for hesitations, capitalisation for emphasis (e.g. goodbye vs. GOODBYE), and sentence punctuations for produce tonal shifts (e.g. huh? vs huh!). Bark also verbalises non-speech sounds such as [laughter], [laughs], [sighs], [music], [gasps], [clears throat], haha, uhm, waaah, and ooh. Bark’s ability to infer and convey emotions from only an input text would be valuable to our mental health risk detection model since it can provide additional emotional cues from the generated sound.

4.3. Baselines and Metrics

We assess the performance of our model by comparing it to previously published results using post-only¹⁵¹⁵15In contrast to studies incorporating other components such as posted images or user network and activity. and post-level classification on the same datasets, employing identical class labels and similar evaluation setups. We use results reported in the following studies as our baselines: Bi-LSTM Char+Word (Long et al., 2022) for TwitSuicide; MLP (Tadesse et al., 2019), and EAN (Ren et al., 2021) for IdenDep; and GUSE-DENSE (Haque et al., 2021), and AugBERT+LR (Ansari et al., 2021) for SDCNL. For the DEPTWEET dataset, we use the published DistilBERT code from (Kabir et al., 2023) to replicate baseline results for the retrieved dataset. In addition, we provide strong baselines from fine-tuning state-of-the-art PLMs: BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), MentalBERT, and MentalRoBERTa (Ji et al., 2021). All PLM baselines follow the training setup used by (Long et al., 2022) with a batch size of 8 and a learning rate of 1e-04 trained for three epochs. Given the class imbalance, we evaluate our system based on macro F1 (F1m) and weighted F1 (F1w) scores, followed by accuracy and class F1 scores.

4.4. Implementation Details

We evaluate our model following established evaluation setups from previous literature using the same datasets on the same classification task setup for fair benchmark comparisons. We use 10-fold cross-validation for TwitSuicide and IdenDep, while a train/test split is used for SDCNL and DEPTWEET. Original data splits are retained when provided; otherwise, the data is randomly split (Table 1). When none is given, 10% of the training set is used for validation.

Hyperparameter tuning is done per dataset and model setup using Optuna¹⁶¹⁶16https://optuna.org/ for 50 trials optimising weighted F1 scores. Detailed search space and best-found hyperparameters may be found in the Supplementary Material. Text-based teachers are trained using ReLu and a max length of 256. Audio-based teachers are trained for 25 epochs with a 5-epoch early stop, a 512 max length, and using the ReduceLRonPlateau scheduler. All inputs for the AST model are normalised to zero mean and 0.5 standard deviation. Emotion-based teachers are trained for 100 epochs with a 10-epoch early stop and a 256 max length. The student models are tuned and trained with distilled knowledge from the fine-tuned teachers using a max length of 256. All tuning was done using a 90:10 validation split and was conducted separately from the final model construction. All models are trained using an Adam optimiser on an NVIDIA TITAN RTX machine.

5. Results

5.1. Overall Performance

Table 4. Overall results using all three teacher modalities (Ours (All)) and the best partial teacher combination (Ours (Best Partial Combination)) against baselines. Class abbreviation definitions may be found in the Figure 2 caption. We present a full teacher combination ablation study in Table LABEL:tbl:ablationstudyteachers. ^† indicates replicated results. Bold face indicates best score while second best are underlined.

	Overall Performance			Breakdown F1 Scores
TwitSuicide	Acc	F1m	F1w	(SC)	(PC)	(SI)
Long et al. (2022)	56.67	-	-	40.00	50.00	66.00
BERT	57.58	53.60	57.25	40.00	57.00	64.00
RoBERTa	55.45	50.61	54.43	37.18	51.63	63.03
MentalBERT	57.73	52.57	57.39	35.23	56.65	65.84
MentalRoBERTa	55.91	51.49	55.60	41.62	53.05	62.81
Ours (Text&Emo)	65.76	61.96	65.46	49.72	62.34	73.81
Ours (All)	61.21	59.64	61.23	54.17	59.50	65.26
DEPTWEET	Acc	F1m	F1w	(SE)	(MO)	(MI)	(ND)
Kabir et al. (2023)^†	79.75	38.59	78.89	17.65	21.98	25.43	89.29
BERT	81.89	40.40	80.21	36.36	00.00	34.34	90.90
RoBERTa	82.18	32.04	80.14	00.00	00.00	36.77	91.41
MentalBERT	83.54	36.04	79.72	26.09	00.00	26.67	91.41
MentalRoBERTa	78.48	36.60	78.62	22.22	00.00	34.91	89.28
Ours (Text&Emo)	84.03	46.43	83.09	41.38	12.31	39.07	92.95
Ours (All)	82.77	46.20	82.61	26.09	34.34	32.32	92.04
IdenDep	Acc	F1m	F1w	(DE)	(NDE)
Tadesse et al. (2019)	91.00	-	-	93.00	-
Ren et al. (2021)	91.30	-	-	93.98	-
BERT	88.65	85.23	88.10	92.34	78.12
RoBERTa	87.18	82.85	86.34	91.47	74.24
MentalBERT	89.63	86.71	89.23	92.93	80.49
MentalRoBERTa	90.11	87.70	89.91	93.15	82.26
Ours (Text&Audio)	94.30	93.10	94.26	95.97	90.23
Ours (All)	93.92	92.58	93.85	95.73	89.43
SDCNL	Acc	F1m	F1w	(SUI)	(DEP)
Haque et al. (2021)	72.24	-	-	73.61	-
Ansari et al. (2021)	-	-	-	76.00	-
BERT	67.02	66.57	66.64	70.45	62.69
RoBERTa	70.97	70.63	70.69	73.81	67.46
MentalBERT	69.39	69.21	69.26	71.57	66.86
MentalRoBERTa	72.30	72.10	72.14	74.45	69.74
Ours (Text&Audio)	76.52	76.50	76.51	77.12	75.88
Ours (All)	75.20	74.84	74.90	77.83	71.86

We compare our model with fine-tuned PLM baselines and several published baselines that use the same mental health detection datasets and evaluation setup. Note that we select post-only mental health detection models as mentioned in Section 1 and 4.3. We comprehensively evaluate the overall performance and class performance in Table 4.

Overall, our model outperforms all baselines on all four benchmark datasets. What should be noted is that our model does not have to be trained with all three different teachers to achieve the best results. As illustrated in Table 4, we have four datasets, the initial two originating from Twitter and the latter from Reddit. Our model demonstrates superior performance, even with partial teacher combinations. The datasets from Twitter produce the best results with the combination of Text and Emotion, whereas the Reddit-based datasets perform the best with Text and Audio Knowledge Distillation. Note that we detail the efficacy of each modality teacher combination on different datasets in Section 5.3. In addition, our model trained with all three modalities still outperforms the other baseline models in most cases and shows greater performance in identifying certain classes. Especially for the Moderate (MO) class of the DEPTWEET dataset, our model trained with all three modalities achieves a 34.34 F1 score, while our model trained with partial modalities only achieves a 12.31 F1 score. All the other pre-trained baseline models fail to recognise the MO class. Hence, we can conclude that learning from different modality teachers helps our model achieve much better performances than the baseline models that learned from only textual inputs. Such improvement is more noticeable on datasets with shorter texts. Specifically, our model’s best performance is 8.36% and 8.07% higher than the best-performing baseline model on the macro F1 and weighted F1, respectively, on the TwitSuicide dataset. For DEPTWEET, IdenDep, and SDCNL datasets, the best performances of our model are 6.03%, 5.40%, 4.40%, and 2.88%, 4.50%, 4.37% higher than the best-performing baseline model on the macro and weighted F1 scores, respectively.

5.2. Audio Representation Analysis

Table 5. Samples for the TwitSuicide audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. SI: Safe to Ignore; SC: Seriously Concerning.

ID	Class	Text
0-10 seconds
214	SI	_USER_ *** i can’t get that link to work
251	SI	_USER_ or, * anyone from the * Rookies all-female racers team
355	SI	*** kill myself.. watching Drag Me To Hell
457	SI	i’m afraid my ups might be dead *** making ticking noises
495	SI	It’s too early to be awake * got up 3 1/2 hours ago! * never wake up before 8.
152	SC	_USER_ *** never wanted to be dead til now…
187	SC	_USER_ thanks…now *** me kill myself
336	SC	feeling like death *** want to die
407	SC	I *** die right now no one loves me
573	SC	*** hate my life sometimes i just want to die
10-25 seconds
92	SI	_USER_ Its a story about how success as * columnist, * helped create, returned him to alcohol & suicidal thoughts
216	SI	_USER_ Thx for your part *! Any time you reblog * I instantly get 3-5x the activity on it I normally do!
359	SI	Gosh back to * from * with a rather large BANG. Really don’t want to be here, much rather be sailing *** for lunch
506	SI	Last paper *! Lucky I’ve bought * or I will be dead.
553	SI	oh noesss * is dieing im ganna kill myself!!! * room with no phone fml.
81	SC	_USER_ ugh. * that one was awful. suicide always * gets to me…
160	SC	_USER_ * so stupid! * think I still can? I want to kill myself!
416	SC	I need to go on suicide watch. *, The Fumble, , Jose Mesa, **, and now this… Where’s my razor blade?
417	SC	I never have any one to talk to * i hate my self * kill myself if no one * say anything to me on *
589	SC	So *. I’m in pain. Sucks. That was * the point. suicide an option?

Table 6. Samples for the DEPTWEET audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. ND: Non-Depressed; SE: Severe.

ID	Class	Text
0-10 seconds
665	ND	*** miss my sc I’m so depressed without it ☹️☹️
965	ND	_USER_ _USER_ _USER_ Frustrated *** fan hai _URL_
1142	ND	Me checking * I hate * Continues to check *** and then gets depressed
2054	ND	* so exhausted * fighting to stay up until 8pm
3107	ND	Do you feel frustrated *** on the simplest things? _USER_ …
3310	ND	* teacher is so tired of * shit
1085	SE	_USER_ sh000t me it would hurt less ***
2442	SE	* so lonely. * going to hurt someone . #depression _USER_
4389	SE	We * the shit country. * so depressed. _URL_
4552	SE	* no reason to live. * I’ll just end it . #depression _USER_
10-25 seconds
159	ND	_USER_ Neither. * people who * clinically depressed are going to be so regardless of their worldview. IMHO
1849	ND	ofc * days off im taking care of my newphew … im so tired i work every weekday * takes forever to get home *** on my days off i babysit… i don’t even get paid. im exhausted
2896	ND	Football: * revive World Cup hopes, * frustrated by *** _URL_
3382	ND	* sad of getting old it made us restless… * so MAD i’m getting old it makes me reckless!!!
4158	ND	I’ve * my toenails off and split the nail bed - the pain has progressed over * days to absolutely excruciating - so bad *** struggling to even walk. This week is going amazing
230	SE	* first guest: Me. * self-sabotage and self-destruction.
1686	SE	_USER_ Man, September was so hard * watched my gma pass away, * so much other stuff went wrong. I been depressed asf
2807	SE	_USER_ _USER_ _USER_ _USER_ _USER_ I personally can’t * 3 or4 died * from either trauma or anxiety and *** those who took their own lives because of what happened
2919	SE	*** get the hell out. so I’ll just end it . #depression _USER_
4838	SE	* thinking about suicide more and more * I don’t want to. I don’t want * that trauma on my kid. But it’s hard… * suffering from depression *** 15 years… it’s a daily battle… I’m tired

Table 7. Samples for the IdenDep audio spectrogram analysis. Each sample has been masked to avoid a reverse search of each post. NDE: Non-Depressive; DE: Depressive.

ID	Class	Text
0-10 seconds
1324	NDE	* Sympathy gift ideas + *
1427	NDE	*** friend vlog
1701	NDE	does anyone want to hear the story about * ’beef’ between *
1728	NDE	TRUST by ” THE HIPSTERS ” (ft. * and *)
1754	NDE	How to have a strong family * products or services have helped * family stay strong together?
172	DE	It’s * easier to fall back in than to fight * it
655	DE	* friends are throwing a LAN party * I wasn’t invited. *** only one who didn’t get an invitation.
904	DE	* feel bitter about everything. * bitter about being bitter.
934	DE	I’m sad I feel sad. *** I feel something.
1235	DE	… I just want to crawl in a whole and cry ***
10-25 seconds
1377	NDE	Having friends * opposite sex * in a relationship _URL_
1470	NDE	* introduce my girlfriend [18F] to my family My girlfriend lives in * I live in * * introduce her to my mom but I don’t know how
1589	NDE	365 New Ways To Hug Your Love * discover and post videos or pictures of New Ways To Hug in the new subreddit *
1616	NDE	With family being a main interest in your lives, what * would you purchase * to help the family to grow?
1824	NDE	What Would You Do? Would you move away from your family * to somewhere far where your kids would have a better education * provide for your family better, like buying a house; * moving from * to the * or *?
132	DE	Anyone else feel like everyone hates them? * paranoia? * the dark cloud over my head just gives off a shitty vibe *** people think I don’t like them and vice Versa.
505	DE	That feeling when you hate who you * but can’t * change because you are so used to being like this for * years. * a shitty person. The thought of change seems impossible *** at this point.
605	DE	Fuck me When you’re * a piece of shit * look at other girls and lie to you, while lying * next to you. I’ll never be enough. Ever. For anyone. * want to ducking die.
1027	DE	Addicted to depression * when I feel like * self-loathing and depressive * becoming less, I feel shit * don’t feel depressed anymore. * I want it to go away * part of me wants to stay depressive and feel suicidal.
1154	DE	Is it depression * don’t want to build memories anymore? * I get really nostalgic. * I don’t want to get too attached to people * just end up hurting in the future.

To ensure the feasibility of our audio modality for mental health detection, we give an illustrative visualisation of the audio embeddings, which are generated by input text and learned via the Audio Spectrogram Transformer (AST). We conduct Principal Component Analysis (PCA) to visualise the acquired audio embeddings and their corresponding mental health class labels. In order to emphasise the distinguishability of the embeddings, we select samples from both the least and most concerning labels in each dataset, as shown in Figure 4. For each dataset, we group all the generated audio based on durations of 0-to-10-second and 10-to-25-second length¹⁷¹⁷17Note that most generated audios are less than 25 seconds.. For each of these two audio groups, we generated the corresponding spectrograms and randomly selected ten audio samples for each group to visualise the first two principal components after performing PCA. In Figure 4, we annotate each sample with the post ID and the audio duration for detailed comparison. Tables 5 to 7 contains de-identified and masked post contents for TwitSuicide, DEPTWEET, and IdenDep, respectively. SDCNL samples may be found in the Supplementary Material.

The visualisation shows that our audio embeddings can show a noticeable separation between mental health classes for all four datasets. In datasets derived from Twitter, shorter audio samples display more pronounced distinctions between the most and least concerning classes, whereas, in datasets from Reddit, this separation becomes more evident in longer audio segments. We assume that this is primarily due to Twitter posts being generally shorter in length, whereas Reddit posts tend to be longer.

5.3. Effectiveness of Multimodal Multi-Teachers

Table 8. Ablation study using different combinations of teacher modalities. Class abbreviation definitions may be found in the Figure 2 caption. Bold face indicates best score while second best are underlined. A ✓indicates the addition of the emotion (Emo) and/or the audio (Aud) teacher/s. Highlighted rows show the best setup.

Text^‡	Emo	Aud	Overall Performance			Breakdown F1 Scores
TwitSuicide			Acc	F1m	F1w	(SC)	(PC)	(SI)
✓	$\times$	$\times$	58.94	47.50	56.13	16.13	56.55	69.81
✓	✓	$\times$	65.76	61.96	65.46	49.72	62.34	73.81
✓	$\times$	✓	63.79	58.94	63.02	44.30	61.76	70.74
✓	✓	✓	61.21	59.64	61.23	54.17	59.50	65.26
DEPTWEET			Acc	F1m	F1w	(SE)	(MO)	(MI)	(ND)
✓	$\times$	$\times$	84.52	44.33	82.40	40.00	13.11	31.28	92.90
✓	✓	$\times$	84.03	46.43	83.09	41.38	12.31	39.07	92.95
✓	$\times$	✓	83.06	36.00	81.69	00.00	15.38	36.26	92.33
✓	✓	✓	82.77	46.20	82.61	26.09	34.34	32.32	92.04
IdenDep			Acc	F1m	F1w	(DE)	(NDE)
✓	$\times$	$\times$	92.32	90.67	92.26	94.60	86.74
✓	✓	$\times$	93.86	92.47	93.78	85.71	89.23
✓	$\times$	✓	94.30	93.10	94.26	95.97	90.23
✓	✓	✓	93.92	92.58	93.85	95.73	89.43
SDCNL			Acc	F1m	F1w	(SUI)	(DEP)
✓	$\times$	$\times$	75.20	75.07	75.10	76.85	73.30
✓	✓	$\times$	72.82	72.45	72.51	75.65	69.25
✓	$\times$	✓	76.52	76.50	76.51	77.12	75.88
✓	✓	✓	75.20	74.84	74.90	77.83	71.86

To examine the efficacy of each teacher modality and their combinations, we evaluate and explore the performance by adding an extra emotion-based teacher, an audio-based teacher, or both alongside the text-based teacher. The results are presented in Table LABEL:tbl:ablationstudyteachers.

In general, the multi-teacher structure outperforms the use of a singular text-based teacher, although the effectiveness of different modalities varies across datasets. For the Twitter-based datasets, applying the emotion-based and text-based teachers together achieves the best results. In contrast, for Reddit-based datasets, the audio- and text-based teachers have better performances. This difference may be attributed to the longer posts in Reddit-based datasets, resulting in longer audio (Table 1) that contains more acoustic information beneficial for the audio-based teacher.

Moreover, due to the lengthier nature of posts in Reddit-based datasets, more SenticNet lexicon tokens are likely to be matched compared to Twitter-based datasets. This results in a higher number of generated emotion labels during the learning process of the multi-label emotion-based teacher. In Figure 5, we compare the count of generated multi-label emotion classes utilised to train the emotion-based teacher across all four datasets. It is evident that a greater proportion of posts in Reddit-based datasets match all seven emotion labels (shown in Section 3.1.2) compared to Twitter-based datasets. This potential increase in the number of matching emotion labels may present challenges in distinguishing between different emotions during the training of the emotion-based teacher, potentially impacting downstream mental health classification, especially for ambiguous classes such as Suicide (SUI) and Depression (DEP) in the SDCNL dataset.

We can conclude that using multimodality teachers generally helps detect mental health, and these findings also suggest varying effectiveness of different modalities across datasets with distinct characteristics, offering valuable insights into selecting suitable modalities for improved performance in future scenarios.

5.4. Impact of Text-based Teachers

Table 9. Ablation study using different PLMs for the text-based teacher. We report results using the best-performing teacher modality combination in Table LABEL:tbl:ablationstudyteachers and change only the text-based teacher. Class abbreviation definitions may be found in the Figure 2 caption. Bold face indicates best score while second best are underlined.

	Overall Performance			Breakdown F1 Scores
TwitSuicide	Acc	F1m	F1w	(SC)	(PC)	(SI)
BERT	65.76	61.96	65.46	49.72	62.34	73.81
RoBERTa	65.15	61.67	64.70	51.19	61.43	72.40
MentalBERT	63.79	61.08	63.67	51.65	64.08	67.52
ClinicalBERT	65.30	63.23	65.25	56.38	62.18	71.13
DEPTWEET	Acc	F1m	F1w	(SE)	(MO)	(MI)	(ND)
BERT	83.84	39.80	83.84	0.00	22.22	44.05	92.93
RoBERTa	82.67	35.99	81.80	0.00	14.29	37.23	92.44
MentalBERT	84.03	46.43	83.09	41.38	12.31	39.07	92.95
ClinicalBERT	83.54	32.29	80.71	0.00	0.00	37.11	92.05
IdenDep	Acc	F1m	F1w	(DE)	(NDE)
BERT	94.30	93.10	94.26	95.97	90.23
RoBERTa	94.13	92.89	94.09	95.86	89.93
MentalBERT	93.21	91.71	93.14	95.24	88.17
ClinicalBERT	94.24	92.95	94.17	95.97	89.94
SDCNL	Acc	F1m	F1w	(SUI)	(DEP)
BERT	76.52	76.50	76.51	77.12	75.88
RoBERTa	75.20	75.03	75.07	77.07	72.99
MentalBERT	73.61	73.61	73.61	73.40	73.82
ClinicalBERT	75.46	75.46	75.45	75.07	75.84

We compare the impact of various pre-trained language models on the effectiveness of the text-based teacher. According to the results in Table LABEL:tbl:ablationstudytextteachers, BERT performs the best in terms of the weighted F1 across all four datasets. However, the domain-specific PLMs, such as MentalBERT or ClinicalBERT, perform better for most of the more concerning classes. For instance, ClinicalBERT outperforms BERT by 6.66% in the F1 score for the Strongly Concerning (SC) class in the TwitSuicide dataset, and MentalBERT achieves a 41.38% F1 score for the Severe (SE) class in the DEPTWEET dataset, surpassing all other language models that fail to recognise it. To ensure optimal performance of the text-based teacher, we specifically employ MentalBERT for experiments on the DEPTWEET dataset, while BERT is used for the other three datasets. Nonetheless, the overall performance of the text-based teacher is not significantly impacted by the choice of pre-trained language models. Notably, more performance enhancement stems from the inclusion of different modalities, as mentioned in the previous sections.

5.5. Impact of Student Model Inputs

Table 10. Ablation study using different combinations of input modalities to the student model. Bold face indicates best score while second best are underlined. A ✓indicates the addition of the emotion-based (Emo) and/or the audio-based (Aud) input features. Highlighted rows show our original proposed student setup. VT: randomly initialised vanilla transformer.

Text	Emo	Aud	Overall Performance			Breakdown F1 Scores
TwitSuicide			Acc	F1m	F1w	(SC)	(PC)	(SI)
BERT	✓	✓	51.67	45.43	50.46	26.95	52.00	57.34
BERT	✓	$\times$	50.91	44.24	48.58	29.58	41.42	61.71
BERT	$\times$	✓	48.64	33.87	43.31	0.00	41.07	60.55
BERT	$\times$	$\times$	65.76	61.96	65.46	49.72	62.34	73.81
VT	$\times$	$\times$	46.36	32.72	41.77	0.00	41.09	57.06
DEPTWEET			Acc	F1m	F1w	(SE)	(MO)	(MI)	(ND)
BERT	✓	✓	85.49	34.75	81.24	0.00	20.59	25.86	92.54
BERT	✓	$\times$	83.84	32.15	80.81	0.00	0.00	36.36	92.55
BERT	$\times$	✓	84.23	22.86	77.01	0.00	0.00	0.00	91.44
BERT	$\times$	$\times$	84.03	46.43	83.09	41.38	12.31	39.07	92.95
VT	$\times$	$\times$	84.23	22.86	77.01	0.00	0.00	0.00	91.44
IdenDep			Acc	F1m	F1w	(DE)	(NDE)
BERT	✓	✓	91.58	89.99	91.60	93.98	86.00
BERT	✓	$\times$	91.09	89.21	91.03	93.72	84.70
BERT	$\times$	✓	91.75	89.85	91.62	94.23	85.47
BERT	$\times$	$\times$	94.30	93.10	94.26	95.97	90.23
VT	$\times$	$\times$	75.77	61.25	70.85	84.97	37.54
SDCNL			Acc	F1m	F1w	(SUI)	(DEP)
BERT	✓	✓	67.55	67.54	67.55	67.89	67.20
BERT	✓	$\times$	67.81	67.80	67.80	67.38	68.23
BERT	$\times$	✓	68.34	68.32	68.31	67.57	69.07
BERT	$\times$	$\times$	76.52	76.50	76.51	77.12	75.88
VT	$\times$	$\times$	67.02	66.94	66.97	68.51	65.37

We examine different combinations of multimodal inputs for the student model in Table 10 in order to explore the optimal input for a knowledge distillation for the student model. In this scenario, we concatenate emotion embeddings, audio embeddings, or both with textual post embeddings from pre-trained BERT and then pass them to the transformer layer after a linear layer projection. We also test a randomly initialised vanilla transformer compared to pre-trained BERT for the student model. The results indicate that unimodal textual post inputs outperform the concatenation of multimodal inputs for the student model. Moreover, pre-trained BERT yields better results than the randomly initialised vanilla transformer across all datasets. These outcomes underscore the effectiveness of multi-aspect knowledge acquired from the multi-teachers, efficiently guiding the student to achieve robust performance with only textual inputs.

5.6. Hyperparameter Testing

We further investigate the impact of different patch size values for the audio-based teacher model, Audio Spectrogram Transformer (AST) (Gong et al., 2021), while maintaining a consistent setup for the student model. Figure 6 shows each dataset’s weighted F1 score for each patch size value tested. TwitSuicide, DEPTWEET, and IdenDep datasets show a relatively stable performance between 2 to 64 patch sizes; however, for SDCNL, performance improvement may be achieved using a patch size of 32. This improvement may be due to the higher variance in audio duration of the outliers in the SDCNL dataset compared to the other three datasets, as shown in Figure 3. Despite being shorter on average length and duration than the other Reddit-based dataset, SDCNL has audio samples with longer durations, which may have benefited from a patch size 32. However, a sharp decline in performance could be expected when the patch size is increased to 64.

6. Conclusion

In conclusion, our study introduces a multimodal multi-teacher knowledge distillation model, 3M-Health, designed for mental health detection and presents a comprehensive exploration. Our experiments demonstrate that the multimodal approach outperforms unimodal counterparts, with the choice of modalities influencing performance across diverse datasets. Notably, the incorporation of audio-based information proves particularly beneficial for social media post-based mental health detection for Reddit-based datasets, emphasising the importance of modality selection based on the nature of the data. Overall, our work contributes valuable insights into the nuanced dynamics of multimodal knowledge distillation for mental health detection, offering a promising avenue for future research in this critical domain.

References

(1)
Ansari et al. (2021) Gunjan Ansari, Muskan Garg, and Chandni Saxena. 2021. Data Augmentation for Mental Health Classification on Social Media. In Proceedings of the 18th International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), National Institute of Technology Silchar, Silchar, India, 152–161. https://aclanthology.org/2021.icon-main.19
Ao et al. (2022) Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei. 2022. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 5723–5738. https://doi.org/10.18653/v1/2022.acl-long.393
Aragón et al. (2023) Mario Ezra Aragón, Adrian Pastor López-Monroy, Luis Carlos González-Gurrola, and Manuel Montes-y Gómez. 2023. Detecting Mental Disorders in Social Media Through Emotional Patterns - The Case of Anorexia and Depression. IEEE Transactions on Affective Computing 14, 1 (2023), 211–222. https://doi.org/10.1109/TAFFC.2021.3075638
Benton et al. (2017) Adrian Benton, Glen Coppersmith, and Mark Dredze. 2017. Ethical Research Protocols for Social Media Health Research. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube, and Hanna Wallach (Eds.). Association for Computational Linguistics, Valencia, Spain, 94–102. https://doi.org/10.18653/v1/W17-1612
Cabral et al. (2024) Rina Carines Cabral, Soyeon Caren Han, Josiah Poon, and Goran Nenadic. 2024. MM-EMOG: Multi-Label Emotion Graph Representation for Mental Health Classification on Social Media. Robotics 13, 3 (2024), 53.
Cambria et al. (2022) Erik Cambria, Qian Liu, Sergio Decherchi, Frank Xing, and Kenneth Kwok. 2022. SenticNet 7: A commonsense-based neurosymbolic AI framework for explainable sentiment analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 3829–3839.
Cao et al. (2022) Lei Cao, Huijun Zhang, and Ling Feng. 2022. Building and Using Personal Knowledge Graph to Improve Suicidal Ideation Detection on Social Media. IEEE Transactions on Multimedia 24 (2022), 87–102. https://doi.org/10.1109/TMM.2020.3046867
Cao et al. (2019) Lei Cao, Huijun Zhang, Ling Feng, Zihan Wei, Xin Wang, Ningyun Li, and Xiaohao He. 2019. Latent Suicide Risk Detection on Microblog via Suicide-Oriented Word Embeddings and Layered Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 1718–1728. https://doi.org/10.18653/v1/D19-1181
Chen et al. (2021) Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2021. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7028–7036.
Collins et al. (2023) Hanne K Collins, Julia A Minson, Ariella Kristal, and Alison Wood Brooks. 2023. Conveying and detecting listening during live conversation. Journal of Experimental Psychology: General (2023).
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Garg (2023) Muskan Garg. 2023. Mental health analysis in social media posts: a survey. Archives of Computational Methods in Engineering 30, 3 (2023), 1819–1842.
Gong et al. (2021) Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571–575. https://doi.org/10.21437/Interspeech.2021-698
Han et al. (2022) Soyeon Caren Han, Zihan Yuan, Kunze Wang, Siqu Long, and Josiah Poon. 2022. Understanding Graph Convolutional Networks for Text Classification. arXiv preprint arXiv:2203.16060 (2022).
Haque et al. (2021) Ayaan Haque, Viraaj Reddi, and Tyler Giallanza. 2021. Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction. In Artificial Neural Networks and Machine Learning – ICANN 2021, Igor Farkaš, Paolo Masulli, Sebastian Otte, and Stefan Wermter (Eds.). Springer International Publishing, Cham, 436–447.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Ito and Johnson (2017) Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.
Iyer et al. (2022) Ravi Iyer, Maja Nedeljkovic, and Denny Meyer. 2022. Using Vocal Characteristics To Classify Psychological Distress in Adult Helpline Callers: Retrospective Observational Study. JMIR Formative Research 6, 12 (Dec. 2022), e42249. https://doi.org/10.2196/42249
Ji et al. (2021) Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2021. Mentalbert: Publicly available pretrained language models for mental healthcare. arXiv preprint arXiv:2110.15621 (2021).
Jiao et al. (2020) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372
Kabir et al. (2023) Mohsinul Kabir, Tasnim Ahmed, Md. Bakhtiar Hasan, Md Tahmid Rahman Laskar, Tarun Kumar Joarder, Hasan Mahmud, and Kamrul Hasan. 2023. DEPTWEET: A typology for social media texts to detect depression severities. Computers in Human Behavior 139 (2023), 107503. https://doi.org/10.1016/j.chb.2022.107503
Kipf and Welling (2016) Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. https://doi.org/10.48550/ARXIV.1609.02907
Lara et al. (2021) Juan S. Lara, Mario Ezra Aragón, Fabio A. González, and Manuel Montes-y Gómez. 2021. Deep Bag-of-Sub-Emotions for Depression Detection in Social Media. In Text, Speech, and Dialogue, Kamil Ekštein, František Pártl, and Miloslav Konopík (Eds.). Springer International Publishing, Cham, 60–72.
Li et al. (2020) Kang Li, Lequan Yu, Shujun Wang, and Pheng-Ann Heng. 2020. Towards cross-modality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 775–783.
Lin et al. (2020) Chenhao Lin, Pengwei Hu, Hui Su, Shaochun Li, Jing Mei, Jie Zhou, and Henry Leung. 2020. SenseMood: Depression Detection on Social Media. In Proceedings of the 2020 International Conference on Multimedia Retrieval (Dublin, Ireland) (ICMR ’20). Association for Computing Machinery, New York, NY, USA, 407–411. https://doi.org/10.1145/3372278.3391932
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Long et al. (2022) Siqu Long, Rina Cabral, Josiah Poon, and Soyeon Caren Han. 2022. A Quantitative and Qualitative Analysis of Suicide Ideation Detection using Deep Learning. https://doi.org/10.48550/ARXIV.2206.08673
Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 5191–5198.
Ni et al. (2022) Jianyuan Ni, Raunak Sarbajna, Yang Liu, Anne HH Ngu, and Yan Yan. 2022. Cross-modal knowledge distillation for Vision-to-Sensor action recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4448–4452.
O’Dea et al. (2015) Bridianne O’Dea, Stephen Wan, Philip J. Batterham, Alison L. Calear, Cecile Paris, and Helen Christensen. 2015. Detecting suicidality on Twitter. Internet Interventions 2, 2 (2015), 183–188. https://doi.org/10.1016/j.invent.2015.03.005
Pham et al. (2023) Cuong Pham, Tuan Hoang, and Thanh-Toan Do. 2023. Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6435–6443.
Pirina and Çöltekin (2018) Inna Pirina and Çağrı Çöltekin. 2018. Identifying Depression on Reddit: The Effect of Training Data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task. Association for Computational Linguistics, Brussels, Belgium, 9–12. https://doi.org/10.18653/v1/W18-5903
Ravanelli et al. (2021) Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).
Ren et al. (2021) Lu Ren, Hongfei Lin, Bo Xu, Shaowu Zhang, Liang Yang, and Shichang Sun. 2021. Depression Detection on Reddit With an Emotion-Based Attention Network: Algorithm Development and Validation. JMIR Med Inform 9, 7 (16 Jul 2021), e28754. https://doi.org/10.2196/28754
Sawhney et al. (2021a) Ramit Sawhney, Harshit Joshi, Lucie Flek, and Rajiv Ratn Shah. 2021a. PHASE: Learning Emotional Phase-aware Representations for Suicide Ideation Detection on Social Media. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 2415–2428. https://doi.org/10.18653/v1/2021.eacl-main.205
Sawhney et al. (2021b) Ramit Sawhney, Harshit Joshi, Saumya Gandhi, and Rajiv Ratn Shah. 2021b. Towards Ordinal Suicide Ideation Detection on Social Media. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 22–30. https://doi.org/10.1145/3437963.3441805
Shen et al. (2018) Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Tadesse et al. (2019) Michael M. Tadesse, Hongfei Lin, Bo Xu, and Liang Yang. 2019. Detection of Depression-Related Posts in Reddit Social Media Forum. IEEE Access 7 (2019), 44883–44893. https://doi.org/10.1109/ACCESS.2019.2909180
Tan et al. (2018) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2018. Multilingual Neural Machine Translation with Knowledge Distillation. In International Conference on Learning Representations.
Vongkulbhisal et al. (2019) Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini-Scarzanella. 2019. Unifying heterogeneous classifiers with distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3175–3184.
Wang et al. (2023) Guangyu Wang, Xiaohong Liu, Zhen Ying, Guoxing Yang, Zhiwei Chen, Zhiwen Liu, Min Zhang, Hongmei Yan, Yuxing Lu, Yuanxu Gao, and et al. 2023. Optimized glycemic control of type 2 diabetes with reinforcement learning: A proof-of-concept trial. Nature Medicine 29, 10 (2023), 2633–2642. https://doi.org/10.1038/s41591-023-02552-9
Wu et al. (2021) Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. 2021. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4408–4413. https://doi.org/10.18653/v1/2021.findings-acl.387
Yang and Xu (2021) Lehan Yang and Kele Xu. 2021. Cross modality knowledge distillation for multi-modal aerial view object classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 382–387.
Zen et al. (2019) H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech. https://doi.org/10.21437/Interspeech.2019-2441
Zogan et al. (2022) Hamad Zogan, Imran Razzak, Xianzhi Wang, Shoaib Jameel, and Guandong Xu. 2022. Explainable depression detection with multi-aspect features using a hybrid deep learning model on social media. World Wide Web 25, 1 (2022), 281–304.