MASIVE: Open-Ended Affective State Identification in English and Spanish

Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown

Columbia University, Department of Computer Science
Correspondence: [ndeas,eturcan,kathy]@cs.columbia.edu

Abstract

In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of affective states, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of affective state identification for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.

Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown Columbia University, Department of Computer Science Correspondence: [ndeas,eturcan,kathy]@cs.columbia.edu

1 Introduction

In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, typically using basic emotion sets from the field of psychology (Plaza-del Arco et al., 2024). These basic emotion sets are rarely designed with textual expression in mind (e.g., Ekman 1984, whose model defines basic emotions by the recognizability of facial expressions), and very little research examines the validity of adapting these sets to textual data.

Emotion analysis furthermore relies on largely the same emotion categories across languages, including, in some cases, simply translating resources such as lexicons from one language into another Mohammad (2018) or translating finetuning and evaluation data Isbister et al. (2021); Kathunia et al. (2024). Previous research has also shown that existing multilingual models encode meaning in an Anglocentric way (Havaldar et al., 2023). As recent studies have found that culture and language influence the particular meaning of emotional terms like "love" Heyes (2019), models that fail to understand this cultural context or rely on mainstream dialects may also fail to capture the nuances of an author’s emotional expression (Deas et al., 2023).

Refer to caption — Figure 1: Paraphrased input and expected output examples from MASIVE in English and Spanish. Models are tasked with predicting affective states (highlighted), which reflect more nuanced feelings than label sets in prior work, such as the Ekman basic emotions.

In this work, we argue for a descriptive approach to emotion analysis. We broaden our scope from a small set of basic emotions to a practically unbounded set of affective states, which includes any terms that humans use to describe their experiences of feeling, including emotions, moods, and figurative expressions of feelings (e.g. "blue" as an expression of sadness instead of the color) (VandenBos, 2007). We then define the new problem of affective state identification (ASI), which is a targeted masked span prediction task: given a text description of an emotional experience, we train models to produce single-word affective states that correspond to the description. These affective states may include common emotion categories such as happy or sad, but they also allow us to incorporate nuance and intensity (e.g., elated, calm, jealous, lonely, etc.) as well as other classifications that are not typically considered emotions such as moods (e.g., longer-term feelings of being motivated or stuck).

We collect MASIVE: Multilingual Affective State Identification with Varied Expressions, a new benchmark dataset for affective state identification using Reddit data. We use a bootstrapping procedure to discover new affective state labels and collect posts containing natural emotional expressions in English and Spanish, yielding 1600 unique affective state labels in English and 1000 in Spanish.¹¹1Our dataset, code, and model checkpoints will be made publicly available upon publication We evaluate our data collection methods with human annotation, finding that 88% and 72% of our automatically collected English and Spanish labels, respectively, reflect affective states, and document unique features of the datasets including negations and, in Spanish, grammatical gender. We then use this dataset to evaluate the performance of several commonly-used generative models, finding that small fine-tuned models generally outperform LLMs. Beyond ASI, we experiment with using our corpora as pretraining data and show that MASIVE incorporates knowledge that generalizes to existing emotion detection benchmarks. Finally, we assess finetuning and evaluating models on machine-translated data and find that original texts from native speakers are essential for performing ASI.

Our contributions in this work are as follows:

1.

We introduce a novel benchmark for affective state identification with language generation models, including a significantly larger label set than prior related benchmarks;
2.

We benchmark multilingual models and show that smaller, finetuned models outperform current LLMs on this dataset;
3.

We analyze the behavior and performance of models on region-specific affective language, gendered language, and negations; and
4.

We empirically argue that both finetuning and evaluating on texts authored by native speakers is vital for capturing nuances in multilingual affective writing

2 Data

2.1 MASIVE Corpus

Our goal is to collect data representing the broad set of ways in which humans describe their feelings. We refer to these expressions as affective states VandenBos (2007); this is an umbrella term incorporating multiple kinds of feelings such as emotions and moods. We collect texts with expressions of affective states from Reddit²²2Using the PullPush API at https://pullpush.io/ using a bootstrapping procedure. Beginning with the adjective forms of the Ekman emotions, we search for texts containing forms of “I feel <affect> and…”, “I am feeling <affect> and…”, where <affect> is replaced with each emotion term. Notably, we also search for “I don’t feel <affect> and…” and “I am not feeling <affect> and…” to better capture the diversity of ways in which authors can express feelings. We extract affective state terms that follow the “and” from the retrieved posts to form a new set of search phrases with these terms. We repeat these steps, expanding the pool of query affective states in each round. Our primary assumption is that any adjective conjuncts of the query emotion term are also affective states, regardless of whether they are canonical emotion terms. For example, if "happy" was used to query the text "I feel happy and excited," the term "excited" is both an adjective and a conjunct; the same is true of “light” in “I feel happy and light”. In contrast, in "I feel happy and want to smile", "want" is a verb and would not be considered an affective state. We evaluate of this assumption in subsection 2.2.

In Spanish, we conduct the same procedure using forms of “Estoy <affect> y…”, “Me siento <affect> y…”, and “Estoy sintiendo <affect> y…”. We also seed the process with the most common Spanish translations of the Ekman emotions on Reddit (see Appendix B). Additionally, as Spanish includes both masculine and feminine forms for some terms, we search for both forms where applicable. Finally, we also collect a challenge set including affective state labels associated with regional Spanish varieties, hand-selected by a native Spanish-speaker, to evaluate models’ abilities to generalize to less-represented dialects (see Appendix B).

For both English and Spanish, we run 4 rounds of bootstrapping; for the regional Spanish terms, we run only a single round to avoid introducing non-regional terms. 15 affective states were randomly sampled from both datasets, and all posts containing those 15 affective states were reserved as part of each test set to evaluate models on unseen emotions. Summary statistics describing the English and Spanish splits as well as the regional Spanish challenge set are included in Table 1, and we include a Data Statement in Appendix A.

Lang	Split	Size	Input Length	# AS/ Text	# Unique AS
En	Train	93,736	310.99	1.11	1,627
	Test	10,049	306.61	1.13	775
	Chal	4,720	306.87	1.33	15
Es	Train	32,917	154.22	1.06	1,022
	Test	6,232	161.29	1.08	730
	Chal	1,557	165.64	1.24	15
	Reg	559	233.95	1.07	59

Table 1: Summary statistics of English and Spanish MASIVE. Text lengths are measured in mT5 tokens. AS = Affective State; Chal = unseen challenge set

2.2 Data Analysis

To validate the assumptions of our bootstrapping procedure and examine how affective states are used in our dataset, we collect human evaluations of the automatically identified affective states. Judgments are conducted by 2 native Spanish-speakers in Iberian Culture studies and 2 native English-speakers in Psychology for Spanish and English respectively. We randomly sample 250 texts from each language’s test set for evaluation such that 50 texts are shared by each pair. Annotators are provided with a full Reddit post with a single automatically-identified affective state highlighted. We ask annotators to judge the term in context on 3 dimensions, beginning with whether the highlighted term reflects an affective state. If a term is judged to reflect an affective state, annotators are asked to judge whether the highlighted term better reflects an emotion or a mood³³3We distinguish emotions – shorter-term feelings triggered by identifiable events – from moods – longer-term feelings not necessarily triggered by an event. and whether the highlighted term is used figuratively (e.g., "blue") or literally (e.g., "sad"). All 3 dimensions are judged on 4-point Likert scales where higher values mean the term primarily reflects an affective state, an emotion, and a literal usage, respectively. Annotators achieved moderate agreement in English ( $\kappa=.51$ ) and substantial agreement in Spanish ( $\kappa=.69$ ). Additional details concerning human annotations are included in Appendix B.

Additionally, we analyze 2 aspects of our dataset that differentiate it from prior emotion detection benchmarks. First, because Spanish is a language with grammatical gender for adjectives, part of the affective state prediction problem in MASIVE includes choosing whether to use the masculine or feminine form in the context of the input. Second, authors in natural settings may also tend to express their feelings by stating how they do not feel (e.g., “I’m not happy, but….”), and we specifically include negations to test models’ capability to contend with this construction in both English and Spanish.

The results of the aforementioned data annotations as well as the automatically extracted statistics are included in Table 2. Human annotation results are reported as the percentage of affective states within the sample; for negations and grammatical gender, we report the percentage of texts in our datasets that include any target negations or any feminine adjectives.⁴⁴4Recall that a single datapoint may have multiple labels joined by and. Large majorities (88% and 72% in English and Spanish respectively) of terms were judged to reflect affective states, validating the contents of MASIVE.

Lang	Human-Annotated			Automatic
Lang	Aff	Fig	Emo	Neg	Fem
En	88.4%	58.8%	34.2%	7.75%
Es	71.5%	38.5%	18.5%	27.0%	28.0%

Table 2: Human and automatic analysis of how affective states in the English and Spanish datasets are used in context. Aff: Affective State; Fig: Figurative; Emo: Emotion; Neg: Negations; Fem: Feminine Form

2.3 Fixed-Label Set Data

We additionally evaluate the performance of MASIVE-finetuned models on two previously published datasets in both English and Spanish. A key distinction from MASIVE is that these datasets feature limited label sets; we describe our evaluation procedures in subsection 3.2. In English, we evaluate on GoEmotions (Demszky et al., 2020), a commonly-used emotion dataset consisting of Reddit comments; it is originally labeled with 27 distinct emotion categories, though the authors also relabel the data with the Ekman basic emotions. We additionally evaluate on EmoEvent (Plaza del Arco et al., 2020), a dataset with both English and Spanish subsets of Tweets (among other languages) also labeled with the Ekman set.

2.4 Machine-Translated Data

Finally, we conduct 2 cross-lingual experiments expanding on prior work investigating the use of machine translation and high-resource language models for inference on lower-resource languages (Isbister et al., 2021; Kathunia et al., 2024). In contrast to prior findings, however, we hypothesize that neither translating the training nor evaluation data will be enable competitive performance with models trained on native data. First, using our natural test sets, we evaluate models finetuned on translated data. Second, we evaluate the performance of our native-trained models on translated data, mimicking the translation of lower-resource language data for inference with a model trained on a higher-resource language.

In both settings, we use bilingual Opus-MT models (Tiedemann and Thottingal, 2020) to independently translate the input documents and target affective state labels. We select Opus-MT models as they are accessible, open-source models, reflecting resources that may be used for large scale translation, and are utilized in experiments in Kathunia et al. (2024). Throughout the experiments, models finetuned on translated data are denoted _Tr. Test sets generated through machine translation are similarly denoted as En_Tr and Es_Tr.

3 Experimental Configuration

3.1 Models

We experiment with finetuning small language models on our original and machine-translated data. We also perform experiments with two Large Language Models (LLMs) in a zero-shot setting.

Finetuned Generative Models.

Most of our models are based on mT5-Large (Xue et al., 2021a) During finetuning and prediction on MASIVE, we mask the automatically identified affective state words wherever they appear and task models to fill them, mimicking mT5’s initial pretraining. We additionally experiment with T5-large (Raffel et al., 2019) for English only.⁵⁵5No comparable monolingual T5 checkpoint for Spanish has been made publicly available. In the results, models’ superscripts denote that a model was finetuned on our English (T5^En and mT5^En) or Spanish (mT5^Es) corpus.

Large Language Models.

We evaluate two modern, open-source LLMs–Llama-3⁶⁶6https://llama.meta.com/llama3 and Mixtral-Instruct (Jiang et al., 2024)–as these models have been specifically evaluated in multilingual settings. We instruct these models to perform the same masked token prediction task as mT5 (see Appendix C). Due to context window constraints and input lengths, LLMs are evaluated in a zero-shot setting. Further checkpoint and generation hyperparameter details are included in Appendix C.

3.2 Metrics

3.2.1 MASIVE Evaluation

We report top-k accuracy for our models with $k\in\{1,3,5\}$ ⁷⁷7As some samples in the datasets have multiple labels, we calculate top-k accuracy at the sample level using beam search and report average sample-level scores., along with two generative metrics: the negative log-likelihood (NLL) of the gold affective state and the model’s log perplexity. In Spanish, if the gendered form of the prediction does not match that of the gold term (e.g. enojado vs. enojada), the prediction is considered incorrect, but the similarity of the prediction in these cases is captured by the top-k similarity metric, which we describe below.

Top-k Similarity.

Because our label set is very large, we also report a measure of similarity between the model’s top predictions and the gold. Here, we rely on contextual embeddings using multilingual, pre-trained BERT-base (Devlin et al., 2019). To ensure that the similarity model encodes affective senses of each term, we embed the predicted and gold emotion terms within 100-token contexts from the original post and calculate cosine similarity between them. We report the maximum similarity of these contextual embeddings when looking at the top 1, 3, and 5 most likely model predictions. Full details are available in Appendix D.

Lang	Model	NLL $\downarrow$	Log Perp $\downarrow$	Acc@1 $\uparrow$	Acc@3 $\uparrow$	Acc@5 $\uparrow$	Sim@1 $\uparrow$	Sim@3 $\uparrow$	Sim@5 $\uparrow$
En	T5^En	23.04	7.11	17.9%	27.3%	32.5%	0.556	0.665	0.711
	mT5^En	25.11	12.40	12.4%	17.5%	20.1%	0.461	0.525	0.552
	Llama-3	64.39	48.85	2.2%	3.4%	4.0%	0.436	0.470	0.487
	Mixtral	1.27		7.7%	9.3%	10.8%	0.474	0.500	0.523
Es	mT5^Es	7.14	7.05	20.1%	29.6%	34.4%	0.490	0.593	0.637
	Llama-3	80.56	63.87	2.3%	3.1%	3.6%	0.435	0.468	0.482
	Mixtral	2.02		7.4%	9.5%	11.2%	0.445	0.477	0.502

Table 3: Comparison of T5, mT5, and two LLMs on our proposed Reddit dataset, aggregated scores only. Note that the Spanish test set and the English test set are not directly comparable as noted in subsection 3.2. Bolded scores highlight the best-performing multilingual model.

Dataset Model P R F1 Acc@1 Acc@3 Acc@5 Sim@1 Sim@3 Sim@5 GoEmotions (27) mT5 22.45 2.53 0.97 2.5% 12.9% 23.5% 0.525 0.614 0.670 mT5^MAS 35.91 7.06 5.28 5.0% 9.6% 13.6% 0.512 0.579 0.609 GoEmotions (7) mT5 28.28 38.63 23.53 38.5% 70.7% 86.0% 0.736 0.884 0.946 mT5^MAS 73.42 28.74 27.79 23.8% 37.2% 45.1% 0.663 0.751 0.783 EmoEvent (En) mT5 6.01 10.48 2.21 10.5% 72.2% 93.3% 0.638 0.884 0.972 mT5^MAS 68.76 35.73 36.90 32.8% 60.5% 72.9% 0.719 0.858 0.903 EmoEvent (Es) mT5 19.53 23.63 11.04 23.5% 64.1% 88.2% 0.721 0.869 0.953 mT5^MAS 31.90 31.83 26.83 32.2% 79.0% 85.6% 0.730 0.906 0.938

Table 4: Performance of mT5 finetuned on emotion classification datasets, with and without prior finetuning on MASIVE. Bolded scores highlight the best performing model on each dataset under each metric.

3.2.2 Fixed-Label Set Evaluation

To evaluate how well our dataset imbues models with general emotional knowledge, we evaluate two variants of mT5: first, mT5 finetuned only on existing emotion benchmarks, and second, mT5 finetuned on MASIVE followed by existing benchmarks (denoted with superscript ^MAS).

To adapt the evaluation sets to our generative setting, we append "I feel <extra_id_0>" to the end of each input to match the format of our evaluation on MASIVE (see Figure 1), using adjective forms of the gold emotion labels. In this setting, we report top-k accuracy and similarity as we do for MASIVE. Additionally, to adapt our models to the fixed-label set setting, we sort the fixed set of emotion labels by their likelihood according to the model and select the most probable emotion label as the prediction. For these experiments, we report macro precision, recall, and F1 score.

4 Results

4.1 MASIVE Evaluation

Table 3 presents the performance metrics for finetuned mT5, Llama-3, and Mixtral on our English and Spanish test sets, as well as finetuned T5 for the English test set only. Among multilingual models, mT5 outperforms both LLMs on top-k accuracy for both languages (Takeaway #1), despite having drastically fewer parameters.⁸⁸8Llama-3 occasionally refuses to make a prediction if the content discussed is sensitive (e.g., drug use). Results with invalid responses filtered out are included in Appendix E. Additionally, mT5 achieves the highest top-k similarity scores, except for top-1 similarity in Spanish. Between the LLMs, Mixtral tends to outperform Llama-3. This performance difference may be explained by the difference in size between models, as well as the fact that multilingual data was upsampled in Mixtral’s pretraining compared to prior models.

In English, the large variant of T5 has been shown to slightly outperform mT5 Xue et al. (2021b). While we find a similar difference, the performance gap is notably quite large. Because the remaining experiments include Spanish data, we focus on mT5. We note, however, that dedicated monolingual English models may offer significantly higher performance on ASI (Takeaway #2) and leave further exploration of the differences between monolingual and multilingual models to future work.

While the differences in language and content of the English and Spanish datasets prevent us from making conclusions concerning their relative difficulty, Table 3 also shows that performance in Spanish tends to be higher than in English, despite the better representation of English in pre-training and larger size of the collected English data compared to Spanish. This trend could be due to the larger set of unique affective states in our English data than Spanish, with more nuanced affective states that may be difficult for models to predict accurately.

Lang	Model	Subset	Acc@1 $\uparrow$	Acc@3 $\uparrow$	Acc@5 $\uparrow$	Sim@1 $\uparrow$	Sim@3 $\uparrow$	Sim@5 $\uparrow$
En	T5^En	Seen	31.2%	46.0%	53.0%	0.620	0.742	0.788
	T5^En	Unseen	2.9%	6.2%	9.3%	0.484	0.578	0.623
	mT5^En	Seen	22.5%	31.8%	36.6%	0.503	0.581	0.612
	mT5^En	Unseen	1.0%	1.3%	1.5%	0.415	0.463	0.484
Es	mT5^Es	Seen	26.5%	39.0%	45.1%	0.522	0.632	0.677
Es	mT5^Es	Unseen	0.8%	1.5%	2.0%	0.394	0.476	0.519

Table 5: Comparison of mT5 performance between affective states included and held out from finetuning in English and Spanish.

Model	Acc@1 $\uparrow$	Acc@3 $\uparrow$	Acc@5 $\uparrow$	Sim@1 $\uparrow$	Sim@3 $\uparrow$	Sim@5 $\uparrow$
mT5^Es	3.9%	7.3%	9.8%	0.369	0.449	0.485
Llama-3	0.0%	0.1%	0.2%	0.366	0.391	0.401
Mixtral	0.1%	0.3%	0.4%	0.341	0.343	0.343

Table 6: Evaluation of Spanish-finetuned mT5, Llama-3, and Mixtral on region-specific Spanish affective states. Bolded metrics highlight the best-performing model.

4.2 Fixed-Label Set Evaluation

To evaluate the generalized emotion detection capabilities afforded by finetuning on MASIVE, Table 4 shows the performance of mT5 finetuned on existing English and Spanish emotion benchmarks, both with and without prior finetuning on MASIVE. First, when used as a classifier, we find that mT5 finetuned on MASIVE first achieves a higher macro-F1 for all datasets. This suggests that finetuning on our corpus gives models generalizable knowledge of emotions (Takeaway #3). Because our corpora contain many more affective state labels than the evaluation datasets, models finetuned on MASIVE will include more nuanced terms than basic emotions in the top-k predictions. So, as expected, models finetuned only on the emotion benchmarks achieve higher top-k accuracy and similarity scores, as they are more likely to predict terms within the smaller label sets. The top-k similarity scores for our models, however, remain high, suggesting that the generated affective states are similar to the ground truth basic emotion labels.

4.3 Unseen and Regional Set Evaluation

To analyze how well models generalize beyond affective states explicitly included in finetuning, we present performance metrics on seen and unseen affective states in both languages in Table 6. In both languages, all models perform considerably better on affective states included in the finetuning data than on unseen affective states. T5^En , however, maintains better performance on unseen affective states than mT5^En, suggesting that monolingual models may better generalize.

In addition to unseen affective states, we present evaluation results on a subset of Spanish affective states which are region-specific in Table 6. Similarly to results on the full Spanish data, finetuned mT5^Es outperforms both LLMs in top-k accuracy and similarity. Notably, the performance of mT5^Es on this regional subset is comparable to its performance on general unseen Spanish emotions (Table 6), while Llama-3 and Mixtral, which are not explicitly finetuned on our corpora, perform significantly worse on the regional subset than they do on the Spanish data as a whole (Table 3). Because top-k accuracy drops significantly on unseen and region-specific affective states (top-k similarity as well, though less so), future work in this area should prioritize a generalized understanding of affective states, including regionalisms (Takeaway #4).

Test Set	Model	NLL $\downarrow$	Log Perp $\downarrow$	Acc@1 $\uparrow$	Acc@3 $\uparrow$	Acc@5 $\uparrow$	Sim@1 $\uparrow$	Sim@3 $\uparrow$	Sim@5 $\uparrow$
En	mT5 ${}^{En}_{S}$	11.16	11.13	12.5%	19.6%	23.5%	0.460	0.551	0.590
En	mT5 ${}^{En}_{Tr}$	18.37	17.42	1.5%	3.0%	4.2%	0.358	0.452	0.491
Es_Tr	mT5^Es	56.06	46.30	2.0%	3.2%	3.9%	0.367	0.426	0.451
Es	mT5 ^En	7.14	7.05	20.1%	29.6%	34.4%	0.490	0.593	0.637
Es	mT5 ${}^{Es}_{Tr,S}$	17.74	17.34	1.6%	3.0%	4.0%	0.341	0.420	0.458
En_Tr	mT5 ${}^{En}_{S}$	28.11	27.98	1.4%	3.1%	4.4%	0.391	0.477	0.513

Table 7: Comparison of mT5 models finetuned on the original data reflecting native use and translated data on our proposed Reddit dataset, aggregated scores only. All finetuning sets are randomly subset to the same size as the smallest set, the collected Spanish training set, and results are averaged across 5 different subsets (n = 32,917). Models with subsetted data are denoted with _S.

4.4 Grammatical Gender and Negations

We break down the top-k accuracy and top-k similarity results for each model by grammatical gender and negations in Figure 2. We see again that mT5 outperforms both LLMs across all subsets, and that mT5 often places the gold label among the top 3 or 5 predictions if not the top 1. In particular, mT5^Es performs better on feminine adjectives than masculine adjectives or those with only a single form, and T5^En and mT5^Es perform better on negated targets than non-negated targets (mT5^En shows the same pattern for accuracy, though not similarity). Llama-3 and Mixtral achieve highest accuracy for masculine adjectives and highest similarity for single-form adjectives, while for negations, Llama-3 performs better on non-negations and Mixtral performs slightly better on negations. These results suggest that explicit training on MASIVE may improve performance specifically on unique features of generative ASI (Takeaway #5).

4.5 Machine-Translation vs. Natural Data

Finally, we evaluate the changes in performance first when using machine-translated finetuning data and alternatively when translating evaluation data in Table 7. First, we find an expected drop in performance when models are finetuned on machine-translated data for both English and Spanish. Interestingly, the drop in accuracy and similarity metrics (90% and 28%, respectively) in Spanish are notably larger than in English (77% and 14%). This could perhaps be explained by the translation model performing better in the Spanish to English direction than English to Spanish, as well as mT5’s ability to better generalize in English than in Spanish.

As an alternative approach to finetuning on translated data, we also consider the case where data may be translated at inference time. In these cases (En_Tr and Es_Tr in Table 7), we find that performance falls. Artifacts of machine translation have been found to impact evaluation of translation models (Freitag et al., 2020), and, similarly, errors and artifacts of unnatural translation may cause these changes in performance. In contrast to prior work suggesting that performance on the target data translated into English is comparable to finetuning on the target language for tasks such as sentiment detection, our results suggest that for our task, machine-translating the evaluation data leads to poorer performance, and translating either at training or inference time result in similar performance (Takeaway #6).

5 Related Work

Emotion Taxonomies.

Many different models of human emotion have been proposed, intending to capture the universal experience of different emotions across cultures. Some of the most notable categorical models in psychology and NLP research are the Ekman (1984, 2005) basic emotion set derived from facial expression and the Plutchik (1980) basic emotion set which assumes emotions occur in opposing pairs (e.g. joy and sadness), though other models exist (e.g., Ortony et al. 1988; Oatley and Johnson-laird 1987; Johnson-Laird and Oatley 1998; PS and Mahalakshmi 2017). Multiple different dimensional models have also been proposed, situating emotions in a space governed by features such as pleasantness and activation (Plutchik, 1980; Russell and Mehrabian, 1977; Russell, 1980; Bradley et al., 1992). Many such models of emotions have been frequently compared and evaluated in psychology and as they apply to emotion detection (see Rubin and Talarico 2009; PS and Mahalakshmi 2017; Lichtenstein et al. 2008).

Emotion and Language Generation.

Numerous approaches to automated emotion detection in text have been proposed, including emotion lexicons (Strapparava and Valitutti, 2004; Staiano and Guerini, 2014; Araque et al., 2018; Mohammad and Turney, 2010) and classification models (see Acheampong et al. 2020 for a review of approaches). Most of this work focuses on small, finite emotion sets, usually Ekman or Plutchik, though some prior work has used larger sets (Sintsova et al., 2013; Liew et al., 2016; Subasic and Huettner, 2001). Mohammad and Kiritchenko (2015) in particular collect data for a very large but still strictly limited set of emotions. More recently, language generation tasks have been proposed that call for models with greater emotional understanding, such as emotional dialogue generation (Ide and Kawahara, 2021; Song et al., 2019; Firdaus et al., 2020), controllable generation (Goswamy et al., 2020; Saha et al., 2022), and emotion trigger summarization (Sosea et al., 2023; Zhan et al., 2022). Given that language generation models have been employed to unify these and other classification and generation tasks, endowing models with a greater understanding of human emotions would greatly benefit multiple applications.

Cross-cultural Emotion Perception.

Many researchers have suggested that a basic set of emotions are universal, while others have argued that emotions are shaped by culture. Past work has built on Ekman’s proposal and provided evidence that emotion categories are universal (Ekman, 1984; Hoemann et al., 2019), with Sauter 2018 finding little support for the argument that language plays a foundational role in perceiving emotions. Additionally, past work has at least in part supported differences in emotion perception and recognition across languages and cultures (Chen et al., 2023; Mesquita et al., 2016; Jackson et al., 2019), even with bilingual speakers (Caldwell-Harris and Ayçiçeği-Dinn, 2009). Past work in sentiment and emotion in NLP frequently translates English corpora to enable multilinguality (Yang and Hirschberg, 2019; Tafreshi et al., 2024). Some work, however, has demonstrated cross-cultural differences in model performance (Havaldar et al., 2023), and approaches that do not rely on machine translation have also been proposed (e.g., Rasooli et al. 2018). Our work evaluates the use of machine translation for the ASI task, and we find that machine translation may not be sufficient for cross-lingual transfer.

6 Conclusion

In this work, we introduce the novel task of affective state identification, a language generation task prioritizing the authors’ natural expressions of their feelings rather than using a prescribed set of emotion labels. For this task, we automatically collect and publish two datasets of Reddit posts in English and Spanish, both containing over 1,000 unique affective state labels.

We use this dataset to benchmark multilingual generative models, and find that (Takeaway #1) small finetuned T5 and mT5 models outperform zero-shot LLMs. Results specifically show that (Takeaway #2) T5 significantly outperforms mT5 in English on ASI, suggesting that monolingual models may be more capable. Additionally, we show that (Takeaway #3) models finetuned on our corpora transfer knowledge that generalizes to existing emotion detection benchmarks. In analyzing model performance on unseen emotions and Spanish regionalisms, we argue that (Takeaway #4) generalization to a broader set of affective states, including those from underrepresented dialects, is an important avenue for future work. With respect to grammatical gender in Spanish and negations, (Takeaway #5) finetuning on MASIVE improves on specific linguistic constructions unique to generative ASI. Finally, we quantify the observed performance differences when using machine-translated data at finetuning or inference time, finding that in contrast to prior work, (Takeaway #6) machine translation leads to large performance drops. We hope these results spark future work into ASI to enable prediction of more nuanced feelings in a variety of languages and contexts, and ultimately, enable prediction of an unbounded set of labels.

Limitations

We limit ourselves in this work to investigating two high-resource languages, English and Spanish, in part because for this application, we find it important that members of the research team be able to speak the languages of study fluently. Additionally, we gather data from one source, Reddit, which limits the demographics of the people whose experiences are represented in our data. This choice of data source may particularly limit our Spanish data, which includes less data and fewer labels than English (Table 1). We choose not to control for things like topic or subreddit when collecting English and Spanish data separately because we wish to collect a natural variety of data, but this also means that we do not claim our two datasets to be equivalent.

Our data gathering framework collects only explicit expressions of affective states by searching for statements including an “I feel”-style template. While we can use models trained on this type of data to predict affective state labels for any input by simply appending an “I feel” statement to be filled (see subsubsection 3.2.2), our training targets do not include this type of data, and this paradigm impacts the types of affective states we are likely to collect.

We also acknowledge that our choices of specific resources limit our work in various ways. We use only Opus-MT models to perform our machine translation experiments because they exhibit good performance in both languages; however, it is possible that we would see different results with different translation models. Our similarity metric also uses pre-trained BERT embeddings because of the benefits of contextual embeddings and subword tokenization, but there are many other possible choices of embedding framework that may more accurately capture emotional nuances. Finally, we evaluate only open-source LLMs on our dataset.

Ethics Statement

We strictly collect publicly available user-authored texts on the pseudonymous social media website Reddit, but we acknowledge the privacy concerns of users when collecting data from social media. Accordingly, we will release the collected texts only with randomly assigned IDs and usernames stripped. We discourage others from attempting to identify authors of the texts in the collected dataset, and will remove data from the dataset upon request.

Because we rely entirely on open-source models, including open-source LLMs, and make our data available, our results are fully reproducible. We also release our code and model checkpoints along with our data. In total, our finetuning and evaluation amounts to approximately 73 hours using Nvidia A100 GPUs.

Our task allows models to predict a larger set of affective states, capturing more nuanced expressions of an authors’ feelings than traditional emotion detection. At the same time, a larger label set could exacerbate the consequences of misclassification in sensitive contexts (e.g., mental health and crisis settings). In some applications of this task where this may be an important consideration, the label set can be artificially restricted, as we show in our external evaluation experiments.

Finally, the aim of predicting authors’ expressions of their own feelings can require models to generate regional or dialectal texts. Prior work has identified dialectal biases in language models (e.g., African American Language; Deas et al. 2023; Groenwold et al. 2020) and we find that all evaluated models perform poorly on regional varieties of Spanish. We hope future work makes progress toward closing performance gaps among dialects and language varieties.

Acknowledgements

This work was supported in part by grant IIS2106666 from the National Science Foundation, the Defense Advanced Research Projects Agency (DARPA) Cross-Cultural Understanding (CCU) program under Contract No HR001122C0034, National Science Foundation Graduate Research Fellowship DGE-2036197, a research gift from Amazon, the Columbia Provost Diversity Fellowship, and the Columbia School of Engineering and Applied Sciences Presidential Fellowship. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as representing the official views or policies of the National Science Foundation, the Department of Defense, or the U.S. Government. We thank Julia Hirschberg and Melanie Subbiah for feedback on earlier drafts of this work.

References

Acheampong et al. (2020) Francisca Adoma Acheampong, Chen Wenyu, and Henry Nunoo-Mensah. 2020. Text-based emotion detection: Advances, challenges, and opportunities. Engineering Reports, 2(7).
Araque et al. (2018) Oscar Araque, Lorenzo Gatti, Jacopo Staiano, and Marco Guerini. 2018. Depechemood++: a bilingual emotion lexicon built through simple yet powerful techniques. Preprint, arXiv:1810.03660.
Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
Bradley et al. (1992) Margaret M. Bradley, Mark K. Greenwald, Margaret C. Petry, and Peter J. Lang. 1992. Remembering pictures: Pleasure and arousal in memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(2):379–390.
Caldwell-Harris and Ayçiçeği-Dinn (2009) Catherine L. Caldwell-Harris and Ayşe Ayçiçeği-Dinn. 2009. Emotion and lying in a non-native language. International Journal of Psychophysiology, 71(3):193–204.
Chen et al. (2023) Peiyao Chen, Ashley Chung-Fat-Yim, Taomei Guo, and Viorica Marian. 2023. Cultural background and input familiarity influence multisensory emotion perception. Cultural Diversity and Ethnic Minority Psychology.
Deas et al. (2023) Nicholas Deas, Jessica Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, and Kathleen McKeown. 2023. Evaluation of African American language bias in natural language generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6805–6824, Singapore. Association for Computational Linguistics.
Demszky et al. (2020) Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ekman (1984) Paul Ekman. 1984. Expression and the nature of emotion. Approaches to emotion, 3(19):344.
Ekman (2005) Paul Ekman. 2005. Basic emotions. In Handbook of Cognition and Emotion, pages 45–60. John Wiley & Sons, Ltd.
Firdaus et al. (2020) Mauajama Firdaus, Hardik Chauhan, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Emosen: Generating sentiment and emotion controlled responses in a multimodal dialogue system. IEEE Transactions on Affective Computing, 13(3):1555–1566.
Freitag et al. (2020) Markus Freitag, David Grangier, and Isaac Caswell. 2020. BLEU might be guilty but references are not innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online. Association for Computational Linguistics.
Goswamy et al. (2020) Tushar Goswamy, Ishika Singh, Ahsan Barkati, and Ashutosh Modi. 2020. Adapting a language model for controlled affective text generation. In Proceedings of the 28th international conference on computational linguistics, pages 2787–2801.
Groenwold et al. (2020) Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita Honnavalli, Sharon Levy, Diba Mirza, and William Yang Wang. 2020. Investigating African-American Vernacular English in transformer-based text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5877–5883, Online. Association for Computational Linguistics.
Havaldar et al. (2023) Shreya Havaldar, Bhumika Singhal, Sunny Rai, Langchen Liu, Sharath Chandra Guntuku, and Lyle Ungar. 2023. Multilingual language models are not multicultural: A case study in emotion. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 202–214, Toronto, Canada. Association for Computational Linguistics.
Heyes (2019) Cecilia Heyes. 2019. Précis of cognitive gadgets: The cultural evolution of thinking. Behavioral and Brain Sciences, 42:e169.
Hoemann et al. (2019) Katie Hoemann, Alyssa N. Crittenden, Shani Msafiri, Qiang Liu, Chaojie Li, Debi Roberson, Gregory A. Ruark, Maria Gendron, and Lisa Feldman Barrett. 2019. Context facilitates performance on a classic cross-cultural emotion perception task. Emotion, 19(7):1292–1313.
Honnibal et al. (2020) Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python.
Ide and Kawahara (2021) Tatsuya Ide and Daisuke Kawahara. 2021. Multi-task learning of generation and classification for emotion-aware dialogue response generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 119–125, Online. Association for Computational Linguistics.
Isbister et al. (2021) Tim Isbister, Fredrik Carlsson, and Magnus Sahlgren. 2021. Should we stop training more monolingual models, and simply use machine translation instead? Preprint, arXiv:2104.10441.
Jackson et al. (2019) Joshua Conrad Jackson, Joseph Watts, Teague R. Henry, Johann-Mattis List, Robert Forkel, Peter J. Mucha, Simon J. Greenhill, Russell D. Gray, and Kristen A. Lindquist. 2019. Emotion semantics show both cultural variation and universal structure. Science, 366(6472):1517–1522.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts. arXiv preprint.
Johnson-Laird and Oatley (1998) Philip N Johnson-Laird and Keith Oatley. 1998. Basic emotions, rationality, and folk theory. In Consciousness and Emotion in Cognitive Science, pages 289–311. Routledge.
Kathunia et al. (2024) Aekansh Kathunia, Mohammad Kaif, Nalin Arora, and N Narotam. 2024. Sentiment analysis across languages: Evaluation before and after machine translation to english. Preprint, arXiv:2405.02887.
Lichtenstein et al. (2008) Antje Lichtenstein, Astrid Oehme, Stefan Kupschick, and Thomas Jürgensohn. 2008. Comparing two emotion models for deriving affective states from physiological data. In Affect and Emotion in Human-Computer Interaction, pages 35–50. Springer Berlin Heidelberg.
Liew et al. (2016) Jasy Suet Yan Liew, Howard R. Turtle, and Elizabeth D. Liddy. 2016. EmoTweet-28: A fine-grained emotion corpus for sentiment analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1149–1156, Portorož, Slovenia. European Language Resources Association (ELRA).
Mesquita et al. (2016) Batja Mesquita, Michael Boiger, and Jozefien De Leersnyder. 2016. The cultural construction of emotions. Current Opinion in Psychology, 8:31–36.
Mohammad and Turney (2010) Saif Mohammad and Peter Turney. 2010. Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 26–34, Los Angeles, CA. Association for Computational Linguistics.
Mohammad (2018) Saif M. Mohammad. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL), Melbourne, Australia.
Mohammad and Kiritchenko (2015) Saif M. Mohammad and Svetlana Kiritchenko. 2015. Using hashtags to capture fine emotion categories from tweets. Computational Intelligence, 31(2):301–326.
Oatley and Johnson-laird (1987) Keith Oatley and P. N. Johnson-laird. 1987. Towards a cognitive theory of emotions. Cognition &: Emotion, 1(1):29–50.
Ortony et al. (1988) Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988. The Cognitive Structure of Emotions. Cambridge University Press.
Plaza-del Arco et al. (2024) Flor Miriam Plaza-del Arco, Alba A. Cercas Curry, Amanda Cercas Curry, and Dirk Hovy. 2024. Emotion analysis in NLP: Trends, gaps and roadmap for future directions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5696–5710, Torino, Italia. ELRA and ICCL.
Plaza del Arco et al. (2020) Flor Miriam Plaza del Arco, Carlo Strapparava, L. Alfonso Urena Lopez, and Maite Martin. 2020. EmoEvent: A multilingual emotion corpus based on different events. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1492–1498, Marseille, France. European Language Resources Association.
Plutchik (1980) Robert Plutchik. 1980. A general psychoevolutionary theory of emotion. In Theories of emotion, pages 3–33. Elsevier.
PS and Mahalakshmi (2017) Sreeja PS and G Mahalakshmi. 2017. Emotion models: a review. International Journal of Control Theory and Applications, 10(8):651–657.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
Rasooli et al. (2018) Mohammad Sadegh Rasooli, Noura Farra, Axinia Radeva, Tao Yu, and Kathleen McKeown. 2018. Cross-lingual sentiment transfer with limited resources. Machine Translation, 32(1/2):143–165.
Rubin and Talarico (2009) David C. Rubin and Jennifer M. Talarico. 2009. A comparison of dimensional models of emotion: Evidence from emotions, prototypical events, autobiographical memories, and words. Memory, 17(8):802–808.
Russell (1980) James A. Russell. 1980. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161–1178.
Russell and Mehrabian (1977) James A Russell and Albert Mehrabian. 1977. Evidence for a three-factor theory of emotions. Journal of research in Personality, 11(3):273–294.
Saha et al. (2022) Punyajoy Saha, Kanishk Singh, Adarsh Kumar, Binny Mathew, and Animesh Mukherjee. 2022. Countergedi: A controllable approach to generate polite, detoxified and emotional counterspeech. arXiv preprint arXiv:2205.04304.
Sauter (2018) Disa A. Sauter. 2018. Is there a role for language in emotion perception? Emotion Review, 10(2):111–115.
Shazeer and Stern (2018) Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. Preprint, arXiv:1804.04235.
Sintsova et al. (2013) Valentina Sintsova, Claudiu Musat, and Pearl Pu. 2013. Fine-grained emotion recognition in olympic tweets based on human computation. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 12–20, Atlanta, Georgia. Association for Computational Linguistics.
Song et al. (2019) Zhenqiao Song, Xiaoqing Zheng, Lu Liu, Mu Xu, and Xuanjing Huang. 2019. Generating responses with a specific emotion in dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3685–3695, Florence, Italy. Association for Computational Linguistics.
Sosea et al. (2023) Tiberiu Sosea, Hongli Zhan, Junyi Jessy Li, and Cornelia Caragea. 2023. Unsupervised extractive summarization of emotion triggers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9550–9569, Toronto, Canada. Association for Computational Linguistics.
Staiano and Guerini (2014) Jacopo Staiano and Marco Guerini. 2014. Depechemood: a lexicon for emotion analysis from crowd-annotated news. Preprint, arXiv:1405.1605.
Strapparava and Valitutti (2004) Carlo Strapparava and Alessandro Valitutti. 2004. WordNet affect: an affective extension of WordNet. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
Subasic and Huettner (2001) P. Subasic and A. Huettner. 2001. Affect analysis of text using fuzzy semantic typing. IEEE Transactions on Fuzzy Systems, 9(4):483–496.
Tafreshi et al. (2024) Shabnam Tafreshi, Shubham Vatsal, and Mona Diab. 2024. Emotion classification in low and moderate resource languages. Preprint, arXiv:2402.18424.
Tiedemann and Thottingal (2020) Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT — Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal.
VandenBos (2007) Gary R VandenBos. 2007. APA dictionary of psychology. American Psychological Association.
Xue et al. (2021a) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021a. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Xue et al. (2021b) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021b. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Yang and Hirschberg (2019) Zixiaofan Yang and Julia Hirschberg. 2019. Linguistically-informed training of acoustic word embeddings for low-resource languages. In Interspeech.
Zhan et al. (2022) Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, and Junyi Jessy Li. 2022. Why do you feel this way? summarizing triggers of emotions in social media posts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9436–9453, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Appendix A Data Statement

A.1 Curation Rationale

The aim of collecting the texts contained in MASIVE was to produce both a training dataset and benchmark for affective state identification. Affective state identification tasks models with predicting individual terms reflecting how a text’s author feels, and in particular, predicting terms that would be used by the author themself. The dataset collection process was designed to automatically extract a large set of possible affective state labels from texts where an author explicitly describes how they feel. Both an English and Spanish version of the dataset were collected in the same fashion to enable research on cross-lingual work, as well as a small set of regional Spanish to enable work on linguistic variation. We intend to make the dataset publicly available

A.2 Language Variety

MASIVE contains texts both in English (en) and Spanish (es). Data collection was not restricted to a particular variety of English or Spanish, and distributions of these varieties likely reflects the overall demographics of English and Spanish-speaking users on Reddit. A small set of data was collected specifically to reflect Spanish specific to particular regions, including terms primarily associated with Spanish spoken in Mexico, Spain, Venezuela, and El Salvador among other regions and countries.

A.3 Annotator Demographics

Two sets of annotators were involved in validating the automatically extracted labels in MASIVE. For the English data, annotators were 2 native English-speakers and Psychology undergraduate students. Both English data annotators were American and female. For the Spanish data, annotators were 2 native Spanish-speakers and graduate students in the department of Latin American and Iberian Cultures. The Spanish data annotators were Colombian and Ecaudorian, and both were male.

A.4 Speech Situation

The collected texts in MASIVE were not restricted to a particular range of time, and may have been published anytime between the founding of Reddit (2005) and the time of data collection (April, 2024). Texts were also not restricted to a particular place, but likely reflect the countries of origin of English and Spanish-speaking Reddit users. All texts were originally written and published on Reddit, which may or may not have been edited before they were included in the dataset. As with most interactions through Reddit posts, the texts reflect asynchronous interactions and are likely intended for a general public audience in most cases.

A.5 Text Characteristics

The texts in MASIVE may discuss a wide variety of topics. All texts, however, contain explicit expressions of feelings or explicit mentions of terms that may reflect feelings. Thus, many texts may reflect personal narratives that provide context for an author’s feelings. Thus, the dataset may also discuss sensitive topics and include the kinds of offensive or harmful content that can be found online.

Appendix B Data Collection and Annotation

B.1 Seed Emotions

The specific adjective forms of the Ekman emotions used to seed our bootstrapping procedure are shown in Table 8. These are also the terms used as the gold in our fixed-set label evaluation, with the addition of ‘nothing’ for the no-emotion class if it is used.

For fixed-label evaluation of GoEmotions (27), the following terms are used for the expanded label set: ‘admiration’, ‘amused’, ‘angry’, ‘annoyed’, ‘approving’, ‘caring’, ‘confused’, ‘curious’, ‘desire’, ‘disappointed’, ‘disapproval’, ‘disgusted’, ‘embarrassed’, ‘excited’, ‘afraid’, ‘grateful’, ‘grief’, ‘happy’, ’love’, ‘nervous’, ‘optimistic’, ‘proud’, ‘realized’, ‘relieved’, ‘remorseful’, ‘sad’, ‘surprised’, and ‘nothing’.

En		Es
happy	surprised	feliz	sorprendido
sad	disgusted	triste	desagradado
angry	afraid	enojado	asustado

Table 8: Seed emotions (Ekman) for each language used in collecting the Reddit corpus.

B.2 Regional Spanish Affective States

To collect affective state labels associated with one or more particular Spanish-speaking regions, we use the following set of terms: ‘mamado/a’, ‘patitieso/a’, ‘emputado/a’, ‘encandilado/a’, ‘arrechado/a’, ‘fastidiado/a’, ‘encabronado/a’, ‘hallado/a’, ‘rayado/a’, ‘achispado/a’, ‘ahuevado/a’, ‘enrabiado/a’, ‘tusa’, ‘chocho/a’, ‘encachimbado/a’, ‘bravo/a’, ‘apantallado/a’, ‘embromado/a’, ‘engorilado/a’, ‘alicaido/a’, ‘flipando/a’, ‘cagado/a’, ‘aguitado/a’, ‘engrinchado/a’, ‘chato/a’, ‘chipil’, ‘picado/a’, ‘bajoneado/a’, ‘acojonado/a’, ‘arrecho/a’"

The terms are not exhaustive, but reflect varieties of Spanish spoken in Spain, Chile, Colombia, Venezuela, Mexico, Bolivia, Argentina, Uruguay, and Paraguay.

B.3 Data Annotation

The instructions and interface given to our human annotators are shown in Figure 3 and Figure 4, respectively. Annotators were paid $23/hour for their work in accordance with the standards of their university. Each annotator completed a pilot task of 30 examples before beginning to annotate the data in order to build familiarity with the platform and task.

Appendix C Experimental Setup

C.1 Generation Configuration

Checkpoints.

Throughout our experiments, we use the large variants of T5 (770 million parameters; google-t5/t5-large) and mT5 (1.2 billion parameters; google/mt5-large). For our two LLMs, we evaluate the instruct variants of Llama-3 (8 billion parameters; meta-llama/Meta-Llama-3-8B-Instruct) loaded in bfloat16 and Mixtral (7 $\times$ 22 billion parameters; mistralai/Mixtral-8x22B-Instruct-v0.1). Mixtral is accessed through the fireworks.ai API.

Beyond the evaluated models, we use two open-source, unidirectional translation models for our translation experiments. In particular, we employ the Helsinki-NLP English-to-Spanish (Helsinki-NLP/opus-mt-en-es) and Spanish-to-English (Helsinki-NLP/opus-mt-es-en) models. We also use a multilingual BERT checkpoint as part of the similarity metric (168 million parameters; bert-base-multilingual-uncased). Finally, we also rely on spacy (Honnibal et al., 2020) to identify parts of speech in English (en_core_web_md) and Spanish (es_core_news_md) during our data collection.

Generation.

For T5, mT5, and Llama-3, we use beam search to generate the top-k most likely predictions, with 5 beams (as we need only the top-5 outputs). We use the default settings of Huggingface’s GenerationConfig, including, e.g., no repetition penalty, etc.; though we expect a single-word output, we allow generations of up to 32 tokens.

The API used to run inference with Mixtral does not allow retrieving the top 5 most probable predictions as we do with the aforementioned models. Instead, Mixtral predictions are generated with a top-k of 5, and a temperature of 0.5. The top 5 candidate generations are then reranked by the log-probability according to Mixtral to be used in evaluating the ranked, top-5 predictions. Also due to accessing Mixtral through an API, we were not able to calculate the log perplexity of the ground truth labels.

Hyperparameters.

T5 and mT5 models are finetuned with a batch size of 4 for 3 epochs each. Model parameters are optimized using Adafactor (Shazeer and Stern, 2018) as implemented by Huggingface’s transformers with a learning rate of $1\times 10^{-4}$ , Huggingface’s linear learning rate scheduler with default parameters, and a weight decay parameter (here, an L2 penalty) of 0.01. For each model, all data is tokenized using the correct pretrained tokenizer corresponding to its pretrained checkpoint. Any input that is longer than 512 tokens (including the end-of-sequence token) is trimmed to fit; in order to preserve the target affective state masks and the grammatical integrity of the text, this trimming removes full sentences (as parsed by nltk (Bird et al., 2009)) from the end of the text if possible (i.e., if this will not remove a target mask), or the beginning otherwise, until the text fits within 512 tokens.

C.2 Prompts

Table 9 shows the prompts provided to Mixtral and Llama-3 throughout our experiments. In a minority of cases, models would reply in the form "Here is a list of terms to fill each <MASK>: ", in which case, only the terms following the colon were considered as the model’s prediction.

Lang	Prompt
En	Determine the most likely term reflecting a feeling to replace each <MASK> in the following text: "<MASKED_POST>" Provide a single emotion term for each <MASK> token. Do not introduce the answer, respond ONLY with a comma-separated list of lowercase terms:
Es	Determine the most likely term reflecting a feeling to replace each <MASK> in the following text: "<MASKED_POST>" Provide a single emotion term for each <MASK> token. Do not introduce the answer, respond ONLY with a comma-separated list of lowercase terms in Spanish:

Table 9: Prompts provided to Llama-3 and Mixtral for evaluation. At inference time, <POST> is replaced with the input text containing masked affective states.

C.3 Machine Translation Configuration

In the finetuning experiment, we subset the English data and translated English-to-Spanish data to keep the number of training steps constant across settings. For these two models, we repeat the experiment with 5 different random subsets and report the averages across the five trials.

Appendix D Top-K Similarity

Let $P=[p_{1},p_{2},p_{3},...p_{n}]$ , where $n\geq k$ , be a list of predictions ordered according to descending likelihood, and let $g$ be the gold (where $p_{i}$ and $g$ are strings). Additionally, let $E(x)$ be a function on a term $x$ that incorporates 100 tokens of context, tokenizes and embeds the sequence with a pre-trained BERT tokenizer, and returns the contextual embedding corresponding to the first sub-word token in $x$ . Then, we report top-k similarity specifically as

\text{sim}_{k}(P,g)=\max_{i\leq k}\big{[}\text{cosine\_sim}(E(p_{i}),E(g))\big% {]}

Appendix E Extended Results

E.1 Limited Evaluation for Llama-3

For some inputs, Llama-3 would decline to make a prediction, particularly for inputs that discuss topics such as depression or drug use. While these are important topics for models to be able to accurately analyze as they are increasingly applied in mental health contexts, Llama-3’s behavior may unfairly skew its evaluation results. Table 10 presents updated results for Llama-3 on the subset of texts for which the model’s response followed the correct format. 87% of English, 96% of Spanish, and 98% of regional Spanish responses by Llama-3 were formatted correctly. Among the datasets, English scores improve the most given the higher percentage of invalid responses, and scores improve only by up to .6% top-k accuracy and .003 top-k similarity. Considering these results, no conclusions made are altered.

Lang	Acc@1 $\uparrow$	Acc@3 $\uparrow$	Acc@5 $\uparrow$	Sim@1 $\uparrow$	Sim@3 $\uparrow$	Sim@5 $\uparrow$
En	2.5%	3.9%	4.6%	0.432	0.471	0.490
Es	2.3%	3.2%	3.7%	0.436	0.470	0.484
Es (Reg)	0.0%	0.1%	0.2%	0.367	0.393	0.403

Table 10: Evaluation results of Llama-3 on each MASIVE dataset only considering samples with correctly formatted responses of the form "prediction_1, prediction_2, etc…"

E.2 Full Fixed-Label Set Results

Extended results from the fixed-label evaluation are given in Table 11. Notably, we include results using T5 in English, where T5 represents a model finetuned only on the target dataset and $\text{T5}^{MAS}$ represents a model finetuned on MASIVE and then finetuned on the target dataset. Precision, recall, and F1 are calculated by ranking the adjective forms of each emotion class (Appendix B) according to model likelihood and taking the most likely one as the preedicted class, while top-k accuracy and similarity are calculated in a generative setting as in the remainder of the paper. T5 scores consistently well on F1; pretraining on MASIVE does not usually improve T5’s performance on GoEmotions, while it does for EmoEvent (En). Because MASIVE pretraining does improve performance on EmoEvent (Es), it is possible that English T5 is already a very strong baseline and potentially near the performance ceiling of generative models.

Dataset Model P R F1 Acc@1 Acc@3 Acc@5 Sim@1 Sim@3 Sim@5 GoEmotions (27) T5 24.23 7.36 5.67 2.0% 3.3% 3.9% 0.197 0.461 0.492 T5^MAS 41.45 6.95 4.01 5.4% 9.3% 11.5% 0.504 0.563 0.590 mT5 22.45 2.53 0.97 2.5% 12.9% 23.5% 0.525 0.614 0.670 mT5^MAS 35.91 7.06 5.28 5.0% 9.6% 13.6% 0.512 0.579 0.609 GoEmotions (7) T5 42.45 39.59 25.59 38.5% 47.9% 55.7% 0.734 0.775 0.804 T5^MAS 57.44 35.59 30.77 28.7% 43.0% 48.9% 0.669 0.758 0.784 mT5 28.28 38.63 23.53 38.5% 70.7% 86.0% 0.736 0.884 0.946 mT5^MAS 73.42 28.74 27.79 23.8% 37.2% 45.1% 0.663 0.751 0.783 EmoEvent (En) T5 42.55 56.06 44.03 12.5% 20.5% 23.9% 0.515 0.591 0.623 T5^MAS 64.10 36.24 39.82 25.5% 54.5% 67.9% 0.643 0.826 0.879 mT5 6.01 10.48 2.21 10.5% 72.2% 93.3% 0.638 0.884 0.972 mT5^MAS 68.76 35.73 36.90 32.8% 60.5% 72.9% 0.719 0.858 0.903 EmoEvent (Es) mT5 19.53 23.63 11.04 23.5% 64.1% 88.2% 0.721 0.869 0.953 mT5^MAS 31.90 31.83 26.83 32.2% 79.0% 85.6% 0.730 0.906 0.938

Table 11: Fixed-label evaluation of our models on prior emotion classification datasets. The best performance under each metric for each dataset is bolded.