MASIVE: Open-Ended Affective State Identification in English and Spanish

Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown

Columbia University, Department of Computer Science
Correspondence: [ndeas,eturcan,kathy]@cs.columbia.edu
Abstract

In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, often applied across languages. These basic sets, however, are rarely designed with textual data in mind, and culture, language, and dialect can influence how particular emotions are interpreted. In this work, we broaden our scope to a practically unbounded set of affective states, which includes any terms that humans use to describe their experiences of feeling. We collect and publish MASIVE, a dataset of Reddit posts in English and Spanish containing over 1,000 unique affective states each. We then define the new problem of affective state identification for language generation models framed as a masked span prediction task. On this task, we find that smaller finetuned multilingual models outperform much larger LLMs, even on region-specific Spanish affective states. Additionally, we show that pretraining on MASIVE improves model performance on existing emotion benchmarks. Finally, through machine translation experiments, we find that native speaker-written data is vital to good performance on this task.

MASIVE: Open-Ended Affective State Identification in English and Spanish


Nicholas Deas, Elsbeth Turcan, Iván Pérez Mejía, Kathleen McKeown Columbia University, Department of Computer Science Correspondence: [ndeas,eturcan,kathy]@cs.columbia.edu


1 Introduction

In the field of emotion analysis, much NLP research focuses on identifying a limited number of discrete emotion categories, typically using basic emotion sets from the field of psychology (Plaza-del Arco et al., 2024). These basic emotion sets are rarely designed with textual expression in mind (e.g., Ekman 1984, whose model defines basic emotions by the recognizability of facial expressions), and very little research examines the validity of adapting these sets to textual data.

Emotion analysis furthermore relies on largely the same emotion categories across languages, including, in some cases, simply translating resources such as lexicons from one language into another Mohammad (2018) or translating finetuning and evaluation data Isbister et al. (2021); Kathunia et al. (2024). Previous research has also shown that existing multilingual models encode meaning in an Anglocentric way (Havaldar et al., 2023). As recent studies have found that culture and language influence the particular meaning of emotional terms like "love" Heyes (2019), models that fail to understand this cultural context or rely on mainstream dialects may also fail to capture the nuances of an author’s emotional expression (Deas et al., 2023).

Refer to caption
Figure 1: Paraphrased input and expected output examples from MASIVE in English and Spanish. Models are tasked with predicting affective states (highlighted), which reflect more nuanced feelings than label sets in prior work, such as the Ekman basic emotions.

In this work, we argue for a descriptive approach to emotion analysis. We broaden our scope from a small set of basic emotions to a practically unbounded set of affective states, which includes any terms that humans use to describe their experiences of feeling, including emotions, moods, and figurative expressions of feelings (e.g. "blue" as an expression of sadness instead of the color) (VandenBos, 2007). We then define the new problem of affective state identification (ASI), which is a targeted masked span prediction task: given a text description of an emotional experience, we train models to produce single-word affective states that correspond to the description. These affective states may include common emotion categories such as happy or sad, but they also allow us to incorporate nuance and intensity (e.g., elated, calm, jealous, lonely, etc.) as well as other classifications that are not typically considered emotions such as moods (e.g., longer-term feelings of being motivated or stuck).

We collect MASIVE: Multilingual Affective State Identification with Varied Expressions, a new benchmark dataset for affective state identification using Reddit data. We use a bootstrapping procedure to discover new affective state labels and collect posts containing natural emotional expressions in English and Spanish, yielding 1600 unique affective state labels in English and 1000 in Spanish.111Our dataset, code, and model checkpoints will be made publicly available upon publication We evaluate our data collection methods with human annotation, finding that 88% and 72% of our automatically collected English and Spanish labels, respectively, reflect affective states, and document unique features of the datasets including negations and, in Spanish, grammatical gender. We then use this dataset to evaluate the performance of several commonly-used generative models, finding that small fine-tuned models generally outperform LLMs. Beyond ASI, we experiment with using our corpora as pretraining data and show that MASIVE incorporates knowledge that generalizes to existing emotion detection benchmarks. Finally, we assess finetuning and evaluating models on machine-translated data and find that original texts from native speakers are essential for performing ASI.

Our contributions in this work are as follows:

  1. 1.

    We introduce a novel benchmark for affective state identification with language generation models, including a significantly larger label set than prior related benchmarks;

  2. 2.

    We benchmark multilingual models and show that smaller, finetuned models outperform current LLMs on this dataset;

  3. 3.

    We analyze the behavior and performance of models on region-specific affective language, gendered language, and negations; and

  4. 4.

    We empirically argue that both finetuning and evaluating on texts authored by native speakers is vital for capturing nuances in multilingual affective writing

2 Data

2.1 MASIVE Corpus

Our goal is to collect data representing the broad set of ways in which humans describe their feelings. We refer to these expressions as affective states VandenBos (2007); this is an umbrella term incorporating multiple kinds of feelings such as emotions and moods. We collect texts with expressions of affective states from Reddit222Using the PullPush API at https://pullpush.io/ using a bootstrapping procedure. Beginning with the adjective forms of the Ekman emotions, we search for texts containing forms of “I feel <affect> and…”, “I am feeling <affect> and…”, where <affect> is replaced with each emotion term. Notably, we also search for “I don’t feel <affect> and…” and “I am not feeling <affect> and…” to better capture the diversity of ways in which authors can express feelings. We extract affective state terms that follow the “and” from the retrieved posts to form a new set of search phrases with these terms. We repeat these steps, expanding the pool of query affective states in each round. Our primary assumption is that any adjective conjuncts of the query emotion term are also affective states, regardless of whether they are canonical emotion terms. For example, if "happy" was used to query the text "I feel happy and excited," the term "excited" is both an adjective and a conjunct; the same is true of “light” in “I feel happy and light”. In contrast, in "I feel happy and want to smile", "want" is a verb and would not be considered an affective state. We evaluate of this assumption in subsection 2.2.

In Spanish, we conduct the same procedure using forms of “Estoy <affect> y…”, “Me siento <affect> y…”, and “Estoy sintiendo <affect> y…”. We also seed the process with the most common Spanish translations of the Ekman emotions on Reddit (see Appendix B). Additionally, as Spanish includes both masculine and feminine forms for some terms, we search for both forms where applicable. Finally, we also collect a challenge set including affective state labels associated with regional Spanish varieties, hand-selected by a native Spanish-speaker, to evaluate models’ abilities to generalize to less-represented dialects (see Appendix B).

For both English and Spanish, we run 4 rounds of bootstrapping; for the regional Spanish terms, we run only a single round to avoid introducing non-regional terms. 15 affective states were randomly sampled from both datasets, and all posts containing those 15 affective states were reserved as part of each test set to evaluate models on unseen emotions. Summary statistics describing the English and Spanish splits as well as the regional Spanish challenge set are included in Table 1, and we include a Data Statement in Appendix A.

Lang Split Size Input Length # AS/ Text # Unique AS
En Train 93,736 310.99 1.11 1,627
Test 10,049 306.61 1.13 775
Chal 4,720 306.87 1.33 15
Es Train 32,917 154.22 1.06 1,022
Test 6,232 161.29 1.08 730
Chal 1,557 165.64 1.24 15
Reg 559 233.95 1.07 59
Table 1: Summary statistics of English and Spanish MASIVE. Text lengths are measured in mT5 tokens. AS = Affective State; Chal = unseen challenge set

2.2 Data Analysis

To validate the assumptions of our bootstrapping procedure and examine how affective states are used in our dataset, we collect human evaluations of the automatically identified affective states. Judgments are conducted by 2 native Spanish-speakers in Iberian Culture studies and 2 native English-speakers in Psychology for Spanish and English respectively. We randomly sample 250 texts from each language’s test set for evaluation such that 50 texts are shared by each pair. Annotators are provided with a full Reddit post with a single automatically-identified affective state highlighted. We ask annotators to judge the term in context on 3 dimensions, beginning with whether the highlighted term reflects an affective state. If a term is judged to reflect an affective state, annotators are asked to judge whether the highlighted term better reflects an emotion or a mood333We distinguish emotions – shorter-term feelings triggered by identifiable events – from moods – longer-term feelings not necessarily triggered by an event. and whether the highlighted term is used figuratively (e.g., "blue") or literally (e.g., "sad"). All 3 dimensions are judged on 4-point Likert scales where higher values mean the term primarily reflects an affective state, an emotion, and a literal usage, respectively. Annotators achieved moderate agreement in English (κ=.51𝜅.51\kappa=.51italic_κ = .51) and substantial agreement in Spanish (κ=.69𝜅.69\kappa=.69italic_κ = .69). Additional details concerning human annotations are included in Appendix B.

Additionally, we analyze 2 aspects of our dataset that differentiate it from prior emotion detection benchmarks. First, because Spanish is a language with grammatical gender for adjectives, part of the affective state prediction problem in MASIVE includes choosing whether to use the masculine or feminine form in the context of the input. Second, authors in natural settings may also tend to express their feelings by stating how they do not feel (e.g., “I’m not happy, but….”), and we specifically include negations to test models’ capability to contend with this construction in both English and Spanish.

The results of the aforementioned data annotations as well as the automatically extracted statistics are included in Table 2. Human annotation results are reported as the percentage of affective states within the sample; for negations and grammatical gender, we report the percentage of texts in our datasets that include any target negations or any feminine adjectives.444Recall that a single datapoint may have multiple labels joined by and. Large majorities (88% and 72% in English and Spanish respectively) of terms were judged to reflect affective states, validating the contents of MASIVE.

Lang Human-Annotated Automatic
Aff Fig Emo Neg Fem
En 88.4% 58.8% 34.2% 7.75%
Es 71.5% 38.5% 18.5% 27.0% 28.0%
Table 2: Human and automatic analysis of how affective states in the English and Spanish datasets are used in context. Aff: Affective State; Fig: Figurative; Emo: Emotion; Neg: Negations; Fem: Feminine Form

2.3 Fixed-Label Set Data

We additionally evaluate the performance of MASIVE-finetuned models on two previously published datasets in both English and Spanish. A key distinction from MASIVE is that these datasets feature limited label sets; we describe our evaluation procedures in subsection 3.2. In English, we evaluate on GoEmotions (Demszky et al., 2020), a commonly-used emotion dataset consisting of Reddit comments; it is originally labeled with 27 distinct emotion categories, though the authors also relabel the data with the Ekman basic emotions. We additionally evaluate on EmoEvent (Plaza del Arco et al., 2020), a dataset with both English and Spanish subsets of Tweets (among other languages) also labeled with the Ekman set.

2.4 Machine-Translated Data

Finally, we conduct 2 cross-lingual experiments expanding on prior work investigating the use of machine translation and high-resource language models for inference on lower-resource languages (Isbister et al., 2021; Kathunia et al., 2024). In contrast to prior findings, however, we hypothesize that neither translating the training nor evaluation data will be enable competitive performance with models trained on native data. First, using our natural test sets, we evaluate models finetuned on translated data. Second, we evaluate the performance of our native-trained models on translated data, mimicking the translation of lower-resource language data for inference with a model trained on a higher-resource language.

In both settings, we use bilingual Opus-MT models (Tiedemann and Thottingal, 2020) to independently translate the input documents and target affective state labels. We select Opus-MT models as they are accessible, open-source models, reflecting resources that may be used for large scale translation, and are utilized in experiments in Kathunia et al. (2024). Throughout the experiments, models finetuned on translated data are denoted Tr. Test sets generated through machine translation are similarly denoted as EnTr and EsTr.

3 Experimental Configuration

3.1 Models

We experiment with finetuning small language models on our original and machine-translated data. We also perform experiments with two Large Language Models (LLMs) in a zero-shot setting.

Finetuned Generative Models.

Most of our models are based on mT5-Large (Xue et al., 2021a) During finetuning and prediction on MASIVE, we mask the automatically identified affective state words wherever they appear and task models to fill them, mimicking mT5’s initial pretraining. We additionally experiment with T5-large (Raffel et al., 2019) for English only.555No comparable monolingual T5 checkpoint for Spanish has been made publicly available. In the results, models’ superscripts denote that a model was finetuned on our English (T5En and mT5En) or Spanish (mT5Es) corpus.

Large Language Models.

We evaluate two modern, open-source LLMs–Llama-3666https://llama.meta.com/llama3 and Mixtral-Instruct (Jiang et al., 2024)–as these models have been specifically evaluated in multilingual settings. We instruct these models to perform the same masked token prediction task as mT5 (see Appendix C). Due to context window constraints and input lengths, LLMs are evaluated in a zero-shot setting. Further checkpoint and generation hyperparameter details are included in Appendix C.

3.2 Metrics

3.2.1 MASIVE Evaluation

We report top-k accuracy for our models with k{1,3,5}𝑘135k\in\{1,3,5\}italic_k ∈ { 1 , 3 , 5 }777As some samples in the datasets have multiple labels, we calculate top-k accuracy at the sample level using beam search and report average sample-level scores., along with two generative metrics: the negative log-likelihood (NLL) of the gold affective state and the model’s log perplexity. In Spanish, if the gendered form of the prediction does not match that of the gold term (e.g. enojado vs. enojada), the prediction is considered incorrect, but the similarity of the prediction in these cases is captured by the top-k similarity metric, which we describe below.

Top-k Similarity.

Because our label set is very large, we also report a measure of similarity between the model’s top predictions and the gold. Here, we rely on contextual embeddings using multilingual, pre-trained BERT-base (Devlin et al., 2019). To ensure that the similarity model encodes affective senses of each term, we embed the predicted and gold emotion terms within 100-token contexts from the original post and calculate cosine similarity between them. We report the maximum similarity of these contextual embeddings when looking at the top 1, 3, and 5 most likely model predictions. Full details are available in Appendix D.

Lang Model NLL\downarrow Log Perp\downarrow Acc@1\uparrow Acc@3\uparrow Acc@5\uparrow Sim@1\uparrow Sim@3\uparrow Sim@5\uparrow
En T5En 23.04 7.11 17.9% 27.3% 32.5% 0.556 0.665 0.711
mT5En 25.11 12.40 12.4% 17.5% 20.1% 0.461 0.525 0.552
Llama-3 64.39 48.85 2.2% 3.4% 4.0% 0.436 0.470 0.487
Mixtral 1.27 7.7% 9.3% 10.8% 0.474 0.500 0.523
Es mT5Es 7.14 7.05 20.1% 29.6% 34.4% 0.490 0.593 0.637
Llama-3 80.56 63.87 2.3% 3.1% 3.6% 0.435 0.468 0.482
Mixtral 2.02 7.4% 9.5% 11.2% 0.445 0.477 0.502
Table 3: Comparison of T5, mT5, and two LLMs on our proposed Reddit dataset, aggregated scores only. Note that the Spanish test set and the English test set are not directly comparable as noted in subsection 3.2. Bolded scores highlight the best-performing multilingual model.

Dataset Model P R F1 Acc@1 Acc@3 Acc@5 Sim@1 Sim@3 Sim@5 GoEmotions (27) mT5 22.45 2.53 0.97 2.5% 12.9% 23.5% 0.525 0.614 0.670 mT5MAS 35.91 7.06 5.28 5.0% 9.6% 13.6% 0.512 0.579 0.609 GoEmotions (7) mT5 28.28 38.63 23.53 38.5% 70.7% 86.0% 0.736 0.884 0.946 mT5MAS 73.42 28.74 27.79 23.8% 37.2% 45.1% 0.663 0.751 0.783 EmoEvent (En) mT5 6.01 10.48 2.21 10.5% 72.2% 93.3% 0.638 0.884 0.972 mT5MAS 68.76 35.73 36.90 32.8% 60.5% 72.9% 0.719 0.858 0.903     EmoEvent (Es) mT5 19.53 23.63 11.04 23.5% 64.1% 88.2% 0.721 0.869 0.953 mT5MAS 31.90 31.83 26.83 32.2% 79.0% 85.6% 0.730 0.906 0.938

Table 4: Performance of mT5 finetuned on emotion classification datasets, with and without prior finetuning on MASIVE. Bolded scores highlight the best performing model on each dataset under each metric.

3.2.2 Fixed-Label Set Evaluation

To evaluate how well our dataset imbues models with general emotional knowledge, we evaluate two variants of mT5: first, mT5 finetuned only on existing emotion benchmarks, and second, mT5 finetuned on MASIVE followed by existing benchmarks (denoted with superscript MAS).

To adapt the evaluation sets to our generative setting, we append "I feel <extra_id_0>" to the end of each input to match the format of our evaluation on MASIVE (see Figure 1), using adjective forms of the gold emotion labels. In this setting, we report top-k accuracy and similarity as we do for MASIVE. Additionally, to adapt our models to the fixed-label set setting, we sort the fixed set of emotion labels by their likelihood according to the model and select the most probable emotion label as the prediction. For these experiments, we report macro precision, recall, and F1 score.

4 Results

4.1 MASIVE Evaluation

Table 3 presents the performance metrics for finetuned mT5, Llama-3, and Mixtral on our English and Spanish test sets, as well as finetuned T5 for the English test set only. Among multilingual models, mT5 outperforms both LLMs on top-k accuracy for both languages (Takeaway #1), despite having drastically fewer parameters.888Llama-3 occasionally refuses to make a prediction if the content discussed is sensitive (e.g., drug use). Results with invalid responses filtered out are included in Appendix E. Additionally, mT5 achieves the highest top-k similarity scores, except for top-1 similarity in Spanish. Between the LLMs, Mixtral tends to outperform Llama-3. This performance difference may be explained by the difference in size between models, as well as the fact that multilingual data was upsampled in Mixtral’s pretraining compared to prior models.

In English, the large variant of T5 has been shown to slightly outperform mT5 Xue et al. (2021b). While we find a similar difference, the performance gap is notably quite large. Because the remaining experiments include Spanish data, we focus on mT5. We note, however, that dedicated monolingual English models may offer significantly higher performance on ASI (Takeaway #2) and leave further exploration of the differences between monolingual and multilingual models to future work.

While the differences in language and content of the English and Spanish datasets prevent us from making conclusions concerning their relative difficulty, Table 3 also shows that performance in Spanish tends to be higher than in English, despite the better representation of English in pre-training and larger size of the collected English data compared to Spanish. This trend could be due to the larger set of unique affective states in our English data than Spanish, with more nuanced affective states that may be difficult for models to predict accurately.

Lang Model Subset Acc@1\uparrow Acc@3\uparrow Acc@5\uparrow Sim@1\uparrow Sim@3\uparrow Sim@5\uparrow
En T5En Seen 31.2% 46.0% 53.0% 0.620 0.742 0.788
Unseen 2.9% 6.2% 9.3% 0.484 0.578 0.623
mT5En Seen 22.5% 31.8% 36.6% 0.503 0.581 0.612
Unseen 1.0% 1.3% 1.5% 0.415 0.463 0.484
  Es mT5Es Seen 26.5% 39.0% 45.1% 0.522 0.632 0.677
Unseen 0.8% 1.5% 2.0% 0.394 0.476 0.519
Table 5: Comparison of mT5 performance between affective states included and held out from finetuning in English and Spanish.
Model Acc@1\uparrow Acc@3\uparrow Acc@5\uparrow Sim@1\uparrow Sim@3\uparrow Sim@5\uparrow
mT5Es 3.9% 7.3% 9.8% 0.369 0.449 0.485
Llama-3 0.0% 0.1% 0.2% 0.366 0.391 0.401
Mixtral 0.1% 0.3% 0.4% 0.341 0.343 0.343
Table 6: Evaluation of Spanish-finetuned mT5, Llama-3, and Mixtral on region-specific Spanish affective states. Bolded metrics highlight the best-performing model.

4.2 Fixed-Label Set Evaluation

To evaluate the generalized emotion detection capabilities afforded by finetuning on MASIVE, Table 4 shows the performance of mT5 finetuned on existing English and Spanish emotion benchmarks, both with and without prior finetuning on MASIVE. First, when used as a classifier, we find that mT5 finetuned on MASIVE first achieves a higher macro-F1 for all datasets. This suggests that finetuning on our corpus gives models generalizable knowledge of emotions (Takeaway #3). Because our corpora contain many more affective state labels than the evaluation datasets, models finetuned on MASIVE will include more nuanced terms than basic emotions in the top-k predictions. So, as expected, models finetuned only on the emotion benchmarks achieve higher top-k accuracy and similarity scores, as they are more likely to predict terms within the smaller label sets. The top-k similarity scores for our models, however, remain high, suggesting that the generated affective states are similar to the ground truth basic emotion labels.

4.3 Unseen and Regional Set Evaluation

To analyze how well models generalize beyond affective states explicitly included in finetuning, we present performance metrics on seen and unseen affective states in both languages in Table 6. In both languages, all models perform considerably better on affective states included in the finetuning data than on unseen affective states. T5En , however, maintains better performance on unseen affective states than mT5En, suggesting that monolingual models may better generalize.

In addition to unseen affective states, we present evaluation results on a subset of Spanish affective states which are region-specific in Table 6. Similarly to results on the full Spanish data, finetuned mT5Es outperforms both LLMs in top-k accuracy and similarity. Notably, the performance of mT5Es on this regional subset is comparable to its performance on general unseen Spanish emotions (Table 6), while Llama-3 and Mixtral, which are not explicitly finetuned on our corpora, perform significantly worse on the regional subset than they do on the Spanish data as a whole (Table 3). Because top-k accuracy drops significantly on unseen and region-specific affective states (top-k similarity as well, though less so), future work in this area should prioritize a generalized understanding of affective states, including regionalisms (Takeaway #4).

Refer to caption
Figure 2: Top-k accuracy and similarity results on subsets reflecting different linguistic constructions in MASIVE: grammatical gender of affective states in Spanish (left) and negated expressions in Spanish (center) and English (right). Shades reflect different values of k separated by small gaps, where the lightest shade represents k=1𝑘1k=1italic_k = 1 and the darkest shade represents k=5𝑘5k=5italic_k = 5.
Test Set Model NLL\downarrow Log Perp\downarrow Acc@1\uparrow Acc@3\uparrow Acc@5\uparrow Sim@1\uparrow Sim@3\uparrow Sim@5\uparrow
En mT5SEnsubscriptsuperscriptabsent𝐸𝑛𝑆{}^{En}_{S}start_FLOATSUPERSCRIPT italic_E italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT 11.16 11.13 12.5% 19.6% 23.5% 0.460 0.551 0.590
mT5TrEnsubscriptsuperscriptabsent𝐸𝑛𝑇𝑟{}^{En}_{Tr}start_FLOATSUPERSCRIPT italic_E italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT 18.37 17.42 1.5% 3.0% 4.2% 0.358 0.452 0.491
EsTr mT5Es 56.06 46.30 2.0% 3.2% 3.9% 0.367 0.426 0.451
  Es mT5 En 7.14 7.05 20.1% 29.6% 34.4% 0.490 0.593 0.637
mT5Tr,SEssubscriptsuperscriptabsent𝐸𝑠𝑇𝑟𝑆{}^{Es}_{Tr,S}start_FLOATSUPERSCRIPT italic_E italic_s end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_T italic_r , italic_S end_POSTSUBSCRIPT 17.74 17.34 1.6% 3.0% 4.0% 0.341 0.420 0.458
EnTr mT5SEnsubscriptsuperscriptabsent𝐸𝑛𝑆{}^{En}_{S}start_FLOATSUPERSCRIPT italic_E italic_n end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT 28.11 27.98 1.4% 3.1% 4.4% 0.391 0.477 0.513
Table 7: Comparison of mT5 models finetuned on the original data reflecting native use and translated data on our proposed Reddit dataset, aggregated scores only. All finetuning sets are randomly subset to the same size as the smallest set, the collected Spanish training set, and results are averaged across 5 different subsets (n = 32,917). Models with subsetted data are denoted with S.

4.4 Grammatical Gender and Negations

We break down the top-k accuracy and top-k similarity results for each model by grammatical gender and negations in Figure 2. We see again that mT5 outperforms both LLMs across all subsets, and that mT5 often places the gold label among the top 3 or 5 predictions if not the top 1. In particular, mT5Es performs better on feminine adjectives than masculine adjectives or those with only a single form, and T5En and mT5Es perform better on negated targets than non-negated targets (mT5En shows the same pattern for accuracy, though not similarity). Llama-3 and Mixtral achieve highest accuracy for masculine adjectives and highest similarity for single-form adjectives, while for negations, Llama-3 performs better on non-negations and Mixtral performs slightly better on negations. These results suggest that explicit training on MASIVE may improve performance specifically on unique features of generative ASI (Takeaway #5).

4.5 Machine-Translation vs. Natural Data

Finally, we evaluate the changes in performance first when using machine-translated finetuning data and alternatively when translating evaluation data in Table 7. First, we find an expected drop in performance when models are finetuned on machine-translated data for both English and Spanish. Interestingly, the drop in accuracy and similarity metrics (90% and 28%, respectively) in Spanish are notably larger than in English (77% and 14%). This could perhaps be explained by the translation model performing better in the Spanish to English direction than English to Spanish, as well as mT5’s ability to better generalize in English than in Spanish.

As an alternative approach to finetuning on translated data, we also consider the case where data may be translated at inference time. In these cases (EnTr and EsTr in Table 7), we find that performance falls. Artifacts of machine translation have been found to impact evaluation of translation models (Freitag et al., 2020), and, similarly, errors and artifacts of unnatural translation may cause these changes in performance. In contrast to prior work suggesting that performance on the target data translated into English is comparable to finetuning on the target language for tasks such as sentiment detection, our results suggest that for our task, machine-translating the evaluation data leads to poorer performance, and translating either at training or inference time result in similar performance (Takeaway #6).

5 Related Work

Emotion Taxonomies.

Many different models of human emotion have been proposed, intending to capture the universal experience of different emotions across cultures. Some of the most notable categorical models in psychology and NLP research are the Ekman (1984, 2005) basic emotion set derived from facial expression and the Plutchik (1980) basic emotion set which assumes emotions occur in opposing pairs (e.g. joy and sadness), though other models exist (e.g., Ortony et al. 1988; Oatley and Johnson-laird 1987; Johnson-Laird and Oatley 1998; PS and Mahalakshmi 2017). Multiple different dimensional models have also been proposed, situating emotions in a space governed by features such as pleasantness and activation (Plutchik, 1980; Russell and Mehrabian, 1977; Russell, 1980; Bradley et al., 1992). Many such models of emotions have been frequently compared and evaluated in psychology and as they apply to emotion detection (see Rubin and Talarico 2009; PS and Mahalakshmi 2017; Lichtenstein et al. 2008).

Emotion and Language Generation.

Numerous approaches to automated emotion detection in text have been proposed, including emotion lexicons (Strapparava and Valitutti, 2004; Staiano and Guerini, 2014; Araque et al., 2018; Mohammad and Turney, 2010) and classification models (see Acheampong et al. 2020 for a review of approaches). Most of this work focuses on small, finite emotion sets, usually Ekman or Plutchik, though some prior work has used larger sets (Sintsova et al., 2013; Liew et al., 2016; Subasic and Huettner, 2001). Mohammad and Kiritchenko (2015) in particular collect data for a very large but still strictly limited set of emotions. More recently, language generation tasks have been proposed that call for models with greater emotional understanding, such as emotional dialogue generation (Ide and Kawahara, 2021; Song et al., 2019; Firdaus et al., 2020), controllable generation (Goswamy et al., 2020; Saha et al., 2022), and emotion trigger summarization (Sosea et al., 2023; Zhan et al., 2022). Given that language generation models have been employed to unify these and other classification and generation tasks, endowing models with a greater understanding of human emotions would greatly benefit multiple applications.

Cross-cultural Emotion Perception.

Many researchers have suggested that a basic set of emotions are universal, while others have argued that emotions are shaped by culture. Past work has built on Ekman’s proposal and provided evidence that emotion categories are universal (Ekman, 1984; Hoemann et al., 2019), with Sauter 2018 finding little support for the argument that language plays a foundational role in perceiving emotions. Additionally, past work has at least in part supported differences in emotion perception and recognition across languages and cultures (Chen et al., 2023; Mesquita et al., 2016; Jackson et al., 2019), even with bilingual speakers (Caldwell-Harris and Ayçiçeği-Dinn, 2009). Past work in sentiment and emotion in NLP frequently translates English corpora to enable multilinguality (Yang and Hirschberg, 2019; Tafreshi et al., 2024). Some work, however, has demonstrated cross-cultural differences in model performance (Havaldar et al., 2023), and approaches that do not rely on machine translation have also been proposed (e.g., Rasooli et al. 2018). Our work evaluates the use of machine translation for the ASI task, and we find that machine translation may not be sufficient for cross-lingual transfer.

6 Conclusion

In this work, we introduce the novel task of affective state identification, a language generation task prioritizing the authors’ natural expressions of their feelings rather than using a prescribed set of emotion labels. For this task, we automatically collect and publish two datasets of Reddit posts in English and Spanish, both containing over 1,000 unique affective state labels.

We use this dataset to benchmark multilingual generative models, and find that (Takeaway #1) small finetuned T5 and mT5 models outperform zero-shot LLMs. Results specifically show that (Takeaway #2) T5 significantly outperforms mT5 in English on ASI, suggesting that monolingual models may be more capable. Additionally, we show that (Takeaway #3) models finetuned on our corpora transfer knowledge that generalizes to existing emotion detection benchmarks. In analyzing model performance on unseen emotions and Spanish regionalisms, we argue that (Takeaway #4) generalization to a broader set of affective states, including those from underrepresented dialects, is an important avenue for future work. With respect to grammatical gender in Spanish and negations, (Takeaway #5) finetuning on MASIVE improves on specific linguistic constructions unique to generative ASI. Finally, we quantify the observed performance differences when using machine-translated data at finetuning or inference time, finding that in contrast to prior work, (Takeaway #6) machine translation leads to large performance drops. We hope these results spark future work into ASI to enable prediction of more nuanced feelings in a variety of languages and contexts, and ultimately, enable prediction of an unbounded set of labels.

Limitations

We limit ourselves in this work to investigating two high-resource languages, English and Spanish, in part because for this application, we find it important that members of the research team be able to speak the languages of study fluently. Additionally, we gather data from one source, Reddit, which limits the demographics of the people whose experiences are represented in our data. This choice of data source may particularly limit our Spanish data, which includes less data and fewer labels than English (Table 1). We choose not to control for things like topic or subreddit when collecting English and Spanish data separately because we wish to collect a natural variety of data, but this also means that we do not claim our two datasets to be equivalent.

Our data gathering framework collects only explicit expressions of affective states by searching for statements including an “I feel”-style template. While we can use models trained on this type of data to predict affective state labels for any input by simply appending an “I feel” statement to be filled (see subsubsection 3.2.2), our training targets do not include this type of data, and this paradigm impacts the types of affective states we are likely to collect.

We also acknowledge that our choices of specific resources limit our work in various ways. We use only Opus-MT models to perform our machine translation experiments because they exhibit good performance in both languages; however, it is possible that we would see different results with different translation models. Our similarity metric also uses pre-trained BERT embeddings because of the benefits of contextual embeddings and subword tokenization, but there are many other possible choices of embedding framework that may more accurately capture emotional nuances. Finally, we evaluate only open-source LLMs on our dataset.

Ethics Statement

We strictly collect publicly available user-authored texts on the pseudonymous social media website Reddit, but we acknowledge the privacy concerns of users when collecting data from social media. Accordingly, we will release the collected texts only with randomly assigned IDs and usernames stripped. We discourage others from attempting to identify authors of the texts in the collected dataset, and will remove data from the dataset upon request.

Because we rely entirely on open-source models, including open-source LLMs, and make our data available, our results are fully reproducible. We also release our code and model checkpoints along with our data. In total, our finetuning and evaluation amounts to approximately 73 hours using Nvidia A100 GPUs.

Our task allows models to predict a larger set of affective states, capturing more nuanced expressions of an authors’ feelings than traditional emotion detection. At the same time, a larger label set could exacerbate the consequences of misclassification in sensitive contexts (e.g., mental health and crisis settings). In some applications of this task where this may be an important consideration, the label set can be artificially restricted, as we show in our external evaluation experiments.

Finally, the aim of predicting authors’ expressions of their own feelings can require models to generate regional or dialectal texts. Prior work has identified dialectal biases in language models (e.g., African American Language; Deas et al. 2023; Groenwold et al. 2020) and we find that all evaluated models perform poorly on regional varieties of Spanish. We hope future work makes progress toward closing performance gaps among dialects and language varieties.

Acknowledgements

This work was supported in part by grant IIS2106666 from the National Science Foundation, the Defense Advanced Research Projects Agency (DARPA) Cross-Cultural Understanding (CCU) program under Contract No HR001122C0034, National Science Foundation Graduate Research Fellowship DGE-2036197, a research gift from Amazon, the Columbia Provost Diversity Fellowship, and the Columbia School of Engineering and Applied Sciences Presidential Fellowship. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as representing the official views or policies of the National Science Foundation, the Department of Defense, or the U.S. Government. We thank Julia Hirschberg and Melanie Subbiah for feedback on earlier drafts of this work.

References

Appendix A Data Statement

A.1 Curation Rationale

The aim of collecting the texts contained in MASIVE was to produce both a training dataset and benchmark for affective state identification. Affective state identification tasks models with predicting individual terms reflecting how a text’s author feels, and in particular, predicting terms that would be used by the author themself. The dataset collection process was designed to automatically extract a large set of possible affective state labels from texts where an author explicitly describes how they feel. Both an English and Spanish version of the dataset were collected in the same fashion to enable research on cross-lingual work, as well as a small set of regional Spanish to enable work on linguistic variation. We intend to make the dataset publicly available

A.2 Language Variety

MASIVE contains texts both in English (en) and Spanish (es). Data collection was not restricted to a particular variety of English or Spanish, and distributions of these varieties likely reflects the overall demographics of English and Spanish-speaking users on Reddit. A small set of data was collected specifically to reflect Spanish specific to particular regions, including terms primarily associated with Spanish spoken in Mexico, Spain, Venezuela, and El Salvador among other regions and countries.

A.3 Annotator Demographics

Two sets of annotators were involved in validating the automatically extracted labels in MASIVE. For the English data, annotators were 2 native English-speakers and Psychology undergraduate students. Both English data annotators were American and female. For the Spanish data, annotators were 2 native Spanish-speakers and graduate students in the department of Latin American and Iberian Cultures. The Spanish data annotators were Colombian and Ecaudorian, and both were male.

A.4 Speech Situation

The collected texts in MASIVE were not restricted to a particular range of time, and may have been published anytime between the founding of Reddit (2005) and the time of data collection (April, 2024). Texts were also not restricted to a particular place, but likely reflect the countries of origin of English and Spanish-speaking Reddit users. All texts were originally written and published on Reddit, which may or may not have been edited before they were included in the dataset. As with most interactions through Reddit posts, the texts reflect asynchronous interactions and are likely intended for a general public audience in most cases.

A.5 Text Characteristics

The texts in MASIVE may discuss a wide variety of topics. All texts, however, contain explicit expressions of feelings or explicit mentions of terms that may reflect feelings. Thus, many texts may reflect personal narratives that provide context for an author’s feelings. Thus, the dataset may also discuss sensitive topics and include the kinds of offensive or harmful content that can be found online.

Appendix B Data Collection and Annotation

B.1 Seed Emotions

The specific adjective forms of the Ekman emotions used to seed our bootstrapping procedure are shown in Table 8. These are also the terms used as the gold in our fixed-set label evaluation, with the addition of ‘nothing’ for the no-emotion class if it is used.

For fixed-label evaluation of GoEmotions (27), the following terms are used for the expanded label set: ‘admiration’, ‘amused’, ‘angry’, ‘annoyed’, ‘approving’, ‘caring’, ‘confused’, ‘curious’, ‘desire’, ‘disappointed’, ‘disapproval’, ‘disgusted’, ‘embarrassed’, ‘excited’, ‘afraid’, ‘grateful’, ‘grief’, ‘happy’, ’love’, ‘nervous’, ‘optimistic’, ‘proud’, ‘realized’, ‘relieved’, ‘remorseful’, ‘sad’, ‘surprised’, and ‘nothing’.

En Es
happy surprised feliz sorprendido
sad disgusted triste desagradado
angry afraid enojado asustado
Table 8: Seed emotions (Ekman) for each language used in collecting the Reddit corpus.

B.2 Regional Spanish Affective States

To collect affective state labels associated with one or more particular Spanish-speaking regions, we use the following set of terms: ‘mamado/a’, ‘patitieso/a’, ‘emputado/a’, ‘encandilado/a’, ‘arrechado/a’, ‘fastidiado/a’, ‘encabronado/a’, ‘hallado/a’, ‘rayado/a’, ‘achispado/a’, ‘ahuevado/a’, ‘enrabiado/a’, ‘tusa’, ‘chocho/a’, ‘encachimbado/a’, ‘bravo/a’, ‘apantallado/a’, ‘embromado/a’, ‘engorilado/a’, ‘alicaido/a’, ‘flipando/a’, ‘cagado/a’, ‘aguitado/a’, ‘engrinchado/a’, ‘chato/a’, ‘chipil’, ‘picado/a’, ‘bajoneado/a’, ‘acojonado/a’, ‘arrecho/a’"

The terms are not exhaustive, but reflect varieties of Spanish spoken in Spain, Chile, Colombia, Venezuela, Mexico, Bolivia, Argentina, Uruguay, and Paraguay.

B.3 Data Annotation

Refer to caption
Figure 3: Instructions provided to our human annotators, including definitions. Annotators may collapse or expand the instructions at will.
Refer to caption
Figure 4: Human annotation interface with a sample datapoint. Clicking the button to show more or less context toggles the display of the full Reddit post vs. the one-sentence context. As shown, the Emotion/Mood and Figurative Language questions only appear if the highlighted term is judged like an affective state or completely an affective state.

The instructions and interface given to our human annotators are shown in Figure 3 and Figure 4, respectively. Annotators were paid $23/hour for their work in accordance with the standards of their university. Each annotator completed a pilot task of 30 examples before beginning to annotate the data in order to build familiarity with the platform and task.

Appendix C Experimental Setup

C.1 Generation Configuration

Checkpoints.

Throughout our experiments, we use the large variants of T5 (770 million parameters; google-t5/t5-large) and mT5 (1.2 billion parameters; google/mt5-large). For our two LLMs, we evaluate the instruct variants of Llama-3 (8 billion parameters; meta-llama/Meta-Llama-3-8B-Instruct) loaded in bfloat16 and Mixtral (7×\times×22 billion parameters; mistralai/Mixtral-8x22B-Instruct-v0.1). Mixtral is accessed through the fireworks.ai API.

Beyond the evaluated models, we use two open-source, unidirectional translation models for our translation experiments. In particular, we employ the Helsinki-NLP English-to-Spanish (Helsinki-NLP/opus-mt-en-es) and Spanish-to-English (Helsinki-NLP/opus-mt-es-en) models. We also use a multilingual BERT checkpoint as part of the similarity metric (168 million parameters; bert-base-multilingual-uncased). Finally, we also rely on spacy (Honnibal et al., 2020) to identify parts of speech in English (en_core_web_md) and Spanish (es_core_news_md) during our data collection.

Generation.

For T5, mT5, and Llama-3, we use beam search to generate the top-k most likely predictions, with 5 beams (as we need only the top-5 outputs). We use the default settings of Huggingface’s GenerationConfig, including, e.g., no repetition penalty, etc.; though we expect a single-word output, we allow generations of up to 32 tokens.

The API used to run inference with Mixtral does not allow retrieving the top 5 most probable predictions as we do with the aforementioned models. Instead, Mixtral predictions are generated with a top-k of 5, and a temperature of 0.5. The top 5 candidate generations are then reranked by the log-probability according to Mixtral to be used in evaluating the ranked, top-5 predictions. Also due to accessing Mixtral through an API, we were not able to calculate the log perplexity of the ground truth labels.

Hyperparameters.

T5 and mT5 models are finetuned with a batch size of 4 for 3 epochs each. Model parameters are optimized using Adafactor (Shazeer and Stern, 2018) as implemented by Huggingface’s transformers with a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, Huggingface’s linear learning rate scheduler with default parameters, and a weight decay parameter (here, an L2 penalty) of 0.01. For each model, all data is tokenized using the correct pretrained tokenizer corresponding to its pretrained checkpoint. Any input that is longer than 512 tokens (including the end-of-sequence token) is trimmed to fit; in order to preserve the target affective state masks and the grammatical integrity of the text, this trimming removes full sentences (as parsed by nltk (Bird et al., 2009)) from the end of the text if possible (i.e., if this will not remove a target mask), or the beginning otherwise, until the text fits within 512 tokens.

C.2 Prompts

Table 9 shows the prompts provided to Mixtral and Llama-3 throughout our experiments. In a minority of cases, models would reply in the form "Here is a list of terms to fill each <MASK>: ", in which case, only the terms following the colon were considered as the model’s prediction.

Lang Prompt
En Determine the most likely term reflecting a feeling to replace each <MASK> in the following text: "<MASKED_POST>" Provide a single emotion term for each <MASK> token. Do not introduce the answer, respond ONLY with a comma-separated list of lowercase terms:
Es Determine the most likely term reflecting a feeling to replace each <MASK> in the following text: "<MASKED_POST>" Provide a single emotion term for each <MASK> token. Do not introduce the answer, respond ONLY with a comma-separated list of lowercase terms in Spanish:
Table 9: Prompts provided to Llama-3 and Mixtral for evaluation. At inference time, <POST> is replaced with the input text containing masked affective states.

C.3 Machine Translation Configuration

In the finetuning experiment, we subset the English data and translated English-to-Spanish data to keep the number of training steps constant across settings. For these two models, we repeat the experiment with 5 different random subsets and report the averages across the five trials.

Appendix D Top-K Similarity

Let P=[p1,p2,p3,pn]𝑃subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝𝑛P=[p_{1},p_{2},p_{3},...p_{n}]italic_P = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where nk𝑛𝑘n\geq kitalic_n ≥ italic_k, be a list of predictions ordered according to descending likelihood, and let g𝑔gitalic_g be the gold (where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and g𝑔gitalic_g are strings). Additionally, let E(x)𝐸𝑥E(x)italic_E ( italic_x ) be a function on a term x𝑥xitalic_x that incorporates 100 tokens of context, tokenizes and embeds the sequence with a pre-trained BERT tokenizer, and returns the contextual embedding corresponding to the first sub-word token in x𝑥xitalic_x. Then, we report top-k similarity specifically as

simk(P,g)=maxik[cosine_sim(E(pi),E(g))]subscriptsim𝑘𝑃𝑔subscript𝑖𝑘cosine_sim𝐸subscript𝑝𝑖𝐸𝑔\text{sim}_{k}(P,g)=\max_{i\leq k}\big{[}\text{cosine\_sim}(E(p_{i}),E(g))\big% {]}sim start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_P , italic_g ) = roman_max start_POSTSUBSCRIPT italic_i ≤ italic_k end_POSTSUBSCRIPT [ cosine_sim ( italic_E ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_E ( italic_g ) ) ]

Appendix E Extended Results

E.1 Limited Evaluation for Llama-3

For some inputs, Llama-3 would decline to make a prediction, particularly for inputs that discuss topics such as depression or drug use. While these are important topics for models to be able to accurately analyze as they are increasingly applied in mental health contexts, Llama-3’s behavior may unfairly skew its evaluation results. Table 10 presents updated results for Llama-3 on the subset of texts for which the model’s response followed the correct format. 87% of English, 96% of Spanish, and 98% of regional Spanish responses by Llama-3 were formatted correctly. Among the datasets, English scores improve the most given the higher percentage of invalid responses, and scores improve only by up to .6% top-k accuracy and .003 top-k similarity. Considering these results, no conclusions made are altered.

Lang Acc@1\uparrow Acc@3\uparrow Acc@5\uparrow Sim@1\uparrow Sim@3\uparrow Sim@5\uparrow
En 2.5% 3.9% 4.6% 0.432 0.471 0.490
Es 2.3% 3.2% 3.7% 0.436 0.470 0.484
Es (Reg) 0.0% 0.1% 0.2% 0.367 0.393 0.403
Table 10: Evaluation results of Llama-3 on each MASIVE dataset only considering samples with correctly formatted responses of the form "prediction_1, prediction_2, etc…"

E.2 Full Fixed-Label Set Results

Extended results from the fixed-label evaluation are given in Table 11. Notably, we include results using T5 in English, where T5 represents a model finetuned only on the target dataset and T5MASsuperscriptT5𝑀𝐴𝑆\text{T5}^{MAS}T5 start_POSTSUPERSCRIPT italic_M italic_A italic_S end_POSTSUPERSCRIPT represents a model finetuned on MASIVE and then finetuned on the target dataset. Precision, recall, and F1 are calculated by ranking the adjective forms of each emotion class (Appendix B) according to model likelihood and taking the most likely one as the preedicted class, while top-k accuracy and similarity are calculated in a generative setting as in the remainder of the paper. T5 scores consistently well on F1; pretraining on MASIVE does not usually improve T5’s performance on GoEmotions, while it does for EmoEvent (En). Because MASIVE pretraining does improve performance on EmoEvent (Es), it is possible that English T5 is already a very strong baseline and potentially near the performance ceiling of generative models.

Dataset Model P R F1 Acc@1 Acc@3 Acc@5 Sim@1 Sim@3 Sim@5 GoEmotions (27) T5 24.23 7.36 5.67 2.0% 3.3% 3.9% 0.197 0.461 0.492 T5MAS 41.45 6.95 4.01 5.4% 9.3% 11.5% 0.504 0.563 0.590 mT5 22.45 2.53 0.97 2.5% 12.9% 23.5% 0.525 0.614 0.670 mT5MAS 35.91 7.06 5.28 5.0% 9.6% 13.6% 0.512 0.579 0.609 GoEmotions (7) T5 42.45 39.59 25.59 38.5% 47.9% 55.7% 0.734 0.775 0.804 T5MAS 57.44 35.59 30.77 28.7% 43.0% 48.9% 0.669 0.758 0.784 mT5 28.28 38.63 23.53 38.5% 70.7% 86.0% 0.736 0.884 0.946 mT5MAS 73.42 28.74 27.79 23.8% 37.2% 45.1% 0.663 0.751 0.783 EmoEvent (En) T5 42.55 56.06 44.03 12.5% 20.5% 23.9% 0.515 0.591 0.623 T5MAS 64.10 36.24 39.82 25.5% 54.5% 67.9% 0.643 0.826 0.879 mT5 6.01 10.48 2.21 10.5% 72.2% 93.3% 0.638 0.884 0.972 mT5MAS 68.76 35.73 36.90 32.8% 60.5% 72.9% 0.719 0.858 0.903     EmoEvent (Es) mT5 19.53 23.63 11.04 23.5% 64.1% 88.2% 0.721 0.869 0.953 mT5MAS 31.90 31.83 26.83 32.2% 79.0% 85.6% 0.730 0.906 0.938

Table 11: Fixed-label evaluation of our models on prior emotion classification datasets. The best performance under each metric for each dataset is bolded.