Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium

Ari Z Klein; Juan M Banda; Yuting Guo; Ana Lucia Schmidt; Dongfang Xu; Ivan Flores Amaro; Raul Rodriguez-Esteban; Abeed Sarker; Graciela Gonzalez-Hernandez

doi:10.1093/jamia/ocae010

Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium

J Am Med Inform Assoc. 2024 Apr 3;31(4):991-996. doi: 10.1093/jamia/ocae010.

Authors

Ari Z Klein¹, Juan M Banda², Yuting Guo³, Ana Lucia Schmidt⁴, Dongfang Xu⁵, Ivan Flores Amaro⁵, Raul Rodriguez-Esteban⁴, Abeed Sarker³, Graciela Gonzalez-Hernandez⁵

Affiliations

¹ Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States.
² Department of Computer Science, Georgia State University, Atlanta, GA 30302, United States.
³ Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, United States.
⁴ Roche Innovation Center, 4070 Basel, Switzerland.
⁵ Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States.

PMID: 38218723
PMCID: PMC10990511 (available on 2025-01-13)
DOI: 10.1093/jamia/ocae010

Abstract

Objective: The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants' systems, and the performance results.

Methods: The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events).

Results: In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora.

Conclusion: To facilitate future work, the datasets-a total of 61 353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.

Keywords: data mining; machine learning; natural language processing; social media.

Publication types

Congress
Research Support, N.I.H., Extramural

MeSH terms

Data Mining / methods
Humans
Machine Learning
Natural Language Processing
Neural Networks, Computer
Social Media*

Abstract

Publication types

MeSH terms

Grants and funding