Analysis of longitudinal social media for monitoring symptoms during a pandemic

Shixu Lin; Lucas Garay; Yining Hua; Zhijiang Guo; Wanxin Li; Minghui Li; Yujie Zhang; Xiaolin Xu; Jie Yang

doi:10.1016/j.jbi.2025.104778

Analysis of longitudinal social media for monitoring symptoms during a pandemic

J Biomed Inform. 2025 Jan 18:104778. doi: 10.1016/j.jbi.2025.104778. Online ahead of print.

Authors

Shixu Lin¹, Lucas Garay¹, Yining Hua², Zhijiang Guo³, Wanxin Li¹, Minghui Li¹, Yujie Zhang¹, Xiaolin Xu¹, Jie Yang⁴

Affiliations

¹ School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China.
² Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA.
³ Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
⁴ School of Public Health, Zhejiang University School of Medicine, Hangzhou 310058 China; Department of Medicine, Harvard Medical School, Boston, MA 02115, USA. Electronic address: [email protected].

PMID: 39832606
DOI: 10.1016/j.jbi.2025.104778

Abstract

Objective: Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic.

Materials and methods: This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence.

Results: This study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39 % (832,287) of symptom-sharing tweets reflected user-related mentions. The trained USM model achieved an average F1 score of 0.927. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the wild-type strain and Delta variant.

Conclusion: This study established a robust framework for analyzing longitudinal social media data to monitor symptoms during a pandemic. By integrating denoising of user-experienced symptom mentions, our findings reveal the duration of different symptoms over time and by variant within a cohort of nearly 200,000 patients, providing critical insights into symptom trends that are often difficult to capture through traditional data source.

Keywords: Deep learning; Natural language processing; Public health, COVID-19; Social media; Symptom surveillance.