Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning

Rohan Goli; Keerthana Komatineni; Shailesh Alluri; Nina Hubig; Hua Min; Yang Gong; Dean F Sittig; Lior Rennert; David Robinson; Paul Biondich; Adam Wright; Christian Nøhr; Timothy Law; Arild Faxvaag; Aneesa Weaver; Ronald Gimbel; Xia Jing

doi:10.1101/2023.01.26.23285060

Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning

medRxiv [Preprint]. 2024 Nov 18:2023.01.26.23285060. doi: 10.1101/2023.01.26.23285060.

Authors

Rohan Goli¹, Keerthana Komatineni¹, Shailesh Alluri¹, Nina Hubig¹, Hua Min², Yang Gong³, Dean F Sittig³, Lior Rennert⁴, David Robinson⁵, Paul Biondich⁶, Adam Wright⁷, Christian Nøhr⁸, Timothy Law⁹, Arild Faxvaag¹⁰, Aneesa Weaver⁴, Ronald Gimbel⁴, Xia Jing⁴

Affiliations

¹ School of Computing, College of Engineering, Computing and Applied Science, Clemson University, Clemson, SC, USA.
² Department of Health Administration and Policy, College of Public Health, George Mason University, Fairfax, VA, USA.
³ School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
⁴ Department of Public Health Sciences, College of Behavioral, Social, and Health Sciences, Clemson University, Clemson, SC, USA.
⁵ General Practitioner/Independent Consultant, Cumbria, UK.
⁶ Clem McDonald Biomedical Informatics Center, Regenstrief Institute, Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN, USA.
⁷ Vanderbilt University Medical Center, Nashville, TN, USA.
⁸ Department of Planning, Faculty of Engineering, Aalborg University, Aalborg, Denmark.
⁹ Ohio Musculoskeletal and Neurologic Institute, Ohio University, Athens, OH, USA.
¹⁰ Department of Neuromedicine and Movement Science, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway.

Abstract

Background: Interoperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can be a critical complementary automatic part of building ontology. However, KP identification requires human expertise, consensus, and contextual understanding for data labeling.

Methods: This paper presents a semi-supervised KP identification framework (long short-term memory-based encoders and the conditional random fields -based decoder models, BiLSTM-CRF) using minimal human labeled data based on hierarchical attention (i.e., at word, sentence, and abstract levels) over the documents and domain adaptation. We created synthetic labels for initial training and human-labeled data for fine-tuning. We also tested different options during NLP preprocessing and ML training to optimize the ML pipeline.

Results: Our method outperforms the prior neural architectures by learning through synthetic labels for initial training, document-level contextual learning, language modeling, and fine-tuning with limited gold standard label data. After comparison, we found that the BIO encoding schema performed slightly better than Blue, and domain adaptation techniques can improve the quality of synthetic labels. In addition, document-level context, pre-trained LM, and pre-trained WE all contributed to better model performance in our tasks. Add 2 to 4 human-labeled documents for every 100 synthetic labeled documents improves the model performance without exhausting human-labeled documents too quickly.

Conclusions: To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify KPs, which is trained on limited human labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging, and light-weighted deep learning models play an important role in real-time KP identification as a complementary approach to human experts' effort.

Keywords: Clinical Decision Support System; Domain adaptation; Hierarchical context; Minimal labeled data; Natural language processing; Semi-supervised learning.

Publication types

Preprint

Abstract

Publication types

Grants and funding