An interpretable natural language processing system for written medical examination assessment

Abeed Sarker; Ari Z Klein; Janet Mee; Polina Harik; Graciela Gonzalez-Hernandez

doi:10.1016/j.jbi.2019.103268

An interpretable natural language processing system for written medical examination assessment

J Biomed Inform. 2019 Oct:98:103268. doi: 10.1016/j.jbi.2019.103268. Epub 2019 Aug 14.

Authors

Abeed Sarker¹, Ari Z Klein², Janet Mee³, Polina Harik³, Graciela Gonzalez-Hernandez²

Affiliations

¹ Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. Electronic address: [email protected].
² Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
³ National Board of Medical Examiners, Philadelphia, PA, USA.

PMID: 31421211
DOI: 10.1016/j.jbi.2019.103268

Abstract

Objective: The assessment of written medical examinations is a tedious and expensive process, requiring significant amounts of time from medical experts. Our objective was to develop a natural language processing (NLP) system that can expedite the assessment of unstructured answers in medical examinations by automatically identifying relevant concepts in the examinee responses.

Materials and methods: Our NLP system, Intelligent Clinical Text Evaluator (INCITE), is semi-supervised in nature. Learning from a limited set of fully annotated examples, it sequentially applies a series of customized text comparison and similarity functions to determine if a text span represents an entry in a given reference standard. Combinations of fuzzy matching and set intersection-based methods capture inexact matches and also fragmented concepts. Customizable, dynamic similarity-based matching thresholds allow the system to be tailored for examinee responses of different lengths.

Results: INCITE achieved an average F₁-score of 0.89 (precision = 0.87, recall = 0.91) against human annotations over held-out evaluation data. Fuzzy text matching, dynamic thresholding and the incorporation of supervision using annotated data resulted in the biggest jumps in performances.

Discussion: Long and non-standard expressions are difficult for INCITE to detect, but the problem is mitigated by the use of dynamic thresholding (i.e., varying the similarity threshold for a text span to be considered a match). Annotation variations within exams and disagreements between annotators were the primary causes for false positives. Small amounts of annotated data can significantly improve system performance.

Conclusions: The high performance and interpretability of INCITE will likely significantly aid the assessment process and also help mitigate the impact of manual assessment inconsistencies.

Keywords: Automated assessment; Clinical notes; Natural language processing; Text mining.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Clinical Competence / standards
Data Collection
Data Curation / methods
Education, Medical / methods*
Education, Medical / standards*
Educational Measurement / methods*
Fuzzy Logic
Humans
Licensure, Medical / standards*
Medical Records
Natural Language Processing*
Pattern Recognition, Automated
Reproducibility of Results
Schools, Medical*
Software
Unified Medical Language System