An Evaluation of Pretrained BERT Models for Comparing Semantic Similarity Across Unstructured Clinical Trial Texts

Jessica Patricoski; Kory Kreimeyer; Archana Balan; Kent Hardart; Jessica Tao; Johns Hopkins Molecular Tumor Board Investigators; Valsamo Anagnostou; Taxiarchis Botsis

doi:10.3233/SHTI210848

An Evaluation of Pretrained BERT Models for Comparing Semantic Similarity Across Unstructured Clinical Trial Texts

Stud Health Technol Inform. 2022 Jan 14:289:18-21. doi: 10.3233/SHTI210848.

Authors

Jessica Patricoski¹, Kory Kreimeyer², Archana Balan², Kent Hardart², Jessica Tao²; Johns Hopkins Molecular Tumor Board Investigators; Valsamo Anagnostou², Taxiarchis Botsis^{1

2}

Affiliations

¹ Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD.
² Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD.

PMID: 35062081
DOI: 10.3233/SHTI210848

Abstract

Processing unstructured clinical texts is often necessary to support certain tasks in biomedicine, such as matching patients to clinical trials. Among other methods, domain-specific language models have been built to utilize free-text information. This study evaluated the performance of Bidirectional Encoder Representations from Transformers (BERT) models in assessing the similarity between clinical trial texts. We compared an unstructured aggregated summary of clinical trials reviewed at the Johns Hopkins Molecular Tumor Board with the ClinicalTrials.gov records, focusing on the titles and eligibility criteria. Seven pretrained BERT-Based models were used in our analysis. Of the six biomedical-domain-specific models, only SciBERT outperformed the original BERT model by accurately assigning higher similarity scores to matched than mismatched trials. This finding is promising and shows that BERT and, likely, other language models may support patient-trial matching.

Keywords: Clinical trial; bidirectional coder representations; word embeddings.

MeSH terms

Clinical Trials as Topic
Humans
Language
Natural Language Processing*
Semantics*