Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort

Imon Banerjee; Matthew C Chen; Matthew P Lungren; Daniel L Rubin

doi:10.1016/j.jbi.2017.11.012

Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort

J Biomed Inform. 2018 Jan:77:11-20. doi: 10.1016/j.jbi.2017.11.012. Epub 2017 Nov 23.

Authors

Imon Banerjee¹, Matthew C Chen², Matthew P Lungren³, Daniel L Rubin⁴

Affiliations

¹ Department of Biomedical Data Science, Stanford University, Stanford, CA, United States. Electronic address: [email protected].
² Department of Radiology, Stanford University, Stanford, CA, United States. Electronic address: [email protected].
³ Department of Radiology, Stanford University, Stanford, CA, United States. Electronic address: [email protected].
⁴ Department of Biomedical Data Science, Stanford University, Stanford, CA, United States; Department of Radiology, Stanford University, Stanford, CA, United States. Electronic address: [email protected].

Abstract

We proposed an unsupervised hybrid method - Intelligent Word Embedding (IWE) that combines neural embedding method with a semantic dictionary mapping technique for creating a dense vector representation of unstructured radiology reports. We applied IWE to generate embedding of chest CT radiology reports from two healthcare organizations and utilized the vector representations to semi-automate report categorization based on clinically relevant categorization related to the diagnosis of pulmonary embolism (PE). We benchmark the performance against a state-of-the-art rule-based tool, PeFinder and out-of-the-box word2vec. On the Stanford test set, the IWE model achieved average F1 score 0.97, whereas the PeFinder scored 0.9 and the original word2vec scored 0.94. On UPMC dataset, the IWE model's average F1 score was 0.94, when the PeFinder scored 0.92 and word2vec scored 0.85. The IWE model had lowest generalization error with highest F1 scores. Of particular interest, the IWE model (trained on the Stanford dataset) outperformed PeFinder on the UPMC dataset which was used originally to tailor the PeFinder model.

Keywords: Information extraction; Pulmonary embolism; Report annotation; Word embedding.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Humans
Machine Learning*
Natural Language Processing
Neural Networks, Computer
Predictive Value of Tests
Pulmonary Embolism
Radiographic Image Interpretation, Computer-Assisted*
Radiography, Thoracic / methods*
Radiography, Thoracic / trends
Semantics
Tomography, X-Ray Computed

Abstract

Publication types

MeSH terms

Grants and funding