Taming Big Data: An Information Extraction Strategy for Large Clinical Text Corpora

Adi V Gundlapalli; Guy Divita; Marjorie E Carter; Andrew Redd; Matthew H Samore; Kalpana Gupta; Barbara Trautner

Taming Big Data: An Information Extraction Strategy for Large Clinical Text Corpora

Stud Health Technol Inform. 2015:213:175-8.

Authors

Adi V Gundlapalli¹, Guy Divita¹, Marjorie E Carter¹, Andrew Redd¹, Matthew H Samore¹, Kalpana Gupta², Barbara Trautner³

Affiliations

¹ VA Salt Lake City Health Care System and University of Utah, Salt Lake City, UT.
² VA Boston Health Care System and Boston University, Boston, MA.
³ VA Houston Health Care System and Baylor College of Medicine, Houston, TX.

PMID: 26152985

Abstract

Concepts of interest for clinical and research purposes are not uniformly distributed in clinical text available in electronic medical records. The purpose of our study was to identify filtering techniques to select 'high yield' documents for increased efficacy and throughput. Using two large corpora of clinical text, we demonstrate the identification of 'high yield' document sets in two unrelated domains: homelessness and indwelling urinary catheters. For homelessness, the high yield set includes homeless program and social work notes. For urinary catheters, concepts were more prevalent in notes from hospitalized patients; nursing notes accounted for a majority of the high yield set. This filtering will enable customization and refining of information extraction pipelines to facilitate extraction of relevant concepts for clinical decision support and other uses.

MeSH terms

Biomedical Research / methods*
Electronic Health Records / organization & administration*
Humans
Ill-Housed Persons
Information Storage and Retrieval / methods*
Natural Language Processing
Urinary Catheters