Background: Contemporary pulmonary embolism (PE) research, in many cases, relies on data from electronic health records (EHRs) and administrative databases that use International Classification of Diseases (ICD) codes. Natural language processing (NLP) tools can be used for automated chart review and patient identification. However, there remains uncertainty with the validity of ICD-10 codes or NLP algorithms for patient identification.
Methods: The PE-EHR+ study has been designed to validate ICD-10 codes as Principal Discharge Diagnosis, or Secondary Discharge Diagnoses, as well as NLP tools set out in prior studies to identify patients with PE within EHRs. Manual chart review by two independent abstractors by predefined criteria will be the reference standard. Sensitivity, specificity, and positive and negative predictive values will be determined. We will assess the discriminatory function of code subgroups for intermediate- and high-risk PE. In addition, accuracy of NLP algorithms to identify PE from radiology reports will be assessed.
Results: A total of 1,734 patients from the Mass General Brigham health system have been identified. These include 578 with ICD-10 Principal Discharge Diagnosis codes for PE, 578 with codes in the secondary position, and 578 without PE codes during the index hospitalization. Patients within each group were selected randomly from the entire pool of patients at the Mass General Brigham health system. A smaller subset of patients will also be identified from the Yale-New Haven Health System. Data validation and analyses will be forthcoming.
Conclusions: The PE-EHR+ study will help validate efficient tools for identification of patients with PE in EHRs, improving the reliability of efficient observational studies or randomized trials of patients with PE using electronic databases.
Thieme. All rights reserved.