Purpose: Several novel therapies for castration-resistant prostate cancer (CRPC) have been approved with randomized phase III studies with continuing observational research either planned or ongoing. Accurately identifying patients with CRPC in electronic health care data is critical for quality observational research, resource allocation, and quality improvement. Previous work in this area has relied on either structured laboratory results and medication data or natural language processing (NLP) methods. However, a computable phenotype using both structured data and NLP identifies these patients with more accuracy.
Methods: The Corporate Data Warehouse (CDW) of the Veterans Health Administration (VHA) was used to collect PCa diagnoses, prostate-specific antigen test results, and information regarding patient characteristics and medication use. The final system used for validation and subsequent analysis combined the NLP system and an algorithm of structured laboratory and medication data to identify patients as being diagnosed with CRPC. Patients with both a documented diagnosis of CRPC and a documented diagnosis of metastatic PCa were classified as having mCRPC by this system.
Results: Among 1.2 million veterans with PCa, the International Classification of Diseases (ICD)-10 diagnosis code for CRPC (Z19.2) identifies 3,791 patients from 2016 when the code was created until 2022, compared with the combined algorithm which identifies 14,103, 10,312 more than ICD-10 codes alone, from 2016 to 2022. The combined algorithm showed a sensitivity of 97.9% and a specificity of 99.2%.
Conclusion: ICD-10 codes proved to be insufficient for capturing CRPC in the VHA CDW data. Using both structured and unstructured data identified more than double the number of patients compared with ICD-10 codes alone. Application of this combined approach drastically improved identification of real-world patients and enables high-quality observational research in mCRPC.