Objective: Large database research in axial spondyloarthritis (SpA) is limited by a lack of methods for identifying most types of axial SpA. Our objective was to develop methods for identifying axial SpA concepts in the free text of documents from electronic medical records.
Methods: Veterans with documents in the national Veterans Health Administration Corporate Data Warehouse between January 1, 2005 and June 30, 2015 were included. Methods were developed for exploring, selecting, and extracting meaningful terms that were likely to represent axial SpA concepts. With annotation, clinical experts reviewed sections of text containing the meaningful terms (snippets) and classified the snippets according to whether or not they represented the intended axial SpA concept. With natural language processing (NLP) tools, computers were trained to replicate the clinical experts' snippet classifications.
Results: Three axial SpA concepts were selected by clinical experts, including sacroiliitis, terms including the prefix spond*, and HLA-B27 positivity (HLA-B27+). With supervised machine learning on annotated snippets, NLP models were developed with accuracies of 91.1% for sacroiliitis, 93.5% for spond*, and 97.2% for HLA-B27+. With independent validation, the accuracies were 92.0% for sacroiliitis, 91.0% for spond*, and 99.0% for HLA-B27+.
Conclusion: We developed feasible and accurate methods for identifying axial SpA concepts in the free text of clinical notes. Additional research is required to determine combinations of concepts that will accurately identify axial SpA phenotypes. These novel methods will facilitate previously impractical observational research in axial SpA and may be applied to research with other diseases.
© 2016, American College of Rheumatology.