Rationale: This study is part of a larger, multi-method project to develop a questionnaire for identifying undiagnosed cases of chronic obstructive pulmonary disease (COPD) in primary care settings, with specific interest in the detection of patients with moderate to severe airway obstruction or risk of exacerbation.
Objectives: To examine 3 existing datasets for insight into key features of COPD that could be useful in the identification of undiagnosed COPD.
Methods: Random forests analyses were applied to the following databases: COPD Foundation Peak Flow Study Cohort (N=5761), Burden of Obstructive Lung Disease (BOLD) Kentucky site (N=508), and COPDGene® (N=10,214). Four scenarios were examined to find the best, smallest sets of variables that distinguished cases and controls:(1) moderate to severe COPD (forced expiratory volume in 1 second [FEV1] <50% predicted) versus no COPD; (2) undiagnosed versus diagnosed COPD; (3) COPD with and without exacerbation history; and (4) clinically significant COPD (FEV1<60% predicted or history of acute exacerbation) versus all others.
Results: From 4 to 8 variables were able to differentiate cases from controls, with sensitivity ≥73 (range: 73-90) and specificity >68 (range: 68-93). Across scenarios, the best models included age, smoking status or history, symptoms (cough, wheeze, phlegm), general or breathing-related activity limitation, episodes of acute bronchitis, and/or missed work days and non-work activities due to breathing or health.
Conclusions: Results provide insight into variables that should be considered during the development of candidate items for a new questionnaire to identify undiagnosed cases of clinically significant COPD.
Keywords: COPD; case identification; chronic airways obstruction; data mining; primary care; random forests; screening.