Computer-aided diagnosis of pulmonary nodules using a two-step approach for feature selection and classifier ensemble construction

Artif Intell Med. 2010 Sep;50(1):43-53. doi: 10.1016/j.artmed.2010.04.011. Epub 2010 May 31.

Abstract

Objective: Accurate classification methods are critical in computer-aided diagnosis (CADx) and other clinical decision support systems. Previous research has reported on methods for combining genetic algorithm (GA) feature selection with ensemble classifier systems in an effort to increase classification accuracy. In this study, we describe a CADx system for pulmonary nodules using a two-step supervised learning system combining a GA with the random subspace method (RSM), with the aim of exploring algorithm design parameters and demonstrating improved classification performance over either the GA or RSM-based ensembles alone.

Methods and materials: We used a retrospective database of 125 pulmonary nodules (63 benign; 62 malignant) with CT volumes and clinical history. A total of 216 features were derived from the segmented image data and clinical history. Ensemble classifiers using RSM or GA-based feature selection were constructed and tested via leave-one-out validation with feature selection and classifier training executed within each iteration. We further tested a two-step approach using a GA ensemble to first assess the relevance of the features, and then using this information to control feature selection during a subsequent RSM step. The base classification was performed using linear discriminant analysis (LDA).

Results: The RSM classifier alone achieved a maximum leave-one-out Az of 0.866 (95% confidence interval: 0.794-0.919) at a subset size of s=36 features. The GA ensemble yielded an Az of 0.851 (0.775-0.907). The proposed two-step algorithm produced a maximum Az value of 0.889 (0.823-0.936) when the GA ensemble was used to completely remove less relevant features from the second RSM step, with similar results obtained when the GA-LDA results were used to reduce but not eliminate the occurrence of certain features. After accounting for correlations in the data, the leave-one-out Az in the two-step method was significantly higher than in the RSM and the GA-LDA.

Conclusions: We have developed a CADx system for evaluation of pulmonary nodule based on a two-step feature selection and ensemble classifier algorithm. We have shown that by combining classifier ensemble algorithms in this two-step manner, it is possible to predict the malignancy for solitary pulmonary nodules with a performance exceeding that of either of the individual steps.

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Data Mining
  • Databases as Topic
  • Decision Support Systems, Clinical*
  • Decision Support Techniques*
  • Discriminant Analysis
  • Female
  • Humans
  • Linear Models
  • Lung Diseases / diagnostic imaging*
  • Lung Neoplasms / diagnostic imaging*
  • Male
  • Medical Informatics*
  • New York
  • Pattern Recognition, Automated
  • Predictive Value of Tests
  • Prognosis
  • Radiographic Image Interpretation, Computer-Assisted*
  • Retrospective Studies
  • Solitary Pulmonary Nodule / diagnostic imaging*
  • Tomography, X-Ray Computed*