A preprocessing method for improving data mining techniques. Application to a large medical diabetes database

Stud Health Technol Inform. 2003:95:269-74.

Abstract

The Knowledge Discovery in Databases (KDD) methodology seems to be attractive on the analyze of large clinical databases. In the KDD process, the preprocessing step (data cleaning and handling of missing values) is paramount since it conditions the quality of the results obtained by data mining procedures and represents about 80% of the whole project time. The aims of the present study were to analyze this step and provide tools to handle inconsistent data and missing values. We have broken down the process into 3 main stages: data cleaning--explanatory study of missing values--choice of the procedure used for handling missing values. The data cleaning stage was based on a system of logical rules to correct mistakes and on cluster analysis to discard the poorly filled files. The missing-data mechanism was analyzed by means of multivariate statistical procedures. Two methods to deal with missing values were compared: imputation by the most common value (mode) and imputation using decision trees. This study was performed on a large medical diabetes database (23,601 patients) including numerous missing values. A system of logical rules allowed to correct mistakes on essential parameters (for example, the type of diabetes). Cluster analysis allowed to identify 10% of poorly filled files. After multivariate analysis, the missing-data mechanism could be considered as random. For variables with low number of missing values (< 10%) and categories (< 4), imputation using decision trees provided better results than imputation by mode.

MeSH terms

  • Data Interpretation, Statistical
  • Diabetes Mellitus*
  • France
  • Humans
  • Information Storage and Retrieval / standards*
  • Medical Informatics Computing*
  • Neural Networks, Computer