Efficient treatment of outliers and class imbalance for diabetes prediction

Nonso Nnamoko; Ioannis Korkontzelos

doi:10.1016/j.artmed.2020.101815

Efficient treatment of outliers and class imbalance for diabetes prediction

Artif Intell Med. 2020 Apr:104:101815. doi: 10.1016/j.artmed.2020.101815. Epub 2020 Feb 10.

Authors

Nonso Nnamoko¹, Ioannis Korkontzelos²

Affiliations

¹ Department of Computer Science, Edge Hill University, Ormskirk, United Kingdom. Electronic address: [email protected].
² Department of Computer Science, Edge Hill University, Ormskirk, United Kingdom. Electronic address: [email protected].

PMID: 32498997
DOI: 10.1016/j.artmed.2020.101815

Abstract

Learning from outliers and imbalanced data remains one of the major difficulties for machine learning classifiers. Among the numerous techniques dedicated to tackle this problem, data preprocessing solutions are known to be efficient and easy to implement. In this paper, we propose a selective data preprocessing approach that embeds knowledge of the outlier instances into artificially generated subset to achieve an even distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) was used to balance the training data by introducing artificial minority instances. However, this was not before the outliers were identified and oversampled (irrespective of class). The aim is to balance the training dataset while controlling the effect of outliers. The experiments prove that such selective oversampling empowers SMOTE, ultimately leading to improved classification performance.

Keywords: Data preprocessing; Imbalanced data; Machine learning; Outlier detection; Oversampling; SMOTE.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Diabetes Mellitus* / diagnosis
Diabetes Mellitus* / therapy
Humans
Machine Learning*