The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.
Keywords: Cost-sensitive learning; Imbalanced data; PPLS-DA; Random Forests; Sampling.
© 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.