A Generative Neighborhood-Based Deep Autoencoder for Robust Imbalanced Classification

IEEE Trans Artif Intell. 2024 Jan;5(1):80-91. doi: 10.1109/TAI.2023.3249685. Epub 2023 Feb 27.

Abstract

Deep learning models perform remarkably well on many classification tasks recently. The superior performance of deep neural networks relies on the large number of training data, which at the same time must have an equal class distribution in order to be efficient. However, in most real-world applications, the labeled data may be limited with high imbalance ratios among the classes, and thus, the learning process of most classification algorithms is adversely affected resulting in unstable predictions and low performance. Three main categories of approaches address the problem of imbalanced learning, i.e., data-level, algorithmic level, and hybrid methods, which combine the two aforementioned approaches. Data generative methods are typically based on generative adversarial networks, which require significant amounts of data, while model-level methods entail extensive domain expert knowledge to craft the learning objectives, thereby being less accessible for users without such knowledge. Moreover, the vast majority of these approaches are designed and applied to imaging applications, less to time series, and extremely rare to both of them. To address the above issues, we introduce GENDA, a generative neighborhood-based deep autoencoder, which is simple yet effective in its design and can be successfully applied to both image and time-series data. GENDA is based on learning latent representations that rely on the neighboring embedding space of the samples. Extensive experiments, conducted on a variety of widely-used real datasets demonstrate the efficacy of the proposed method.

Impact statement—: Imbalanced data classification is an actual and important issue in many real-world learning applications hampering most classification tasks. Fraud detection, biomedical imaging categorizing healthy people versus patients, and object detection are some indicative domains with an economic, social and technological impact, which are greatly affected by inherent imbalanced data distribution. However, the majority of the existing algorithms that address the imbalanced classification problem are designed with a particular application in mind, and thus they can be used with specific datasets and even hyperparameters. The generative model introduced in this paper overcomes this limitation and produces improved results for a large class of imaging and time series data even under severe imbalance ratios, making it quite competitive.

Keywords: Data augmentation; image data; imbalanced classification; latent space; time-series data.