Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption

Praveen Kumar; Christophe G Lambert

doi:10.7717/peerj-cs.2451

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption

PeerJ Comput Sci. 2024 Nov 5:10:e2451. doi: 10.7717/peerj-cs.2451. eCollection 2024.

Authors

Praveen Kumar¹, Christophe G Lambert¹

Affiliation

¹ Department of Internal Medicine, Division of Translational Informatics, University of New Mexico, Albuquerque, United States.

Abstract

Positive and unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, α, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms vary; some estimate only the proportion, α, of positives in the unlabeled set, while others calculate the probability that each specific unlabeled instance is positive, and some can do both. We propose two PU learning algorithms to estimate α, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR employs a divide-and-conquer approach to cluster SNAR positives into subtypes and estimates α for each subtype by applying PULSCAR to positives from each cluster and all unlabeled. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

Keywords: Class imbalance; Class prior; Machine learning; Mixture proportion estimation; Noisy label learning; PULSCAR; PULSNAR; Positive-unlabeled learning; Probability calibration; SCAR; SNAR; Semi-supervised learning.

Grants and funding

This research was supported by the National Institute of Mental Health of the National Institutes of Health under award numbers R01MH129764 and R56MH120826. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.