MMIL: A novel algorithm for disease associated cell type discovery

Erin Craig^∗1 Timothy Keyes^∗1 Jolanda Sarno^3,6 Maxim Zaslavsky⁴ Garry Nolan⁵ Kara Davis⁶ Trevor Hastie^1,2 Robert Tibshirani^1,2

(¹Department of Biomedical Data Science, Stanford University
²Department of Statistics, Stanford University
³M. Tettamanti Research Center, Pediatric Clinic, University of Milano Bicocca
⁴Department of Computer Science, Stanford University
⁵Department of Pathology, Stanford University
⁶Department of Pediatrics, Stanford University
^∗These authors contributed equally to this work.)

Abstract

Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

1 Introduction

In the biomedical field, we often know whether or not a person is sick. However, we do not necessarily know which cells in a sick person’s body–which contains a combination of diseased and healthy cells–play a role in their illness. Identifying disease-associated cell populations is a common goal in translational studies of single-cell datasets, with applications in systems biology, disease monitoring, diagnostics, and the discovery of targetable cell populations for novel therapeutics. Nevertheless, classifying cells as sick or healthy can be difficult: obtaining gold standard labels can be prohibitively challenging or expensive, and it is not obvious how to predict labels without labeled training data. This is particularly true in the context of complex diseases for which true disease-associated cell populations remain unidentified, making gold-standard cell annotation impossible.

Without cell labels to train a classifier, researchers often turn to methods from multiple instance learning (MIL), which are designed for data consisting of many unlabeled cells from each labeled patient. However, the primary aim of most MIL methods is to predict new patient labels rather than cell labels. Here, we present Mixture Modeling for Multiple Instance Learning (MMIL), a method that, armed only with patient labels, uses expectation maximization (EM) [1] to train a classifier to label individual cells as disease-associated or non-disease-associated. Our approach alternately estimates cell labels and trains a classifier; at each iteration, our classifier and cell label estimates improve. MMIL is versatile and can be easily implemented as a wrapper around any classifier trained by optimizing the binomial log likelihood; for example, lasso logistic regression models, gradient boosted trees and neural networks. MMIL can also be used to improve the model calibration of any cell-level model — calibration is important because it encourages the alignment of estimated probabilities with outcomes such that e.g. 90% of the cells with a $0.9$ probability are truly disease-associated. MMIL’s calibration step can be used as a post-processing step for any cell-level model, making predictions more reliable and understandable.

We then show that MMIL identifies cancerous cells in leukemia (cancer of the blood and bone marrow). This is an important task: the early and precise identification of malignant cells is pivotal for determining treatment options and monitoring disease progression. Leukemia cells (often called “leukemic blasts”) are challenging to identify due to phenotypic characteristics resembling the healthy cells from which they originate [2]. For this reason, identifying treatment-resistant cells–referred to as “minimal residual disease” (MRD) in clinical contexts–is challenging and labor-intensive, even among highly trained clinicians using state-of-the-art blood profiling technologies [3, 4].

We use MMIL to train lasso logistic regression models on data collected from patients with either Acute Myeloid Leukemia (AML) or Acute Lymphoblastic Leukemia (ALL) as well as healthy controls. In both diseases, we demonstrate MMIL’s ability to (1) identify cancer cells accurately; (2) generalize to cells sampled from different patients, different timepoints after initiation of treatment, or different tissues than those on which the models were originally trained; and (3) select single-cell features known to be associated with leukemia. Our applications in leukemia diagnosis and treatment monitoring reveal that MMIL has broad implications in the field of single-cell analysis. Because MMIL is a direct, flexible method for identifying disease-associated cells using only patient labels, it has the potential to reveal novel biological insights and support patient monitoring, treatment and diagnosis for poorly understood diseases.

2 Results

2.1 An algorithm to classify cells without complete cell labels

We wish to train a classifier to predict whether a given cell is healthy or diseased. Typically, we would (1) gather a dataset of cells where we know which cells are diseased and which are healthy; and (2) train a model to predict cell disease status (using e.g. logistic regression, random forest, neural network). This workflow relies on knowing which cells are diseased and which are not. In our setting we have no such luxury: we do not know which cells sampled from sick patients are diseased, we only know that cells sampled from healthy people are healthy. MMIL is an algorithm that trains classifiers in this setting.

MMIL is an iterative process that relies on the following intuition. If we had a trained classifier, we could use it to predict which cells are healthy and which are not. And if we had a prediction of which cells were healthy and which were not, then we could use these predicted labels to train a classifier. MMIL fits a classifier by alternating between these two steps, and is described in Algorithm 1 (and in more detail in Appendix B).

Refer to caption — Figure 1: Mixture Modeling for Multiple Instance Learning (MMIL) detects cancer cells in Acute Myeloid Leukemia (AML) using patient labels only (A) Process to train a mixture model for multiple instance learning data. We initialize the sick person’s cell labels as $0.5$ : in this example, we assume that the prevalence of diseased cells in sick people is $50\%$ , and so each cell has a 50/50 chance of being diseased. After training the first classifier, we improve our estimates of the sick person’s cell labels. This process is repeated until convergence. (B) Schematic of model training and evaluation on the AML cohort from [5]. (C) Nonzero coefficients for the Mixture Lasso model trained to detect leukemic blasts in AML. (D) Nonzero coefficients for the ”Optimal” Lasso model trained to detect leukemic blasts in AML. (E) Receiver-Operator Characteristic (ROC) curves demonstrating individual (thin) and average (thick) performance of Optimal, Mixture, and Naive lasso models trained to detect leukemic blasts in AML. Insets indicate mean area under the ROC curve (AUC) across all patients. (F) Scatterplots representing the relationship between the gold-standard, pathologist-enumerated blast percentage for each patient (X-axis) and the model-assigned blast percentage for each patient for the Optimal (red), Mixture (blue), and Naive (yellow) lasso models. Inset text represents the Pearson correlation coefficients between the values on the X- and Y-axes.

Algorithm 1 MMIL: Mixture Models for Multiple Instance Learning

1.

Using domain knowledge, choose a value for $1-\rho$ , the probability that a cell from a sick person is diseased.
2.
Create a dataset with:
- •
  
  One copy of each cell from healthy people, assigned the label $y=0$ (healthy) and weight $w=1$ .
- •
  
  Two copies of each cell from sick people, one labeled $y=0$ (healthy) with weight $w=\rho$ and the other $y=1$ (diseased) with weight $w=1-\rho$ .
The weights reflect uncertainty. We are confident that healthy patients’ cells are healthy. But we do not know whether sick patients’ cells are diseased or healthy – the best we can do is assume that each cell is diseased with probability $1-\rho$ .
3.
Alternate:
1. (a)
  
  Train a classifier using the augmented dataset.
2. (b)
  
  Use the trained classifier to predict whether each sick patient’s cell is diseased or healthy. Update their weights using these predictions. (See Methods for details.)
After training, the classifier improves our guess for whether each sick person’s cell is diseased or healthy. Then, the new guesses are used to train a new classifier.
4.

Stop alternating when the model predictions stop changing.

2.2 Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using only unlabeled cells

We now consider a publicly available dataset of 800 thousand cells collected from 13 Acute Myeloid Leukemia (AML) patients and three healthy bone marrow controls [5] (see Methods). Each AML sample is a mixture of cancer cells and healthy cells, and this cohort of patients represents a wide range of leukemic blast counts. Typically, a highly-trained hematopathologist must analyze a blood or bone marrow specimen to diagnose and track the progression of disease over time. Hematopathologists do this by manually enumerating the proportion of leukemic blasts in the sample based on each cell’s size, shape, and expression of leukemia-associated markers measured by microscopy or flow cytometry. Particularly in the context of myeloid leukemias, blast enumeration is challenging and error-prone even among expert pathologists [5, 6].

Using MMIL, we sought to automate the identification of cells specific to AML without expert cell-level annotation. Importantly, each cell in the AML dataset has a gold-standard annotation indicating whether it is a leukemic blast as evaluated by a board-certified hematopathologist. These annotations allow us to evaluate MMIL’s performance in identifying leukemic blasts compared to expert clinical judgement.

Each cell was analyzed for the presence of 32 surface proteins and 11 intracellular proteins [5] using mass cytometry (CyTOF), a high-dimensional cytometry platform similar to clinical cytometers commonly used to analyze leukemic tissue specimens in clinical laboratories [7, 8]. However, CyTOF analysis allows a high-dimensional, single-cell characterization of each sample, extending beyond the capabilities of conventional multicolor flow cytometers and making it ideal for deep phenotyping of leukemic cells in a research environment [9].

Using this dataset, we trained 3 models: (1) the “optimal” model, a lasso logistic regression model trained using gold-standard annotations (usually unknown), (2) the ”naive” model, a lasso model trained using patient labels instead of cell labels, and (3) Mixture Lasso, a lasso model trained with MMIL. When training Mixture Lasso, we assumed that $75\%$ of cells in cancer patients are healthy, based on the clinical guideline that patients with at least $25\%$ of blasts in their bone marrow receive a leukemia diagnosis[10]. We estimated the performance of each of these models using leave-one-(patient)-out cross-validation (LOOCV). In each fold of our cross-validation procedure, models were trained using all 3 healthy control marrows and 12 of the 13 AML marrows; then, the performance of each model was evaluated by comparing predictions on the held-out patient’s cells to the pathologist’s gold-standard labels (Figure 1b).

At the cell-level, Mixture Lasso achieved a mean area under the receiver-operator characteristic curve (AUROC) of 0.751 across all held-out patients during the cross-validation. By comparison, the naive model achieved worse performance with an AUROC of 0.658, and the optimal model achieved superior performance with an AUROC of 0.945 (Figure 1e). At the patient-level, when a probability threshold of 0.5 was used to classify cells as leukemic blasts (see Methods), the blast percentages that Mixture Lasso assigned to each patient in the dataset correlated strongly with pathologist-assigned blast percentages (Pearson $\rho$ = 0.88 compared to the optimal model’s $\rho$ = 0.95). The naive model’s blast percentage assignments correlated less with the pathologist gold standard (Pearson $\rho$ = 0.65).

In their original study, Tsai et al. identified two sets of markers that could be used to distinguish between different stages of myeloid development in healthy and AML cell populations. Among these were the cytoskeletal proteins $\beta$ actin, Lamin A/C, and Lamin B1; the granule-associated proteins SerpinB1, VAMP-7, ribosomal RNA (rRNA), Lactoferrin, and lysozyme; CD45; and the cell size marker wheat germ agglutinin lectin (WGA). Two of the most important markers for the pathologists’ AML blast enumeration were rRNA (highly expressed in AML blasts due to their rapid growth) and Lactoferrin (a granule protein only expressed in mature myeloid cells that remains lowly expressed in undifferentiated AML blasts). Figures 1b-c show that Lactoferrin, Lamin A/C, CD45, and rRNA were selected by both the Mixture Lasso and optimal lasso models, whereas SerpinB1 and WGA were uniquely selected by Mixture Lasso. However, Lamin B1 and $\beta$ actin were uniquely selected by the optimal lasso model, and only the optimal model selected rRNA with a positive coefficient. Thus, Mixture Lasso closely follows, but does not entirely reproduce the pathologist’s decision-making process, suggesting an ability to augment (not just replicate) existing clinical decision processes.

To visualize Mixture Lasso’s predictions, we constructed several uniform manifold approximation and projection (UMAP) [11] plots using all cells and markers in the dataset (Figure 2a-e). We find that cells close to one another in UMAP space tend to have similar Mixture Lasso probabilities, indicating that Mixture Lasso tends to identify phenotypically coherent cell populations. Likewise, regions of UMAP space with high Mixture Lasso probabilities tend to correspond to regions containing large numbers of pathologist-identified AML blasts. However, Mixture Lasso does not simply assign high probabilities to all cells collected from cancer patients; rather, it assigns high probabilities to cells in areas of high-dimensional phenotype space preferentially occupied by cells from cancer patients, but not from healthy patients. Intuitively, this makes sense: leukemic blasts should map to regions of phenotype space with little to no overlap with phenotypes present in healthy samples. This observation is quantified in (Figures 2d,f).

Protein-specific UMAP plots further indicate that Mixture Lasso assigns high disease-association probabilities to cells with unique marker expression patterns–generally, those low in Lactoferrin and high in SerpinB1, CD16, and/or CD56. Furthermore, while rRNA was not explicitly selected as a disease-associated feature in the Mixture Lasso model, cells with high probabilities also tend to express high levels of rRNA (Figure 5 and Figure 6).

2.3 Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using both labeled and unlabeled cells

Next, we sought to evaluate MMIL’s performance in the setting where both labeled and unlabeled cells are present–e.g. a scenario in which most leukemic patients in a dataset do not have cell labels, but a small number of them have been annotated. We hypothesized that a model capable of leveraging both the expert labels of a clinician and a larger corpus of unlabeled data would outperform models trained using only labeled or only unlabeled data [12].

However, using expert labels to train a model on a difficult classification task can be risky: clinical misidentification of leukemic blasts is a known challenge in hematopathology, with a high degree of documented interobserver variability and methodological disagreement (Supplemental Figure 7) [6]. This is particularly concerning for training machine learning models, as the degradation of model performance due to label error is a widely characterized problem [13, 14]. Thus, we also hypothesized that models trained on both labeled and unlabeled data would be more robust to label error than models trained on labeled data alone, as they would be less susceptible to over-fitting on the imperfect labels.

To test these hypotheses, we developed a semi-supervised version of MMIL (“1-shot MMIL”) using the training procedure illustrated in (Figure 3a). Briefly, 1-shot MMIL is trained identically to MMIL with one change: during model training, a single AML patient (termed the “1-shot patient”) is chosen and their gold-standard cell labels are used for supervision instead of probabilistic labels. Like the cells from healthy patients, the 1-shot patient’s cell labels remain fixed throughout all iterations of the EM. Otherwise, model training and LOOCV are carried out as before.

To benchmark 1-shot MMIL’s performance, we compare it to 0-shot MMIL (the usual MMIL), the 0-shot naive model (a naive model trained as described before), the 1-shot naive model (a model in which the 1-shot patient’s gold-standard cell labels are used during model training, but all others use their inherited patient labels), and the 1-shot optimal model (a model trained on only the 1-shot patient using their gold-standard cell labels). Furthermore, to benchmark 1-shot MMIL’s robustness to imperfect labeling, we again fit each 1-shot model after randomly permuting 25% of the 1-shot patient’s cell labels–a proportion chosen to match the variability observed between pathologists when enumerating leukemic blasts in AML (Figures 7 and Methods). All models were lasso-regularized logistic regression classifiers.

We performed 13 1-shot experiments: one for each AML patient in the dataset to serve as the 1-shot patient. The results of these experiments are summarized in Figure 3b-e. In the “0-shot” case, we once again observed that Mixture Lasso outperformed the naive model (Figure 3b, left). In the 1-shot case, we found that Mixture Lasso’s performance improved substantially compared to 0-shot Mixture Lasso (average AUROC increase of 0.058) while maintaining its superior performance to the 1-shot naive model (Figure 3b, middle). We also found that training an optimal model using only the 1-shot patient outperforms 1-shot Mixture Lasso; however, after training labels were permuted, 1-shot Mixture Lasso outperforms both the naive model and the optimal model, whose performance degrades substantially after training on the imperfect labels (Figure 3b, right). Together, these results suggest that MMIL is capable of improving its performance using even a small number of gold-standard labels, while remaining more robust to noisy labeling than alternative models.

Based on these results, we wondered how the inclusion of the 1-shot patient during model training affected the feature sets selected by Mixture Lasso. To interrogate this, we examined the 1-shot experiment with the largest average AUROC improvement between the 0-shot and 1-shot models across all patients (patient AML-5An was the 1-shot patient). In this experiment, rRNA was selected as the most cancer-associated feature by the optimal model (Figure 3c). Interestingly, rRNA was not identified as a cancer-associated marker by the 0-shot Mixture Lasso model, but was identified by the 1-shot model (Figure 3d). In addition, several markers (CD64, CD3, CD8, and CD19) that were included in the 0-shot Mixture Lasso–but that were not used in Tsai et al.’s manual gating of leukemic blasts–were removed in the 1-shot model. This held as a general trend across all 1-shot experiments, across which rRNA was never selected as a cancer-associated marker by 0-shot Mixture Lasso, but was selected as a cancer-associated marker in 10/13 of the corresponding 1-shot Mixture Lasso. At the same time, markers CD3, CD5, CD15, and CD11c were often selected by 0-shot Mixture Lasso but are not canonically considered markers of AML blasts, and these were selected less frequently by 1-shot Mixture Lasso. Taken together, these results demonstrate that MMIL can adapt its feature selection process via semi-supervision, leveraging even a small number of expert-labels to prioritize clinically relevant markers over non-informative ones.

2.4 Mixture modeling identifies and tracks cancer cells throughout treatment progression in Acute Lymphoblastic Leukemia

Finally, we applied MMIL to a dataset of 1.1 million blood and bone marrow cells collected from three Acute Lymphoblastic Leukemia (ALL) patients and three healthy controls. This dataset contains samples collected from multiple tissues (blood and bone marrow) and multiple timepoints throughout treatment (diagnosis, Day 8 post-treatment initiation, and Day 15 post-treatment initiation). Using this dataset, we wanted to evaluate the ability of an MMIL model trained at diagnosis to accurately identify and track residual leukemic cells during treatment progression.

The identification of residual leukemic cells throughout treatment is crucial for assessing a patient’s risk of relapse and therefore for clinical decision making. Thus, MMIL’s ability to maintain strong performance despite distributional shift–i.e. a shift in the distribution of leukemic blast phenotypes over time due to treatment effects or changes in disease state–would be paramount for its application to disease monitoring in cancer.

To evaluate MMIL’s ability to detect residual leukemic cells over the course of treatment, we analyzed all cells in the ALL cohort for the presence of 29 proteins as previously described using CyTOF (see Methods). Optimal, naive, and Mixture Lasso models were trained using only bone marrow cells from the diagnostic timepoint, and model performance was once again evaluated using leave-one-patient-out cross validation. Next, we applied our diagnostic models to paired blood samples from the diagnostic timepoint to evaluate each model’s ability to generalize to a new tissue context that is more easily monitored clinically. In addition, we applied each model to paired blood samples taken at days 8 and 15 post-treatment initiation to evaluate its ability to track a patient’s disease burden over time.

Trained without access to gold standard labels, Mixture Lasso demonstrates robust performance and generalizability, with an AUROC of 0.941 at diagnosis and 0.815 at day 15 post-treatment initiation. The naive model begins with AUROC of 0.923 at diagnosis, but this degrades across time to 0.662 at Day 15 (Figure 4). This decline in performance aligns with clinical challenges encountered over the course of a patient’s treatment, including the diminishing proportion of leukemic blasts over time (as low as 0.1-5% of all cells) and phenotypic plasticity in resistant cells post-therapy [15]. Notably, the naive model’s brittle performance in response to these shifts highlights its susceptibility to false positives. Mixture Lasso, however, retains strong performance even as the percentage of leukemic cells becomes low, suggesting its potential for robust disease monitoring and minimal residual disease detection in leukemia.

As before, we analyzed Mixture Lasso’s coefficients to examine which protein markers best distinguish between healthy and leukemic cells. Among the markers with positive (cancer-associated) coefficients identified by Mixture Lasso (Supplemental Figure 9) were CD10, CD19, PAX5, and CD34—each of which has been previously described as aberrantly overexpressed in ALL [16]. For example, CD10, or “Common Acute Lymphoblastic Leukemia Antigen”(CALLA), is a surface protein expressed in primitive, undifferentiated B cell precursors as well as in the majority of ALL cells [17]. Mixture Lasso also identified CD58, a well-known marker of clinical MRD [18]. Conversely, all the proteins with negative coefficients in the Mixture Lasso model are phenotypic markers of mature B cells — such as surface immunoglobulin (IgMs), intracellular immunoglobulin M (IgMi), CD20, and the lambda chain of the B cell receptor (Supplemental Figure 9). Each of these proteins are expressed primarily during stages of B cell development that are not reached by leukemic cells, whose differentiation is blocked at earlier stages of B-cell development [7].

3 Discussion

Biomedical studies of single-cell data often involve analyzing many unlabeled cells from each patient in order to identify disease-associated cells. Finding such cells has broad-reaching applications in biomedical science, such as aiding our comprehension of disease-contributing cell populations and improving disease diagnosis and monitoring. Mixture Modeling for MIL (MMIL) is a method to train a cell classifier using patient labels only. Our approach has several attractive features. First, it adapts to different choices of classification algorithms: it is straightforward to implement as a wrapper that repeatedly calls established classification software that optimizes the binomial log likelihood. This makes it easy to implement our method within a pre-existing machine learning pipeline. Second, in the past it has been unclear how to calibrate models without labeled data; because our approach can estimate cell labels, we can use it to perform Platt scaling, a standard calibration method. This is useful because interpretable probability estimates empower straighforward downstream clinical decision-making. Finally, MMIL also supports partially labeled datasets: if a small number of cells have known positive labels, they can be included during model training to guide the classifier. This means that MMIL is able to learn from both labeled and unlabeled data simultaneously, allowing it to leverage existing knowledge while remaining robust to noisy or imperfect gold-standard labels.

Applied to a dataset of cells from patients with and without leukemia, MMIL performed and generalized better than the naive approach of training a classifier using patient labels in place of cell labels. In the context of detecting leukemic blasts in AML, we found that MMIL offered substantial improvement over the naive model. Furthermore, we observed that including a single pathologist-annotated AML sample in MMIL’s training set was sufficient not only to improve its performance, but also for it to incorporate expert knowledge into its own decision making. We also showed that MMIL is more robust to noisy labeling than traditional supervised learning methods, allowing for robust performance in the classically difficult task of leukemic blast identification.

In the context of ALL, we additionally found that MMIL offered modest improvement over the naive model when applied to patient samples from the same timepoint (diagnosis) as the samples on which the models were trained (Figure 4 A-B). However, the naive model’s performance degraded substantially when applied to samples collected at later timepoints, likely due to phenotypic shifts occuring in leukemic blasts throughout chemotherapy [19]. By contrast, MMIL’s performance was more robust at later timepoints—particularly at day 15 post-treatment initiation–with a mean AUC of 0.815 compared to the naive model’s AUC of 0.662 (Figure 4 C-D). MMIL’s robust performance across multiple treatment timepoints highlights its ability to reliably detect biomarkers for disease at the single-cell level.

Only two parameters must be set to train a Mixture Model, aside from any hyperparameters for the chosen classifier. We assume that $\zeta$ , the proportion of sick people, and $\rho$ , the proportion of healthy cells in sick people, are known. Previous work has shown that these parameters are not, in general, identifiable. Where possible, these parameters should be estimated from other datasets and/or clinical knowledge, and a sensitivity analysis should be performed to evaluate how results are affected. We also assume that healthy patients have no disease-associated cells. When healthy patients may also have a mixture of cells, we instead recommend a clustering method such as MDA2 [20].

Mixture Modeling for MIL is a flexible, easy to implement method to train a cell-level classifier using patient labels, and has potential as a useful tool for biological hypothesis generation, diagnostics, disease monitoring, and biomarker discovery. In this work, we illustrated our method using two distinct biomedical datasets that were carefully hand-labeled by experts. In the future, however, we expect our method to be most impactful in situations when gold-standard labels are unknown; for example, to identify cells associated with adverse patient outcomes in cancer or to discover specific populations of immune cells underlying autoimmune disease.

4 Methods

4.1 Related work

In multiple instance learning (MIL), data consists of labeled bags, each with many unlabeled instances. Negative bags have only negative instances, and positive bags have at least one positive instance. In our case, the “bags” are people, each with many unlabeled cells; healthy people have only healthy cells (negative instances), and sick people have at least one diseased cell (positive instance). MIL methods can be broadly categorized into three paradigms [21]: instance space, bag space, and embedded space. Bag and embedded space methods aim to label bags; the instance space paradigm, like ours, aims to label instances, and so is our primary focus. For example, multiple instance SVM (mi-SVM) [22] performs SVM with the goal of assigning all negative instances to one halfspace and at least one instance from every positive bag to the other. This method trains an SVM iteratively: first, instances are assigned labels and an SVM is fit. Then, using the fitted SVM, instances are relabeled and a new model is trained. This is repeated until the labels stabilize. Another related approach is multiple instance logistic regression with a lasso penalty (MILR) [23], which treats the instance labels as missing data, and uses expectation maximization to optimize a lasso-penalized binomial log likelihood. This is the same general approach as ours. However, in the expectation step, MILR uses the joint probability of all instances in the same bag; as a result, MILR tends to be biased when the number of instances in each bag becomes large. Our approach, Mixture Modeling for MIL, instead treats instances as the primary objects of study, and does not use their bag-level grouping, thereby avoiding issues related to computing joint probabilities of many instances. This is particularly important in medical settings where we may sample tens or hundreds of thousands of cells per person. Finally, there are many neural network architectures designed for bag space multiple instance learning [24] that can be used to label cells. One common method is to apply attention [25] to simultaneously label patients and identify the cells responsible for the predicted bag label.

Our setting is also analogous to the presence only or positive unlabeled settings, where some data points have known positive labels, and the remaining labels are unknown. We may instead refer to our data as negative only data: cells from healthy people are healthy, and therefore have known negative labels, and cells from sick people are unlabeled and may have positive or negative labels. In presence only data, unlabeled instances are randomly sampled from the population, and

P(y_{i}=1\mid x_{i},\text{$i$ is unlabeled})=P(y_{i}=1\mid x_{i}).

(1)

In our setting, though, unlabeled data have a higher-than-baseline probability of belonging to the positive class. The knowledge that a cell is from a sick person raises the probability that it is diseased:

P(y_{i}=1\mid x_{i},\text{$i$ is unlabeled})=P(y_{i}=1\mid x_{i},\text{$i$ is % from a sick person})>P(y_{i}=1\mid x_{i}).

(2)

Ward et al. [26] present a method to train a classifier with presence only data using EM. Mixture modeling for MIL is a modification of their approach to accommodate this structural difference in our data.

Mixture discriminant analysis 2 (MDA2) [20], a clustering method, is related to the problem at hand. MDA2 considers data from $K$ classes. Each class is a mixture of Gaussian distributions, the centers of which are shared across all $K$ classes. Here, we have $K=2$ classes (healthy and sick people). The healthy class is composed of just one distribution: the distribution of healthy cells. Because sick people have both diseased and healthy cells, the sick class is a mixture of two distributions: the healthy and diseased cell distributions. While MDA2 is a natural fit for our setting, it is a generative method, and it assumes the features $X$ are normally distributed in order to model the joint likelihood $P(X,\text{cell labels $y$})$ . This is not always appropriate, and we may instead prefer a discriminative method that models the conditional likelihood $P(y\mid X)$ . We may also prefer to use classifiers that identify important features (like lasso and ridge regression), or relationships between features (like gradient boosted trees and neural networks).

Alternately, a simple but problematic approach is to train cell-level classifier using patient labels in lieu of cell labels [27, 28]. In our comparison, we term this the “naive” approach. This may work when most cells sampled from sick people are diseased, as the inherited label is correct most of the time. When sick people have a smaller fraction of diseased cells, classifier performance will suffer because of high label noise. Ward et al. [26] further show that using this approach with logistic regression causes coefficients to be biased toward 0, thereby hindering discovery of biologically important features. To directly address label noise, approaches from the denoising literature (e.g. [29, 30, 31]) may be useful. Still, our method leverages the knowledge that cells from healthy people have known labels, which is not assumed by more general methods.

4.2 Simulation reveals that calibration is possible without ground truth labels

We wish to simulate a dataset with size $n=1000$ and $p=100$ such that $\rho=\zeta=0.5$ , and the relationship between $x_{i}$ and $y_{i}$ is given by ${\text{logit }P(y_{i}\mid x_{i})=\beta^{T}x_{i}}$ . In each simulation, the first $10$ coefficients of $\beta$ are drawn from a normal distribution with mean and standard deviation 1, and the remainder are 0. Then, $x_{i}\in\mathbb{R}^{100}$ is sampled from a standard multivariate normal distribution with identity covariance; $y_{i}$ is from a binomial distribution defined by $P(y_{i}\mid x_{i})$ . Instances with $y_{i}=1$ (diseased cells) necessarily have $z_{i}=1$ (sampled from sick patients), and instances with $y_{i}=0$ (healthy cells) have $z_{i}=1$ with probability $\frac{\rho\zeta}{\rho\zeta+(1-\zeta)}$ . We repeat this process until we have $1000$ instances ( $500$ train and $500$ test) satisfying the requirement $\rho=\zeta=0.5$ .

Then, as before, we trained the optimal model, the naive model, and Mixture Lasso. In addition, we also trained a multiple instance SVM model (mi-SVM, described above). We are interested in both prediction and inference, so we compare the models in terms of AUROC, AUPRC, expected calibration error, and the sum of absolute differences between $\hat{\beta}$ and $\beta$ . For the optimal model, calibration is done with Platt scaling using the true labels $y$ ; this represents the best performance we could reasonably expect. The naive model is calibrated using the incorrect labels $z$ , to stay consistent with the fact that it was trained using the incorrect labels. The Mixture Lasso and mi-SVM models are calibrated in two steps: (1) obtain held-out predictions for each point in the training set (or use a separate held out dataset) and (2) fit a Mixture Lasso model using only the held out probabilities to predict the cell labels $y$ . This results in a logistic regression model that adjusts the predicted probabilities to be better calibrated. This simulation is repeated $1000$ times. For all objectives, the optimal model (trained using the true labels) performs best. When the true labels are unavailable, Mixture Lasso outperforms the naive approach (Table 1), and performing Platt scaling using Mixture Logistic Regression greatly improves calibration.

	AUROC	AUPRC	$\|\|\hat{\beta}-\beta\|\|_{1}$	uncalibrated ECE	calibrated ECE
Optimal	$0.94\pm 0.02$	$0.86\pm 0.05$	$6.71\pm 1.58$	$0.37\pm 0.02$	$0.04\pm 0.01$
Mixture Lasso	$0.90\pm 0.03$	$0.80\pm 0.07$	$9.64\pm 2.24$	$0.11\pm 0.02$	$0.06\pm 0.02$
Naive	$0.88\pm 0.04$	$0.74\pm 0.07$	$11.08\pm 2.43$	$0.38\pm 0.02$	$0.25\pm 0.03$
mi-SVM [32]	$0.71\pm 0.05$	$0.47\pm 0.07$	—	—	$0.17\pm 0.05$

Table 1: Comparison of three Lasso logistic regression modeling approaches in 1000 simulations. Optimal was trained with the true labels

y

(unavailable in realistic settings) and naive with the observed labels

z

. The final two columns show the expected calibration error (ECE), before and after calibration; a lower ECE is preferred. Platt scaling with Mixture Logistic Regression improves calibration for Mixture Lasso, and makes calibration possible for mi-SVM. The mean and one standard deviation of each distribution is shown.

4.3 Algorithm details

Algorithm 2 Train a mixture model for multiple instance learning data

Input: Covariates $X\in\mathbb{R}^{n\times p}$ , binary sampling labels $z\in\mathbb{R}^{n}$ , $n_{1}$ of which are positive.

Assumptions: The proportion of healthy cells in sick people is $P(y=0\mid z=1)=\rho$ .
AssumptionThe proportion of cells from sick people in the prediction population is ${P(z=1)=\zeta}$ .

1. Create a label vector $y^{(0)}$ , set to $0$ for cells from healthy people and $1-\rho$ for cells from sick people.

2. Iterate to convergence. At the $i^{\text{th}}$ step:

(a)

Maximization step: Fit a model $\eta^{\star(i)}(x)$ using $X$ and $y^{(i-1)}$ . Account for biased sampling: make the case-control intercept adjustment to obtain

\eta^{(i)}(x)=\eta^{\star(i)}(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1% }}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).

(3)

(b)

Expectation step: Use $\eta^{(i)}(x)$ to define $y^{(i)}$ . Cells from healthy people have known label $y^{(i)}=0$ . Cell $j$ from a sick person has the label $y_{j}^{(i)}$ such that:

\text{logit }y^{(i)}_{j}=\eta^{(i)}(x_{j})-\log\left(\frac{\rho\zeta}{1-(1-% \rho)\zeta}\right).

(4)

Output: A fitted model $\hat{\eta}(x)$ to predict labels for new cells.

To choose hyperparameters (e.g. the regularization strength in lasso logistic regression), we recommend using cross validation. For each desired choice of hyperparameter, train a model using cells from a subset of the patients, and record the value of the observed data log likelihood (in Section B) on the cells from held out patients. After repeating this across all desired hyperparameter values and all folds, choose the hyperparameters that maximize the cross validated log likelihood.

Importantly, we have assumed that $\rho$ and $\zeta$ , the proportion of healthy cells in sick patients and the proportion of cells sampled from sick patients, are known. We make these assumptions because $\rho$ and $\zeta$ are not, in general, identifiable. Ward et al. [26] show that they are identifiable only if we make strong assumptions about the form of $\eta(x)$ ; even when it they are identifiable, their estimates have high variance. This argument is reaffirmed and expanded upon by Hastie and Fithian [33].

For large datasets, the EM procedure may be slow and it may be more appropriate to optimize the observed log likelihood (Appendix B) through stochastic gradient descent. For smaller datasets, using EM allows us to easily try many different model choices for $\eta$ , and to take advantage of off-the-shelf model fitting software in each loop of the algorithm.

As we have described it here, our algorithm relies on the use of classification software that can train a model using soft labels (values between 0 and 1). In Appendix C, we share a modification for classifiers that require hard labels (restricted to be 0 or 1).

4.4 AML dataset analysis

4.4.1 Data acquisition

Normalized, singlet-gated AML data were downloaded from FlowRepository (ID: FR-FCM-Z2E7), and gold-standard patholgist annotations were obtained via correspondence with the authorship team of Tsai et al. (2020). Due to the ambiguity of their diagnosis, the one patient in the cohort with Myelodysplastic Syndrome (MDS) was excluded from analysis. For compatability with the blast enumeration in Tsai et al. (2020) and standard pathology laboratory procedure, only CD45+ events (hematopoietic lineage cells) were analyzed.

4.4.2 Data cleaning and preprocessing

Standard preprocessing steps for mass cytometry data analysis were performed as described previously [8]. Specifically, ion counts were transformed using the hyperbolic arcsine function with a cofactor of 5, and all markers were scaled to their 99.9th percentile for comparability between markers. Additionally, all cells expressing a marker value over the 99.9th percentile were excluded from analysis in order to remove technical artifacts and outliers. All analyses were performed using the R package tidytof [34].

4.4.3 Model fitting

Models in Figure 1 and Figure 2

Model specification: MMIL was applied to the AML cohort via the Mixture Lasso, a lasso logistic regression model trained using the algorithm described in 2.1. We used a value of ( $\rho=0.75$ ) for all MMIL models, relying on the clinical knowledge that AML patients with more than 25% blasts in their bone marrow receive the clinical diagnosis of leukemia [10]. $\zeta$ was estimated using the training set of each fold (see below).

In addition to Mixture Lasso, two other models were fit and evaluated: the “optimal” model (a lasso logistic regression model trained using the true cell labels, as annotated by a pathologist) and the “naive” model (a lasso model trained using inherited patient labels instead of cell labels). For clarity, the cell labels used by Mixture Lasso are referred to as the “probabilistic labels” of each cell; the cell labels used by the optimal model are referred to as the “gold-standard” or “pathologist-annotated” labels of each cell; and the labels used by the naive model are referred to as the “inherited patient labels” of each cell. Mixture Lasso, the optimal model, and the naive model are referred to as the 3 “model classes” that we evaluated.

Hyperparameter tuning: Lasso models have a single hyperparameter: $\lambda$ , the penalty term determining the amount of regularization applied to the model’s coefficients. For each model class, we tuned over 10 values of lambda, equally spaced on a logarithmic scale between the lowest value ( $10^{-5}$ ) and the highest value (1). The optimal hyperparameter for each model class was determined using cross-validation (see below). The model predictions of the model fit with the optimal penalty parameter were used for reporting in Figure 4’s ROC curves and AUROC values.

Cross-validation and model selection: Model performance was estimated using leave-one-(patient)-out cross-validation (LOOCV), a schema in which all cells from a single AML patient are held out as a separate test set in each fold of the cross-validation. Specifically, we break the dataset into 13 folds such that each fold includes 12 AML patients and all 3 healthy controls for model training and 1 held-out AML patient for model evaluation. For each fold and model class, we fit the model on the training set and evaluate its cell-level performance on the held-out AML patient according to gold-standard, pathologist-annotated cell labels. The optimal lasso regularization penalty for each model class was chosen by selecting the penalty that optimized the average log-likelihood (for Mixture Lasso) or binomial deviance (for both other models) on the held out sample across all folds. Finally, for model interpretation, we refit each model with the optimal penalty on all 13 AML samples and all 3 healthy bone marrow samples.

0-shot and 1-shot models in Figure 3:

In this section, we refer to models trained on a dataset for whom no cell labels are known “0-shot” models (the 0 denoting the number of patients who have received expert annotation before model training). Similarly, we refer to models trained on a dataset for whom a single patient’s cell labels are known “1-shot” models (the 1 again denoting the number of patients who have received expert annotation before model training). We borrow this language from the few-shot machine learning literature, a paradigm that has been scarcely applied to multiple-instance learning problems but that is highly applicable here.

0-shot models are simply the same models described in the “Models in Figure 1 and Figure 2” section above. In the 0-shot case, MMIL models are trained using probabilisitic cell labels as described in Figure 1a, naive models are trained using inherited patient labels, and optimal models are trained using gold-standard cell labels. Notably, it is not possible to fit a 0-shot optimal modelm, as optimal models require direct cell labels. This is why the left panel of 3b has a blank space for the optimal model.

By contrast, 1-shot models are trained identically to 0-shot models except for a single change: during model training, one AML patient (termed the “1-shot patient”) is chosen such that their gold-standard cell labels are used for supervision instead of their probabilistic labels (for Mixture Lasso) or inherited patient labels (for naive models). Similarly to the cells from healthy patients, the 1-shot patient’s cell labels remain fixed throughout all iterations of the EM. Otherwise, model training, cross-validation, and model selection are carried out as before.

Note that a single ’1-shot experiment’ in the main text refers to the following training procedure:

1.

Choose one of the 13 AML patients to designate as the 1-shot patient. Designate all other 12 AML patients as ”cross-validation patients.”
2.

Fit a 0-shot Mixture Lasso model and a 0-shot naive model using 12-fold cross validation. In each fold of the cross validation, 11 cross-validation patients, the 1-shot patient, and all 3 healthy controls are included in the training set, and 1 cross-validation patient is used as the held-out test set. For Mixture Lasso, the model with the penalty parameter that optimizes the average MMIL log likelihood in the test set across all folds is chosen as the best model and is used for error reporting. For the naive model, the model with the penalty parameter that optimizes the average binomial deviance in the test set across all folds is chosen as the best model and used for error reporting.
3.

Fit a 1-shot Mixture Lasso model and a 1-shot naive model using 12-fold cross validation. In each fold of the cross validation, 11 cross-validation patients, the 1-shot patient, and all 3 healthy controls are included in the training set, and 1 cross-validation patient is used as the held-out test set. Additionally, fit an optimal model using only the 1-shot patient. For Mixture Lasso, the model with the penalty parameter that optimizes the average MMIL log likelihood in the test set across all folds is chosen as the best model and is used for error reporting. For the naive model, the model with the penalty parameter that optimizes the average binomial deviance in the test set across all folds is chosen as the best model and used for error reporting. For the optimal model, only a single patient is available for model training, so the model with the penalty parameter that optimizes the average binomial deviance for that patient is chosen as the best model and used for error reporting.
4.

Repeat step 3 after randomly permuting 25% of the 1-shot patient’s gold-standard labels.

Thus, to give each AML patient a turn being the 1-shot patient, steps 1-4 were repeated 13 times. Finally, the AUROC for each left-out patient was averaged across all 13 1-shot experiments, giving an expected AUROC across all possible choices of 1-shot patient. These AUROC values are the reported values for each patient in 3b.

4.4.4 Model evaluation and interpretation

Single-cell model evaluation:To evaluate each model class’s performance at the single-cell level, we calculated the ROC curve and corresponding AUROC for each sample using gold-standard cell labels.

Blast percentages in Figure 1e: Mixture Lasso, naive, and optimal probabilities were assigned to each cell using the models refit on all AML patients with the best penalty parameters identified by cross-validation. These probabilities were used to classify each cell in the dataset as either a leukemic blasts or not a leukemic blast with each of the 3 model classes. In the case of the optimal model, cell labels are known, so the probability cutoff was chosen to give the highest cell label classification accuracy. In the case of the naive model, cell labels are not known, so the probability cutoff was chosen to give the highest inherited patient label classification accuracy. Finally, for Mixture Lasso, the cell labels are not known, but because of Mixture Lasso’s strong calibration due to our model fitting procedure, a simply probability cutoff of 0.5 was chosen.

Using these binary classifications for each cell, blast percentages were calculated for each patient. To recalibrate the blast percentages assigned by each model with the pathologist-enumerated blast percentages, a linear regression was fit for each model class as a post-processing step. Specifically, a linear regression of the model-assigned blast percentage onto the pathologist-enumerated blast percentage was used to derive the values plotted on the y-axis of 1e.

UMAP plots: To analyze how cells assigned high MMIL probabilities arrange in high-dimensional space, we performed dimensionality reduction using uniform manifold approximation and projection (UMAP) using all cells and all markers in the dataset (2). The plotted MMIL probabilities were taken from the Mixture Lasso model refit on all AML patients using the optimal penalty parameter identified by cross-validation, and UMAP was run using the default parameters of the tidytof R package [11, 34].

In Figure 2d, local neighborhoods were constructed using a two-step process. First, density-dependent downsampling ([35, 34]) was used to select index cells from the full dataset such that all regions of phenotypic space are represented equally, with each index cell representing the center of a local neighborhood in high-dimensional space. All markers were used to calculated the local densities surrounding each cell in the dataset during density-dependent downsampling. This downsampling process selected 6,849 index cells that were approximately evenly dispersed throughout the high-dimensional point cloud. After index cells were selected, the percentage of cells in each local neighborhood collected from AML patient samples was calculated by finding the 100-nearest neighbors of each index cell in the original dataset and computing the proportion of those neighbors from cancer samples.

0-shot and 1-shot Lasso coefficient analysis: For model interpretation in Figure 3, we examined the non-negative coefficients of the Mixture Lasso model refit on all data using the optimal penalty parameter identified by cross-validation. We focused our analysis on positive (i.e. cancer-associated) lasso coefficients because, generally speaking, disease-associated features are more useful as positive biomarkers that can be used to diagnose or monitor disease. In Figure 3e, features were counted as disease-associated if they had a lasso coefficient of at least 0.01 in their Mixture Lasso model.

4.5 ALL dataset analysis

4.5.1 Data acquisition

De-identified bone marrow and peripheral blood primary samples from patients with ALL were obtained under informed consent from Lucile Packard Children’s Hospital and Stanford Hospital from the Pediatric Clinic of University of Milano-Bicocca (San Gerardo Hospital, Monza, Italy). The use of these samples was approved by the Institutional Review Boards at Lucile Packard Children’s Hospital, Stanford Hospital at Stanford University, and from the Pediatric Clinic of University of Milano-Bicocca. Cryopreserved primary bone marrow and peripheral blood samples were thawed rapidly in thawing media (RPMI 1640 supplemented with 10% fetal bovine serum, 1% penicillin-streptomycin, and glutamine, 20 U/mL sodium heparin, and 0.025 U/mL Benzonase. Cells were rested for 30 minutes at 37°C and cisplatin viability stained [36].

Each sample was analyzed for the expression of 29 proteins using mass cytometry as previously described. [8]. Briefly, cells were fixed with paraformaldehyde, washed in cell staining media (CSM) twice, followed by one wash in PBS and one wash in PBS+0.02% saponin. Blocking was performed with Human TruStain FcX receptor blocking solution (Biolegend, 422302). Cells underwent surface staining with the following surface markers: CD3, CD10, CD19, CD20, CD22, CD24, CD34, CD38, CD43, CD45, CD58, CD61, CD79b, CD123, CD127, CD179a, CD179b, Human Leukocyte Antigen-DR (HLA-DR), Immunoglobulin M (IgM), and Thymic Stromal Lymphopoietin Receptor (TSLPr).

After surface staining, cells were washed, permeabilized, and intracellular stained with the following markers: Intracellular Immunoglobulin M (IgMi), Marker of Proliferation Ki-67 (Ki67), lambda chain of the B-cell receptor, Paired Box Protein 5 (PAX5), Recombination Activating Gene 1 (RAG1), and Terminal deoxynucleotidyl transferase (TdT). Once intracellularly stained, samples were washed in CSM, Iridium intercalated, washed in CSM, followed by two washes in ultra-pure double-distilled water. To prepare for acquisition, cells were resuspended with normalization beads. Mass cytometry data were then acquired on a Helios (a 3rd generation CyTOF).

4.5.2 Data cleaning and preprocessing

4.5.3 Model fitting

Model specification: MMIL was applied to the ALL cohort via the Mixture Lasso, a lasso logistic regression model trained using the algorithm described in 2.1. We used a value of ( $\rho=0.75$ ) for all MMIL models, relying on the clinical knowledge that ALL patients with more than 25% blasts in their bone marrow receive the clinical diagnosis of leukemia [10]. $\zeta$ was estimated using the training set of each fold (see below).

Importantly, for the ALL dataset, only diagnostic bone marrow samples were used to train any models in order to mimic the clinical scenario in which a model is built at the time of diagnosis and used to track a patient’s leukemia burden over time.

Cross-validation and model selection: Model performance was estimated using leave-one-(patient)-out cross-validation (LOOCV), a schema in which all cells from a single ALL patient are held out as a separate test set in each fold of the cross-validation. Specifically, we break the dataset into 3 folds such that each fold includes 2 ALL patients and 3 healthy patients for model training and 1 held-out ALL patient for model evaluation. For each fold and model class, we fit the model on the training set and evaluate its cell-level performance on the held-out ALL patient according to gold-standard cell labels. The optimal lasso regularization penalty for each model class was chosen by selecting the penalty that optimized the average log-likelihood (for Mixture Lasso) or binomial deviance (for both other models, as is ordinarily done) on the held out sample across all folds. Finally, for model interpretation, we refit each model with the optimal penalty on all 3 diagnostic ALL and all 3 healthy bone marrow samples.

4.5.4 Model evaluation and interpretation

Single-cell model evaluation: To evaluate each model class’s performance at the single-cell level, we calculated the ROC curve and corresponding AUROC for each sample using gold-standard cell labels.

Although models were only trained using cells from the diagnostic timepoint, the models were evaluated for all available samples at all timepoints. Accordingly, ROC curves and AUROCs from different tissues and timepoints were calculated separately.

Model interpretation: For model interpretation in Figure 9, we examined the nonzero coefficients of the Mixture Lasso model refit on all data using the optimal penalty parameter identified by cross-validation.

5 Supplementary information

5.1 Ethical approval declarations

This research complies with all relevant ethical regulations. The use of the primary samples was approved by the Institutional Review Board at Lucile Packard Children’s Hospital at Stanford University. Healthy human bone marrow samples (n = 3) were purchased through AllCells. De-identified bone marrow samples from pediatric patients with BCP-ALL were obtained, under informed consent, from the Pediatric Clinic of University of Milano-Bicocca (Centro Maria Letizia Verga, Monza, Italy; n = 3). The use of these primary samples was approved by the Institutional Review Boards at both institutions. Written informed consent was obtained from the parents of the patients or their legal representatives, who agreed to the use of biological material for research and clinical studies. To protect patients’ privacy, samples have been de-identified.

5.2 Acknowledgments

EC is supported by the Stanford Data Science Scholars program and a Stanford Graduate Fellowship. MZ is supported by the Stanford Bio-X Bowes Graduate Student Fellowship. TH was partially supported by grants DMS-2013736 from the National Science Foundation, and grant R01GM134483 from the National Institutes of Health. RT is supported by the National Institutes of Health(5R01EB001988-16) and the National Science Foundation(19DMS1208164). TJK is supported by the National Institutes of Health [grant number 1F31CA239365-01], the Mark Foundation for Cancer Research, the Andrew McDonough B+ (Be Positive) Foundation, and a Point Foundation Graduate Student Scholarship. JS is supported by Associazione Italiana per la Ricerca sul Cancro (AIRC; Start-UP grant no. 27325). GPN is supported by the National Institutes of Health [grant numbers U54CA209971, U54HG010426, and U19AI100627]. KLD is supported by the National Institutes of Health [grant number R01 CA251858-01A1], the Mark Foundation for Cancer Research, the Andrew McDonough B+ (Be Positive) Foundation, the Oxnard Foundation, and the Stanford Maternal and Child Health Research Institute.

5.3 Availability of data and materials

Acute Myeloid Leukemia (AML) mass cytometry data are availability on FlowRespository, with Accession ID FR-FCM-Z2E7.

Acute Lymphoblastic Leukemia (ALL) mass cytometry data are available on Dryad, with DOI 10.5061/dryad.8gtht76vw.

5.4 Code availability

An R package to train mixture classifiers can be found on Github in the repo [XXXXXXXX].

Appendix A Supplemental Figures

Appendix B Mixture modeling uses expectation maximization

Our approach is an application of expectation maximization [1] to maximize the observed data likelihood, $\prod_{i=1}^{n}P(z_{i}\mid x_{i})$ . The EM procedure is as follows:

1.

Expectation step: At step $t$ , use $\eta^{(t-1)}(x)$ to compute $y^{(t)}=P_{\eta^{(t-1)}}(y\mid x,z)$ , our current estimates for the unobserved $y$ . For cells with $z=0$ , use the known label $y^{(t)}=0$ .
2.

Maximization step: Find $\eta^{(t)}$ that maximizes the full likelihood, $\prod_{i=1}^{n}P(z_{i},y_{i}^{(t)}\mid x_{i})$ . Adjust $\eta^{(t)}$ to account for biased sampling.

In this section, we will first address the expectation step: we will show how to compute $P(y\mid x,z)$ . Then, we will share the details for the maximization step. We will show the intercept adjustment to account for biased sampling, and then use this to write our objective, the observed data likelihood. Finally, we will show that the binomial log likelihood with respect to $y$ is proportional to the full likelihood, and so the maximization step can be achieved by solving a typical binary classification problem.

B.1 Expectation step: estimating labels

At the $t^{\text{th}}$ step of our algorithm, we update our estimate of the unknown labels $y$ as $P_{\eta^{(t-1)}}(y\mid x,z=1)$ , where

\text{logit }P_{\eta^{(t-1)}}(y\mid x,z=1)=\eta^{(t-1)}(x)-\log\left(\frac{% \rho\zeta}{1-(1-\rho)\zeta}\right).

(5)

We note that $\log\left(\frac{\rho\zeta}{1-(1-\rho)\zeta}\right)<0$ . So, this intercept adjustment raises the baseline probability that a cell is diseased, and this makes sense: knowing a cell came from a sick person should make us think the cell is more likely to be disease associated.

Now, we show our derivation of this label update, and use our assumption that ${P(z\mid y,x)=P(z\mid y)}$ . We begin by applying Bayes’ rule:

$\displaystyle P_{\eta^{(t-1)}}(y\mid x,z=1)$	$\displaystyle=\frac{P(z=1\mid y=1)P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}% }(z=1\mid x)}$	(6)
	$\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(z=1\mid x)}$	(7)
	$\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P(z=1\mid y=1)P_{\eta^{(t-1)}% }(y=1\mid x)+P(z=1\mid y=0)P_{\eta^{(t-1)}}(y=0\mid x)}$	(8)
	$\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(y=1\mid x)+P% (z=1\mid y=0)P_{\eta^{(t-1)}}(y=0\mid x)}$	(9)

In the denominator, we used the identity $P(z=1\mid y=1)=1$ . We can compute ${P(z=1\mid y=0)}$ :

P(z=1\mid y=0)=\frac{P(y=0\mid z=1)P(z=1)}{P(y=0)}=\frac{\rho\zeta}{1-(1-\rho)% \zeta}.

(10)

Now, we can plug this back in to our expression:

	$\displaystyle P_{\eta^{(t-1)}}(y\mid x,z=1)$	$\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(y=1\mid x)+% \frac{\rho\zeta}{1-(1-\rho)\zeta}P_{\eta^{(t-1)}}(y=0\mid x)}$		(11)
		$\displaystyle=\frac{e^{\eta^{(t-1)}(x)}}{e^{\eta^{(t-1)}(x)}+\frac{\rho\zeta}{% 1-(1-\rho)\zeta}}.$		(12)

With manipulation, this becomes:

\text{logit }P_{\eta^{(t-1)}}(y\mid x,z=1)=\eta^{(t-1)}(x)-\log\left(\frac{% \rho\zeta}{1-(1-\rho)\zeta}\right).

(13)

B.2 Maximization step: adjustment for biased sampling

Relative to the general population, we may have a biased sample of healthy and sick people. To address this, we can update the intercept of our model using the following adjustment:

\text{logit }P(y=1\mid x,s=1)=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho% )n_{1}}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right),

(14)

where $s=1$ indicates our sample, and $n_{1}$ and $n$ are the number of cells in our sample with label $z=1$ and the number of cells total, respectively.

Proof: We begin using Bayes rule, and we use the assumption that $P(s\mid y,x)=P(s\mid y)$ :

$\displaystyle P(y=1\mid x,s=1)$	$\displaystyle=\frac{P(s=1\mid y=1)P(y=1\mid x)}{P(s=1\mid y=1)P(y=1\mid x)+P(s% =1\mid y=0)P(y=0\mid x)}$	(15)
	$\displaystyle=\frac{\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}e^{\eta(x)}}{\frac{P(% s=1\mid y=1)}{P(s=1\mid y=0)}e^{\eta(x)}+1}$	(16)
	$\displaystyle=\frac{e^{\eta^{\star}(x)}}{e^{\eta^{\star}(x)}+1},$	(17)

where $\eta^{\star}(x)=\eta(x)+\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)$ . Now, we simplify $\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)$ :

	$\displaystyle\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)$	$\displaystyle=\log\left(\frac{\frac{P(y=1\mid s=1)P(s=1)}{P(y=1)}}{\frac{P(y=0% \mid s=1)P(s=1)}{P(y=0)}}\right)$		(18)
		$\displaystyle=\log\left(\frac{P(y=1\mid s=1)}{P(y=0\mid s=1)}\right)-\log\left% (\frac{P(y=1)}{P(y=0)}\right).$		(19)

We write out each of these probabilities. We show the full derivation for the first expression; the arguments for the the remainder are similar.

$\displaystyle P(y=1\mid s=1)$	$\displaystyle=P(y=1\mid z=1,s=1)P(z=1\mid s=1)+P(y=1\mid z=0,s=1)P(z=0\mid s=1)$	(20)
	$\displaystyle=P(y=1\mid z=1)P(z=1\mid s=1)$	(21)
	$\displaystyle=\frac{(1-\rho)n_{1}}{n}$	(22)
$\displaystyle P(y=0\mid s=1)$	$\displaystyle=\frac{n-(1-\rho)n_{1}}{n}$	(23)
$\displaystyle P(y=1)$	$\displaystyle=(1-\rho)\zeta$	(24)
$\displaystyle P(y=0)$	$\displaystyle=1-(1-\rho)\zeta.$	(25)

Now, we substitute back in:

\displaystyle\log\left(\frac{P(y=1\mid s=1)}{P(y=1\mid s=0)}\right)-\log\left(% \frac{P(y=1)}{P(y=0)}\right)

\displaystyle=\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)-\log\left% (\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).

(26)

Therefore, $\eta^{\star}(x)=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)% -\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)$ .

B.3 Maximization step: the observed likelihood

The observed likelihood is given by:

	$\displaystyle L(\eta\mid z,X)$	$\displaystyle=\prod_{i=1}^{n}P(z=z_{i}\mid x_{i},s=1)$		(27)
		$\displaystyle=\prod_{i=1}^{n}\left(\frac{e^{\eta^{\star}(x)}+\frac{\rho n_{1}}% {n-n_{1}(1-\rho)}}{e^{\eta^{\star}(x)}+1}\right)^{z_{i}}\left(\frac{1-\frac{% \rho n_{1}}{n-n_{1}(1-\rho)}}{e^{\eta^{\star}(x)}+1}\right)^{1-z_{i}},$		(28)

where $\eta^{\star}(x)=\text{logit }P(y=1\mid x,s=1)=\eta(x)+\log\left(\frac{(1-\rho)% n_{1}}{n-(1-\rho)n_{1}}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)$ , as in Equation 14.

Proof: Consider just one instance, $(x,z)$ . We show how to compute ${P(z=1\mid x,s=1)}$ , beginning by using the law of total probability and assuming that ${P(z\mid y,s=1)=P(z\mid y)}$ :

P(z=1\mid x,s=1)=P(z=1\mid y=1)P(y=1\mid x,s=1)+P(z=1\mid y=0)P(y=0\mid x,s=1)

(29)

Because all diseased cells come from sick people, $P(z=1\mid y=1)=1$ . To compute ${P(z=1\mid y=0)}$ , we will use Bayes’ rule. Recall that we assume $P(y=0\mid z=1)=\rho$ .

	$\displaystyle P(z=1\mid y=0,s=1)$	$\displaystyle=\frac{P(y=0\mid z=1)P(z=1\mid s=1)}{P(y=0\mid s=1)}$		(30)
		$\displaystyle=\frac{\rho\frac{n_{1}}{n}}{1-(1-\rho)\frac{n_{1}}{n}}=\frac{\rho n% _{1}}{n-(1-\rho)n_{1}}.$		(31)

Now, Expression 29 becomes:

P(z=1\mid x,s=1)=P(y=1\mid x,s=1)+\frac{\rho n_{1}}{n-(1-\rho)n_{1}}P(y=0\mid x% ,s=1).

(32)

Finally, we use Equation 14: $P(y=1\mid x,s=1)=\frac{e^{\eta^{\star}(x)}}{1+e^{\eta^{\star}(x)}}$ , where ${\eta^{\star}=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)-% \log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)}$ . The contribution of one instance to the likelihood is then:

	$\displaystyle P(z=1\mid x,s=1)$	$\displaystyle=\frac{e^{\eta^{\star}(x)}}{1+e^{\eta^{\star}(x)}}+\left(\frac{% \rho n_{1}}{n-(1-\rho)n_{1}}\right)\frac{1}{1+e^{\eta^{\star}(x)}}$		(33)
		$\displaystyle=\frac{e^{\eta^{\star}(x)}+\frac{\rho n_{1}}{n-(1-\rho)n_{1}}}{1+% e^{\eta^{\star}(x)}},$		(34)

as desired, and $P(z=0\mid x,s=1)=1-P(z=1\mid x,s=1)$ .

B.4 Maximization step: the full likelihood

In the maximization step of the EM, we need to maximize the full likelihood using the current expected values of the unknown labels $y$ . We will show that the full likelihood $P(y,z\mid X)\propto P(y\mid X)$ ; this is what allows us to optimize the usual binomial log likelihood at every step of the EM. Suppose we know both $y$ and $z$ . Then, the full likelihood is:

$\displaystyle L(\eta\mid z,y,X)$	$\displaystyle=\prod_{i}P(y=y_{i},z=z_{i}\mid s_{i}=1,x_{i})$	(35)
	$\displaystyle=\prod_{i}P(z=z_{i},\mid y=y_{i},s_{i}=1,x_{i})P(y=y_{i}\mid s_{i% }=1,x_{i})$	(36)
	$\displaystyle=\prod_{i}P(z=z_{i},\mid y=y_{i},s_{i}=1)P(y=y_{i}\mid s_{i}=1,x_% {i})$	(37)
	$\displaystyle\propto\prod_{i}P(y=y_{i}\mid s_{i}=1,x_{i}).$	(38)

To go from Expression 36 to Expression 37, we use our assumption that $P(z\mid y,s=1,x)=P(z\mid y,s=1)$ : there is no systematic difference between cells from healthy and sick people, other than the fact that some cells from sick people are diseased. Now, we show that $P(z=z_{i}\mid y=y_{i},s=1)$ is a constant, based on our assumptions of the values of $P(z=1)$ and ${P(y=y_{i}\mid z=z_{i})}$ . There are three cases:

1.

$P(z=1\mid y=1,s=1)=1$
2.

$P(z=1\mid y=0,s=1)=\frac{\rho n_{1}}{n-n_{1}(1-\rho)}$ (as in Equation 31)
3.

$P(z=0\mid y=0,s=1)=\frac{n-n_{1}}{n-n_{1}(1-\rho)}$ .

We note that the fourth case, $P(z=0\mid y=1,s=1)$ , is not possible under our assumption that healthy people have no diseased cells. Therefore, to find $\eta$ that maximizes the full likelihood, we only need to find $\eta$ that maximizes the binomial log likelihood.

Appendix C Modification for learning algorithms that do not allow soft labels

For learning algorithms that do not allow “soft” labels, do the following. This is as described in the body of this text.

1.
Augment your dataset.
1. (a)
  
  Make a new covariate matrix $\tilde{X}$ containing one copy of cells from healthy people, and two copies of cells from sick people.
2. (b)
  
  Make a cell label vector $\tilde{y}$ that is $0$ for cells from healthy people. One copy of each cell from sick people should be labeled $1$ and the other $0$ .
3. (c)
  
  Make a weight label vector $w^{(0)}$ that is $1$ for cells from healthy people, $\rho$ for cells labeled $0$ from sick people, and $1-\rho$ for cells labeled $1$ from sick people.

Iterate to convergence. At the $i^{\text{th}}$ step:

(a)

Maximization step: Fit a model $\eta^{\star(i)}(x)$ using $X$ , $y$ and $w^{(i-1)}$ . Make the intercept adjustment to obtain

\eta^{(i)}(x)=\eta^{\star(i)}(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1% }}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).

(b)

Expectation step: Use $\eta^{(i)}(x)$ to define $w^{(i)}$ . Cells from healthy people have $w^{(i)}=1$ . Cell $j$ with label $y=1$ from a sick person has weight:

\text{logit }w^{(i)}_{j}=\eta^{(i)}(x_{j})-\log\left(\frac{\rho\zeta}{1-(1-% \rho)\zeta}\right).

The weight for the corresponding row with label $y=0$ has weight $1-w^{(i)}_{j}$ .

References

Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
Dworzak et al. [1999] Michael N Dworzak, Frank Stolz, Gertraud Fröschl, Dieter Printz, Traudl Henn, Susanna Fischer, Christine Fleischer, Oskar A Haas, Gerhard Fritsch, Helmut Gadner, et al. Detection of residual disease in pediatric b-cell precursor acute lymphoblastic leukemia by comparative phenotype mapping: a study of five cases controlled by genetic methods. Experimental hematology, 27(4):673–681, 1999.
Kruse et al. [2020] A. Kruse, N. Abdel-Azim, H. N. Kim, Y. Ruan, V. Phan, H. Ogana, W. Wang, R. Lee, E. J. Gang, S. Khazal, and Y. M. Kim. Minimal Residual Disease Detection in Acute Lymphoblastic Leukemia. Int J Mol Sci, 21(3), Feb 2020.
Chen and Wood [2017] Xueyan Chen and Brent L. Wood. Monitoring minimal residual disease in acute leukemia: Technical challenges and interpretive complexities. Blood Reviews, 31(2):63–75, 2017. ISSN 0268-960X. doi: 10.1016/j.blre.2016.09.006. URL https://doi.org/10.1016/j.blre.2016.09.006.
Tsai et al. [2020] A. G. Tsai, D. R. Glass, M. Juntilla, F. J. Hartmann, J. S. Oak, S. Fernandez-Pol, R. S. Ohgami, and S. C. Bendall. Multiplexed single-cell morphometry for hematopathology diagnostics. Nature Medicine, 26(3):408–417, Mar 2020. doi: 10.1038/s41591-020-0783-x.
Hodes et al. [2019] Aaron Hodes, Katherine R. Calvo, Alina Dulau, Irina Maric, Junfeng Sun, and Raul Braylan. The challenging task of enumerating blasts in the bone marrow. Seminars in Hematology, 56(1):58–64, 2019. ISSN 0037-1963. doi: https://doi.org/10.1053/j.seminhematol.2018.07.001. URL https://www.sciencedirect.com/science/article/pii/S0037196318300568.
Good et al. [2018] Z. Good, J. Sarno, A. Jager, N. Samusik, N. Aghaeepour, E. F. Simonds, L. White, N. J. Lacayo, W. J. Fantl, G. Fazio, G. Gaipa, A. Biondi, R. Tibshirani, S. C. Bendall, G. P. Nolan, and K. L. Davis. Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nat Med, 24(4):474–483, May 2018.
Jager et al. [2021] A. Jager, J. Sarno, and K. L. Davis. Mass Cytometry of Hematopoietic Cells. Methods Mol Biol, 2185:65–76, 2021.
Coustan-Smith et al. [2002] E. Coustan-Smith, J. Sancho, M. L. Hancock, B. I. Razzouk, R. C. Ribeiro, G. K. Rivera, J. E. Rubnitz, J. T. Sandlund, C. H. Pui, and D. Campana. Use of peripheral blood instead of bone marrow to monitor residual disease in children with acute lymphoblastic leukemia. Blood, 100(7):2399–2402, Oct 2002.
van der Does-van den Berg et al. [1992] A. van der Does-van den Berg, C. R. Bartram, G. Basso, Y. C. Benoit, A. Biondi, K. M. Debatin, O. A. Haas, J. Harbott, W. A. Kamps, and U. ller. Minimal requirements for the diagnosis, classification, and evaluation of the treatment of childhood acute lymphoblastic leukemia (ALL) in the ”BFM Family” Cooperative Group. Med Pediatr Oncol, 20(6):497–505, 1992.
McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426, 2018. URL http://arxiv.org/abs/1802.03426.
Yakimovich et al. [2021] A. Yakimovich, A. Beaugnon, Y. Huang, and E. Ozkirimli. Labels in a haystack: Approaches beyond supervised learning in biomedical applications. Patterns (N Y), 2(12):100383, Dec 2021. doi: 10.1016/j.patter.2021.100383.
Northcutt et al. [2020] C. G. Northcutt, A. Athalye, and J. Lin. Pervasive label errors in ml benchmark test sets, consequences, and benefits. NeurIPS 2020 Workshop on Security and Data Curation Workshop, 2020.
Bernhardt et al. [2022] M. Bernhardt, D.C. Castro, and R. et al. Tanno. Active label cleaning for improved dataset quality under resource constraints. Nature Communications, 13:1161, 2022. doi: 10.1038/s41467-022-28818-3. URL https://doi.org/10.1038/s41467-022-28818-3.
Sarno et al. [2023] J. Sarno, P. Domizi, Y. Liu, M. Merchant, C. B. Pedersen, D. Jedoui, A. Jager, G. P. Nolan, G. Gaipa, S. C. Bendall, F. A. Bava, and K. L. Davis. Dasatinib overcomes glucocorticoid resistance in B-cell acute lymphoblastic leukemia. Nat Commun, 14(1):2935, May 2023.
van Dongen et al. [2012] J. J. van Dongen, L. Lhermitte, S. ttcher, J. Almeida, V. H. van der Velden, J. Flores-Montero, A. Rawstron, V. Asnafi, Q. crevisse, P. Lucio, E. Mejstrikova, T. ski, T. Kalina, R. de Tute, M. ggemann, L. Sedek, M. Cullen, A. W. Langerak, A. a, E. Macintyre, M. Martin-Ayuso, O. Hrusak, M. B. Vidriales, and A. Orfao. EuroFlow antibody panels for standardized n-dimensional flow cytometric immunophenotyping of normal, reactive and malignant leukocytes. Leukemia, 26(9):1908–1975, Sep 2012.
Pui et al. [2003] C. H. Pui, J. M. Chessells, B. Camitta, A. Baruchel, A. Biondi, J. M. Boyett, A. Carroll, O. B. Eden, W. E. Evans, H. Gadner, J. Harbott, D. O. Harms, C. J. Harrison, P. L. Harrison, N. Heerema, G. Janka-Schaub, W. Kamps, G. Masera, J. Pullen, S. C. Raimondi, S. Richards, H. Riehm, S. Sallan, H. Sather, J. Shuster, L. B. Silverman, M. G. Valsecchi, E. Vilmer, Y. Zhou, P. S. Gaynon, and M. Schrappe. Clinical heterogeneity in childhood acute lymphoblastic leukemia with 11q23 rearrangements. Leukemia, 17(4):700–706, Apr 2003.
Veltroni et al. [2003] M. Veltroni, L. De Zen, M. C. Sanzari, O. Maglia, M. N. Dworzak, R. Ratei, A. Biondi, G. Basso, and G. Gaipa. Expression of CD58 in normal, regenerating and leukemic bone marrow B cells: implications for the detection of minimal residual disease in acute lymphocytic leukemia. Haematologica, 88(11):1245–1252, Nov 2003.
Borowitz et al. [2005] M. J. Borowitz, D. J. Pullen, N. Winick, P. L. Martin, W. P. Bowman, and B. Camitta. Comparison of diagnostic and relapse flow cytometry phenotypes in childhood acute lymphoblastic leukemia: implications for residual disease detection: a report from the children’s oncology group. Cytometry B Clin Cytom, 68(1):18–24, Nov 2005.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
Amores [2013] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial intelligence, 201:81–105, 2013.
Andrews et al. [2002] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. Advances in neural information processing systems, 15, 2002.
Chen et al. [2017] Ping-Yang Chen, Ching-Chuan Chen, Chun-Hao Yang, Sheng-Mao Chang, and Kuo-Jung Lee. milr: Multiple-instance logistic regression with lasso penalty. R J., 9(1):446, 2017.
Gadermayr and Tschuchnig [2022] Michael Gadermayr and Maximilian Tschuchnig. Multiple instance learning for digital pathology: A review on the state-of-the-art, limitations & future potential. arXiv preprint arXiv:2206.04425, 2022.
Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Ward et al. [2009] Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. Presence-only data and the em algorithm. Biometrics, 65(2):554–563, 2009.
Ray and Craven [2005] Soumya Ray and Mark Craven. Supervised versus multiple instance learning: An empirical comparison. Proceedings of the 22nd international conference on Machine learning, pages 697–704, 2005.
Carbonneau et al. [2018] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
Muhlenbach et al. [2004] Fabrice Muhlenbach, Stéphane Lallich, and Djamel A Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89–109, 2004.
Natarajan et al. [2013] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
Song et al. [2022] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Kent and Liou [2022] Sean Kent and Yifei Liou. mildsvm: Multiple-Instance Learning with Support Vector Machines, 2022. URL https://CRAN.R-project.org/package=mildsvm. R package version 0.4.0.
Hastie and Fithian [2013] Trevor Hastie and Will Fithian. Inference from presence-only data; the ongoing controversy. Ecography, 36(8):864–867, 2013.
Keyes et al. [2023] T. J. Keyes, A. Koladiya, Y.-C. Lo, G. P. Nolan, and K. L. Davis. tidytof: a user-friendly framework for scalable and reproducible high-dimensional cytometry data analysis. Bioinform Adv, 3:vbad071, 2023.
Qiu et al. [2011] P. Qiu, E. F. Simonds, S. C. Bendall, K. D. Gibbs, R. V Bruggner, M. D. Linderman, and S. K. Plevritis. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology, 29(10):886–891, 2011. doi: 10.1038/nbt.1991.
Fienberg et al. [2012] H. G. Fienberg, E. F. Simonds, W. J. Fantl, G. P. Nolan, and B. Bodenmiller. A platinum-based covalent viability reagent for single-cell mass cytometry. Cytometry A, 81(6):467–475, Jun 2012.