MMIL: A novel algorithm for disease associated cell type discovery

Erin Craig∗1    Timothy Keyes∗1    Jolanda Sarno3,6    Maxim Zaslavsky4    Garry Nolan5    Kara Davis6    Trevor Hastie1,2    Robert Tibshirani1,2
(1Department of Biomedical Data Science, Stanford University
2Department of Statistics, Stanford University
3M. Tettamanti Research Center, Pediatric Clinic, University of Milano Bicocca
4Department of Computer Science, Stanford University
5Department of Pathology, Stanford University
6Department of Pediatrics, Stanford University
These authors contributed equally to this work.)
Abstract

Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

1 Introduction

In the biomedical field, we often know whether or not a person is sick. However, we do not necessarily know which cells in a sick person’s body–which contains a combination of diseased and healthy cells–play a role in their illness. Identifying disease-associated cell populations is a common goal in translational studies of single-cell datasets, with applications in systems biology, disease monitoring, diagnostics, and the discovery of targetable cell populations for novel therapeutics. Nevertheless, classifying cells as sick or healthy can be difficult: obtaining gold standard labels can be prohibitively challenging or expensive, and it is not obvious how to predict labels without labeled training data. This is particularly true in the context of complex diseases for which true disease-associated cell populations remain unidentified, making gold-standard cell annotation impossible.

Without cell labels to train a classifier, researchers often turn to methods from multiple instance learning (MIL), which are designed for data consisting of many unlabeled cells from each labeled patient. However, the primary aim of most MIL methods is to predict new patient labels rather than cell labels. Here, we present Mixture Modeling for Multiple Instance Learning (MMIL), a method that, armed only with patient labels, uses expectation maximization (EM) [1] to train a classifier to label individual cells as disease-associated or non-disease-associated. Our approach alternately estimates cell labels and trains a classifier; at each iteration, our classifier and cell label estimates improve. MMIL is versatile and can be easily implemented as a wrapper around any classifier trained by optimizing the binomial log likelihood; for example, lasso logistic regression models, gradient boosted trees and neural networks. MMIL can also be used to improve the model calibration of any cell-level model — calibration is important because it encourages the alignment of estimated probabilities with outcomes such that e.g. 90% of the cells with a 0.90.90.90.9 probability are truly disease-associated. MMIL’s calibration step can be used as a post-processing step for any cell-level model, making predictions more reliable and understandable.

We then show that MMIL identifies cancerous cells in leukemia (cancer of the blood and bone marrow). This is an important task: the early and precise identification of malignant cells is pivotal for determining treatment options and monitoring disease progression. Leukemia cells (often called “leukemic blasts”) are challenging to identify due to phenotypic characteristics resembling the healthy cells from which they originate [2]. For this reason, identifying treatment-resistant cells–referred to as “minimal residual disease” (MRD) in clinical contexts–is challenging and labor-intensive, even among highly trained clinicians using state-of-the-art blood profiling technologies [3, 4].

We use MMIL to train lasso logistic regression models on data collected from patients with either Acute Myeloid Leukemia (AML) or Acute Lymphoblastic Leukemia (ALL) as well as healthy controls. In both diseases, we demonstrate MMIL’s ability to (1) identify cancer cells accurately; (2) generalize to cells sampled from different patients, different timepoints after initiation of treatment, or different tissues than those on which the models were originally trained; and (3) select single-cell features known to be associated with leukemia. Our applications in leukemia diagnosis and treatment monitoring reveal that MMIL has broad implications in the field of single-cell analysis. Because MMIL is a direct, flexible method for identifying disease-associated cells using only patient labels, it has the potential to reveal novel biological insights and support patient monitoring, treatment and diagnosis for poorly understood diseases.

2 Results

2.1 An algorithm to classify cells without complete cell labels

We wish to train a classifier to predict whether a given cell is healthy or diseased. Typically, we would (1) gather a dataset of cells where we know which cells are diseased and which are healthy; and (2) train a model to predict cell disease status (using e.g. logistic regression, random forest, neural network). This workflow relies on knowing which cells are diseased and which are not. In our setting we have no such luxury: we do not know which cells sampled from sick patients are diseased, we only know that cells sampled from healthy people are healthy. MMIL is an algorithm that trains classifiers in this setting.

MMIL is an iterative process that relies on the following intuition. If we had a trained classifier, we could use it to predict which cells are healthy and which are not. And if we had a prediction of which cells were healthy and which were not, then we could use these predicted labels to train a classifier. MMIL fits a classifier by alternating between these two steps, and is described in Algorithm 1 (and in more detail in Appendix B).

Refer to caption
Figure 1: Mixture Modeling for Multiple Instance Learning (MMIL) detects cancer cells in Acute Myeloid Leukemia (AML) using patient labels only (A) Process to train a mixture model for multiple instance learning data. We initialize the sick person’s cell labels as 0.50.50.50.5: in this example, we assume that the prevalence of diseased cells in sick people is 50%percent5050\%50 %, and so each cell has a 50/50 chance of being diseased. After training the first classifier, we improve our estimates of the sick person’s cell labels. This process is repeated until convergence. (B) Schematic of model training and evaluation on the AML cohort from [5]. (C) Nonzero coefficients for the Mixture Lasso model trained to detect leukemic blasts in AML. (D) Nonzero coefficients for the ”Optimal” Lasso model trained to detect leukemic blasts in AML. (E) Receiver-Operator Characteristic (ROC) curves demonstrating individual (thin) and average (thick) performance of Optimal, Mixture, and Naive lasso models trained to detect leukemic blasts in AML. Insets indicate mean area under the ROC curve (AUC) across all patients. (F) Scatterplots representing the relationship between the gold-standard, pathologist-enumerated blast percentage for each patient (X-axis) and the model-assigned blast percentage for each patient for the Optimal (red), Mixture (blue), and Naive (yellow) lasso models. Inset text represents the Pearson correlation coefficients between the values on the X- and Y-axes.
Algorithm 1 MMIL: Mixture Models for Multiple Instance Learning
  1. 1.

    Using domain knowledge, choose a value for 1ρ1𝜌1-\rho1 - italic_ρ, the probability that a cell from a sick person is diseased.

  2. 2.

    Create a dataset with:

    • One copy of each cell from healthy people, assigned the label y=0𝑦0y=0italic_y = 0 (healthy) and weight w=1𝑤1w=1italic_w = 1.

    • Two copies of each cell from sick people, one labeled y=0𝑦0y=0italic_y = 0 (healthy) with weight w=ρ𝑤𝜌w=\rhoitalic_w = italic_ρ and the other y=1𝑦1y=1italic_y = 1 (diseased) with weight w=1ρ𝑤1𝜌w=1-\rhoitalic_w = 1 - italic_ρ.

    The weights reflect uncertainty. We are confident that healthy patients’ cells are healthy. But we do not know whether sick patients’ cells are diseased or healthy – the best we can do is assume that each cell is diseased with probability 1ρ1𝜌1-\rho1 - italic_ρ.

  3. 3.

    Alternate:

    1. (a)

      Train a classifier using the augmented dataset.

    2. (b)

      Use the trained classifier to predict whether each sick patient’s cell is diseased or healthy. Update their weights using these predictions. (See Methods for details.)

    After training, the classifier improves our guess for whether each sick person’s cell is diseased or healthy. Then, the new guesses are used to train a new classifier.

  4. 4.

    Stop alternating when the model predictions stop changing.

2.2 Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using only unlabeled cells

We now consider a publicly available dataset of 800 thousand cells collected from 13 Acute Myeloid Leukemia (AML) patients and three healthy bone marrow controls [5] (see Methods). Each AML sample is a mixture of cancer cells and healthy cells, and this cohort of patients represents a wide range of leukemic blast counts. Typically, a highly-trained hematopathologist must analyze a blood or bone marrow specimen to diagnose and track the progression of disease over time. Hematopathologists do this by manually enumerating the proportion of leukemic blasts in the sample based on each cell’s size, shape, and expression of leukemia-associated markers measured by microscopy or flow cytometry. Particularly in the context of myeloid leukemias, blast enumeration is challenging and error-prone even among expert pathologists [5, 6].

Using MMIL, we sought to automate the identification of cells specific to AML without expert cell-level annotation. Importantly, each cell in the AML dataset has a gold-standard annotation indicating whether it is a leukemic blast as evaluated by a board-certified hematopathologist. These annotations allow us to evaluate MMIL’s performance in identifying leukemic blasts compared to expert clinical judgement.

Each cell was analyzed for the presence of 32 surface proteins and 11 intracellular proteins [5] using mass cytometry (CyTOF), a high-dimensional cytometry platform similar to clinical cytometers commonly used to analyze leukemic tissue specimens in clinical laboratories [7, 8]. However, CyTOF analysis allows a high-dimensional, single-cell characterization of each sample, extending beyond the capabilities of conventional multicolor flow cytometers and making it ideal for deep phenotyping of leukemic cells in a research environment [9].

Using this dataset, we trained 3 models: (1) the “optimal” model, a lasso logistic regression model trained using gold-standard annotations (usually unknown), (2) the ”naive” model, a lasso model trained using patient labels instead of cell labels, and (3) Mixture Lasso, a lasso model trained with MMIL. When training Mixture Lasso, we assumed that 75%percent7575\%75 % of cells in cancer patients are healthy, based on the clinical guideline that patients with at least 25%percent2525\%25 % of blasts in their bone marrow receive a leukemia diagnosis[10]. We estimated the performance of each of these models using leave-one-(patient)-out cross-validation (LOOCV). In each fold of our cross-validation procedure, models were trained using all 3 healthy control marrows and 12 of the 13 AML marrows; then, the performance of each model was evaluated by comparing predictions on the held-out patient’s cells to the pathologist’s gold-standard labels (Figure 1b).

At the cell-level, Mixture Lasso achieved a mean area under the receiver-operator characteristic curve (AUROC) of 0.751 across all held-out patients during the cross-validation. By comparison, the naive model achieved worse performance with an AUROC of 0.658, and the optimal model achieved superior performance with an AUROC of 0.945 (Figure 1e). At the patient-level, when a probability threshold of 0.5 was used to classify cells as leukemic blasts (see Methods), the blast percentages that Mixture Lasso assigned to each patient in the dataset correlated strongly with pathologist-assigned blast percentages (Pearson ρ𝜌\rhoitalic_ρ = 0.88 compared to the optimal model’s ρ𝜌\rhoitalic_ρ = 0.95). The naive model’s blast percentage assignments correlated less with the pathologist gold standard (Pearson ρ𝜌\rhoitalic_ρ = 0.65).

In their original study, Tsai et al. identified two sets of markers that could be used to distinguish between different stages of myeloid development in healthy and AML cell populations. Among these were the cytoskeletal proteins β𝛽\betaitalic_β actin, Lamin A/C, and Lamin B1; the granule-associated proteins SerpinB1, VAMP-7, ribosomal RNA (rRNA), Lactoferrin, and lysozyme; CD45; and the cell size marker wheat germ agglutinin lectin (WGA). Two of the most important markers for the pathologists’ AML blast enumeration were rRNA (highly expressed in AML blasts due to their rapid growth) and Lactoferrin (a granule protein only expressed in mature myeloid cells that remains lowly expressed in undifferentiated AML blasts). Figures 1b-c show that Lactoferrin, Lamin A/C, CD45, and rRNA were selected by both the Mixture Lasso and optimal lasso models, whereas SerpinB1 and WGA were uniquely selected by Mixture Lasso. However, Lamin B1 and β𝛽\betaitalic_β actin were uniquely selected by the optimal lasso model, and only the optimal model selected rRNA with a positive coefficient. Thus, Mixture Lasso closely follows, but does not entirely reproduce the pathologist’s decision-making process, suggesting an ability to augment (not just replicate) existing clinical decision processes.

To visualize Mixture Lasso’s predictions, we constructed several uniform manifold approximation and projection (UMAP) [11] plots using all cells and markers in the dataset (Figure 2a-e). We find that cells close to one another in UMAP space tend to have similar Mixture Lasso probabilities, indicating that Mixture Lasso tends to identify phenotypically coherent cell populations. Likewise, regions of UMAP space with high Mixture Lasso probabilities tend to correspond to regions containing large numbers of pathologist-identified AML blasts. However, Mixture Lasso does not simply assign high probabilities to all cells collected from cancer patients; rather, it assigns high probabilities to cells in areas of high-dimensional phenotype space preferentially occupied by cells from cancer patients, but not from healthy patients. Intuitively, this makes sense: leukemic blasts should map to regions of phenotype space with little to no overlap with phenotypes present in healthy samples. This observation is quantified in (Figures 2d,f).

Refer to caption
Figure 2: MMIL identifies regions of high-dimensional phenotype space occupied by cells from AML patients, but not by cells from healthy controls. (A) A scatterplot of uniform manifold approximation and projection (UMAP) coordinates colored by Mixture Lasso probabilities. Cells with probability scores of 0 have a very small chance of being AML-associated (i.e. leukemic blasts), whereas cells with probability scores close to 1 have a high chance of being AML-associated (leukemic blasts). (B) UMAP plot as in (A), but with cells annotated as leukemic blasts by a pathologist in red and cells annotated as healthy (i.e. non-leukemic blasts) by a pathologist in blue. Note the general agreement of probabilities in (A) to red regions in (B). (C) UMAP plot as in (A), but with cells collected from cancer patients shown in orange and cells collected from healthy controls in blue. Note that regions with overlapping orange and blue cells are assigned low Mixture Lasso probabilities in (A). (D) A UMAP plot of local ”phenotypic neighborhoods” spaced throughout the high-dimensional point cloud selected by density-dependent downsampling (Methods). Neighborhoods are colored based on the proportion of cells that they contain that come from cancer patients. Note that neighborhoods exclusively comprised of cells from cancer patients are assigned high Mixture Lasso probabilities in (A). (E) A UMAP plot as in (A), but with cells colored by their expression of Lactoferrin, the marker with the largest coefficient in the Mixture Lasso model. See also Supplementary Figure 6. (F) Count heatmap of 2-dimensional bins demonstrating the correlation between the average Mixture Lasso probability in a phenotypic neighborhood (Y-axis) and the proportion of cells from cancer patients it contains (X-axis). Bins are colored by the density of neighborhoods in that region of the plot, and the red line represents the locally-weighted moving average across the x-axis. Inset text indicates the Pearson correlation between the values on the X- and Y-axes. Note: In panels A-E, UMAP coordinates were calculated using all protein markers.

Protein-specific UMAP plots further indicate that Mixture Lasso assigns high disease-association probabilities to cells with unique marker expression patterns–generally, those low in Lactoferrin and high in SerpinB1, CD16, and/or CD56. Furthermore, while rRNA was not explicitly selected as a disease-associated feature in the Mixture Lasso model, cells with high probabilities also tend to express high levels of rRNA (Figure 5 and Figure 6).

2.3 Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using both labeled and unlabeled cells

Next, we sought to evaluate MMIL’s performance in the setting where both labeled and unlabeled cells are present–e.g. a scenario in which most leukemic patients in a dataset do not have cell labels, but a small number of them have been annotated. We hypothesized that a model capable of leveraging both the expert labels of a clinician and a larger corpus of unlabeled data would outperform models trained using only labeled or only unlabeled data [12].

However, using expert labels to train a model on a difficult classification task can be risky: clinical misidentification of leukemic blasts is a known challenge in hematopathology, with a high degree of documented interobserver variability and methodological disagreement (Supplemental Figure 7) [6]. This is particularly concerning for training machine learning models, as the degradation of model performance due to label error is a widely characterized problem [13, 14]. Thus, we also hypothesized that models trained on both labeled and unlabeled data would be more robust to label error than models trained on labeled data alone, as they would be less susceptible to over-fitting on the imperfect labels.

To test these hypotheses, we developed a semi-supervised version of MMIL (“1-shot MMIL”) using the training procedure illustrated in (Figure 3a). Briefly, 1-shot MMIL is trained identically to MMIL with one change: during model training, a single AML patient (termed the “1-shot patient”) is chosen and their gold-standard cell labels are used for supervision instead of probabilistic labels. Like the cells from healthy patients, the 1-shot patient’s cell labels remain fixed throughout all iterations of the EM. Otherwise, model training and LOOCV are carried out as before.

To benchmark 1-shot MMIL’s performance, we compare it to 0-shot MMIL (the usual MMIL), the 0-shot naive model (a naive model trained as described before), the 1-shot naive model (a model in which the 1-shot patient’s gold-standard cell labels are used during model training, but all others use their inherited patient labels), and the 1-shot optimal model (a model trained on only the 1-shot patient using their gold-standard cell labels). Furthermore, to benchmark 1-shot MMIL’s robustness to imperfect labeling, we again fit each 1-shot model after randomly permuting 25% of the 1-shot patient’s cell labels–a proportion chosen to match the variability observed between pathologists when enumerating leukemic blasts in AML (Figures 7 and Methods). All models were lasso-regularized logistic regression classifiers.

We performed 13 1-shot experiments: one for each AML patient in the dataset to serve as the 1-shot patient. The results of these experiments are summarized in Figure 3b-e. In the “0-shot” case, we once again observed that Mixture Lasso outperformed the naive model (Figure 3b, left). In the 1-shot case, we found that Mixture Lasso’s performance improved substantially compared to 0-shot Mixture Lasso (average AUROC increase of 0.058) while maintaining its superior performance to the 1-shot naive model (Figure 3b, middle). We also found that training an optimal model using only the 1-shot patient outperforms 1-shot Mixture Lasso; however, after training labels were permuted, 1-shot Mixture Lasso outperforms both the naive model and the optimal model, whose performance degrades substantially after training on the imperfect labels (Figure 3b, right). Together, these results suggest that MMIL is capable of improving its performance using even a small number of gold-standard labels, while remaining more robust to noisy labeling than alternative models.

Refer to caption
Figure 3: MMIL can train on labeled and unlabeled data simultaneously to incorporate expert knowledge while remaining robust to imperfect labeling. (A) Schematic of semi-supervised “0-shot” and “1-shot” MMIL experiments (also see Methods). (B) Boxplots indicating average area under the receiver-operator characteristic curve (AUROC) for Mixture Lasso (blue), Naive (orange), and Optimal models across 0-shot (left), 1-shot (with perfect labels; middle), and 1-shot (with imperfect labels; right) training procedures. “***” represents statistical significance at the level of p<0.0001𝑝0.0001p<0.0001italic_p < 0.0001 using a paired Student’s t-test with Benjamini-Hochberg correction for multiple comparisons. (C) Positive coefficients for an Optimal Lasso model fit on a single patient (AML-5An). (D) Positive Mixture Lasso coefficients after 0-shot (left) and 1-shot (right) learning. Note that rRNA, the feature with the largest Optimal Lasso coefficient in (C) and 1d, was selected with a positive coefficient only after 1-shot learning. (E) Two-sided barplot indicating how many times a feature was included in the Mixture Lasso model with a positive coefficient after 0-shot (left, blue) and 1-shot (right, orange) training. Dashed gray lines indicate the maximum number of times a feature could have been included (13, the total number of 1-shot experiments).

Based on these results, we wondered how the inclusion of the 1-shot patient during model training affected the feature sets selected by Mixture Lasso. To interrogate this, we examined the 1-shot experiment with the largest average AUROC improvement between the 0-shot and 1-shot models across all patients (patient AML-5An was the 1-shot patient). In this experiment, rRNA was selected as the most cancer-associated feature by the optimal model (Figure 3c). Interestingly, rRNA was not identified as a cancer-associated marker by the 0-shot Mixture Lasso model, but was identified by the 1-shot model (Figure 3d). In addition, several markers (CD64, CD3, CD8, and CD19) that were included in the 0-shot Mixture Lasso–but that were not used in Tsai et al.’s manual gating of leukemic blasts–were removed in the 1-shot model. This held as a general trend across all 1-shot experiments, across which rRNA was never selected as a cancer-associated marker by 0-shot Mixture Lasso, but was selected as a cancer-associated marker in 10/13 of the corresponding 1-shot Mixture Lasso. At the same time, markers CD3, CD5, CD15, and CD11c were often selected by 0-shot Mixture Lasso but are not canonically considered markers of AML blasts, and these were selected less frequently by 1-shot Mixture Lasso. Taken together, these results demonstrate that MMIL can adapt its feature selection process via semi-supervision, leveraging even a small number of expert-labels to prioritize clinically relevant markers over non-informative ones.

2.4 Mixture modeling identifies and tracks cancer cells throughout treatment progression in Acute Lymphoblastic Leukemia

Finally, we applied MMIL to a dataset of 1.1 million blood and bone marrow cells collected from three Acute Lymphoblastic Leukemia (ALL) patients and three healthy controls. This dataset contains samples collected from multiple tissues (blood and bone marrow) and multiple timepoints throughout treatment (diagnosis, Day 8 post-treatment initiation, and Day 15 post-treatment initiation). Using this dataset, we wanted to evaluate the ability of an MMIL model trained at diagnosis to accurately identify and track residual leukemic cells during treatment progression.

The identification of residual leukemic cells throughout treatment is crucial for assessing a patient’s risk of relapse and therefore for clinical decision making. Thus, MMIL’s ability to maintain strong performance despite distributional shift–i.e. a shift in the distribution of leukemic blast phenotypes over time due to treatment effects or changes in disease state–would be paramount for its application to disease monitoring in cancer.

To evaluate MMIL’s ability to detect residual leukemic cells over the course of treatment, we analyzed all cells in the ALL cohort for the presence of 29 proteins as previously described using CyTOF (see Methods). Optimal, naive, and Mixture Lasso models were trained using only bone marrow cells from the diagnostic timepoint, and model performance was once again evaluated using leave-one-patient-out cross validation. Next, we applied our diagnostic models to paired blood samples from the diagnostic timepoint to evaluate each model’s ability to generalize to a new tissue context that is more easily monitored clinically. In addition, we applied each model to paired blood samples taken at days 8 and 15 post-treatment initiation to evaluate its ability to track a patient’s disease burden over time.

Trained without access to gold standard labels, Mixture Lasso demonstrates robust performance and generalizability, with an AUROC of 0.941 at diagnosis and 0.815 at day 15 post-treatment initiation. The naive model begins with AUROC of 0.923 at diagnosis, but this degrades across time to 0.662 at Day 15 (Figure 4). This decline in performance aligns with clinical challenges encountered over the course of a patient’s treatment, including the diminishing proportion of leukemic blasts over time (as low as 0.1-5% of all cells) and phenotypic plasticity in resistant cells post-therapy [15]. Notably, the naive model’s brittle performance in response to these shifts highlights its susceptibility to false positives. Mixture Lasso, however, retains strong performance even as the percentage of leukemic cells becomes low, suggesting its potential for robust disease monitoring and minimal residual disease detection in leukemia.

Refer to caption
Figure 4: Mixture Lasso identifies leukemia cells despite being trained without any cell labels. We compare the performance of Mixture Lasso to the optimal model (trained using gold standard labels that are typically unavailable) and the naive model (trained using patient labels in place of cell labels). Mixture Lasso generalizes better than the naive approach from bone marrow to blood samples and across time, evidenced by its performance on blood samples collected at (B) diagnosis, (C) day 8 post-treatment initiation, and (D) day 15 post-treatment initiation.

As before, we analyzed Mixture Lasso’s coefficients to examine which protein markers best distinguish between healthy and leukemic cells. Among the markers with positive (cancer-associated) coefficients identified by Mixture Lasso (Supplemental Figure 9) were CD10, CD19, PAX5, and CD34—each of which has been previously described as aberrantly overexpressed in ALL [16]. For example, CD10, or “Common Acute Lymphoblastic Leukemia Antigen”(CALLA), is a surface protein expressed in primitive, undifferentiated B cell precursors as well as in the majority of ALL cells [17]. Mixture Lasso also identified CD58, a well-known marker of clinical MRD [18]. Conversely, all the proteins with negative coefficients in the Mixture Lasso model are phenotypic markers of mature B cells — such as surface immunoglobulin (IgMs), intracellular immunoglobulin M (IgMi), CD20, and the lambda chain of the B cell receptor (Supplemental Figure 9). Each of these proteins are expressed primarily during stages of B cell development that are not reached by leukemic cells, whose differentiation is blocked at earlier stages of B-cell development  [7].

3 Discussion

Biomedical studies of single-cell data often involve analyzing many unlabeled cells from each patient in order to identify disease-associated cells. Finding such cells has broad-reaching applications in biomedical science, such as aiding our comprehension of disease-contributing cell populations and improving disease diagnosis and monitoring. Mixture Modeling for MIL (MMIL) is a method to train a cell classifier using patient labels only. Our approach has several attractive features. First, it adapts to different choices of classification algorithms: it is straightforward to implement as a wrapper that repeatedly calls established classification software that optimizes the binomial log likelihood. This makes it easy to implement our method within a pre-existing machine learning pipeline. Second, in the past it has been unclear how to calibrate models without labeled data; because our approach can estimate cell labels, we can use it to perform Platt scaling, a standard calibration method. This is useful because interpretable probability estimates empower straighforward downstream clinical decision-making. Finally, MMIL also supports partially labeled datasets: if a small number of cells have known positive labels, they can be included during model training to guide the classifier. This means that MMIL is able to learn from both labeled and unlabeled data simultaneously, allowing it to leverage existing knowledge while remaining robust to noisy or imperfect gold-standard labels.

Applied to a dataset of cells from patients with and without leukemia, MMIL performed and generalized better than the naive approach of training a classifier using patient labels in place of cell labels. In the context of detecting leukemic blasts in AML, we found that MMIL offered substantial improvement over the naive model. Furthermore, we observed that including a single pathologist-annotated AML sample in MMIL’s training set was sufficient not only to improve its performance, but also for it to incorporate expert knowledge into its own decision making. We also showed that MMIL is more robust to noisy labeling than traditional supervised learning methods, allowing for robust performance in the classically difficult task of leukemic blast identification.

In the context of ALL, we additionally found that MMIL offered modest improvement over the naive model when applied to patient samples from the same timepoint (diagnosis) as the samples on which the models were trained (Figure 4 A-B). However, the naive model’s performance degraded substantially when applied to samples collected at later timepoints, likely due to phenotypic shifts occuring in leukemic blasts throughout chemotherapy [19]. By contrast, MMIL’s performance was more robust at later timepoints—particularly at day 15 post-treatment initiation–with a mean AUC of 0.815 compared to the naive model’s AUC of 0.662 (Figure 4 C-D). MMIL’s robust performance across multiple treatment timepoints highlights its ability to reliably detect biomarkers for disease at the single-cell level.

Only two parameters must be set to train a Mixture Model, aside from any hyperparameters for the chosen classifier. We assume that ζ𝜁\zetaitalic_ζ, the proportion of sick people, and ρ𝜌\rhoitalic_ρ, the proportion of healthy cells in sick people, are known. Previous work has shown that these parameters are not, in general, identifiable. Where possible, these parameters should be estimated from other datasets and/or clinical knowledge, and a sensitivity analysis should be performed to evaluate how results are affected. We also assume that healthy patients have no disease-associated cells. When healthy patients may also have a mixture of cells, we instead recommend a clustering method such as MDA2 [20].

Mixture Modeling for MIL is a flexible, easy to implement method to train a cell-level classifier using patient labels, and has potential as a useful tool for biological hypothesis generation, diagnostics, disease monitoring, and biomarker discovery. In this work, we illustrated our method using two distinct biomedical datasets that were carefully hand-labeled by experts. In the future, however, we expect our method to be most impactful in situations when gold-standard labels are unknown; for example, to identify cells associated with adverse patient outcomes in cancer or to discover specific populations of immune cells underlying autoimmune disease.

4 Methods

4.1 Related work

In multiple instance learning (MIL), data consists of labeled bags, each with many unlabeled instances. Negative bags have only negative instances, and positive bags have at least one positive instance. In our case, the “bags” are people, each with many unlabeled cells; healthy people have only healthy cells (negative instances), and sick people have at least one diseased cell (positive instance). MIL methods can be broadly categorized into three paradigms [21]: instance space, bag space, and embedded space. Bag and embedded space methods aim to label bags; the instance space paradigm, like ours, aims to label instances, and so is our primary focus. For example, multiple instance SVM (mi-SVM) [22] performs SVM with the goal of assigning all negative instances to one halfspace and at least one instance from every positive bag to the other. This method trains an SVM iteratively: first, instances are assigned labels and an SVM is fit. Then, using the fitted SVM, instances are relabeled and a new model is trained. This is repeated until the labels stabilize. Another related approach is multiple instance logistic regression with a lasso penalty (MILR) [23], which treats the instance labels as missing data, and uses expectation maximization to optimize a lasso-penalized binomial log likelihood. This is the same general approach as ours. However, in the expectation step, MILR uses the joint probability of all instances in the same bag; as a result, MILR tends to be biased when the number of instances in each bag becomes large. Our approach, Mixture Modeling for MIL, instead treats instances as the primary objects of study, and does not use their bag-level grouping, thereby avoiding issues related to computing joint probabilities of many instances. This is particularly important in medical settings where we may sample tens or hundreds of thousands of cells per person. Finally, there are many neural network architectures designed for bag space multiple instance learning [24] that can be used to label cells. One common method is to apply attention [25] to simultaneously label patients and identify the cells responsible for the predicted bag label.

Our setting is also analogous to the presence only or positive unlabeled settings, where some data points have known positive labels, and the remaining labels are unknown. We may instead refer to our data as negative only data: cells from healthy people are healthy, and therefore have known negative labels, and cells from sick people are unlabeled and may have positive or negative labels. In presence only data, unlabeled instances are randomly sampled from the population, and

P(yi=1xi,i is unlabeled)=P(yi=1xi).𝑃subscript𝑦𝑖conditional1subscript𝑥𝑖i is unlabeled𝑃subscript𝑦𝑖conditional1subscript𝑥𝑖P(y_{i}=1\mid x_{i},\text{$i$ is unlabeled})=P(y_{i}=1\mid x_{i}).italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i is unlabeled ) = italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (1)

In our setting, though, unlabeled data have a higher-than-baseline probability of belonging to the positive class. The knowledge that a cell is from a sick person raises the probability that it is diseased:

P(yi=1xi,i is unlabeled)=P(yi=1xi,i is from a sick person)>P(yi=1xi).𝑃subscript𝑦𝑖conditional1subscript𝑥𝑖i is unlabeled𝑃subscript𝑦𝑖conditional1subscript𝑥𝑖i is from a sick person𝑃subscript𝑦𝑖conditional1subscript𝑥𝑖P(y_{i}=1\mid x_{i},\text{$i$ is unlabeled})=P(y_{i}=1\mid x_{i},\text{$i$ is % from a sick person})>P(y_{i}=1\mid x_{i}).italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i is unlabeled ) = italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i is from a sick person ) > italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (2)

Ward et al. [26] present a method to train a classifier with presence only data using EM. Mixture modeling for MIL is a modification of their approach to accommodate this structural difference in our data.

Mixture discriminant analysis 2 (MDA2) [20], a clustering method, is related to the problem at hand. MDA2 considers data from K𝐾Kitalic_K classes. Each class is a mixture of Gaussian distributions, the centers of which are shared across all K𝐾Kitalic_K classes. Here, we have K=2𝐾2K=2italic_K = 2 classes (healthy and sick people). The healthy class is composed of just one distribution: the distribution of healthy cells. Because sick people have both diseased and healthy cells, the sick class is a mixture of two distributions: the healthy and diseased cell distributions. While MDA2 is a natural fit for our setting, it is a generative method, and it assumes the features X𝑋Xitalic_X are normally distributed in order to model the joint likelihood P(X,cell labels y)𝑃𝑋cell labels yP(X,\text{cell labels $y$})italic_P ( italic_X , cell labels italic_y ). This is not always appropriate, and we may instead prefer a discriminative method that models the conditional likelihood P(yX)𝑃conditional𝑦𝑋P(y\mid X)italic_P ( italic_y ∣ italic_X ). We may also prefer to use classifiers that identify important features (like lasso and ridge regression), or relationships between features (like gradient boosted trees and neural networks).

Alternately, a simple but problematic approach is to train cell-level classifier using patient labels in lieu of cell labels [27, 28]. In our comparison, we term this the “naive” approach. This may work when most cells sampled from sick people are diseased, as the inherited label is correct most of the time. When sick people have a smaller fraction of diseased cells, classifier performance will suffer because of high label noise. Ward et al. [26] further show that using this approach with logistic regression causes coefficients to be biased toward 0, thereby hindering discovery of biologically important features. To directly address label noise, approaches from the denoising literature (e.g. [29, 30, 31]) may be useful. Still, our method leverages the knowledge that cells from healthy people have known labels, which is not assumed by more general methods.

4.2 Simulation reveals that calibration is possible without ground truth labels

We wish to simulate a dataset with size n=1000𝑛1000n=1000italic_n = 1000 and p=100𝑝100p=100italic_p = 100 such that ρ=ζ=0.5𝜌𝜁0.5\rho=\zeta=0.5italic_ρ = italic_ζ = 0.5, and the relationship between xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by logit P(yixi)=βTxilogit 𝑃conditionalsubscript𝑦𝑖subscript𝑥𝑖superscript𝛽𝑇subscript𝑥𝑖{\text{logit }P(y_{i}\mid x_{i})=\beta^{T}x_{i}}logit italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In each simulation, the first 10101010 coefficients of β𝛽\betaitalic_β are drawn from a normal distribution with mean and standard deviation 1, and the remainder are 0. Then, xi100subscript𝑥𝑖superscript100x_{i}\in\mathbb{R}^{100}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT is sampled from a standard multivariate normal distribution with identity covariance; yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is from a binomial distribution defined by P(yixi)𝑃conditionalsubscript𝑦𝑖subscript𝑥𝑖P(y_{i}\mid x_{i})italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Instances with yi=1subscript𝑦𝑖1y_{i}=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 (diseased cells) necessarily have zi=1subscript𝑧𝑖1z_{i}=1italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 (sampled from sick patients), and instances with yi=0subscript𝑦𝑖0y_{i}=0italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 (healthy cells) have zi=1subscript𝑧𝑖1z_{i}=1italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 with probability ρζρζ+(1ζ)𝜌𝜁𝜌𝜁1𝜁\frac{\rho\zeta}{\rho\zeta+(1-\zeta)}divide start_ARG italic_ρ italic_ζ end_ARG start_ARG italic_ρ italic_ζ + ( 1 - italic_ζ ) end_ARG. We repeat this process until we have 1000100010001000 instances (500500500500 train and 500500500500 test) satisfying the requirement ρ=ζ=0.5𝜌𝜁0.5\rho=\zeta=0.5italic_ρ = italic_ζ = 0.5.

Then, as before, we trained the optimal model, the naive model, and Mixture Lasso. In addition, we also trained a multiple instance SVM model (mi-SVM, described above). We are interested in both prediction and inference, so we compare the models in terms of AUROC, AUPRC, expected calibration error, and the sum of absolute differences between β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG and β𝛽\betaitalic_β. For the optimal model, calibration is done with Platt scaling using the true labels y𝑦yitalic_y; this represents the best performance we could reasonably expect. The naive model is calibrated using the incorrect labels z𝑧zitalic_z, to stay consistent with the fact that it was trained using the incorrect labels. The Mixture Lasso and mi-SVM models are calibrated in two steps: (1) obtain held-out predictions for each point in the training set (or use a separate held out dataset) and (2) fit a Mixture Lasso model using only the held out probabilities to predict the cell labels y𝑦yitalic_y. This results in a logistic regression model that adjusts the predicted probabilities to be better calibrated. This simulation is repeated 1000100010001000 times. For all objectives, the optimal model (trained using the true labels) performs best. When the true labels are unavailable, Mixture Lasso outperforms the naive approach (Table 1), and performing Platt scaling using Mixture Logistic Regression greatly improves calibration.

AUROC AUPRC β^β1subscriptnorm^𝛽𝛽1||\hat{\beta}-\beta||_{1}| | over^ start_ARG italic_β end_ARG - italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT uncalibrated ECE calibrated ECE
Optimal 0.94±0.02plus-or-minus0.940.020.94\pm 0.020.94 ± 0.02 0.86±0.05plus-or-minus0.860.050.86\pm 0.050.86 ± 0.05 6.71±1.58plus-or-minus6.711.586.71\pm 1.586.71 ± 1.58 0.37±0.02plus-or-minus0.370.020.37\pm 0.020.37 ± 0.02 0.04±0.01plus-or-minus0.040.010.04\pm 0.010.04 ± 0.01
Mixture Lasso 0.90±0.03plus-or-minus0.900.030.90\pm 0.030.90 ± 0.03 0.80±0.07plus-or-minus0.800.070.80\pm 0.070.80 ± 0.07 9.64±2.24plus-or-minus9.642.249.64\pm 2.249.64 ± 2.24 0.11±0.02plus-or-minus0.110.020.11\pm 0.020.11 ± 0.02 0.06±0.02plus-or-minus0.060.020.06\pm 0.020.06 ± 0.02
Naive 0.88±0.04plus-or-minus0.880.040.88\pm 0.040.88 ± 0.04 0.74±0.07plus-or-minus0.740.070.74\pm 0.070.74 ± 0.07 11.08±2.43plus-or-minus11.082.4311.08\pm 2.4311.08 ± 2.43 0.38±0.02plus-or-minus0.380.020.38\pm 0.020.38 ± 0.02 0.25±0.03plus-or-minus0.250.030.25\pm 0.030.25 ± 0.03
mi-SVM [32] 0.71±0.05plus-or-minus0.710.050.71\pm 0.050.71 ± 0.05 0.47±0.07plus-or-minus0.470.070.47\pm 0.070.47 ± 0.07 0.17±0.05plus-or-minus0.170.050.17\pm 0.050.17 ± 0.05
Table 1: Comparison of three Lasso logistic regression modeling approaches in 1000 simulations. Optimal was trained with the true labels y𝑦yitalic_y (unavailable in realistic settings) and naive with the observed labels z𝑧zitalic_z. The final two columns show the expected calibration error (ECE), before and after calibration; a lower ECE is preferred. Platt scaling with Mixture Logistic Regression improves calibration for Mixture Lasso, and makes calibration possible for mi-SVM. The mean and one standard deviation of each distribution is shown.

4.3 Algorithm details

Algorithm 2 Train a mixture model for multiple instance learning data

Input: Covariates Xn×p𝑋superscript𝑛𝑝X\in\mathbb{R}^{n\times p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT, binary sampling labels zn𝑧superscript𝑛z\in\mathbb{R}^{n}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of which are positive.

Assumptions: The proportion of healthy cells in sick people is P(y=0z=1)=ρ𝑃𝑦conditional0𝑧1𝜌P(y=0\mid z=1)=\rhoitalic_P ( italic_y = 0 ∣ italic_z = 1 ) = italic_ρ.
AssumptionThe proportion of cells from sick people in the prediction population is P(z=1)=ζ𝑃𝑧1𝜁{P(z=1)=\zeta}italic_P ( italic_z = 1 ) = italic_ζ.  

1. Create a label vector y(0)superscript𝑦0y^{(0)}italic_y start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, set to 00 for cells from healthy people and 1ρ1𝜌1-\rho1 - italic_ρ for cells from sick people.

2. Iterate to convergence. At the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT step:

  1. (a)

    Maximization step: Fit a model η(i)(x)superscript𝜂absent𝑖𝑥\eta^{\star(i)}(x)italic_η start_POSTSUPERSCRIPT ⋆ ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) using X𝑋Xitalic_X and y(i1)superscript𝑦𝑖1y^{(i-1)}italic_y start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. Account for biased sampling: make the case-control intercept adjustment to obtain

    η(i)(x)=η(i)(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ).superscript𝜂𝑖𝑥superscript𝜂absent𝑖𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\eta^{(i)}(x)=\eta^{\star(i)}(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1% }}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) = italic_η start_POSTSUPERSCRIPT ⋆ ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) . (3)
  2. (b)

    Expectation step: Use η(i)(x)superscript𝜂𝑖𝑥\eta^{(i)}(x)italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) to define y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Cells from healthy people have known label y(i)=0superscript𝑦𝑖0y^{(i)}=0italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0. Cell j𝑗jitalic_j from a sick person has the label yj(i)superscriptsubscript𝑦𝑗𝑖y_{j}^{(i)}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT such that:

    logit yj(i)=η(i)(xj)log(ρζ1(1ρ)ζ).logit subscriptsuperscript𝑦𝑖𝑗superscript𝜂𝑖subscript𝑥𝑗𝜌𝜁11𝜌𝜁\text{logit }y^{(i)}_{j}=\eta^{(i)}(x_{j})-\log\left(\frac{\rho\zeta}{1-(1-% \rho)\zeta}\right).logit italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_log ( divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) . (4)

 

Output: A fitted model η^(x)^𝜂𝑥\hat{\eta}(x)over^ start_ARG italic_η end_ARG ( italic_x ) to predict labels for new cells.

To choose hyperparameters (e.g. the regularization strength in lasso logistic regression), we recommend using cross validation. For each desired choice of hyperparameter, train a model using cells from a subset of the patients, and record the value of the observed data log likelihood (in Section B) on the cells from held out patients. After repeating this across all desired hyperparameter values and all folds, choose the hyperparameters that maximize the cross validated log likelihood.

Importantly, we have assumed that ρ𝜌\rhoitalic_ρ and ζ𝜁\zetaitalic_ζ, the proportion of healthy cells in sick patients and the proportion of cells sampled from sick patients, are known. We make these assumptions because ρ𝜌\rhoitalic_ρ and ζ𝜁\zetaitalic_ζ are not, in general, identifiable. Ward et al. [26] show that they are identifiable only if we make strong assumptions about the form of η(x)𝜂𝑥\eta(x)italic_η ( italic_x ); even when it they are identifiable, their estimates have high variance. This argument is reaffirmed and expanded upon by Hastie and Fithian [33].

For large datasets, the EM procedure may be slow and it may be more appropriate to optimize the observed log likelihood (Appendix B) through stochastic gradient descent. For smaller datasets, using EM allows us to easily try many different model choices for η𝜂\etaitalic_η, and to take advantage of off-the-shelf model fitting software in each loop of the algorithm.

As we have described it here, our algorithm relies on the use of classification software that can train a model using soft labels (values between 0 and 1). In Appendix C, we share a modification for classifiers that require hard labels (restricted to be 0 or 1).

4.4 AML dataset analysis

4.4.1 Data acquisition

Normalized, singlet-gated AML data were downloaded from FlowRepository (ID: FR-FCM-Z2E7), and gold-standard patholgist annotations were obtained via correspondence with the authorship team of Tsai et al. (2020). Due to the ambiguity of their diagnosis, the one patient in the cohort with Myelodysplastic Syndrome (MDS) was excluded from analysis. For compatability with the blast enumeration in Tsai et al. (2020) and standard pathology laboratory procedure, only CD45+ events (hematopoietic lineage cells) were analyzed.

4.4.2 Data cleaning and preprocessing

Standard preprocessing steps for mass cytometry data analysis were performed as described previously [8]. Specifically, ion counts were transformed using the hyperbolic arcsine function with a cofactor of 5, and all markers were scaled to their 99.9th percentile for comparability between markers. Additionally, all cells expressing a marker value over the 99.9th percentile were excluded from analysis in order to remove technical artifacts and outliers. All analyses were performed using the R package tidytof [34].

4.4.3 Model fitting

Models in Figure 1 and Figure 2

Model specification: MMIL was applied to the AML cohort via the Mixture Lasso, a lasso logistic regression model trained using the algorithm described in 2.1. We used a value of (ρ=0.75𝜌0.75\rho=0.75italic_ρ = 0.75) for all MMIL models, relying on the clinical knowledge that AML patients with more than 25% blasts in their bone marrow receive the clinical diagnosis of leukemia [10]. ζ𝜁\zetaitalic_ζ was estimated using the training set of each fold (see below).

In addition to Mixture Lasso, two other models were fit and evaluated: the “optimal” model (a lasso logistic regression model trained using the true cell labels, as annotated by a pathologist) and the “naive” model (a lasso model trained using inherited patient labels instead of cell labels). For clarity, the cell labels used by Mixture Lasso are referred to as the “probabilistic labels” of each cell; the cell labels used by the optimal model are referred to as the “gold-standard” or “pathologist-annotated” labels of each cell; and the labels used by the naive model are referred to as the “inherited patient labels” of each cell. Mixture Lasso, the optimal model, and the naive model are referred to as the 3 “model classes” that we evaluated.

Hyperparameter tuning: Lasso models have a single hyperparameter: λ𝜆\lambdaitalic_λ, the penalty term determining the amount of regularization applied to the model’s coefficients. For each model class, we tuned over 10 values of lambda, equally spaced on a logarithmic scale between the lowest value (105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT) and the highest value (1). The optimal hyperparameter for each model class was determined using cross-validation (see below). The model predictions of the model fit with the optimal penalty parameter were used for reporting in Figure 4’s ROC curves and AUROC values.

Cross-validation and model selection: Model performance was estimated using leave-one-(patient)-out cross-validation (LOOCV), a schema in which all cells from a single AML patient are held out as a separate test set in each fold of the cross-validation. Specifically, we break the dataset into 13 folds such that each fold includes 12 AML patients and all 3 healthy controls for model training and 1 held-out AML patient for model evaluation. For each fold and model class, we fit the model on the training set and evaluate its cell-level performance on the held-out AML patient according to gold-standard, pathologist-annotated cell labels. The optimal lasso regularization penalty for each model class was chosen by selecting the penalty that optimized the average log-likelihood (for Mixture Lasso) or binomial deviance (for both other models) on the held out sample across all folds. Finally, for model interpretation, we refit each model with the optimal penalty on all 13 AML samples and all 3 healthy bone marrow samples.

0-shot and 1-shot models in Figure 3:

In this section, we refer to models trained on a dataset for whom no cell labels are known “0-shot” models (the 0 denoting the number of patients who have received expert annotation before model training). Similarly, we refer to models trained on a dataset for whom a single patient’s cell labels are known “1-shot” models (the 1 again denoting the number of patients who have received expert annotation before model training). We borrow this language from the few-shot machine learning literature, a paradigm that has been scarcely applied to multiple-instance learning problems but that is highly applicable here.

0-shot models are simply the same models described in the “Models in Figure 1 and Figure 2” section above. In the 0-shot case, MMIL models are trained using probabilisitic cell labels as described in Figure 1a, naive models are trained using inherited patient labels, and optimal models are trained using gold-standard cell labels. Notably, it is not possible to fit a 0-shot optimal modelm, as optimal models require direct cell labels. This is why the left panel of 3b has a blank space for the optimal model.

By contrast, 1-shot models are trained identically to 0-shot models except for a single change: during model training, one AML patient (termed the “1-shot patient”) is chosen such that their gold-standard cell labels are used for supervision instead of their probabilistic labels (for Mixture Lasso) or inherited patient labels (for naive models). Similarly to the cells from healthy patients, the 1-shot patient’s cell labels remain fixed throughout all iterations of the EM. Otherwise, model training, cross-validation, and model selection are carried out as before.

Note that a single ’1-shot experiment’ in the main text refers to the following training procedure:

  1. 1.

    Choose one of the 13 AML patients to designate as the 1-shot patient. Designate all other 12 AML patients as ”cross-validation patients.”

  2. 2.

    Fit a 0-shot Mixture Lasso model and a 0-shot naive model using 12-fold cross validation. In each fold of the cross validation, 11 cross-validation patients, the 1-shot patient, and all 3 healthy controls are included in the training set, and 1 cross-validation patient is used as the held-out test set. For Mixture Lasso, the model with the penalty parameter that optimizes the average MMIL log likelihood in the test set across all folds is chosen as the best model and is used for error reporting. For the naive model, the model with the penalty parameter that optimizes the average binomial deviance in the test set across all folds is chosen as the best model and used for error reporting.

  3. 3.

    Fit a 1-shot Mixture Lasso model and a 1-shot naive model using 12-fold cross validation. In each fold of the cross validation, 11 cross-validation patients, the 1-shot patient, and all 3 healthy controls are included in the training set, and 1 cross-validation patient is used as the held-out test set. Additionally, fit an optimal model using only the 1-shot patient. For Mixture Lasso, the model with the penalty parameter that optimizes the average MMIL log likelihood in the test set across all folds is chosen as the best model and is used for error reporting. For the naive model, the model with the penalty parameter that optimizes the average binomial deviance in the test set across all folds is chosen as the best model and used for error reporting. For the optimal model, only a single patient is available for model training, so the model with the penalty parameter that optimizes the average binomial deviance for that patient is chosen as the best model and used for error reporting.

  4. 4.

    Repeat step 3 after randomly permuting 25% of the 1-shot patient’s gold-standard labels.

Thus, to give each AML patient a turn being the 1-shot patient, steps 1-4 were repeated 13 times. Finally, the AUROC for each left-out patient was averaged across all 13 1-shot experiments, giving an expected AUROC across all possible choices of 1-shot patient. These AUROC values are the reported values for each patient in 3b.

4.4.4 Model evaluation and interpretation

Single-cell model evaluation:To evaluate each model class’s performance at the single-cell level, we calculated the ROC curve and corresponding AUROC for each sample using gold-standard cell labels.

Blast percentages in Figure 1e: Mixture Lasso, naive, and optimal probabilities were assigned to each cell using the models refit on all AML patients with the best penalty parameters identified by cross-validation. These probabilities were used to classify each cell in the dataset as either a leukemic blasts or not a leukemic blast with each of the 3 model classes. In the case of the optimal model, cell labels are known, so the probability cutoff was chosen to give the highest cell label classification accuracy. In the case of the naive model, cell labels are not known, so the probability cutoff was chosen to give the highest inherited patient label classification accuracy. Finally, for Mixture Lasso, the cell labels are not known, but because of Mixture Lasso’s strong calibration due to our model fitting procedure, a simply probability cutoff of 0.5 was chosen.

Using these binary classifications for each cell, blast percentages were calculated for each patient. To recalibrate the blast percentages assigned by each model with the pathologist-enumerated blast percentages, a linear regression was fit for each model class as a post-processing step. Specifically, a linear regression of the model-assigned blast percentage onto the pathologist-enumerated blast percentage was used to derive the values plotted on the y-axis of 1e.

UMAP plots: To analyze how cells assigned high MMIL probabilities arrange in high-dimensional space, we performed dimensionality reduction using uniform manifold approximation and projection (UMAP) using all cells and all markers in the dataset (2). The plotted MMIL probabilities were taken from the Mixture Lasso model refit on all AML patients using the optimal penalty parameter identified by cross-validation, and UMAP was run using the default parameters of the tidytof R package [11, 34].

In Figure 2d, local neighborhoods were constructed using a two-step process. First, density-dependent downsampling ([35, 34]) was used to select index cells from the full dataset such that all regions of phenotypic space are represented equally, with each index cell representing the center of a local neighborhood in high-dimensional space. All markers were used to calculated the local densities surrounding each cell in the dataset during density-dependent downsampling. This downsampling process selected 6,849 index cells that were approximately evenly dispersed throughout the high-dimensional point cloud. After index cells were selected, the percentage of cells in each local neighborhood collected from AML patient samples was calculated by finding the 100-nearest neighbors of each index cell in the original dataset and computing the proportion of those neighbors from cancer samples.

0-shot and 1-shot Lasso coefficient analysis: For model interpretation in Figure 3, we examined the non-negative coefficients of the Mixture Lasso model refit on all data using the optimal penalty parameter identified by cross-validation. We focused our analysis on positive (i.e. cancer-associated) lasso coefficients because, generally speaking, disease-associated features are more useful as positive biomarkers that can be used to diagnose or monitor disease. In Figure 3e, features were counted as disease-associated if they had a lasso coefficient of at least 0.01 in their Mixture Lasso model.

4.5 ALL dataset analysis

4.5.1 Data acquisition

De-identified bone marrow and peripheral blood primary samples from patients with ALL were obtained under informed consent from Lucile Packard Children’s Hospital and Stanford Hospital from the Pediatric Clinic of University of Milano-Bicocca (San Gerardo Hospital, Monza, Italy). The use of these samples was approved by the Institutional Review Boards at Lucile Packard Children’s Hospital, Stanford Hospital at Stanford University, and from the Pediatric Clinic of University of Milano-Bicocca. Cryopreserved primary bone marrow and peripheral blood samples were thawed rapidly in thawing media (RPMI 1640 supplemented with 10% fetal bovine serum, 1% penicillin-streptomycin, and glutamine, 20 U/mL sodium heparin, and 0.025 U/mL Benzonase. Cells were rested for 30 minutes at 37°C and cisplatin viability stained [36].

Each sample was analyzed for the expression of 29 proteins using mass cytometry as previously described. [8]. Briefly, cells were fixed with paraformaldehyde, washed in cell staining media (CSM) twice, followed by one wash in PBS and one wash in PBS+0.02% saponin. Blocking was performed with Human TruStain FcX receptor blocking solution (Biolegend, 422302). Cells underwent surface staining with the following surface markers: CD3, CD10, CD19, CD20, CD22, CD24, CD34, CD38, CD43, CD45, CD58, CD61, CD79b, CD123, CD127, CD179a, CD179b, Human Leukocyte Antigen-DR (HLA-DR), Immunoglobulin M (IgM), and Thymic Stromal Lymphopoietin Receptor (TSLPr).

After surface staining, cells were washed, permeabilized, and intracellular stained with the following markers: Intracellular Immunoglobulin M (IgMi), Marker of Proliferation Ki-67 (Ki67), lambda chain of the B-cell receptor, Paired Box Protein 5 (PAX5), Recombination Activating Gene 1 (RAG1), and Terminal deoxynucleotidyl transferase (TdT). Once intracellularly stained, samples were washed in CSM, Iridium intercalated, washed in CSM, followed by two washes in ultra-pure double-distilled water. To prepare for acquisition, cells were resuspended with normalization beads. Mass cytometry data were then acquired on a Helios (a 3rd generation CyTOF).

4.5.2 Data cleaning and preprocessing

Standard preprocessing steps for mass cytometry data analysis were performed as described previously [8]. Specifically, ion counts were transformed using the hyperbolic arcsine function with a cofactor of 5, and all markers were scaled to their 99.9th percentile for comparability between markers. Additionally, all cells expressing a marker value over the 99.9th percentile were excluded from analysis in order to remove technical artifacts and outliers. All analyses were performed using the R package tidytof [34].

4.5.3 Model fitting

Model specification: MMIL was applied to the ALL cohort via the Mixture Lasso, a lasso logistic regression model trained using the algorithm described in 2.1. We used a value of (ρ=0.75𝜌0.75\rho=0.75italic_ρ = 0.75) for all MMIL models, relying on the clinical knowledge that ALL patients with more than 25% blasts in their bone marrow receive the clinical diagnosis of leukemia [10]. ζ𝜁\zetaitalic_ζ was estimated using the training set of each fold (see below).

In addition to Mixture Lasso, two other models were fit and evaluated: the “optimal” model (a lasso logistic regression model trained using the true cell labels, as annotated by a pathologist) and the “naive” model (a lasso model trained using inherited patient labels instead of cell labels). For clarity, the cell labels used by Mixture Lasso are referred to as the “probabilistic labels” of each cell; the cell labels used by the optimal model are referred to as the “gold-standard” or “pathologist-annotated” labels of each cell; and the labels used by the naive model are referred to as the “inherited patient labels” of each cell. Mixture Lasso, the optimal model, and the naive model are referred to as the 3 ”model classes” that we evaluated.

Importantly, for the ALL dataset, only diagnostic bone marrow samples were used to train any models in order to mimic the clinical scenario in which a model is built at the time of diagnosis and used to track a patient’s leukemia burden over time.

Hyperparameter tuning: Lasso models have a single hyperparameter: λ𝜆\lambdaitalic_λ, the penalty term determining the amount of regularization applied to the model’s coefficients. For each model class, we tuned over 10 values of lambda, equally spaced on a logarithmic scale between the lowest value (105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT) and the highest value (1). The optimal hyperparameter for each model class was determined using cross-validation (see below). The model predictions of the model fit with the optimal penalty parameter were used for reporting in Figure 4’s ROC curves and AUROC values.

Cross-validation and model selection: Model performance was estimated using leave-one-(patient)-out cross-validation (LOOCV), a schema in which all cells from a single ALL patient are held out as a separate test set in each fold of the cross-validation. Specifically, we break the dataset into 3 folds such that each fold includes 2 ALL patients and 3 healthy patients for model training and 1 held-out ALL patient for model evaluation. For each fold and model class, we fit the model on the training set and evaluate its cell-level performance on the held-out ALL patient according to gold-standard cell labels. The optimal lasso regularization penalty for each model class was chosen by selecting the penalty that optimized the average log-likelihood (for Mixture Lasso) or binomial deviance (for both other models, as is ordinarily done) on the held out sample across all folds. Finally, for model interpretation, we refit each model with the optimal penalty on all 3 diagnostic ALL and all 3 healthy bone marrow samples.

4.5.4 Model evaluation and interpretation

Single-cell model evaluation: To evaluate each model class’s performance at the single-cell level, we calculated the ROC curve and corresponding AUROC for each sample using gold-standard cell labels.

Although models were only trained using cells from the diagnostic timepoint, the models were evaluated for all available samples at all timepoints. Accordingly, ROC curves and AUROCs from different tissues and timepoints were calculated separately.

Model interpretation: For model interpretation in Figure 9, we examined the nonzero coefficients of the Mixture Lasso model refit on all data using the optimal penalty parameter identified by cross-validation.

5 Supplementary information

5.1 Ethical approval declarations

This research complies with all relevant ethical regulations. The use of the primary samples was approved by the Institutional Review Board at Lucile Packard Children’s Hospital at Stanford University. Healthy human bone marrow samples (n = 3) were purchased through AllCells. De-identified bone marrow samples from pediatric patients with BCP-ALL were obtained, under informed consent, from the Pediatric Clinic of University of Milano-Bicocca (Centro Maria Letizia Verga, Monza, Italy; n = 3). The use of these primary samples was approved by the Institutional Review Boards at both institutions. Written informed consent was obtained from the parents of the patients or their legal representatives, who agreed to the use of biological material for research and clinical studies. To protect patients’ privacy, samples have been de-identified.

5.2 Acknowledgments

EC is supported by the Stanford Data Science Scholars program and a Stanford Graduate Fellowship. MZ is supported by the Stanford Bio-X Bowes Graduate Student Fellowship. TH was partially supported by grants DMS-2013736 from the National Science Foundation, and grant R01GM134483 from the National Institutes of Health. RT is supported by the National Institutes of Health(5R01EB001988-16) and the National Science Foundation(19DMS1208164). TJK is supported by the National Institutes of Health [grant number 1F31CA239365-01], the Mark Foundation for Cancer Research, the Andrew McDonough B+ (Be Positive) Foundation, and a Point Foundation Graduate Student Scholarship. JS is supported by Associazione Italiana per la Ricerca sul Cancro (AIRC; Start-UP grant no. 27325). GPN is supported by the National Institutes of Health [grant numbers U54CA209971, U54HG010426, and U19AI100627]. KLD is supported by the National Institutes of Health [grant number R01 CA251858-01A1], the Mark Foundation for Cancer Research, the Andrew McDonough B+ (Be Positive) Foundation, the Oxnard Foundation, and the Stanford Maternal and Child Health Research Institute.

5.3 Availability of data and materials

Acute Myeloid Leukemia (AML) mass cytometry data are availability on FlowRespository, with Accession ID FR-FCM-Z2E7.

Acute Lymphoblastic Leukemia (ALL) mass cytometry data are available on Dryad, with DOI 10.5061/dryad.8gtht76vw.

5.4 Code availability

An R package to train mixture classifiers can be found on Github in the repo [XXXXXXXX].

Appendix A Supplemental Figures

Refer to caption
Figure 5: Mixture Lasso selects features that discriminate between healthy cells and cancer cells (leukemic blasts) in Acute Myeloid Leukemia (AML). Density plots indicating that 5 markers (Lactoferrin, CD16, CD56, Lamin A/C, and CD33) selected by Mixture Lasso trained using only patient labels successfully separate healthy and cancer cell populations. Healthy patient IDs are the top 3 density plots in each panel (REACTIVE-3A, REACTIVE-2A, and REACTIVE-3A), and all other plots are AML patients. Note that rRNA, which strongly separates healthy and cancer cells, was not selected by Mixture Lasso, representing an interesting instance of a missed discovery.
Refer to caption
Figure 6: Mixture Lasso selects features that distinguish phenotypically distinct cell populations in Acute Myeloid Leukemia (AML). UMAP diagrams denoting the cell populations expressing 4 of the markers (Lactoferrin, SerpinB1, CD16, and CD56) selected by Mixture Lasso whe trained to detect leukemic blasts in AML using only patient labels. rRNA is also included as one of the most important markers used for manual labeling of leukemic blasts by pathologists in Tsai et al. (2020). Note that, although rRNA was not selected as a disease-associated feature by the Mixture Lasso model, it is strongly expressed by the cells assigned high probabilities by the Mixture Lasso model.
Refer to caption
Figure 7: Enumeration of leukemic blasts in AML exhibits wide variability among expert pathologists. A ball-and-stick plot indicating the range between the highest and lowest blast enumeration scores (percentage of cells) assigned to each AML patient in the Tsai et al. (2020) AML cohort among 4 board-certified hematopathologists. Among these, the median range is 22.5%. Blast enumeration was performed using light microscopy, a gold-standard of hematopathological evaluation of leukemia patients. Data were obtained via correspondence with the authorship team of Tsai et al. (2020).
Refer to caption
Figure 8: 1-shot MMIL improves performance at detecting AML blasts by incorporating gold-standard labels while remaining robust to noisy labels. This plot represents the same result as 3b, but with AUROC values averaged across patients in the left-out fold (instead of across 1-shot patients). Thus, each point represents the average AUROC across all held-out patients within each 1-shot experiment (rather than the average held-out performance on a given patient across all 1-shot experiments). ”*” indicates statistical significance at the level of p ¡ 0.05 and ”***” indicates statistical significance at the level of p ¡ 0.001 using a paired Student’s t-test with Benjamini-Hochberg correction for multiple comparisons.
Refer to caption
Figure 9: Mixture Lasso coefficients in the Acute Lymphoblastic Leukemia model.
Refer to caption
Figure 10: Mixture Lasso selects features that discriminate between healthy cells and cancer cells (leukemic blasts) in Acute Lymphoblastic Leukemia (ALL). Density plots indicating that 6 markers (CD10, CD19, PAX5, IgM (surface), CD20, and Lambda chain (of the BCR receptor) selected by Mixture Lasso trained using only patient labels successfully separate healthy and cancer cell populations across all 3 ALL patients. Patient IDs (upn60, upn62, and upn63) are indicated at left.

Appendix B Mixture modeling uses expectation maximization

Our approach is an application of expectation maximization [1] to maximize the observed data likelihood, i=1nP(zixi)superscriptsubscriptproduct𝑖1𝑛𝑃conditionalsubscript𝑧𝑖subscript𝑥𝑖\prod_{i=1}^{n}P(z_{i}\mid x_{i})∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The EM procedure is as follows:

  1. 1.

    Expectation step: At step t𝑡titalic_t, use η(t1)(x)superscript𝜂𝑡1𝑥\eta^{(t-1)}(x)italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x ) to compute y(t)=Pη(t1)(yx,z)superscript𝑦𝑡subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧y^{(t)}=P_{\eta^{(t-1)}}(y\mid x,z)italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z ), our current estimates for the unobserved y𝑦yitalic_y. For cells with z=0𝑧0z=0italic_z = 0, use the known label y(t)=0superscript𝑦𝑡0y^{(t)}=0italic_y start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = 0.

  2. 2.

    Maximization step: Find η(t)superscript𝜂𝑡\eta^{(t)}italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT that maximizes the full likelihood, i=1nP(zi,yi(t)xi)superscriptsubscriptproduct𝑖1𝑛𝑃subscript𝑧𝑖conditionalsuperscriptsubscript𝑦𝑖𝑡subscript𝑥𝑖\prod_{i=1}^{n}P(z_{i},y_{i}^{(t)}\mid x_{i})∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Adjust η(t)superscript𝜂𝑡\eta^{(t)}italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to account for biased sampling.

In this section, we will first address the expectation step: we will show how to compute P(yx,z)𝑃conditional𝑦𝑥𝑧P(y\mid x,z)italic_P ( italic_y ∣ italic_x , italic_z ). Then, we will share the details for the maximization step. We will show the intercept adjustment to account for biased sampling, and then use this to write our objective, the observed data likelihood. Finally, we will show that the binomial log likelihood with respect to y𝑦yitalic_y is proportional to the full likelihood, and so the maximization step can be achieved by solving a typical binary classification problem.

B.1 Expectation step: estimating labels

At the tthsuperscript𝑡tht^{\text{th}}italic_t start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT step of our algorithm, we update our estimate of the unknown labels y𝑦yitalic_y as Pη(t1)(yx,z=1)subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧1P_{\eta^{(t-1)}}(y\mid x,z=1)italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z = 1 ), where

logit Pη(t1)(yx,z=1)=η(t1)(x)log(ρζ1(1ρ)ζ).logit subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧1superscript𝜂𝑡1𝑥𝜌𝜁11𝜌𝜁\text{logit }P_{\eta^{(t-1)}}(y\mid x,z=1)=\eta^{(t-1)}(x)-\log\left(\frac{% \rho\zeta}{1-(1-\rho)\zeta}\right).logit italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z = 1 ) = italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x ) - roman_log ( divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) . (5)

We note that log(ρζ1(1ρ)ζ)<0𝜌𝜁11𝜌𝜁0\log\left(\frac{\rho\zeta}{1-(1-\rho)\zeta}\right)<0roman_log ( divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) < 0. So, this intercept adjustment raises the baseline probability that a cell is diseased, and this makes sense: knowing a cell came from a sick person should make us think the cell is more likely to be disease associated.

Now, we show our derivation of this label update, and use our assumption that P(zy,x)=P(zy)𝑃conditional𝑧𝑦𝑥𝑃conditional𝑧𝑦{P(z\mid y,x)=P(z\mid y)}italic_P ( italic_z ∣ italic_y , italic_x ) = italic_P ( italic_z ∣ italic_y ). We begin by applying Bayes’ rule:

Pη(t1)(yx,z=1)subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧1\displaystyle P_{\eta^{(t-1)}}(y\mid x,z=1)italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z = 1 ) =P(z=1y=1)Pη(t1)(y=1x)Pη(t1)(z=1x)absent𝑃𝑧conditional1𝑦1subscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥subscript𝑃superscript𝜂𝑡1𝑧conditional1𝑥\displaystyle=\frac{P(z=1\mid y=1)P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}% }(z=1\mid x)}= divide start_ARG italic_P ( italic_z = 1 ∣ italic_y = 1 ) italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z = 1 ∣ italic_x ) end_ARG (6)
=Pη(t1)(y=1x)Pη(t1)(z=1x)absentsubscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥subscript𝑃superscript𝜂𝑡1𝑧conditional1𝑥\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(z=1\mid x)}= divide start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z = 1 ∣ italic_x ) end_ARG (7)
=Pη(t1)(y=1x)P(z=1y=1)Pη(t1)(y=1x)+P(z=1y=0)Pη(t1)(y=0x)absentsubscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥𝑃𝑧conditional1𝑦1subscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥𝑃𝑧conditional1𝑦0subscript𝑃superscript𝜂𝑡1𝑦conditional0𝑥\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P(z=1\mid y=1)P_{\eta^{(t-1)}% }(y=1\mid x)+P(z=1\mid y=0)P_{\eta^{(t-1)}}(y=0\mid x)}= divide start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P ( italic_z = 1 ∣ italic_y = 1 ) italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) + italic_P ( italic_z = 1 ∣ italic_y = 0 ) italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 0 ∣ italic_x ) end_ARG (8)
=Pη(t1)(y=1x)Pη(t1)(y=1x)+P(z=1y=0)Pη(t1)(y=0x)absentsubscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥subscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥𝑃𝑧conditional1𝑦0subscript𝑃superscript𝜂𝑡1𝑦conditional0𝑥\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(y=1\mid x)+P% (z=1\mid y=0)P_{\eta^{(t-1)}}(y=0\mid x)}= divide start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) + italic_P ( italic_z = 1 ∣ italic_y = 0 ) italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 0 ∣ italic_x ) end_ARG (9)

In the denominator, we used the identity P(z=1y=1)=1𝑃𝑧conditional1𝑦11P(z=1\mid y=1)=1italic_P ( italic_z = 1 ∣ italic_y = 1 ) = 1. We can compute P(z=1y=0)𝑃𝑧conditional1𝑦0{P(z=1\mid y=0)}italic_P ( italic_z = 1 ∣ italic_y = 0 ):

P(z=1y=0)=P(y=0z=1)P(z=1)P(y=0)=ρζ1(1ρ)ζ.𝑃𝑧conditional1𝑦0𝑃𝑦conditional0𝑧1𝑃𝑧1𝑃𝑦0𝜌𝜁11𝜌𝜁P(z=1\mid y=0)=\frac{P(y=0\mid z=1)P(z=1)}{P(y=0)}=\frac{\rho\zeta}{1-(1-\rho)% \zeta}.italic_P ( italic_z = 1 ∣ italic_y = 0 ) = divide start_ARG italic_P ( italic_y = 0 ∣ italic_z = 1 ) italic_P ( italic_z = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ) end_ARG = divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG . (10)

Now, we can plug this back in to our expression:

Pη(t1)(yx,z=1)subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧1\displaystyle P_{\eta^{(t-1)}}(y\mid x,z=1)italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z = 1 ) =Pη(t1)(y=1x)Pη(t1)(y=1x)+ρζ1(1ρ)ζPη(t1)(y=0x)absentsubscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥subscript𝑃superscript𝜂𝑡1𝑦conditional1𝑥𝜌𝜁11𝜌𝜁subscript𝑃superscript𝜂𝑡1𝑦conditional0𝑥\displaystyle=\frac{P_{\eta^{(t-1)}}(y=1\mid x)}{P_{\eta^{(t-1)}}(y=1\mid x)+% \frac{\rho\zeta}{1-(1-\rho)\zeta}P_{\eta^{(t-1)}}(y=0\mid x)}= divide start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 1 ∣ italic_x ) + divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y = 0 ∣ italic_x ) end_ARG (11)
=eη(t1)(x)eη(t1)(x)+ρζ1(1ρ)ζ.absentsuperscript𝑒superscript𝜂𝑡1𝑥superscript𝑒superscript𝜂𝑡1𝑥𝜌𝜁11𝜌𝜁\displaystyle=\frac{e^{\eta^{(t-1)}(x)}}{e^{\eta^{(t-1)}(x)}+\frac{\rho\zeta}{% 1-(1-\rho)\zeta}}.= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG end_ARG . (12)

With manipulation, this becomes:

logit Pη(t1)(yx,z=1)=η(t1)(x)log(ρζ1(1ρ)ζ).logit subscript𝑃superscript𝜂𝑡1conditional𝑦𝑥𝑧1superscript𝜂𝑡1𝑥𝜌𝜁11𝜌𝜁\text{logit }P_{\eta^{(t-1)}}(y\mid x,z=1)=\eta^{(t-1)}(x)-\log\left(\frac{% \rho\zeta}{1-(1-\rho)\zeta}\right).logit italic_P start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x , italic_z = 1 ) = italic_η start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x ) - roman_log ( divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) . (13)

B.2 Maximization step: adjustment for biased sampling

Relative to the general population, we may have a biased sample of healthy and sick people. To address this, we can update the intercept of our model using the following adjustment:

logit P(y=1x,s=1)=η(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ),logit 𝑃𝑦conditional1𝑥𝑠1𝜂𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\text{logit }P(y=1\mid x,s=1)=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho% )n_{1}}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right),logit italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) = italic_η ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) , (14)

where s=1𝑠1s=1italic_s = 1 indicates our sample, and n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and n𝑛nitalic_n are the number of cells in our sample with label z=1𝑧1z=1italic_z = 1 and the number of cells total, respectively.

Proof: We begin using Bayes rule, and we use the assumption that P(sy,x)=P(sy)𝑃conditional𝑠𝑦𝑥𝑃conditional𝑠𝑦P(s\mid y,x)=P(s\mid y)italic_P ( italic_s ∣ italic_y , italic_x ) = italic_P ( italic_s ∣ italic_y ):

P(y=1x,s=1)𝑃𝑦conditional1𝑥𝑠1\displaystyle P(y=1\mid x,s=1)italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) =P(s=1y=1)P(y=1x)P(s=1y=1)P(y=1x)+P(s=1y=0)P(y=0x)absent𝑃𝑠conditional1𝑦1𝑃𝑦conditional1𝑥𝑃𝑠conditional1𝑦1𝑃𝑦conditional1𝑥𝑃𝑠conditional1𝑦0𝑃𝑦conditional0𝑥\displaystyle=\frac{P(s=1\mid y=1)P(y=1\mid x)}{P(s=1\mid y=1)P(y=1\mid x)+P(s% =1\mid y=0)P(y=0\mid x)}= divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) italic_P ( italic_y = 1 ∣ italic_x ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) italic_P ( italic_y = 1 ∣ italic_x ) + italic_P ( italic_s = 1 ∣ italic_y = 0 ) italic_P ( italic_y = 0 ∣ italic_x ) end_ARG (15)
=P(s=1y=1)P(s=1y=0)eη(x)P(s=1y=1)P(s=1y=0)eη(x)+1absent𝑃𝑠conditional1𝑦1𝑃𝑠conditional1𝑦0superscript𝑒𝜂𝑥𝑃𝑠conditional1𝑦1𝑃𝑠conditional1𝑦0superscript𝑒𝜂𝑥1\displaystyle=\frac{\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}e^{\eta(x)}}{\frac{P(% s=1\mid y=1)}{P(s=1\mid y=0)}e^{\eta(x)}+1}= divide start_ARG divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 0 ) end_ARG italic_e start_POSTSUPERSCRIPT italic_η ( italic_x ) end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 0 ) end_ARG italic_e start_POSTSUPERSCRIPT italic_η ( italic_x ) end_POSTSUPERSCRIPT + 1 end_ARG (16)
=eη(x)eη(x)+1,absentsuperscript𝑒superscript𝜂𝑥superscript𝑒superscript𝜂𝑥1\displaystyle=\frac{e^{\eta^{\star}(x)}}{e^{\eta^{\star}(x)}+1},= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + 1 end_ARG , (17)

where η(x)=η(x)+log(P(s=1y=1)P(s=1y=0))superscript𝜂𝑥𝜂𝑥𝑃𝑠conditional1𝑦1𝑃𝑠conditional1𝑦0\eta^{\star}(x)=\eta(x)+\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) = italic_η ( italic_x ) + roman_log ( divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 0 ) end_ARG ). Now, we simplify log(P(s=1y=1)P(s=1y=0))𝑃𝑠conditional1𝑦1𝑃𝑠conditional1𝑦0\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)roman_log ( divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 0 ) end_ARG ):

log(P(s=1y=1)P(s=1y=0))𝑃𝑠conditional1𝑦1𝑃𝑠conditional1𝑦0\displaystyle\log\left(\frac{P(s=1\mid y=1)}{P(s=1\mid y=0)}\right)roman_log ( divide start_ARG italic_P ( italic_s = 1 ∣ italic_y = 1 ) end_ARG start_ARG italic_P ( italic_s = 1 ∣ italic_y = 0 ) end_ARG ) =log(P(y=1s=1)P(s=1)P(y=1)P(y=0s=1)P(s=1)P(y=0))absent𝑃𝑦conditional1𝑠1𝑃𝑠1𝑃𝑦1𝑃𝑦conditional0𝑠1𝑃𝑠1𝑃𝑦0\displaystyle=\log\left(\frac{\frac{P(y=1\mid s=1)P(s=1)}{P(y=1)}}{\frac{P(y=0% \mid s=1)P(s=1)}{P(y=0)}}\right)= roman_log ( divide start_ARG divide start_ARG italic_P ( italic_y = 1 ∣ italic_s = 1 ) italic_P ( italic_s = 1 ) end_ARG start_ARG italic_P ( italic_y = 1 ) end_ARG end_ARG start_ARG divide start_ARG italic_P ( italic_y = 0 ∣ italic_s = 1 ) italic_P ( italic_s = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ) end_ARG end_ARG ) (18)
=log(P(y=1s=1)P(y=0s=1))log(P(y=1)P(y=0)).absent𝑃𝑦conditional1𝑠1𝑃𝑦conditional0𝑠1𝑃𝑦1𝑃𝑦0\displaystyle=\log\left(\frac{P(y=1\mid s=1)}{P(y=0\mid s=1)}\right)-\log\left% (\frac{P(y=1)}{P(y=0)}\right).= roman_log ( divide start_ARG italic_P ( italic_y = 1 ∣ italic_s = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ∣ italic_s = 1 ) end_ARG ) - roman_log ( divide start_ARG italic_P ( italic_y = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ) end_ARG ) . (19)

We write out each of these probabilities. We show the full derivation for the first expression; the arguments for the the remainder are similar.

P(y=1s=1)𝑃𝑦conditional1𝑠1\displaystyle P(y=1\mid s=1)italic_P ( italic_y = 1 ∣ italic_s = 1 ) =P(y=1z=1,s=1)P(z=1s=1)+P(y=1z=0,s=1)P(z=0s=1)\displaystyle=P(y=1\mid z=1,s=1)P(z=1\mid s=1)+P(y=1\mid z=0,s=1)P(z=0\mid s=1)= italic_P ( italic_y = 1 ∣ italic_z = 1 , italic_s = 1 ) italic_P ( italic_z = 1 ∣ italic_s = 1 ) + italic_P ( italic_y = 1 ∣ italic_z = 0 , italic_s = 1 ) italic_P ( italic_z = 0 ∣ italic_s = 1 ) (20)
=P(y=1z=1)P(z=1s=1)absent𝑃𝑦conditional1𝑧1𝑃𝑧conditional1𝑠1\displaystyle=P(y=1\mid z=1)P(z=1\mid s=1)= italic_P ( italic_y = 1 ∣ italic_z = 1 ) italic_P ( italic_z = 1 ∣ italic_s = 1 ) (21)
=(1ρ)n1nabsent1𝜌subscript𝑛1𝑛\displaystyle=\frac{(1-\rho)n_{1}}{n}= divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG (22)
P(y=0s=1)𝑃𝑦conditional0𝑠1\displaystyle P(y=0\mid s=1)italic_P ( italic_y = 0 ∣ italic_s = 1 ) =n(1ρ)n1nabsent𝑛1𝜌subscript𝑛1𝑛\displaystyle=\frac{n-(1-\rho)n_{1}}{n}= divide start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG (23)
P(y=1)𝑃𝑦1\displaystyle P(y=1)italic_P ( italic_y = 1 ) =(1ρ)ζabsent1𝜌𝜁\displaystyle=(1-\rho)\zeta= ( 1 - italic_ρ ) italic_ζ (24)
P(y=0)𝑃𝑦0\displaystyle P(y=0)italic_P ( italic_y = 0 ) =1(1ρ)ζ.absent11𝜌𝜁\displaystyle=1-(1-\rho)\zeta.= 1 - ( 1 - italic_ρ ) italic_ζ . (25)

Now, we substitute back in:

log(P(y=1s=1)P(y=1s=0))log(P(y=1)P(y=0))𝑃𝑦conditional1𝑠1𝑃𝑦conditional1𝑠0𝑃𝑦1𝑃𝑦0\displaystyle\log\left(\frac{P(y=1\mid s=1)}{P(y=1\mid s=0)}\right)-\log\left(% \frac{P(y=1)}{P(y=0)}\right)roman_log ( divide start_ARG italic_P ( italic_y = 1 ∣ italic_s = 1 ) end_ARG start_ARG italic_P ( italic_y = 1 ∣ italic_s = 0 ) end_ARG ) - roman_log ( divide start_ARG italic_P ( italic_y = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ) end_ARG ) =log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ).absent1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\displaystyle=\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)-\log\left% (\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).= roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) . (26)

Therefore, η(x)=η(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ)superscript𝜂𝑥𝜂𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\eta^{\star}(x)=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)% -\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) = italic_η ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ).

B.3 Maximization step: the observed likelihood

The observed likelihood is given by:

L(ηz,X)𝐿conditional𝜂𝑧𝑋\displaystyle L(\eta\mid z,X)italic_L ( italic_η ∣ italic_z , italic_X ) =i=1nP(z=zixi,s=1)absentsuperscriptsubscriptproduct𝑖1𝑛𝑃𝑧conditionalsubscript𝑧𝑖subscript𝑥𝑖𝑠1\displaystyle=\prod_{i=1}^{n}P(z=z_{i}\mid x_{i},s=1)= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s = 1 ) (27)
=i=1n(eη(x)+ρn1nn1(1ρ)eη(x)+1)zi(1ρn1nn1(1ρ)eη(x)+1)1zi,absentsuperscriptsubscriptproduct𝑖1𝑛superscriptsuperscript𝑒superscript𝜂𝑥𝜌subscript𝑛1𝑛subscript𝑛11𝜌superscript𝑒superscript𝜂𝑥1subscript𝑧𝑖superscript1𝜌subscript𝑛1𝑛subscript𝑛11𝜌superscript𝑒superscript𝜂𝑥11subscript𝑧𝑖\displaystyle=\prod_{i=1}^{n}\left(\frac{e^{\eta^{\star}(x)}+\frac{\rho n_{1}}% {n-n_{1}(1-\rho)}}{e^{\eta^{\star}(x)}+1}\right)^{z_{i}}\left(\frac{1-\frac{% \rho n_{1}}{n-n_{1}(1-\rho)}}{e^{\eta^{\star}(x)}+1}\right)^{1-z_{i}},= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_ρ ) end_ARG end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG 1 - divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_ρ ) end_ARG end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + 1 end_ARG ) start_POSTSUPERSCRIPT 1 - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (28)

where η(x)=logit P(y=1x,s=1)=η(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ)superscript𝜂𝑥logit 𝑃𝑦conditional1𝑥𝑠1𝜂𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\eta^{\star}(x)=\text{logit }P(y=1\mid x,s=1)=\eta(x)+\log\left(\frac{(1-\rho)% n_{1}}{n-(1-\rho)n_{1}}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) = logit italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) = italic_η ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ), as in Equation 14.

Proof: Consider just one instance, (x,z)𝑥𝑧(x,z)( italic_x , italic_z ). We show how to compute P(z=1x,s=1)𝑃𝑧conditional1𝑥𝑠1{P(z=1\mid x,s=1)}italic_P ( italic_z = 1 ∣ italic_x , italic_s = 1 ), beginning by using the law of total probability and assuming that P(zy,s=1)=P(zy)𝑃conditional𝑧𝑦𝑠1𝑃conditional𝑧𝑦{P(z\mid y,s=1)=P(z\mid y)}italic_P ( italic_z ∣ italic_y , italic_s = 1 ) = italic_P ( italic_z ∣ italic_y ):

P(z=1x,s=1)=P(z=1y=1)P(y=1x,s=1)+P(z=1y=0)P(y=0x,s=1)𝑃𝑧conditional1𝑥𝑠1𝑃𝑧conditional1𝑦1𝑃𝑦conditional1𝑥𝑠1𝑃𝑧conditional1𝑦0𝑃𝑦conditional0𝑥𝑠1P(z=1\mid x,s=1)=P(z=1\mid y=1)P(y=1\mid x,s=1)+P(z=1\mid y=0)P(y=0\mid x,s=1)italic_P ( italic_z = 1 ∣ italic_x , italic_s = 1 ) = italic_P ( italic_z = 1 ∣ italic_y = 1 ) italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) + italic_P ( italic_z = 1 ∣ italic_y = 0 ) italic_P ( italic_y = 0 ∣ italic_x , italic_s = 1 ) (29)

Because all diseased cells come from sick people, P(z=1y=1)=1𝑃𝑧conditional1𝑦11P(z=1\mid y=1)=1italic_P ( italic_z = 1 ∣ italic_y = 1 ) = 1. To compute P(z=1y=0)𝑃𝑧conditional1𝑦0{P(z=1\mid y=0)}italic_P ( italic_z = 1 ∣ italic_y = 0 ), we will use Bayes’ rule. Recall that we assume P(y=0z=1)=ρ𝑃𝑦conditional0𝑧1𝜌P(y=0\mid z=1)=\rhoitalic_P ( italic_y = 0 ∣ italic_z = 1 ) = italic_ρ.

P(z=1y=0,s=1)\displaystyle P(z=1\mid y=0,s=1)italic_P ( italic_z = 1 ∣ italic_y = 0 , italic_s = 1 ) =P(y=0z=1)P(z=1s=1)P(y=0s=1)absent𝑃𝑦conditional0𝑧1𝑃𝑧conditional1𝑠1𝑃𝑦conditional0𝑠1\displaystyle=\frac{P(y=0\mid z=1)P(z=1\mid s=1)}{P(y=0\mid s=1)}= divide start_ARG italic_P ( italic_y = 0 ∣ italic_z = 1 ) italic_P ( italic_z = 1 ∣ italic_s = 1 ) end_ARG start_ARG italic_P ( italic_y = 0 ∣ italic_s = 1 ) end_ARG (30)
=ρn1n1(1ρ)n1n=ρn1n(1ρ)n1.absent𝜌subscript𝑛1𝑛11𝜌subscript𝑛1𝑛𝜌subscript𝑛1𝑛1𝜌subscript𝑛1\displaystyle=\frac{\rho\frac{n_{1}}{n}}{1-(1-\rho)\frac{n_{1}}{n}}=\frac{\rho n% _{1}}{n-(1-\rho)n_{1}}.= divide start_ARG italic_ρ divide start_ARG italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG end_ARG start_ARG 1 - ( 1 - italic_ρ ) divide start_ARG italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG end_ARG = divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG . (31)

Now, Expression 29 becomes:

P(z=1x,s=1)=P(y=1x,s=1)+ρn1n(1ρ)n1P(y=0x,s=1).𝑃𝑧conditional1𝑥𝑠1𝑃𝑦conditional1𝑥𝑠1𝜌subscript𝑛1𝑛1𝜌subscript𝑛1𝑃𝑦conditional0𝑥𝑠1P(z=1\mid x,s=1)=P(y=1\mid x,s=1)+\frac{\rho n_{1}}{n-(1-\rho)n_{1}}P(y=0\mid x% ,s=1).italic_P ( italic_z = 1 ∣ italic_x , italic_s = 1 ) = italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) + divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_P ( italic_y = 0 ∣ italic_x , italic_s = 1 ) . (32)

Finally, we use Equation 14: P(y=1x,s=1)=eη(x)1+eη(x)𝑃𝑦conditional1𝑥𝑠1superscript𝑒superscript𝜂𝑥1superscript𝑒superscript𝜂𝑥P(y=1\mid x,s=1)=\frac{e^{\eta^{\star}(x)}}{1+e^{\eta^{\star}(x)}}italic_P ( italic_y = 1 ∣ italic_x , italic_s = 1 ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG, where η=η(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ)superscript𝜂𝜂𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁{\eta^{\star}=\eta(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1}}\right)-% \log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right)}italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_η ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ). The contribution of one instance to the likelihood is then:

P(z=1x,s=1)𝑃𝑧conditional1𝑥𝑠1\displaystyle P(z=1\mid x,s=1)italic_P ( italic_z = 1 ∣ italic_x , italic_s = 1 ) =eη(x)1+eη(x)+(ρn1n(1ρ)n1)11+eη(x)absentsuperscript𝑒superscript𝜂𝑥1superscript𝑒superscript𝜂𝑥𝜌subscript𝑛1𝑛1𝜌subscript𝑛111superscript𝑒superscript𝜂𝑥\displaystyle=\frac{e^{\eta^{\star}(x)}}{1+e^{\eta^{\star}(x)}}+\left(\frac{% \rho n_{1}}{n-(1-\rho)n_{1}}\right)\frac{1}{1+e^{\eta^{\star}(x)}}= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG + ( divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG (33)
=eη(x)+ρn1n(1ρ)n11+eη(x),absentsuperscript𝑒superscript𝜂𝑥𝜌subscript𝑛1𝑛1𝜌subscript𝑛11superscript𝑒superscript𝜂𝑥\displaystyle=\frac{e^{\eta^{\star}(x)}+\frac{\rho n_{1}}{n-(1-\rho)n_{1}}}{1+% e^{\eta^{\star}(x)}},= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT + divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT end_ARG , (34)

as desired, and P(z=0x,s=1)=1P(z=1x,s=1)𝑃𝑧conditional0𝑥𝑠11𝑃𝑧conditional1𝑥𝑠1P(z=0\mid x,s=1)=1-P(z=1\mid x,s=1)italic_P ( italic_z = 0 ∣ italic_x , italic_s = 1 ) = 1 - italic_P ( italic_z = 1 ∣ italic_x , italic_s = 1 ).

B.4 Maximization step: the full likelihood

In the maximization step of the EM, we need to maximize the full likelihood using the current expected values of the unknown labels y𝑦yitalic_y. We will show that the full likelihood P(y,zX)P(yX)proportional-to𝑃𝑦conditional𝑧𝑋𝑃conditional𝑦𝑋P(y,z\mid X)\propto P(y\mid X)italic_P ( italic_y , italic_z ∣ italic_X ) ∝ italic_P ( italic_y ∣ italic_X ); this is what allows us to optimize the usual binomial log likelihood at every step of the EM. Suppose we know both y𝑦yitalic_y and z𝑧zitalic_z. Then, the full likelihood is:

L(ηz,y,X)𝐿conditional𝜂𝑧𝑦𝑋\displaystyle L(\eta\mid z,y,X)italic_L ( italic_η ∣ italic_z , italic_y , italic_X ) =iP(y=yi,z=zisi=1,xi)\displaystyle=\prod_{i}P(y=y_{i},z=z_{i}\mid s_{i}=1,x_{i})= ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (35)
=iP(z=zi,y=yi,si=1,xi)P(y=yisi=1,xi)\displaystyle=\prod_{i}P(z=z_{i},\mid y=y_{i},s_{i}=1,x_{i})P(y=y_{i}\mid s_{i% }=1,x_{i})= ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P ( italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∣ italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (36)
=iP(z=zi,y=yi,si=1)P(y=yisi=1,xi)\displaystyle=\prod_{i}P(z=z_{i},\mid y=y_{i},s_{i}=1)P(y=y_{i}\mid s_{i}=1,x_% {i})= ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P ( italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∣ italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (37)
iP(y=yisi=1,xi).\displaystyle\propto\prod_{i}P(y=y_{i}\mid s_{i}=1,x_{i}).∝ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (38)

To go from Expression 36 to Expression 37, we use our assumption that P(zy,s=1,x)=P(zy,s=1)𝑃conditional𝑧𝑦𝑠1𝑥𝑃conditional𝑧𝑦𝑠1P(z\mid y,s=1,x)=P(z\mid y,s=1)italic_P ( italic_z ∣ italic_y , italic_s = 1 , italic_x ) = italic_P ( italic_z ∣ italic_y , italic_s = 1 ): there is no systematic difference between cells from healthy and sick people, other than the fact that some cells from sick people are diseased. Now, we show that P(z=ziy=yi,s=1)P(z=z_{i}\mid y=y_{i},s=1)italic_P ( italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s = 1 ) is a constant, based on our assumptions of the values of P(z=1)𝑃𝑧1P(z=1)italic_P ( italic_z = 1 ) and P(y=yiz=zi)𝑃𝑦conditionalsubscript𝑦𝑖𝑧subscript𝑧𝑖{P(y=y_{i}\mid z=z_{i})}italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_z = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). There are three cases:

  1. 1.

    P(z=1y=1,s=1)=1P(z=1\mid y=1,s=1)=1italic_P ( italic_z = 1 ∣ italic_y = 1 , italic_s = 1 ) = 1

  2. 2.

    P(z=1y=0,s=1)=ρn1nn1(1ρ)P(z=1\mid y=0,s=1)=\frac{\rho n_{1}}{n-n_{1}(1-\rho)}italic_P ( italic_z = 1 ∣ italic_y = 0 , italic_s = 1 ) = divide start_ARG italic_ρ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_ρ ) end_ARG (as in Equation 31)

  3. 3.

    P(z=0y=0,s=1)=nn1nn1(1ρ)P(z=0\mid y=0,s=1)=\frac{n-n_{1}}{n-n_{1}(1-\rho)}italic_P ( italic_z = 0 ∣ italic_y = 0 , italic_s = 1 ) = divide start_ARG italic_n - italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_ρ ) end_ARG.

We note that the fourth case, P(z=0y=1,s=1)P(z=0\mid y=1,s=1)italic_P ( italic_z = 0 ∣ italic_y = 1 , italic_s = 1 ), is not possible under our assumption that healthy people have no diseased cells. Therefore, to find η𝜂\etaitalic_η that maximizes the full likelihood, we only need to find η𝜂\etaitalic_η that maximizes the binomial log likelihood.

Appendix C Modification for learning algorithms that do not allow soft labels

For learning algorithms that do not allow “soft” labels, do the following. This is as described in the body of this text.

  1. 1.

    Augment your dataset.

    1. (a)

      Make a new covariate matrix X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG containing one copy of cells from healthy people, and two copies of cells from sick people.

    2. (b)

      Make a cell label vector y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG that is 00 for cells from healthy people. One copy of each cell from sick people should be labeled 1111 and the other 00.

    3. (c)

      Make a weight label vector w(0)superscript𝑤0w^{(0)}italic_w start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT that is 1111 for cells from healthy people, ρ𝜌\rhoitalic_ρ for cells labeled 00 from sick people, and 1ρ1𝜌1-\rho1 - italic_ρ for cells labeled 1111 from sick people.

  2. 2.

    Iterate to convergence. At the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT step:

    1. (a)

      Maximization step: Fit a model η(i)(x)superscript𝜂absent𝑖𝑥\eta^{\star(i)}(x)italic_η start_POSTSUPERSCRIPT ⋆ ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) using X𝑋Xitalic_X, y𝑦yitalic_y and w(i1)superscript𝑤𝑖1w^{(i-1)}italic_w start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. Make the intercept adjustment to obtain

      η(i)(x)=η(i)(x)+log((1ρ)n1n(1ρ)n1)log((1ρ)ζ1(1ρ)ζ).superscript𝜂𝑖𝑥superscript𝜂absent𝑖𝑥1𝜌subscript𝑛1𝑛1𝜌subscript𝑛11𝜌𝜁11𝜌𝜁\eta^{(i)}(x)=\eta^{\star(i)}(x)+\log\left(\frac{(1-\rho)n_{1}}{n-(1-\rho)n_{1% }}\right)-\log\left(\frac{(1-\rho)\zeta}{1-(1-\rho)\zeta}\right).italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) = italic_η start_POSTSUPERSCRIPT ⋆ ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) + roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n - ( 1 - italic_ρ ) italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) - roman_log ( divide start_ARG ( 1 - italic_ρ ) italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) .
    2. (b)

      Expectation step: Use η(i)(x)superscript𝜂𝑖𝑥\eta^{(i)}(x)italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x ) to define w(i)superscript𝑤𝑖w^{(i)}italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Cells from healthy people have w(i)=1superscript𝑤𝑖1w^{(i)}=1italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 1. Cell j𝑗jitalic_j with label y=1𝑦1y=1italic_y = 1 from a sick person has weight:

      logit wj(i)=η(i)(xj)log(ρζ1(1ρ)ζ).logit subscriptsuperscript𝑤𝑖𝑗superscript𝜂𝑖subscript𝑥𝑗𝜌𝜁11𝜌𝜁\text{logit }w^{(i)}_{j}=\eta^{(i)}(x_{j})-\log\left(\frac{\rho\zeta}{1-(1-% \rho)\zeta}\right).logit italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_log ( divide start_ARG italic_ρ italic_ζ end_ARG start_ARG 1 - ( 1 - italic_ρ ) italic_ζ end_ARG ) .

      The weight for the corresponding row with label y=0𝑦0y=0italic_y = 0 has weight 1wj(i)1subscriptsuperscript𝑤𝑖𝑗1-w^{(i)}_{j}1 - italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

References

  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  • Dworzak et al. [1999] Michael N Dworzak, Frank Stolz, Gertraud Fröschl, Dieter Printz, Traudl Henn, Susanna Fischer, Christine Fleischer, Oskar A Haas, Gerhard Fritsch, Helmut Gadner, et al. Detection of residual disease in pediatric b-cell precursor acute lymphoblastic leukemia by comparative phenotype mapping: a study of five cases controlled by genetic methods. Experimental hematology, 27(4):673–681, 1999.
  • Kruse et al. [2020] A. Kruse, N. Abdel-Azim, H. N. Kim, Y. Ruan, V. Phan, H. Ogana, W. Wang, R. Lee, E. J. Gang, S. Khazal, and Y. M. Kim. Minimal Residual Disease Detection in Acute Lymphoblastic Leukemia. Int J Mol Sci, 21(3), Feb 2020.
  • Chen and Wood [2017] Xueyan Chen and Brent L. Wood. Monitoring minimal residual disease in acute leukemia: Technical challenges and interpretive complexities. Blood Reviews, 31(2):63–75, 2017. ISSN 0268-960X. doi: 10.1016/j.blre.2016.09.006. URL https://doi.org/10.1016/j.blre.2016.09.006.
  • Tsai et al. [2020] A. G. Tsai, D. R. Glass, M. Juntilla, F. J. Hartmann, J. S. Oak, S. Fernandez-Pol, R. S. Ohgami, and S. C. Bendall. Multiplexed single-cell morphometry for hematopathology diagnostics. Nature Medicine, 26(3):408–417, Mar 2020. doi: 10.1038/s41591-020-0783-x.
  • Hodes et al. [2019] Aaron Hodes, Katherine R. Calvo, Alina Dulau, Irina Maric, Junfeng Sun, and Raul Braylan. The challenging task of enumerating blasts in the bone marrow. Seminars in Hematology, 56(1):58–64, 2019. ISSN 0037-1963. doi: https://doi.org/10.1053/j.seminhematol.2018.07.001. URL https://www.sciencedirect.com/science/article/pii/S0037196318300568.
  • Good et al. [2018] Z. Good, J. Sarno, A. Jager, N. Samusik, N. Aghaeepour, E. F. Simonds, L. White, N. J. Lacayo, W. J. Fantl, G. Fazio, G. Gaipa, A. Biondi, R. Tibshirani, S. C. Bendall, G. P. Nolan, and K. L. Davis. Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nat Med, 24(4):474–483, May 2018.
  • Jager et al. [2021] A. Jager, J. Sarno, and K. L. Davis. Mass Cytometry of Hematopoietic Cells. Methods Mol Biol, 2185:65–76, 2021.
  • Coustan-Smith et al. [2002] E. Coustan-Smith, J. Sancho, M. L. Hancock, B. I. Razzouk, R. C. Ribeiro, G. K. Rivera, J. E. Rubnitz, J. T. Sandlund, C. H. Pui, and D. Campana. Use of peripheral blood instead of bone marrow to monitor residual disease in children with acute lymphoblastic leukemia. Blood, 100(7):2399–2402, Oct 2002.
  • van der Does-van den Berg et al. [1992] A. van der Does-van den Berg, C. R. Bartram, G. Basso, Y. C. Benoit, A. Biondi, K. M. Debatin, O. A. Haas, J. Harbott, W. A. Kamps, and U. ller. Minimal requirements for the diagnosis, classification, and evaluation of the treatment of childhood acute lymphoblastic leukemia (ALL) in the ”BFM Family” Cooperative Group. Med Pediatr Oncol, 20(6):497–505, 1992.
  • McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426, 2018. URL http://arxiv.org/abs/1802.03426.
  • Yakimovich et al. [2021] A. Yakimovich, A. Beaugnon, Y. Huang, and E. Ozkirimli. Labels in a haystack: Approaches beyond supervised learning in biomedical applications. Patterns (N Y), 2(12):100383, Dec 2021. doi: 10.1016/j.patter.2021.100383.
  • Northcutt et al. [2020] C. G. Northcutt, A. Athalye, and J. Lin. Pervasive label errors in ml benchmark test sets, consequences, and benefits. NeurIPS 2020 Workshop on Security and Data Curation Workshop, 2020.
  • Bernhardt et al. [2022] M. Bernhardt, D.C. Castro, and R. et al. Tanno. Active label cleaning for improved dataset quality under resource constraints. Nature Communications, 13:1161, 2022. doi: 10.1038/s41467-022-28818-3. URL https://doi.org/10.1038/s41467-022-28818-3.
  • Sarno et al. [2023] J. Sarno, P. Domizi, Y. Liu, M. Merchant, C. B. Pedersen, D. Jedoui, A. Jager, G. P. Nolan, G. Gaipa, S. C. Bendall, F. A. Bava, and K. L. Davis. Dasatinib overcomes glucocorticoid resistance in B-cell acute lymphoblastic leukemia. Nat Commun, 14(1):2935, May 2023.
  • van Dongen et al. [2012] J. J. van Dongen, L. Lhermitte, S. ttcher, J. Almeida, V. H. van der Velden, J. Flores-Montero, A. Rawstron, V. Asnafi, Q. crevisse, P. Lucio, E. Mejstrikova, T. ski, T. Kalina, R. de Tute, M. ggemann, L. Sedek, M. Cullen, A. W. Langerak, A. a, E. Macintyre, M. Martin-Ayuso, O. Hrusak, M. B. Vidriales, and A. Orfao. EuroFlow antibody panels for standardized n-dimensional flow cytometric immunophenotyping of normal, reactive and malignant leukocytes. Leukemia, 26(9):1908–1975, Sep 2012.
  • Pui et al. [2003] C. H. Pui, J. M. Chessells, B. Camitta, A. Baruchel, A. Biondi, J. M. Boyett, A. Carroll, O. B. Eden, W. E. Evans, H. Gadner, J. Harbott, D. O. Harms, C. J. Harrison, P. L. Harrison, N. Heerema, G. Janka-Schaub, W. Kamps, G. Masera, J. Pullen, S. C. Raimondi, S. Richards, H. Riehm, S. Sallan, H. Sather, J. Shuster, L. B. Silverman, M. G. Valsecchi, E. Vilmer, Y. Zhou, P. S. Gaynon, and M. Schrappe. Clinical heterogeneity in childhood acute lymphoblastic leukemia with 11q23 rearrangements. Leukemia, 17(4):700–706, Apr 2003.
  • Veltroni et al. [2003] M. Veltroni, L. De Zen, M. C. Sanzari, O. Maglia, M. N. Dworzak, R. Ratei, A. Biondi, G. Basso, and G. Gaipa. Expression of CD58 in normal, regenerating and leukemic bone marrow B cells: implications for the detection of minimal residual disease in acute lymphocytic leukemia. Haematologica, 88(11):1245–1252, Nov 2003.
  • Borowitz et al. [2005] M. J. Borowitz, D. J. Pullen, N. Winick, P. L. Martin, W. P. Bowman, and B. Camitta. Comparison of diagnostic and relapse flow cytometry phenotypes in childhood acute lymphoblastic leukemia: implications for residual disease detection: a report from the children’s oncology group. Cytometry B Clin Cytom, 68(1):18–24, Nov 2005.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Amores [2013] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial intelligence, 201:81–105, 2013.
  • Andrews et al. [2002] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. Advances in neural information processing systems, 15, 2002.
  • Chen et al. [2017] Ping-Yang Chen, Ching-Chuan Chen, Chun-Hao Yang, Sheng-Mao Chang, and Kuo-Jung Lee. milr: Multiple-instance logistic regression with lasso penalty. R J., 9(1):446, 2017.
  • Gadermayr and Tschuchnig [2022] Michael Gadermayr and Maximilian Tschuchnig. Multiple instance learning for digital pathology: A review on the state-of-the-art, limitations & future potential. arXiv preprint arXiv:2206.04425, 2022.
  • Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Ward et al. [2009] Gill Ward, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. Presence-only data and the em algorithm. Biometrics, 65(2):554–563, 2009.
  • Ray and Craven [2005] Soumya Ray and Mark Craven. Supervised versus multiple instance learning: An empirical comparison. Proceedings of the 22nd international conference on Machine learning, pages 697–704, 2005.
  • Carbonneau et al. [2018] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
  • Muhlenbach et al. [2004] Fabrice Muhlenbach, Stéphane Lallich, and Djamel A Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89–109, 2004.
  • Natarajan et al. [2013] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
  • Song et al. [2022] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Kent and Liou [2022] Sean Kent and Yifei Liou. mildsvm: Multiple-Instance Learning with Support Vector Machines, 2022. URL https://CRAN.R-project.org/package=mildsvm. R package version 0.4.0.
  • Hastie and Fithian [2013] Trevor Hastie and Will Fithian. Inference from presence-only data; the ongoing controversy. Ecography, 36(8):864–867, 2013.
  • Keyes et al. [2023] T. J. Keyes, A. Koladiya, Y.-C. Lo, G. P. Nolan, and K. L. Davis. tidytof: a user-friendly framework for scalable and reproducible high-dimensional cytometry data analysis. Bioinform Adv, 3:vbad071, 2023.
  • Qiu et al. [2011] P. Qiu, E. F. Simonds, S. C. Bendall, K. D. Gibbs, R. V Bruggner, M. D. Linderman, and S. K. Plevritis. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology, 29(10):886–891, 2011. doi: 10.1038/nbt.1991.
  • Fienberg et al. [2012] H. G. Fienberg, E. F. Simonds, W. J. Fantl, G. P. Nolan, and B. Bodenmiller. A platinum-based covalent viability reagent for single-cell mass cytometry. Cytometry A, 81(6):467–475, Jun 2012.