Negative binomial mixture model for identification of noise in antibody-antigen specificity predictions from single-cell data

Perry T Wasdin; Alexandra A Abu-Shmais; Michael W Irvin; Matthew J Vukovich; Ivelin S Georgiev

doi:10.1093/bioadv/vbae170

Negative binomial mixture model for identification of noise in antibody-antigen specificity predictions from single-cell data

Bioinform Adv. 2024 Dec 4;4(1):vbae170. doi: 10.1093/bioadv/vbae170. eCollection 2024.

Authors

Perry T Wasdin^{1

2

3}, Alexandra A Abu-Shmais^{3

4}, Michael W Irvin⁵, Matthew J Vukovich^{3

4}, Ivelin S Georgiev^{1

2

3

4

6

7

8}

Affiliations

¹ Program in Chemical and Physical Biology, Vanderbilt University Medical Center, Nashville, TN, 37232, United States.
² Center for Computational Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, 37232, United States.
³ Vanderbilt Vaccine Center, Vanderbilt University Medical Center, Nashville, TN, 37232, United States.
⁴ Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, 37232, United States.
⁵ Multiscale Modeling Group, Computational Biology Hub, Altos Laboratories, Redwood City, CA, 94065, United States.
⁶ Department of Computer Science, Vanderbilt University, Nashville, TN, 37232, United States.
⁷ Vanderbilt Institute for Infection, Immunology and Inflammation, Vanderbilt University Medical Center, Nashville, TN, 37232, United States.
⁸ Center for Structural Biology, Vanderbilt University, Nashville, TN, 37232, United States.

Abstract

Motivation: LIBRA-seq (linking B cell receptor to antigen specificity by sequencing) provides a powerful tool for interrogating the antigen-specific B cell compartment and identifying antibodies against antigen targets of interest. Identification of noise in single-cell B cell receptor sequencing data, such as LIBRA-seq, is critical for improving antigen binding predictions for downstream applications including antibody discovery and machine learning technologies.

Results: In this study, we present a method for denoising LIBRA-seq data by clustering antigen counts into signal and noise components with a negative binomial mixture model. This approach leverages single-cell sequencing reads from a large, multi-donor dataset described in a recent LIBRA-seq study to develop a data-driven means for identification of technical noise. We apply this method to nine donors representing separate LIBRA-seq experiments and show that our approach provides improved predictions for in vitro antibody-antigen binding when compared to the standard scoring method, despite variance in data size and noise structure across samples. This development will improve the ability of LIBRA-seq to identify antigen-specific B cells and contribute to providing more reliable datasets for machine learning based approaches as the corpus of single-cell B cell sequencing data continues to grow.

Availability and implementation: All data and code are available at https://github.com/IGlab-VUMC/mixture_model_denoising.