Enhanced protein isoform characterization through long-read proteogenomics

Rachel M Miller; Ben T Jordan; Madison M Mehlferber; Erin D Jeffery; Christina Chatzipantsiou; Simi Kaur; Robert J Millikin; Yunxiang Dai; Simone Tiberi; Peter J Castaldi; Michael R Shortreed; Chance John Luckey; Ana Conesa; Lloyd M Smith; Anne Deslattes Mays; Gloria M Sheynkman

doi:10.1186/s13059-022-02624-y

Enhanced protein isoform characterization through long-read proteogenomics

Genome Biol. 2022 Mar 3;23(1):69. doi: 10.1186/s13059-022-02624-y.

Authors

Rachel M Miller¹, Ben T Jordan², Madison M Mehlferber^{2

3}, Erin D Jeffery², Christina Chatzipantsiou⁴, Simi Kaur¹, Robert J Millikin¹, Yunxiang Dai¹, Simone Tiberi^{5

6}, Peter J Castaldi^{7

8}, Michael R Shortreed¹, Chance John Luckey⁹, Ana Conesa^{10

11}, Lloyd M Smith¹, Anne Deslattes Mays¹², Gloria M Sheynkman^{13

14

15}

Affiliations

¹ Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA.
² Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA.
³ Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
⁴ Lifebit Biotech LTD., London, UK.
⁵ Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
⁶ Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
⁷ Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.
⁸ Division of General Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, USA.
⁹ Department of Pathology, University of Virginia, Charlottesville, VA, USA.
¹⁰ Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain.
¹¹ Microbiology and Cell Science Department, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, FL, USA.
¹² Office of Data Science and Sharing, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Rockville, MD, USA.
¹³ Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA. [email protected].
¹⁴ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA. [email protected].
¹⁵ UVA Cancer Center, University of Virginia, Charlottesville, VA, USA. [email protected].

Abstract

Background: The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g., PacBio or Oxford Nanopore) provides full-length transcripts which can be used to predict full-length protein isoforms.

Results: We describe here a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data to enable detection of protein isoforms previously intractable to MS-based detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis.

Conclusions: Our work suggests that the incorporation of long-read sequencing and proteomic data can facilitate improved characterization of human protein isoform diversity. Our first-generation pipeline provides a strong foundation for future development of long-read proteogenomics and its adoption for both basic and translational research.

Keywords: Alternative splicing; Iso-Seq; Lifebit CloudOS; Long-read RNA-seq; Mass spectrometry-based proteomics; Nextflow; PacBio; Protein inference; Proteogenomics; SQANTI.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Alternative Splicing
Humans
Protein Isoforms / genetics
Proteogenomics*
Proteomics
Sequence Analysis, RNA / methods
Transcriptome

Substances

Protein Isoforms

Abstract

Publication types

MeSH terms

Substances

Grants and funding