Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets

Kyle Ellrott; Christopher K Wong; Christina Yau; Mauro A A Castro; Jordan A Lee; Brian J Karlberg; Jasleen K Grewal; Vincenzo Lagani; Bahar Tercan; Verena Friedl; Toshinori Hinoue; Vladislav Uzunangelov; Lindsay Westlake; Xavier Loinaz; Ina Felau; Peggy I Wang; Anab Kemal; Samantha J Caesar-Johnson; Ilya Shmulevich; Alexander J Lazar; Ioannis Tsamardinos; Katherine A Hoadley; Cancer Genome Atlas Analysis Network; A Gordon Robertson; Theo A Knijnenburg; Christopher C Benz; Joshua M Stuart; Jean C Zenklusen; Andrew D Cherniack; Peter W Laird

doi:10.1016/j.ccell.2024.12.002

Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets

Cancer Cell. 2024 Dec 30:S1535-6108(24)00477-X. doi: 10.1016/j.ccell.2024.12.002. Online ahead of print.

Authors

Kyle Ellrott¹, Christopher K Wong², Christina Yau³, Mauro A A Castro⁴, Jordan A Lee⁵, Brian J Karlberg⁵, Jasleen K Grewal⁶, Vincenzo Lagani⁷, Bahar Tercan⁸, Verena Friedl², Toshinori Hinoue⁹, Vladislav Uzunangelov², Lindsay Westlake¹⁰, Xavier Loinaz¹¹, Ina Felau¹², Peggy I Wang¹², Anab Kemal¹², Samantha J Caesar-Johnson¹², Ilya Shmulevich⁸, Alexander J Lazar¹³, Ioannis Tsamardinos¹⁴, Katherine A Hoadley¹⁵; Cancer Genome Atlas Analysis Network; A Gordon Robertson⁶, Theo A Knijnenburg⁸, Christopher C Benz¹⁶, Joshua M Stuart², Jean C Zenklusen¹², Andrew D Cherniack¹⁷, Peter W Laird¹⁸

Affiliations

¹ Oregon Health and Science University, Portland, OR 97239, USA. Electronic address: [email protected].
² Biomolecular Engineering Department, School of Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA.
³ University of California, San Francisco, Department of Surgery, San Francisco, CA 94158, USA; Buck Institute for Research on Aging, Novato, CA 94945, USA.
⁴ Bioinformatics and Systems Biology Laboratory, Federal University of Paraná, Curitiba, PR 81520-260, Brazil.
⁵ Oregon Health and Science University, Portland, OR 97239, USA.
⁶ Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada.
⁷ JADBio Gnosis DA, GR-700 13 Heraklion, Crete, Greece; Institute of Chemical Biology, Ilia State University, Tbilisi 0162, Georgia.
⁸ Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109, USA.
⁹ Department of Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA.
¹⁰ The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
¹¹ The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.
¹² Center for Cancer Genomics, National Cancer Institute, Bethesda, MD 20892, USA.
¹³ Departments of Pathology & Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
¹⁴ JADBio Gnosis DA, GR-700 13 Heraklion, Crete, Greece; Department of Computer Science, University of Crete, GR-700 13 Heraklion, Crete, Greece; Institute of Applied and Computational Mathematics, Foundation for Research and Technology Hellas (FORTH), GR-700 13 Heraklion, Crete, Greece.
¹⁵ Department of Genetics, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27519, USA.
¹⁶ Buck Institute for Research on Aging, Novato, CA 94945, USA.
¹⁷ The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA; Harvard Medical School, Boston, MA 02115, USA. Electronic address: [email protected].
¹⁸ Department of Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA. Electronic address: [email protected].

PMID: 39753139
DOI: 10.1016/j.ccell.2024.12.002

Abstract

Molecular subtypes, such as defined by The Cancer Genome Atlas (TCGA), delineate a cancer's underlying biology, bringing hope to inform a patient's prognosis and treatment plan. However, most approaches used in the discovery of subtypes are not suitable for assigning subtype labels to new cancer specimens from other studies or clinical trials. Here, we address this barrier by applying five different machine learning approaches to multi-omic data from 8,791 TCGA tumor samples comprising 106 subtypes from 26 different cancer cohorts to build models based upon small numbers of features that can classify new samples into previously defined TCGA molecular subtypes-a step toward molecular subtype application in the clinic. We validate select classifiers using external datasets. Predictive performance and classifier-selected features yield insight into the different machine-learning approaches and genomic data platforms. For each cancer and data type we provide containerized versions of the top-performing models as a public resource.

Keywords: TCGA; artificial intelligence; biomarkers; cancer; classification; epigenomic; genomic; machine learning; molecular; pathology.