Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data

Nikita Kotlov; Kirill Shaposhnikov; Cagdas Tazearslan; Madison Chasse; Artur Baisangurov; Svetlana Podsvirova; Dawn Fernandez; Mary Abdou; Leznath Kaneunyenye; Kelley Morgan; Ilya Cheremushkin; Pavel Zemskiy; Maxim Chelushkin; Maria Sorokina; Ekaterina Belova; Svetlana Khorkova; Yaroslav Lozinsky; Katerina Nuzhdina; Elena Vasileva; Dmitry Kravchenko; Kushal Suryamohan; Krystle Nomie; John Curran; Nathan Fowler; Alexander Bagaev

doi:10.1038/s42003-024-06020-z

Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data

Commun Biol. 2024 Mar 30;7(1):392. doi: 10.1038/s42003-024-06020-z.

Authors

Nikita Kotlov^#¹, Kirill Shaposhnikov^#¹, Cagdas Tazearslan¹, Madison Chasse¹, Artur Baisangurov¹, Svetlana Podsvirova¹, Dawn Fernandez¹, Mary Abdou¹, Leznath Kaneunyenye¹, Kelley Morgan¹, Ilya Cheremushkin¹, Pavel Zemskiy¹, Maxim Chelushkin¹, Maria Sorokina¹, Ekaterina Belova¹, Svetlana Khorkova¹, Yaroslav Lozinsky¹, Katerina Nuzhdina¹, Elena Vasileva¹, Dmitry Kravchenko¹, Kushal Suryamohan¹, Krystle Nomie¹, John Curran¹, Nathan Fowler², Alexander Bagaev¹

Affiliations

¹ BostonGene, Corp., Waltham, MA, 02453, USA.
² BostonGene, Corp., Waltham, MA, 02453, USA. [email protected].

^# Contributed equally.

Abstract

With the increased use of gene expression profiling for personalized oncology, optimized RNA sequencing (RNA-seq) protocols and algorithms are necessary to provide comparable expression measurements between exome capture (EC)-based and poly-A RNA-seq. Here, we developed and optimized an EC-based protocol for processing formalin-fixed, paraffin-embedded samples and a machine-learning algorithm, Procrustes, to overcome batch effects across RNA-seq data obtained using different sample preparation protocols like EC-based or poly-A RNA-seq protocols. Applying Procrustes to samples processed using EC and poly-A RNA-seq protocols showed the expression of 61% of genes (N = 20,062) to correlate across both protocols (concordance correlation coefficient > 0.8, versus 26% before transformation by Procrustes), including 84% of cancer-specific and cancer microenvironment-related genes (versus 36% before applying Procrustes; N = 1,438). Benchmarking analyses also showed Procrustes to outperform other batch correction methods. Finally, we showed that Procrustes can project RNA-seq data for a single sample to a larger cohort of RNA-seq data. Future application of Procrustes will enable direct gene expression analysis for single tumor samples to support gene expression-based treatment decisions.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Gene Expression Profiling* / methods
Humans
Machine Learning
RNA* / genetics
Sequence Analysis, RNA / methods
Tissue Fixation / methods

Substances

RNA