Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Chao Wu; Xiaonan Zhao; Mark Welsh; Kellianne Costello; Kajia Cao; Ahmad Abou Tayoun; Marilyn Li; Mahdi Sarmady

doi:10.1373/clinchem.2019.308213

Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Clin Chem. 2020 Jan 1;66(1):239-246. doi: 10.1373/clinchem.2019.308213.

Authors

Chao Wu¹, Xiaonan Zhao¹, Mark Welsh¹, Kellianne Costello², Kajia Cao¹, Ahmad Abou Tayoun³, Marilyn Li^{1

4}, Mahdi Sarmady^{1

5

4}

Affiliations

¹ Division of Genomic Diagnostics, The Children's Hospital of Philadelphia, Philadelphia, PA.
² College of Science and Technology, Temple University, Philadelphia, PA.
³ Department of Genetics, Al Jalila Children's Specialty Hospital, Dubai, UAE.
⁴ Department of Pathology & Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, PA.
⁵ Center for Data-Driven Discovery in Biomedicine, Children's Hospital of Philadelphia, Philadelphia, PA.

PMID: 31672855
DOI: 10.1373/clinchem.2019.308213

Abstract

Background: Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning-based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens.

Methods: A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label "uncertain" variants.

Results: The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as "uncertain," with zero misclassification between the true positives and artifacts in the test set.

Conclusions: We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

MeSH terms

False Positive Reactions
High-Throughput Nucleotide Sequencing / methods*
Humans
Machine Learning*
Neoplasms / diagnosis
Neoplasms / genetics
Polymorphism, Single Nucleotide
Sensitivity and Specificity