Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

Yujia Bao; Zhengyi Deng; Yan Wang; Heeyoon Kim; Victor Diego Armengol; Francisco Acevedo; Nofal Ouardaoui; Cathy Wang; Giovanni Parmigiani; Regina Barzilay; Danielle Braun; Kevin S Hughes

doi:10.1200/CCI.19.00042

Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

JCO Clin Cancer Inform. 2019 Sep:3:1-9. doi: 10.1200/CCI.19.00042.

Authors

Yujia Bao¹, Zhengyi Deng², Yan Wang², Heeyoon Kim¹, Victor Diego Armengol², Francisco Acevedo², Nofal Ouardaoui³, Cathy Wang^{3

4}, Giovanni Parmigiani^{3

4}, Regina Barzilay¹, Danielle Braun^{3

4}, Kevin S Hughes^{2

5}

Affiliations

¹ Massachusetts Institute of Technology, Boston, MA.
² Massachusetts General Hospital, Boston, MA.
³ Harvard T.H. Chan School of Public Health, Boston, MA.
⁴ Dana-Farber Cancer Institute, Boston, MA.
⁵ Harvard Medical School, Boston, MA.

Abstract

Purpose: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance-risk of cancer for germline mutation carriers-or prevalence of germline genetic mutations.

Materials and methods: We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence.

Results: For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy-percentage of papers that were correctly classified-whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy.

Conclusion: Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.

Publication types

Meta-Analysis
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Genetic Predisposition to Disease*
Humans
Knowledge Discovery*
Machine Learning*
Medicine in Literature*
Natural Language Processing*
Neoplasms / genetics*
Oncogenes*
Polymorphism, Genetic
Prevalence
ROC Curve
Reproducibility of Results
Support Vector Machine

Abstract

Publication types

MeSH terms

Grants and funding