An unsupervised machine learning method for discovering patient clusters based on genetic signatures

Christian Lopez; Scott Tucker; Tarik Salameh; Conrad Tucker

doi:10.1016/j.jbi.2018.07.004

An unsupervised machine learning method for discovering patient clusters based on genetic signatures

J Biomed Inform. 2018 Sep:85:30-39. doi: 10.1016/j.jbi.2018.07.004. Epub 2018 Jul 29.

Authors

Christian Lopez¹, Scott Tucker², Tarik Salameh³, Conrad Tucker⁴

Affiliations

¹ Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
² Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA; Engineering Science and Mechanics, The Pennsylvania State University, University Park, PA 16802, USA.
³ Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA.
⁴ Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA; Engineering Design Technology and Professional Programs, The Pennsylvania State University, University Park, PA 16802, USA; Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA. Electronic address: [email protected].

Abstract

Introduction: Many chronic disorders have genomic etiology, disease progression, clinical presentation, and response to treatment that vary on a patient-to-patient basis. Such variability creates a need to identify characteristics within patient populations that have clinically relevant predictive value in order to advance personalized medicine. Unsupervised machine learning methods are suitable to address this type of problem, in which no a priori class label information is available to guide this search. However, it is challenging for existing methods to identify cluster memberships that are not just a result of natural sampling variation. Moreover, most of the current methods require researchers to provide specific input parameters a priori.

Method: This work presents an unsupervised machine learning method to cluster patients based on their genomic makeup without providing input parameters a priori. The method implements internal validity metrics to algorithmically identify the number of clusters, as well as statistical analyses to test for the significance of the results. Furthermore, the method takes advantage of the high degree of linkage disequilibrium between single nucleotide polymorphisms. Finally, a gene pathway analysis is performed to identify potential relationships between the clusters in the context of known biological knowledge.

Datasets and results: The method is tested with a cluster validation and a genomic dataset previously used in the literature. Benchmark results indicate that the proposed method provides the greatest performance out of the methods tested. Furthermore, the method is implemented on a sample genome-wide study dataset of 191 multiple sclerosis patients. The results indicate that the method was able to identify genetically distinct patient clusters without the need to select parameters a priori. Additionally, variants identified as significantly different between clusters are shown to be enriched for protein-protein interactions, especially in immune processes and cell adhesion pathways, via Gene Ontology term analysis.

Conclusion: Once links are drawn between clusters and clinically relevant outcomes, Immunochip data can be used to classify high-risk and newly diagnosed chronic disease patients into known clusters for predictive value. Further investigation can extend beyond pathway analysis to evaluate these clusters for clinical significance of genetically related characteristics such as age of onset, disease course, heritability, and response to treatment.

Keywords: Clustering analysis; Genomic similarity; Multiple sclerosis; Unsupervised machine learning.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Cluster Analysis*
Computational Biology
Databases, Genetic / statistics & numerical data
Gene Ontology / statistics & numerical data
Gene Regulatory Networks
Genome-Wide Association Study / statistics & numerical data
Humans
Linkage Disequilibrium*
Polymorphism, Single Nucleotide*
Precision Medicine
Unsupervised Machine Learning*

Abstract

Publication types

MeSH terms

Grants and funding