Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods

BMC Bioinformatics. 2017 Oct 12;18(1):449. doi: 10.1186/s12859-017-1854-y.

Abstract

Background: The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions.

Results: We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a "flat" learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity.

Conclusions: Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository.

Keywords: Gene-Abnormal phenotype association; Hierarchical ensemble methods; Hierarchical multi-label classification; Human Phenotype Ontology; Human Phenotype Ontology term prediction; Phenotype gene prioritization.

MeSH terms

  • Algorithms*
  • Area Under Curve
  • Biological Ontologies*
  • Genetic Association Studies
  • Humans
  • Molecular Sequence Annotation
  • Phenotype
  • ROC Curve