Large-scale protein function prediction using heterogeneous ensembles

F1000Res. 2018 Sep 28:7:ISCB Comm J-1577. doi: 10.12688/f1000research.16415.1. eCollection 2018.

Abstract

Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).

Keywords: protein function prediction,heterogeneous ensembles,machine learning, high-performance computing, performance evaluation.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Bacterial Proteins / genetics*
  • Gene Ontology*
  • Logistic Models
  • Machine Learning

Substances

  • Bacterial Proteins