PARROT is a flexible recurrent neural network framework for analysis of large protein datasets

Daniel Griffith; Alex S Holehouse

doi:10.7554/eLife.70576

PARROT is a flexible recurrent neural network framework for analysis of large protein datasets

Elife. 2021 Sep 17:10:e70576. doi: 10.7554/eLife.70576.

Authors

Daniel Griffith^{1

2}, Alex S Holehouse^{1

2}

Affiliations

¹ Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St Louis, United States.
² Center for Science and Engineering Living Systems, Washington University, St Louis, United States.

Abstract

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.

Keywords: bioinformatics; computational biology; functional annotation; high-throughput methods; human; machine learning; proteomics; systems biology.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computational Biology / methods*
Databases, Protein*
Deep Learning
High-Throughput Nucleotide Sequencing
Humans
Neural Networks, Computer*
Phosphorylation
Proteins / analysis
Proteins / chemistry
Proteins / metabolism
Sequence Analysis, Protein / methods*
Software

Substances

Proteins

Grants and funding

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.