The neglected giants: Uncovering the prevalence and functional groups of huge proteins in proteomes

PLoS Comput Biol. 2024 Sep 16;20(9):e1012459. doi: 10.1371/journal.pcbi.1012459. eCollection 2024 Sep.

Abstract

An often-overlooked aspect of biology is formed by the outliers of the protein length distribution, specifically those proteins with more than 5000 amino acids, which we refer to as huge proteins (HPs). By examining UniprotKB, we discovered more than 41 000 HPs throughout the tree of life, with the majority found in eukaryotes. Notably, the phyla with the highest propensity for HPs are Apicomplexa and Fornicata. Moreover, we observed that certain bacteria, such as Elusimicrobiota or Planctomycetota, have a higher tendency for encoding HPs, even more than the average eukaryote. To investigate if these macro-polypeptides represent "real" proteins, we explored several indirect metrics. Additionally, orthology analyses reveals thousands of clusters of homologous sequences of HPs, revealing functional groups related to key cellular processes such as cytoskeleton organization and functioning as chaperones or as E3-ubiquitin ligases in eukaryotes. In the case of bacteria, the major clusters have functions related to non-ribosomomal peptide synthesis/polyketide synthesis, followed by pathogen-host attachment or recognition surface proteins. Further exploration of the annotations for each HPs supported the previously identified functional groups. These findings underscore the need for further investigation of the cellular and ecological roles of these HPs and their potential impact on biology and biotechnology.

MeSH terms

  • Animals
  • Bacteria / genetics
  • Bacteria / metabolism
  • Computational Biology
  • Databases, Protein
  • Humans
  • Proteins / chemistry
  • Proteins / metabolism
  • Proteome* / metabolism
  • Proteomics / methods

Substances

  • Proteome
  • Proteins

Grants and funding

DPD was funded by the Gordon and Betty Moore Foundation grant #9733. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.