Impact of phylogeny on the inference of functional sectors from protein sequence data

Nicola Dietler; Alia Abbara; Subham Choudhury; Anne-Florence Bitbol

doi:10.1371/journal.pcbi.1012091

Impact of phylogeny on the inference of functional sectors from protein sequence data

PLoS Comput Biol. 2024 Sep 23;20(9):e1012091. doi: 10.1371/journal.pcbi.1012091. eCollection 2024 Sep.

Authors

Nicola Dietler^{1

2}, Alia Abbara^{1

2}, Subham Choudhury^{1

2}, Anne-Florence Bitbol^{1

2}

Affiliations

¹ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
² SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.

Abstract

Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.

Copyright: © 2024 Dietler et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Algorithms
Amino Acid Sequence / genetics
Computational Biology / methods
Evolution, Molecular
Mutation
Phylogeny*
Proteins* / chemistry
Proteins* / genetics
Sequence Alignment* / methods
Sequence Alignment* / statistics & numerical data
Sequence Analysis, Protein* / methods

Substances

Proteins

Grants and funding

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851173, to A.-F. B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.