Minus the Error: Estimating d N/ d S and Testing for Natural Selection in the Presence of Residual Alignment Errors

Avery G Selberg; Maria Chikina; Tim Sackton; Spencer V Muse; Alexander G Lucaci; Steven Weaver; Anton Nekrutenko; Nathan Clark; Sergei L Kosakovsky Pond

doi:10.1101/2024.11.13.620707

Minus the Error: Estimating d _N/ d _S and Testing for Natural Selection in the Presence of Residual Alignment Errors

bioRxiv [Preprint]. 2024 Nov 15:2024.11.13.620707. doi: 10.1101/2024.11.13.620707.

Authors

Avery G Selberg^{1

2}, Maria Chikina³, Tim Sackton⁴, Spencer V Muse⁵, Alexander G Lucaci^{6

7}, Steven Weaver^{1

2}, Anton Nekrutenko⁸, Nathan Clark³, Sergei L Kosakovsky Pond^{1

2}

Affiliations

¹ Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
² Department of Biology, Temple University, Philadelphia, PA, USA.
³ Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ FAS Informatics Group, Harvard University, Cambridge, MA, USA.
⁵ Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA.
⁶ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
⁷ Weill Cornell Medicine, The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, New York, NY, USA.
⁸ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA.

Abstract

Errors in multiple sequence alignments (MSAs) are known to bias many comparative evolutionary methods. In the context of natural selection analyses, specifically codon evolutionary models, excessive rates of false positives result. A characteristic signature of error-driven findings is unrealistically high estimates of dN/dS (e.g., >100), affecting only a small fraction (e.g., ~0.1%) of the alignment. Despite the widespread use of codon models to assess alignment quality, their potential for error correction remains unexplored. We present BUSTED-E: a novel method designed to detect positive selection while concurrently identifying alignment errors. This method is a straightforward adaptation of the BUSTED flexible branch-site random effects model used to fit distributions of dN/dS, with an important modification: it integrates an "error-sink" component representing an abiological evolutionary regime (dN/dS > 100), and provides the option for masking errors in the MSA for downstream analyses. Statistical performance of BUSTED-E on data simulated without errors shows that there is a small loss of power, which can be mitigated by model averaged inference. Using four published empirical datasets, we show BUSTED-E reduces unrealistic rates of positive selection detection, often by an order of magnitude, and improves biological interpretability of results. BUSTED-E also detects errors that are largely distinct from other popular alignment cleaning tools (HMMCleaner and BMGE). Overall, BUSTED-E is a robust and scalable solution for improving the accuracy of evolutionary analyses in the presence of residual alignment errors, contributing to a more nuanced understanding of natural selection and adaptive evolution.

Publication types

Preprint