Substitution Model Adequacy and Assessing the Reliability of Estimates of Virus Evolutionary Rates and Time Scales

Sebastián Duchêne; Francesca Di Giallonardo; Edward C Holmes

doi:10.1093/molbev/msv207

Substitution Model Adequacy and Assessing the Reliability of Estimates of Virus Evolutionary Rates and Time Scales

Mol Biol Evol. 2016 Jan;33(1):255-67. doi: 10.1093/molbev/msv207. Epub 2015 Sep 28.

Authors

Sebastián Duchêne¹, Francesca Di Giallonardo¹, Edward C Holmes²

Affiliations

¹ Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia.
² Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia [email protected].

PMID: 26416981
DOI: 10.1093/molbev/msv207

Abstract

Determining the time scale of virus evolution is central to understanding their origins and emergence. The phylogenetic methods commonly used for this purpose can be misleading if the substitution model makes incorrect assumptions about the data. Empirical studies consider a pool of models and select that with the highest statistical fit. However, this does not allow the rejection of all models, even if they poorly describe the data. An alternative is to use model adequacy methods that evaluate the ability of a model to predict hypothetical future observations. This can be done by comparing the empirical data with data generated under the model in question. We conducted simulations to evaluate the sensitivity of such methods with nucleotide, amino acid, and codon data. These effectively detected underparameterized models, but failed to detect mutational saturation and some instances of nonstationary base composition, which can lead to biases in estimates of tree topology and length. To test the applicability of these methods with real data, we analyzed nucleotide and amino acid data sets from the genus Flavivirus of RNA viruses. In most cases these models were inadequate, with the exception of a data set of relatively closely related sequences of Dengue virus, for which the GTR+Γ nucleotide and LG+Γ amino acid substitution models were adequate. Our results partly explain the lack of consensus over estimates of the long-term evolutionary time scale of these viruses, and indicate that assessing the adequacy of substitution models should be routinely used to determine whether estimates are reliable.

Keywords: Bayesian model averaging; model adequacy; parametric bootstrap; posterior predictive simulation; substitution model; virus evolution.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bayes Theorem
Cluster Analysis
Codon / genetics
Computer Simulation
Evolution, Molecular*
Models, Genetic*
Viruses / genetics*

Substances

Codon