SARS-CoV-2, which causes the current pandemic of respiratory illness, is evolving continuously and generating new variants. Nevertheless, most of the sequence analyses thus far focused on nucleotide substitutions despite the fact that insertions and deletions (indels) are equally important in the evolution of SARS-CoV-2. In this study, we analyzed 1,099,664 high-quality sequences of SARS-CoV-2 genomes to re-construct the evolutionary and epidemiological histories of indels. Our analysis revealed 289 circulating indel types (237 deletion and 52 insertion types, each represented by more than ten genomic sequences), among which eighteen were recurrent indel types, each represented by more than 500 genome sequences. Although indels were identified across the entire genome, most of them were identified in nsp6, S, ORF8, and N genes, among which ORF8 indel types had the highest frequencies of frameshift. Geographical and temporal analyses of these variants revealed a few alterations of dominant indel types, each accompanied by geographic expansion to different countries and continents, which resulted in the fixation of several types of indels in the field, including the current variants of concern. Evolutionary and structural analyses revealed that indels involving S N-terminal domain regions were linked to the 3/4 variants of concern, resulting in significantly altered S protein that might contribute to the selective advantage of the corresponding variant. In sum, our study highlights the important role of insertions and deletions in the evolution and spread of SARS-CoV-2.
Keywords: SARS-CoV-2; deletions; evolution; insertions; molecular epidemiology.
© The Author(s) 2021. Published by Oxford University Press.