Repeated sequences spread throughout the genome play important roles in shaping the structure of chromosomes and facilitating the generation of new genomic variation through structural rearrangements. Several mechanisms of structural variation formation use shared nucleotide similarity between repeated sequences as substrate for ectopic recombination. We performed genome-wide analyses of direct and inverted intrachromosomal repeated sequence pairs with >200bp and >80% sequence identity in three human genome assemblies, GRCh37, GRCh38, and the T2T-CHM13 alternate assembly. Overall, the composition and distribution of direct and inverted repeats identified was similar among the three assemblies involving 13-15% of the haploid genome, with an increased, albeit not significant, number of repeated sequences in T2T-CHM13. Interestingly, the majority of repeated sequences are below 1 Kb in length with a median of 84.2% identity, highlighting the potential relevance of smaller, less identical repeats, such as Alu-Alu pairs, for ectopic recombination. We cross-referenced the identified repeated sequences with protein-coding genes to identify those at risk for being involved in genomic disorders. Olfactory receptors and immune response genes were enriched among those impacted. We have produced a catalog of highly-identical directly and inversely oriented intrachromosomal repeated sequences across the currently three most widely used human genome assemblies. Bioinformatic analyses of these sequences and their contribution to genome architecture can reveal regions that are susceptible to genomic instability. Understanding how their architectural genomic features such as identity, length, and distance can lead to genomic rearrangements can provide further insights into the molecular mechanisms leading to genomic disorders and genome evolution.
Keywords: Alus; GRCh37; GRCh38; LINEs; T2T; assembly; complex genomic rearrangements; genomic rearrangements; inversions; low-copy repeats; microhomology; recombination; repeats; segmental duplications.
Copyright © 2024. Published by Elsevier Inc.