Finding and Characterizing Repeats in Plant Genomes

Jacques Nicolas; Sébastien Tempel; Anna-Sophie Fiston-Lavier; Emira Cherif

doi:10.1007/978-1-0716-2067-0_18

Finding and Characterizing Repeats in Plant Genomes

Methods Mol Biol. 2022:2443:327-385. doi: 10.1007/978-1-0716-2067-0_18.

Authors

Jacques Nicolas¹, Sébastien Tempel², Anna-Sophie Fiston-Lavier^{3

4}, Emira Cherif³

Affiliations

¹ Univ Rennes, Inria, CNRS, IRISA, Rennes, France. [email protected].
² Institut de Biologie du Développement de Marseille, CNRS, Univ Aix-Marseille, Marseille, France.
³ ISEM, Université Montpellier, CNRS, UM, IRD, CIRAD, EPHE, Montpellier, France.
⁴ Institut Universitaire de France (IUF), Paris, France.

PMID: 35037215
DOI: 10.1007/978-1-0716-2067-0_18

Abstract

Plant genomes contain a particularly high proportion of repeated structures of various types. This chapter proposes a guided tour of the available software that can help biologists to scan automatically for these repeats in sequence data or check hypothetical models intended to characterize their structures. Since transposable elements (TEs) are a major source of repeats in plants, many methods have been used or developed for this broad class of sequences. They are representative of the range of tools available for other classes of repeats and we have provided two sections on this topic (for the analysis of genomes or directly of sequenced reads), as well as a selection of the main existing software. It may be hard to keep up with the profusion of proposals in this dynamic field and the rest of the chapter is devoted to the foundations of an efficient search for repeats and more complex patterns. We first introduce the key concepts of the art of indexing and mapping or querying sequences. We end the chapter with the more prospective issue of building models of repeat families. We present the Machine Learning approach first, seeking to build predictors automatically for some families of ET, from a set of sequences known to belong to this family. A second approach, the linguistic (or syntactic) approach, allows biologists to describe themselves and check the validity of models of their favorite repeat family.

Keywords: Algorithmics on words; Homology-based; Indexing; Machine Learning; Mapping; Pattern matching; Repeats; Structure-based methods; Transposon.

MeSH terms

DNA Transposable Elements / genetics
Genome, Plant*
Plants / genetics
Prospective Studies
Software*

Substances

DNA Transposable Elements