CpG islands (CGIs) are rare, interspersed DNA sequences, which possess a significant deviation from background genomic distribution by exhibiting patterns of GC-rich and CpG-rich sequence, the density of which provides a good classification feature for long noncoding RNA (lncRNA) promoters. By reviewing previous CpG-related studies, we consider that the transcription regulation of about half of the human genes, mostly housekeeping (HK) genes, involves CGIs, their methylation states, CpG spacing, and other chromosomal parameters. However, the precise CGI definition and positioning of CGIs within gene structures, as well as specific CGI-associated regulatory mechanisms, all remain to be elucidated at individual gene and gene family levels, together with consideration of species and lineage specificity. Although previous studies have already classified CGIs into high-CpG (HCGI), intermediate-CpG (ICGI), and low-CpG (LCGI) densities based on CpG density variation, the correlation between CGI density and gene expression regulation, such as co-regulation of CGIs and TATA-box on HK genes, is not clear. Here, we introduce such a problem-solving protocol for human genome annotation, which is based on a combination of GTEx, JBLA, and GO analysis. Next, we discuss why CGI-associated genes are most likely regulated by HCGI and tend to be HK genes; The HCGI/TATA± and LCGI/TATA± combinations show different GO enrichment, whereas the ICGI/TATA± combination is less characteristic than LCGI/TATA± based on GO enrichment analysis.
Keywords: CpG Island; Genome analysis; Genome annotation; LncRNA; Statistical genetics.
© 2025. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.