Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome

Nucleic Acids Res. 2001 Feb 1;29(3):818-30. doi: 10.1093/nar/29.3.818.

Abstract

Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from an mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently 'dead', they usually have a variety of obvious disablements (e.g., insertions, deletions, frameshifts and truncations) relative to their functioning homologs. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in 'molecular archaeology'. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene. The population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes-whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common 'pseudofold' is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes. For example, one family of seven-transmembrane receptors (represented by gene B0334.7) has one pseudogene for every four genes, and another uncharacterized family (represented by gene B0403.1) is approximately two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic fragments do not have any obvious homologs in the worm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Animals
  • Caenorhabditis elegans / genetics*
  • Chromosomes / genetics
  • DNA, Helminth / genetics
  • Genes, Helminth / genetics
  • Genome*
  • Molecular Sequence Data
  • Pseudogenes / genetics*

Substances

  • DNA, Helminth