Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors

bioRxiv [Preprint]. 2024 Nov 12:2024.11.11.622097. doi: 10.1101/2024.11.11.622097.

Abstract

We describe an effort ("Codebook") to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments across multiple in vitro and in vivo assays produced motifs for just over half of the putative TFs analyzed (177, or 53%), of which most are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple assays provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking, however, suggests that many of the currently known binding motifs for well-studied TFs may inaccurately describe the TF's true sequence preferences.

Keywords: ChIP-seq; Codebook; DNA-binding specificity; GHT-SELEX; HT-SELEX; Motif; PBM; PWM; SELEX; SMiLE-seq; TF; Transcription factor.

Publication types

  • Preprint