CaTCH: Calculating transcript complexity of human genes

MethodsX. 2024 Apr 4:12:102697. doi: 10.1016/j.mex.2024.102697. eCollection 2024 Jun.

Abstract

The findings based on whole transcriptome sequencing suggest that alternative splicing occurs in approximately 95% of human multi-exon genes, thus, playing a crucial role in promoting proteome diversity. According to the latest GENCODE annotations, most genes have less than four transcripts, positively correlating with the number of exons. Thus, it is more accurate to measure the splice variant efficiency of a gene with respect to the number of exons, which is a measure of Transcript Complexity (TC). In addition to that, the theoretical number of transcripts is substantially higher than the actual number of transcripts produced by Alternative Splicing Events, and the features restricting this phenomenon need to be explored. In this method, we have extracted the data of various features contributing to TC from different databases. Linear regression is used to identify the determinant features and to train and test the model of TC. The results indicate that exon length is the determining feature of TC, followed by coding potential, presence of chromatin signature, and 5' splice site dinucleotide, all of which negatively affect a gene's TC, except exon length. To further classify the genes based on TC, random forest is used to identify the determinant features.•The splicing efficiency of a gene can be inferred by the transcript complexity, which is the number of transcripts per exon.•CaTCH is a linear regression-based model to calculate the transcript complexity of human genes, which can be calculated from the exon length, coding potentiality, presence of chromatin signature/s, and 5' splice site dinucleotide.

Keywords: Alternative splicing; CaTCH; Linear Regression; Random Forest; Transcript complexit.