While recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research, there is no comprehensive database focusing on the annotation of driver indels (insertions and deletions) yet. Therefore, we have developed the database of Cancer driver InDels (dbCID), which is a collection of known coding indels that likely to be engaged in cancer development, progression or therapy. dbCID contains experimentally supported and putative driver indels derived from manual curation of literature and is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID. Using the data deposited in dbCID, we summarized features of driver indels in four levels (gene, DNA, transcript and protein) through comparing with putative neutral indels. We found that most of the genes containing driver indels in dbCID are known cancer genes playing a role in tumorigenesis. Contrary to the expectation, the sequences affected by driver frameshift indels are not larger than those by neutral ones. In addition, the frameshift and inframe driver indels prefer to disrupt high-conservative regions both in DNA sequences and protein domains. Finally, we developed a computational method for discriminating cancer driver from neutral frameshift indels based on the deposited data in dbCID. The proposed method outperformed other widely used non-cancer-specific predictors on an external test set, which demonstrated the usefulness of the data deposited in dbCID. We hope dbCID will be a benchmark for improving and evaluating prediction algorithms, and the characteristics summarized here may assist with investigating the mechanism of indel-cancer association.
Keywords: database; driver mutation; frameshift indel; inframe indel.
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected].