INDELpred: Improving the prediction and interpretation of indel pathogenicity within the clinical genome

Yilin Wei; Tongda Zhang; Bangyao Wang; Xiaosen Jiang; Fei Ling; Mingyan Fang; Xin Jin; Yong Bai

doi:10.1016/j.xhgg.2024.100325

INDELpred: Improving the prediction and interpretation of indel pathogenicity within the clinical genome

HGG Adv. 2024 Oct 10;5(4):100325. doi: 10.1016/j.xhgg.2024.100325. Epub 2024 Jul 10.

Authors

Yilin Wei¹, Tongda Zhang², Bangyao Wang², Xiaosen Jiang², Fei Ling³, Mingyan Fang², Xin Jin⁴, Yong Bai⁵

Affiliations

¹ School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China; BGI Research, Shenzhen 518083, China.
² BGI Research, Shenzhen 518083, China.
³ School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China.
⁴ BGI Research, Shenzhen 518083, China; The Innovation Centre of Ministry of Education for Development and Diseases, School of Medicine, South China University of Technology, Guangzhou 510006, China; Shanxi Medical University-BGI Collaborative Center for Future Medicine, Shanxi Medical University, Taiyuan 030001, China; Shenzhen Key Laboratory of Transomics Biotechnologies, BGI Research, Shenzhen, China. Electronic address: [email protected].
⁵ BGI Research, Shenzhen 518083, China. Electronic address: [email protected].

Abstract

Small insertions and deletions (indels) are critical yet challenging genetic variations with significant clinical implications. However, the identification of pathogenic indels from neutral variants in clinical contexts remains an understudied problem. Here, we developed INDELpred, a machine-learning-based predictive model for discerning pathogenic from benign indels. INDELpred was established based on key features, including allele frequency, indel length, function-based features, and gene-based features. A set of comprehensive evaluation analyses demonstrated that INDELpred exhibited superior performance over competing methods in terms of computational efficiency and prediction accuracy. Importantly, INDELpred highlighted the crucial role of function-based features in identifying pathogenic indels, with a clear interpretability of the features in understanding the disease-causing variants. We envisage INDELpred as a desirable tool for the detection of pathogenic indels within large-scale genomic datasets, thereby enhancing the precision of genetic diagnoses in clinical settings.

Keywords: InDel; clinical genomics; machine learning; pathogenicity prediction; whole genome sequencing.

MeSH terms

Computational Biology / methods
Gene Frequency
Genome, Human / genetics
Genomics / methods
Humans
INDEL Mutation* / genetics
Machine Learning*
Software