Background: Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.
Results: We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.
Results: This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.
Keywords: Copy number variation; Machine learning; Oligo capture baits; Targeted capture.
© 2024. The Author(s).