OCRClassifier: integrating statistical control chart into machine learning framework for better detecting open chromatin regions

Xin Lai; Min Liu; Yuqian Liu; Xiaoyan Zhu; Jiayin Wang

doi:10.3389/fgene.2024.1400228

OCRClassifier: integrating statistical control chart into machine learning framework for better detecting open chromatin regions

Front Genet. 2024 Dec 4:15:1400228. doi: 10.3389/fgene.2024.1400228. eCollection 2024.

Authors

Xin Lai^{1

2}, Min Liu¹, Yuqian Liu¹, Xiaoyan Zhu¹, Jiayin Wang^{1

2}

Affiliations

¹ School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.
² Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, China.

Abstract

Open chromatin regions (OCRs) play a crucial role in transcriptional regulation and gene expression. In recent years, there has been a growing interest in using plasma cell-free DNA (cfDNA) sequencing data to detect OCRs. By analyzing the characteristics of cfDNA fragments and their sequencing coverage, researchers can differentiate OCRs from non-OCRs. However, the presence of noise and variability in cfDNA-seq data poses challenges for the training data used in the noise-tolerance learning-based OCR estimation approach, as it contains numerous noisy labels that may impact the accuracy of the results. For current methods of detecting OCRs, they rely on statistical features derived from typical open and closed chromatin regions to determine whether a region is OCR or non-OCR. However, there are some atypical regions that exhibit statistical features that fall between the two categories, making it difficult to classify them definitively as either open or closed chromatin regions (CCRs). These regions should be considered as partially open chromatin regions (pOCRs). In this paper, we present OCRClassifier, a novel framework that combines control charts and machine learning to address the impact of high-proportion noisy labels in the training set and classify the chromatin open states into three classes accurately. Our method comprises two control charts. We first design a robust Hotelling T² control chart and create new run rules to accurately identify reliable OCRs and CCRs within the initial training set. Then, we exclusively utilize the pure training set consisting of OCRs and CCRs to create and train a sensitized T² control chart. This sensitized T² control chart is specifically designed to accurately differentiate between the three categories of chromatin states: open, partially open, and closed. Experimental results demonstrate that under this framework, the model exhibits not only excellent performance in terms of three-class classification, but also higher accuracy and sensitivity in binary classification compared to the state-of-the-art models currently available.

Keywords: cell-free DNA; machine learning approach; multivariate control chart; noisy label; open chromatin region; sequencing data analysis.

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported by the National Natural Science Foundation of China (72274152) and Shaanxi’s Natural Science Basic Research Program, grant number 2020JC-01.