Background: Many important clinical decisions require causal knowledge (CK) to take action. Although many causal knowledge bases for medicine have been constructed, a comprehensive evaluation based on real-world data and methods for handling potential knowledge noise are still lacking.
Objective: The objectives of our study are threefold: (1) propose a framework for the construction of a large-scale and high-quality causal knowledge graph (CKG); (2) design the methods for knowledge noise reduction to improve the quality of the CKG; (3) evaluate the knowledge completeness and accuracy of the CKG using real-world data.
Material and methods: We extracted causal triples from three knowledge sources (SemMedDB, UpToDate and Churchill's Pocketbook of Differential Diagnosis) based on rule methods and language models, performed ontological encoding, and then designed semantic modeling between electronic health record (EHR) data and the CKG to complete knowledge instantiation. We proposed two graph pruning strategies (co-occurrence ratio and causality ratio) to reduce the potential noise introduced by SemMedDB. Finally, the evaluation was carried out by taking the diagnostic decision support (DDS) of diabetic nephropathy (DN) as a real-world case. The data originated from a Chinese hospital EHR system from October 2010 to October 2020. The knowledge completeness and accuracy of the CKG were evaluated based on three state-of-the-art embedding methods (R-GCN, MHGRN and MedPath), the annotated clinical text and the expert review, respectively.
Results: This graph included 153,289 concepts and 1,719,968 causal triples. A total of 1427 inpatient data were used for evaluation. Better results were achieved by combining three knowledge sources than using only SemMedDB (three models: area under the receiver operating characteristic curve (AUC): p < 0.01, F1: p < 0.01), and the graph covered 93.9 % of the causal relations between diseases and diagnostic evidence recorded in clinical text. Causal relations played a vital role in all relations related to disease progression for DDS of DN (three models: AUC: p > 0.05, F1: p > 0.05), and after pruning, the knowledge accuracy of the CKG was significantly improved (three models: AUC: p < 0.01, F1: p < 0.01; expert review: average accuracy: + 5.5 %).
Conclusions: The results demonstrated that our proposed CKG could completely and accurately capture the abstract CK under the concrete EHR data, and the pruning strategies could improve the knowledge accuracy of our CKG. The CKG has the potential to be applied to the DDS of diseases.
Keywords: Causal knowledge; Diabetic nephropathy; Electronic health record; Knowledge graph.
Copyright © 2023 Elsevier Inc. All rights reserved.