Introduction: Regulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.
Methods: We utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA's DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.
Results: The automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).
Discussion: Moreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.
Keywords: CatBoost; LightGBM; Matthews correlation coefficient (MCC); TF-IDF; XGBoost; anatomical therapeutic chemical classification (ATC); area under the curve.
Copyright © 2024 Chen, Wu, Wingerd, Liu, Xu, Thakkar, Pedersen, Donnelly, Mann, Tong, Wolfinger and Bao.