Multicohort study testing the generalisability of the SASKit-ML stroke and PDAC prognostic model pipeline to other chronic diseases

BMJ Open. 2024 Sep 30;14(9):e088181. doi: 10.1136/bmjopen-2024-088181.

Abstract

Objectives: To validate and test the generalisability of the SASKit-ML pipeline, a prepublished feature selection and machine learning pipeline for the prediction of health deterioration after a stroke or pancreatic adenocarcinoma event, by using it to identify biomarkers of health deterioration in chronic disease.

Design: This is a validation study using a predefined protocol applied to multiple publicly available datasets, including longitudinal data from cohorts with type 2 diabetes (T2D), inflammatory bowel disease (IBD), rheumatoid arthritis (RA) and various cancers. The datasets were chosen to mimic as closely as possible the SASKit cohort, a prospective, longitudinal cohort study.

Data sources: Public data were used from the T2D (77 patients with potential pre-diabetes and 18 controls) and IBD (49 patients with IBD and 12 controls) branches of the Human Microbiome Project (HMP), RA Map (RA-MAP, 92 patients with RA, 22 controls) and The Cancer Genome Atlas (TCGA, 16 cancers).

Methods: Data integration steps were performed in accordance with the prepublished study protocol, generating features to predict disease outcomes using 10-fold cross-validated random survival forests.

Outcome measures: Health deterioration was assessed using disease-specific clinical markers and endpoints across different cohorts. In the HMP-T2D cohort, the worsening of glycated haemoglobin (HbA1c) levels (5.7% or more HbA1c in the blood), fasting plasma glucose (at least 100 mg/dL) and oral glucose tolerance test (at least 140) results were considered. For the HMP-IBD cohort, a worsening by at least 3 points of a disease-specific severity measure, the "Simple Clinical Colitis Activity Index" or "Harvey-Bradshaw Index" indicated an event. For the RA-MAP cohort, the outcome was defined as the worsening of the "Disease Activity Score 28" or "Simple Disease Activity Index" by at least five points, or the worsening of the "Health Assessment Questionnaire" score or an increase in the number of swollen/tender joints were evaluated. Finally, the outcome for all TCGA datasets was the progression-free interval.

Results: Models for the prediction of health deterioration in T2D, IBD, RA and 16 cancers were produced. The T2D (C-index of 0.633 and Integrated Brier Score (IBS) of 0.107) and the RA (C-index of 0.654 and IBS of 0.150) models were modestly predictive. The IBD model was uninformative. TCGA models tended towards modest predictive power.

Conclusions: The SASKit-ML pipeline produces informative and useful features with the power to predict health deterioration in a variety of diseases and cancers; however, this performance is disease-dependent.

Keywords: DIABETES & ENDOCRINOLOGY; Health informatics; Inflammatory bowel disease; Rheumatology.

Publication types

  • Validation Study

MeSH terms

  • Aged
  • Arthritis, Rheumatoid
  • Biomarkers / blood
  • Chronic Disease
  • Cohort Studies
  • Diabetes Mellitus, Type 2* / complications
  • Female
  • Humans
  • Inflammatory Bowel Diseases
  • Longitudinal Studies
  • Machine Learning
  • Male
  • Middle Aged
  • Pancreatic Neoplasms*
  • Prognosis
  • Prospective Studies
  • Stroke*

Substances

  • Biomarkers