Refining sleep staging accuracy: transfer learning coupled with scorability models

Sleep. 2024 Nov 8;47(11):zsae202. doi: 10.1093/sleep/zsae202.

Abstract

Study objectives: This study aimed to (1) improve sleep staging accuracy through transfer learning (TL), to achieve or exceed human inter-expert agreement and (2) introduce a scorability model to assess the quality and trustworthiness of automated sleep staging.

Methods: A deep neural network (base model) was trained on a large multi-site polysomnography (PSG) dataset from the United States. TL was used to calibrate the model to a reduced montage and limited samples from the Korean Genome and Epidemiology Study (KoGES) dataset. Model performance was compared to inter-expert reliability among three human experts. A scorability assessment was developed to predict the agreement between the model and human experts.

Results: Initial sleep staging by the base model showed lower agreement with experts (κ = 0.55) compared to the inter-expert agreement (κ = 0.62). Calibration with 324 randomly sampled training cases matched expert agreement levels. Further targeted sampling improved performance, with models exceeding inter-expert agreement (κ = 0.70). The scorability assessment, combining biosignal quality and model confidence features, predicted model-expert agreement moderately well (R² = 0.42). Recordings with higher scorability scores demonstrated greater model-expert agreement than inter-expert agreement. Even with lower scorability scores, model performance was comparable to inter-expert agreement.

Conclusions: Fine-tuning a pretrained neural network through targeted TL significantly enhances sleep staging performance for an atypical montage, achieving and surpassing human expert agreement levels. The introduction of a scorability assessment provides a robust measure of reliability, ensuring quality control and enhancing the practical application of the system before deployment. This approach marks an important advancement in automated sleep analysis, demonstrating the potential for AI to exceed human performance in clinical settings.

Keywords: confidence; deep learning; home sleep recordings; interrater reliability; scorability; sleep staging; transfer learning.

MeSH terms

  • Adult
  • Deep Learning
  • Female
  • Humans
  • Male
  • Middle Aged
  • Neural Networks, Computer
  • Polysomnography* / methods
  • Polysomnography* / standards
  • Reproducibility of Results
  • Sleep Stages* / physiology