Refining sleep staging accuracy: transfer learning coupled with scorability models

Wolfgang Ganglberger; Samaneh Nasiri; Haoqi Sun; Soriul Kim; Chol Shin; M Brandon Westover; Robert J Thomas

doi:10.1093/sleep/zsae202

Refining sleep staging accuracy: transfer learning coupled with scorability models

Sleep. 2024 Nov 8;47(11):zsae202. doi: 10.1093/sleep/zsae202.

Authors

Wolfgang Ganglberger^{1

2

3}, Samaneh Nasiri^{2

3

4}, Haoqi Sun^{1

2

3}, Soriul Kim⁵, Chol Shin^{5

6}, M Brandon Westover^{1

2

3}, Robert J Thomas^{3

7}

Affiliations

¹ Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
² McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, USA.
³ Division of Sleep Medicine, Harvard Medical School, Boston, MA, USA.
⁴ Biomedical Informatics & Neurology, Emory School of Medicine, Atlanta, GA, USA.
⁵ Institute of Human Genomic Study, College of Medicine, Kore University, Seoul, Republic of Korea.
⁶ Biomedical Research Center, Korea University Ansan Hospital, Ansan, Republic of Korea.
⁷ Division of Pulmonary Critical Care & Sleep Medicine, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.

PMID: 39215679
DOI: 10.1093/sleep/zsae202

Abstract

Study objectives: This study aimed to (1) improve sleep staging accuracy through transfer learning (TL), to achieve or exceed human inter-expert agreement and (2) introduce a scorability model to assess the quality and trustworthiness of automated sleep staging.

Methods: A deep neural network (base model) was trained on a large multi-site polysomnography (PSG) dataset from the United States. TL was used to calibrate the model to a reduced montage and limited samples from the Korean Genome and Epidemiology Study (KoGES) dataset. Model performance was compared to inter-expert reliability among three human experts. A scorability assessment was developed to predict the agreement between the model and human experts.

Results: Initial sleep staging by the base model showed lower agreement with experts (κ = 0.55) compared to the inter-expert agreement (κ = 0.62). Calibration with 324 randomly sampled training cases matched expert agreement levels. Further targeted sampling improved performance, with models exceeding inter-expert agreement (κ = 0.70). The scorability assessment, combining biosignal quality and model confidence features, predicted model-expert agreement moderately well (R² = 0.42). Recordings with higher scorability scores demonstrated greater model-expert agreement than inter-expert agreement. Even with lower scorability scores, model performance was comparable to inter-expert agreement.

Conclusions: Fine-tuning a pretrained neural network through targeted TL significantly enhances sleep staging performance for an atypical montage, achieving and surpassing human expert agreement levels. The introduction of a scorability assessment provides a robust measure of reliability, ensuring quality control and enhancing the practical application of the system before deployment. This approach marks an important advancement in automated sleep analysis, demonstrating the potential for AI to exceed human performance in clinical settings.

Keywords: confidence; deep learning; home sleep recordings; interrater reliability; scorability; sleep staging; transfer learning.

© The Author(s) 2024. Published by Oxford University Press on behalf of Sleep Research Society. All rights reserved. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

MeSH terms

Adult
Deep Learning
Female
Humans
Male
Middle Aged
Neural Networks, Computer
Polysomnography* / methods
Polysomnography* / standards
Reproducibility of Results
Sleep Stages* / physiology

Abstract

MeSH terms

Grants and funding