SCALPEL3: A scalable open-source library for healthcare claims databases

Int J Med Inform. 2020 Sep:141:104203. doi: 10.1016/j.ijmedinf.2020.104203. Epub 2020 May 28.

Abstract

Objective: This article introduces SCALPEL3 (Scalable Pipeline for Health Data), a scalable open-source framework for studies involving Large Observational Databases (LODs). It focuses on scalable medical concept extraction, easy interactive analysis, and helpers for data flow analysis to accelerate studies performed on LODs.

Materials and methods: Inspired from web analytics, SCALPEL3 relies on distributed computing, data denormalization and columnar storage. It was compared to the existing SAS-Oracle SNDS infrastructure by performing several queries on a dataset containing a three years-long history of healthcare claims of 13.7 million patients.

Results and discussion: SCALPEL3 horizontal scalability allows handling large tasks quicker than the existing infrastructure while it has comparable performance when using only a few executors. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.

Conclusion: SCALPEL3 makes studies based on SNDS much easier and more scalable than the existing framework [1]. It is now used at the agency collecting SNDS data, at the French Ministry of Health and soon at the National Health Data Hub in France [2].

Keywords: ETL; Healthcare claims data; Interactive data manipulation; Large observational database; Reproducibility; Scalability.

MeSH terms

  • Databases, Factual
  • Delivery of Health Care*
  • France
  • Humans
  • Reproducibility of Results