Perceptions of Data Set Experts on Important Characteristics of Health Data Sets Ready for Machine Learning: A Qualitative Study

JAMA Netw Open. 2023 Dec 1;6(12):e2345892. doi: 10.1001/jamanetworkopen.2023.45892.

Abstract

Importance: The lack of data quality frameworks to guide the development of artificial intelligence (AI)-ready data sets limits their usefulness for machine learning (ML) research in health care and hinders the diagnostic excellence of developed clinical AI applications for patient care.

Objective: To discern what constitutes high-quality and useful data sets for health and biomedical ML research purposes according to subject matter experts.

Design, setting, and participants: This qualitative study interviewed data set experts, particularly those who are creators and ML researchers. Semistructured interviews were conducted in English and remotely through a secure video conferencing platform between August 23, 2022, and January 5, 2023. A total of 93 experts were invited to participate. Twenty experts were enrolled and interviewed. Using purposive sampling, experts were affiliated with a diverse representation of 16 health data sets/databases across organizational sectors. Content analysis was used to evaluate survey information and thematic analysis was used to analyze interview data.

Main outcomes and measures: Data set experts' perceptions on what makes data sets AI ready.

Results: Participants included 20 data set experts (11 [55%] men; mean [SD] age, 42 [11] years), of whom all were health data set creators, and 18 of the 20 were also ML researchers. Themes (3 main and 11 subthemes) were identified and integrated into an AI-readiness framework to show their association within the health data ecosystem. Participants partially determined the AI readiness of data sets using priority appraisal elements of accuracy, completeness, consistency, and fitness. Ethical acquisition and societal impact emerged as appraisal considerations in that participant samples have not been described to date in prior data quality frameworks. Factors that drive creation of high-quality health data sets and mitigate risks associated with data reuse in ML research were also relevant to AI readiness. The state of data availability, data quality standards, documentation, team science, and incentivization were associated with elements of AI readiness and the overall perception of data set usefulness.

Conclusions and relevance: In this qualitative study of data set experts, participants contributed to the development of a grounded framework for AI data set quality. Data set AI readiness required the concerted appraisal of many elements and the balancing of transparency and ethical reflection against pragmatic constraints. The movement toward more reliable, relevant, and ethical AI and ML applications for patient care will inevitably require strategic updates to data set creation practices.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Artificial Intelligence*
  • Delivery of Health Care
  • Female
  • Humans
  • Machine Learning
  • Male
  • Qualitative Research