Contextualized race and ethnicity annotations for clinical text from MIMIC-III

Oliver J Bear Don't Walk 4th; Adrienne Pichon; Harry Reyes Nieva; Tony Sun; Jaan Li; Josh Joseph; Sivan Kinberg; Lauren R Richter; Salvatore Crusco; Kyle Kulas; Shaan A Ahmed; Daniel Snyder; Ashkon Rahbari; Benjamin L Ranard; Pallavi Juneja; Dina Demner-Fushman; Noémie Elhadad

doi:10.1038/s41597-024-04183-2

Contextualized race and ethnicity annotations for clinical text from MIMIC-III

Sci Data. 2024 Dec 5;11(1):1332. doi: 10.1038/s41597-024-04183-2.

Authors

Oliver J Bear Don't Walk 4th¹, Adrienne Pichon², Harry Reyes Nieva^{2

3}, Tony Sun², Jaan Li^{4

5}, Josh Joseph^{3

6}, Sivan Kinberg², Lauren R Richter², Salvatore Crusco^{2

7}, Kyle Kulas², Shaan A Ahmed², Daniel Snyder², Ashkon Rahbari², Benjamin L Ranard^{2

7}, Pallavi Juneja², Dina Demner-Fushman⁸, Noémie Elhadad²

Affiliations

¹ University of Washington, Seattle, Washington, USA. [email protected].
² Columbia University Irving Medical Center, New York, New York, USA.
³ Harvard Medical School, Boston, Massachusetts, USA.
⁴ One Fact Foundation, Claymont, Delaware, USA.
⁵ University of Tartu, Tartu, Estonia.
⁶ Brigham and Women's Hospital, Boston, Massachusetts, USA.
⁷ NewYork-Presbyterian Hospital, New York, New York, USA.
⁸ US National Library of Medicine, Bethesda, Maryland, USA.

Abstract

Observational health research often relies on accurate and complete race and ethnicity (RE) patient information, such as characterizing cohorts, assessing quality/performance metrics of hospitals and health systems, and identifying health disparities. While the electronic health record contains structured data such as accessible patient-level RE data, it is often missing, inaccurate, or lacking granular details. Natural language processing models can be trained to identify RE in clinical text which can supplement missing RE data in clinical data repositories. Here we describe the Contextualized Race and Ethnicity Annotations for Clinical Text (C-REACT) Dataset, which comprises 12,000 patients and 17,281 sentences from their clinical notes in the MIMIC-III dataset. Using these sentences, two sets of reference standard annotations for RE data are made available with annotation guidelines. The first set of annotations comprise highly granular information related to RE, such as preferred language and country of origin, while the second set contains RE labels annotated by physicians. This dataset can support health systems' ability to use RE data to serve health equity goals.

Publication types

Dataset

MeSH terms

Electronic Health Records*
Ethnicity*
Humans
Natural Language Processing*
Racial Groups*

Abstract

Publication types

MeSH terms

Grants and funding