CheXpert Plus: Augmenting a Large Chest X-ray Dataset with
Text Radiology Reports, Patient Demographics and
Additional Image Formats

Pierre Chambon∗♠
[email protected]
&Jean-Benoit Delbrouck∗♠
[email protected]

&Thomas Sounack
[email protected]

\ANDShih-Cheng Huang

&Zhihong Chen

&Maya Varma

\ANDSteven QH Truong

&Chu The Chuong

&Curtis P. Langlotz
\AND Stanford AIMI    VinBrain
  Equal contribution
Abstract

Since the release of the original CheXpert paper Irvin et al. (2019) five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus  serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus  is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus

1 Introduction

The rapid advancement of deep learning technologies has catalyzed transformative changes across numerous fields, with healthcare standing out as a particularly promising domain for such innovations. Integration of artificial intelligence into the analysis of chest X-rays has emerged as a significant area of progress and study. Early work has yielded models that rival the diagnostic capabilities of radiologists (cite CheXpert PLOS). Subsequent research has leveraged self-supervised pre-training objectives that bridge radiology reports with chest X-ray images, achieving enhanced performance metrics with considerably fewer data requirements Huang et al. (2023). More recently, the development of Vision Language Models (VLMs) aimed at generating chest X-ray reports has begun to illustrate the vast potential of AI to improve diagnostic processes. These advancements collectively move us toward a future where AI significantly enhances patient outcomes through more accurate and efficient diagnoses.

In recent years, there has been a surge in research developing Vision Language Models (VLMs) for chest X-ray analysis. Some key developments in this area include generating radiology reports Chen et al. (2020); Delbrouck et al. (2022); Hyland et al. (2023); Chaves et al. (2024), employing self-supervised learning techniques Huang et al. (2021); Zhang et al. (2022); Varma et al. (2023); Bannur et al. (2023), using stable diffusion models Chambon et al. (2022a), and applying broad reasoning foundation models Moor et al. (2023); Wu et al. (2023); Tu et al. (2024); Chen et al. (2024).

However, the success of these technologies underscores the critical need for extensive datasets that encompass both images and text to train increasingly sophisticated models. This paper presents a significant update to the CheXpert dataset, accompanied by the release of corresponding radiology reports and patient information. We have undertaken a rigorous de-identification process to ensure that these reports are devoid of any patient Personal Health Information (PHI), thereby aligning with the ethical standards required for the utilization of medical data in research. This enhancement of the CheXpert dataset marks a pivotal step forward in the development of AI tools capable of transforming patient care in radiology and beyond.

Our new dataset, referred to as CheXpert Plus, comprises the following sources of data:

Images, available in both DICOM and PNG format.

Reports, corresponding to each CheXpert image, parsed into subsections.

Demographics, including several clinical and socio-economic attributes.

14 pathology labels, automatically extracted from the reports.

RadGraph annotations, extracted separately from Findings and Impression sections.

Models, trained on these data sources for key machine learning tasks.

2 Related Work

Dataset Patients X-rays Labels Reports DICOM Meta
MIMIC-CXR Johnson et al. (2019) 65,379 377,095 14
OpenI Demner-Fushman et al. (2012) 3,996 8,121 - - -
CheXpert Irvin et al. (2019) 64,540 223,414 14 - -
BraX Reis et al. (2022) 18,442 40,967 14 - - -
CandidPTX Feng et al. (2021b) 13,744 19,234 3 - -
NIH Wang et al. (2017) 30,805 112,120 14 - - -
PadChest Bustos et al. (2020) 67,625 160,861 - * -
VinDR Nguyen et al. (2020) 15,000 15,000 14 -
MIDRC 131,351 131,351 1 - -
JF Healthcare Healthcare 16,000 16,000 1 - - -
CheXpert Plus 64,725 223,462 14
Table 1: Comparison of Chest X-ray Datasets. *PadChest reports are in Spanish

Several datasets comprising chest X-rays and their corresponding reports have been published, with the main ones outlined in Table  1. Notably, the PadChest dataset Bustos et al. (2020), created from chest X-rays collected at the Hospital Universitario de San Juan, Alicante, Spain, between January 2009 and December 2017, encompasses 109,931 studies and 168,861 images, all de-identified for research purposes. It includes radiology reports written in Spanish with comprehensive metadata. The data has been processed for quality and consistency, excluding non-compliant images based on specific criteria such as readability, incorrect modality, or inappropriate projections.

The MIMIC Chest X-ray Database Johnson et al. (2019) offers a collection of 377,110 de-identified chest radiographs in DICOM format, paired with free-text radiology reports from 227,835 studies at Beth Israel Deaconess Medical Center (Boston, MA). The de-identification process included generating random identifiers for patients and studies while preserving chronological data integrity. A custom algorithm was developed to de-identify the chest radiographs while retaining medically relevant information.

Other datasets, smaller in size, are also available to the research community, such as Open-I Demner-Fushman et al. (2012), BIMCV-COVID19 Vayá et al. (2020) and CANDID-PTX Feng et al. (2021a).

3 Dataset Composition

3.1 General Composition

CheXpert Plus is a dataset that pairs text and images, featuring 223,228 unique pairs of radiology reports and chest X-rays from 187,711 studies and 64,725 patients. A single patient may be linked to several studies, and each study may include multiple chest X-rays. The dataset comprises:

  • 223,228 unique chest X-ray images in DICOM format, each featuring 47 DICOM metadata elements. These images are also be made available in PNG format(Section 3.2).

  • 187,711 unique radiology reports, each report divided into subsections extracted from the original corresponding report (Section 3.3).

  • 64,725 unique patients, with 8 de-identified demographic data points (Section 3.4).

  • 187,711 unique annotations for 14 different chest pathologies using CheXbert. (Section 3.5).

  • 187,575 unique RadGraph annotations for impression sections and 47,328 unique RadGraph annotations for finding sections, using the pretrained RadGraph model (Section 3.6).

  • a plurality of models trained on these data sources for key radiology tasks (Section 3.7).

3.2 Images

CheXpert Plus comprises 223,228 unique images available both in DICOM format and in PNG format.

The DICOM format includes image metadata attributes that encapsulate essential information for medical image processing and interpretation. In total, we release up to 47 DICOM metadata elements, listed in Appendix B. With these image metadata attributes, it is possible to convert the DICOM pixel data into the PNG format. While the DICOM format offers the most comprehensive data, the PNG format may be more straightforward to integrate into existing training pipelines.

3.3 Reports

Table 2: Aggregated statistics for all reports and report sections as present in CheXpert Plus . The BERT-base-uncased tokenizer is being used to compute report tokens.
Report Section
Studies
Total
Tokens
Total
Tokens
Mean
Tokens
Std
Tokens
Median
Tokens
η.1subscript𝜂.1\eta_{.1}italic_η start_POSTSUBSCRIPT .1 end_POSTSUBSCRIPT
Tokens
η.9subscript𝜂.9\eta_{.9}italic_η start_POSTSUBSCRIPT .9 end_POSTSUBSCRIPT
Full Report 187,711 36,469,132 194 62 181 135 268
Impression 187,575 13,351,758 71 33 66 34 113
Findings 47,328 4,844,613 102 56 90 49 171
Narrative 183,022 2,524,834 13 6 13 10 18
Clinical History 116,231 1,661,044 14 6 14 7 22
History 30,865 411,485 13 6 12 7 20
Comparison 178,893 1,707,006 9 6 8 4 16
Technique 9,173 105,376 11 7 10 8 14
Procedure Comments 20,563 179,681 8 4 8 8 8
End of Impression 9,033 216,481 23 32 4 2 65
Summary 160,081 4,147,222 25 16 25 8 42

CheXpert Plus contains 187,711 distinct radiology reports, each corresponding to a separate study. For studies comprising more than one image, the report compiles information extracted from all the images in the study. Therefore, the findings described in the report may describe findings from multiple images. Additionally, a single study may be preceded by one or more related studies, especially in cases where a patient has undergone multiple examinations. To facilitate the analysis of a disease’s progression over multiple studies, we have introduced the field patient_report_date_order𝑝𝑎𝑡𝑖𝑒𝑛𝑡_𝑟𝑒𝑝𝑜𝑟𝑡_𝑑𝑎𝑡𝑒_𝑜𝑟𝑑𝑒𝑟patient\_report\_date\_orderitalic_p italic_a italic_t italic_i italic_e italic_n italic_t _ italic_r italic_e italic_p italic_o italic_r italic_t _ italic_d italic_a italic_t italic_e _ italic_o italic_r italic_d italic_e italic_r, which ranks the studies of each patient in chronological order. This enhancement makes CheXpert Plus compatible with ML approaches that condition their predictions not only on a single study but also on its prior studies.

Section names were not used consistently by the radiologists producing these reports. Therefore, the reports from CheXpert Plus may contain as many as 11 distinct sections:

Narrative: This section outlines the type of exam and the date it was conducted.

Clinical History: This details the reasons the ordering provider requested the radiology exam. It typically includes patient metadata and symptoms.

History: Similar to the Clinical History, this section explains the reasons for the radiology exam provided by the requesting provider, often incorporating patient symptoms and a reason for the exam.

Comparison: Here, any previous studies that were used as part of the interpretation are listed.

Technique: This section provides details on how the exam was conducted, including the views captured and whether contrast material was injected.

Procedure Comments: Noted here are additional details about the procedure, such as the number of views.

Findings: Observations reported by the radiologist are listed here in detail.

Impression: This section summarizes the findings and the radiologist’s interpretation.

End of Impression: Closing remarks and often the reporting radiologist’s name are included in this section.

Summary: A brief overview of the study is provided here, sometimes alongside a number that classifies the study as normal, abnormal, or extremely abnormal.

Accession Number: A de-identified accession number is given. Although this section holds limited informational value for any ML models, it is retained to preserve the original format of the report.

Aggregated statistics for each section are made available in Table 2. In CheXpert Plus reports, an impression is almost always present.

3.4 Demographics

Refer to caption
Figure 1: Frequency plots for patient metadata as available in CheXpert Plus.

Alongside the de-identified reports and images, CheXpert Plus includes de-identified demographic data. Each study is linked to the corresponding demographic data using a privacy-preserving identifier created during the de-identification process. While a patient’s insurance may have varied over time, the data contains their insurance status as of February 2024.

As listed in Figure 1, CheXpert Plus lists patient age, sex, race, ethnicity, insurance type, BMI, deceased status, and the need for an interpreter. Collectively, this demographic data allows researchers to account for subgroups of patients and improve the training of ML models relying on CheXpert Plus data sources, enhancing their fairness and robustness.

Table 3: 14 pathology labels as present in the CheXpert Plus dataset. Total counts and proportions for each pathology are being reported, for the training set and counting each image of the dataset per CheXpert label

. Pathology Positive (%) Uncertain (%) Negative (%) Not Mentioned (%) Atelectasis 33,385 (14.94) 33,725 (15.09) 1,326 (0.59) 155,026 (69.37) Cardiomegaly 26,996 (12.08) 8,095 (3.62) 11,126 (4.98) 177,245 (79.32) Consolidation 14,790 (6.62) 27,727 (12.41) 28,116 (12.58) 152,829 (68.39) Edema 52,245 (23.38) 12,984 (5.81) 20,735 (9.28) 137,498 (61.53) Enlarged Cardiomediastinum 10,789 (4.83) 12,403 (5.55) 21,656 (9.69) 178,614 (79.93) Fracture 9,049 (4.05) 644 (0.29) 2,512 (1.12) 211,257 (94.54) Lung Lesion 9,193 (4.11) 1,486 (0.66) 1,271 (0.57) 211,512 (94.65) Lung Opacity 105,567 (47.24) 5,602 (2.51) 6,606 (2.96) 105,687 (47.30) No Finding 22,407 (10.03) 0 (0.00) 0 (0.00) 201,055 (89.97) Pleural Effusion 86,174 (38.56) 11,629 (5.20) 35,425 (15.85) 90,234 (40.38) Pleural Other 3,521 (1.58) 2,654 (1.19) 315 (0.14) 216,972 (97.10) Pneumonia 6,042 (2.70) 18,771 (8.40) 2,806 (1.26) 195,843 (87.64) Pneumothorax 19,453 (8.71) 3,141 (1.41) 56,347 (25.22) 144,521 (64.67) Support Devices 116,004 (51.91) 1,079 (0.48) 6,136 (2.75) 100,243 (44.86)

3.5 Pathology Labels

CheXpert Plus lists 14 pathology labels, generated from the radiology reports using model-based extraction methods. Table 3 presents the prevalence of each pathology based on the CheXpert labeler Irvin et al. (2019). For each disease, the label can be positive (1), uncertain (-1), negative (0) or not mentioned.

In addition to these labels, CheXpert Plus also includes labels generated from CheXbert Smit et al. (2020). Due to CheXbert’s maximum input token size of 512 tokens, different parts of the report are used as input: the full report, the findings, the impression, as well as concatenations (findings - impression) and (impression - findings). When an input exceeds 512 tokens, the excess tokens are not used by the labeler.

To assess the performance of each label, these labels are compared against two human-annotated test sets. The first test set comprises 1000 samples annotated by 2 board-certified radiologists based on the radiology images only, with disagreement resolution through consensus. The second test set includes 500 samples annotated by 8 board-certified radiologists based on the radiology reports only, with a majority vote of 5 radiologists. In order to have meaningful comparisons, the test sets are restricted to instances where both findings and impression are available, leading to 154 and 339 labeled samples respectively. In order to simplify this multi-label classification problem, Not Mentioned labels are assigned to Negative and Uncertain labels to Positive. The results of this analysis are detailed in table 4.

Table 4: Comparison of labelers on human-annotated sets, based on image and text respectively. The metrics reported are macro-averaged across the 14 pathology labels.
Image Test (S. 154) Text Test (S. 339)
Labeler F1 (Pr., R.) F1 (Pr., R.)
CheXpert 0.35 (0.40, 0.52) 0.92 (0.91, 0.95)
CheXbert
Full Report
0.41 (0.41, 0.64) 0.74 (0.67, 0.90)
CheXbert
Findings
0.44 (0.44, 0.59) 0.65 (0.65, 0.72)
CheXbert
Impression
0.34 (0.39, 0.50) 0.93 (0.92, 0.94)
CheXbert
Findings + Impression
0.42 (0.41, 0.64) 0.77 (0.70, 0.93)
CheXbert
Impression + Findings
0.42 (0.41, 0.64) 0.77 (0.69, 0.93)

For prediction tasks based on the reports, we recommend using the CheXbert-impression labels, which achieve the highest F1-score (0.93) when evaluated against the text-based test set. On the contrary, for prediction tasks based on the images, we suggest using the CheXbert-findings labels, which achieve an F1-score of 0.44 on the image-based test set. The performance gap compared to the text-based test set indicates that the radiology reports themselves might not encapsulate all the data needed to accurately classify the presence or absence of a pathology, and that labels generated synthetically from radiology reports might not be reliable for training image pathology classifiers using supervised learning . We encourage research aiming at improving the synthetic generation of pathology labels based on text.

3.6 RadGraph Annotations

Table 5: RadGraph annotations generated for the Findings and Impression section of the reports.
Kategorie Findings (%) Impression (%)
Anatomy 740,453 (46.81) 1,730,617 (43.27)
Observation Present 685,006 (43.30) 1,762,298 (44.06)
Observation Uncertain 58,772 (3.72) 206,959 (5.17)
Observation Absent 97,234 (6.15) 298,811 (7.47)
Total Entities 1,581,863 (100) 3,999,559 (100)
Modify 658,095 (59.47) 1,640,192 (59.16)
Located at 400,132 (36.16) 977,785 (35.27)
Suggestive of 48,432 (4.38) 154,360 (5.57)
Total Relations 1,106,659 (100) 2,772,337 (100)

The Table 5 showcases the RadGraph Jain et al. (2021) annotations released as part of CheXpert Plus for the Findings and Impression sections of the radiology reports. The most common annotations are Anatomy (46.81% in Findings and 43.27% in Impression) and Observation: Definitely Present (43.30% in Findings and 44.06% in Impression). Observation: Uncertain and Definitely Absent categories are less frequent, ranging from 3.72% to 7.47%. Relation annotations like Modify (59.47% in Findings and 59.16% in Impression) dominate over Located at and Suggestive of. The total entity annotations are significant, with 1,581,863 in Findings and 3,999,559 in Impression, while total relations annotations are 1,106,659 (Findings) and 2,772,337 (Impression), demonstrating the comprehensive nature of the dataset.

The Table 5 showcases the generated RadGraph annotations Jain et al. (2021) on our dataset. These annotations cover the Findings and Impression sections of radiology reports. The most frequent annotations are "Anatomy" (46.81% in Findings and 43.27% in Impression) and "Observation: Definitely Present" (43.30% in Findings and 44.06% in Impression). The categories "Observation: Uncertain" and "Definitely Absent" are less common, with frequencies ranging from 3.72% to 7.47%. Relation annotations, such as "Modify" (59.47% in Findings and 59.16% in Impression), are more prevalent than "Located at" and "Suggestive of." Entity annotations amount to 1,581,863 in Findings and 3,999,559 in Impression, while the total number of relation annotations are 1,106,659 (in Findings) and 2,772,337 (in Impression).

3.7 Model Zoo

As part of CheXpert Plus , we are releasing pretrained models trained on CheXpert Plus data, incorporating recent developments in machine learning spanning natural language processing, image recognition, and generative modeling. Among these releases are a pretrained LLaMA Touvron et al. (2023) model, which generates human-like text and excels at complex language tasks. A pretrained CLIP Radford et al. (2021) model is being introduced, which improves the way visual concepts are learned from text descriptions, thereby enhancing image search and classification capabilities. The lineup also includes a pretrained VQ-GAN Esser et al. (2021) model, blending VQ-VAE and GAN technologies to produce realistic images and demonstrating generative power. The release also includes a pretrained DINOv2 Oquab et al. (2023) model that employs self-supervised learning with Vision Transformers to achieve robust visual representations. Finally, several architecture-agnostic models are included, todeliver competitive performance in radiology report generation (RRG) and radiology report summarization (RRS).
These models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus.

4 Dataset Release and Analysis

In this section, we will explore how the release of CheXpert Plus dataset was made possible, and analyze uses and possible future work.

We begin with a comparison between CheXpert Plus  and the original CheXpert 1.0 dataset in Section 4.1, as well as other existing datasets in Section 4.2. Then, we take an in-depth look at the de-identification procedures applied to both the radiology images and reports, which are explained in Section 4.3 and Section 4.4, respectively. Finally, we discuss the applicability of this dataset in Section 4.5 and the limitations in Section 4.6.

4.1 Comparison with CheXpert 1.0

Building upon the foundational CheXpert dataset introduced in 2019, which comprised 224,316 chest radiographs from 65,240 patients annotated for 14 observations, our enhanced version, CheXpert Plus, introduces several significant advancements aimed at pushing the boundaries of medical imaging research. By transitioning to DICOM, the gold standard in medical image formats, we provide images alongside a subset of their original DICOM headers, thereby offering richer image metadata and superior image quality. CheXpert Plus further enriches the dataset by including the corresponding radiology reports, which we have parsed into sections such as medical history, findings, and impressions, enhancing the dataset’s utility for comprehensive analysis. Moreover, by incorporating detailed patient demographic data, CheXpert Plus facilitates the development of fairness-focused analyses and multimodal models capable of utilizing this information for more informed and nuanced diagnoses.

In addition, CheXpert Plus also focuses on releasing higher quality extracted labels, be it for the 14 lung diseases thanks to the CheXbert annotation tool, or for the more recent RadGraph annotations, allowing the training of classifiers directly on top of CheXpert Plus or for CheXpert- and RadGraph-based metrics to be computed.

These enhancements not only deepen the analytical potential within the medical imaging field but also significantly contribute to the evolving medical AI community, particularly as it strides towards leveraging multimodal and image-text learning paradigms, promising to enhance diagnostic accuracy and patient outcomes.

Finally, we mention that 186 studies are missing compared to the original CheXpert release, as the corresponding reports could not be recovered when preparing the CheXpert Plus release.

4.2 Head-to-head Comparison with other Radiology Datasets

While numerous chest X-ray datasets exist (Table 1), few compare CheXpert Plus in terms of scale and modality inclusiveness. Among these, MIMIC, OpenI, and PadChest also provide radiology reports, yet PadChest’s utilization of Spanish and OpenI’s smaller scale differentiate CheXpert Plus significantly. Additionally, besides MIMIC and OpenI, only CandidPTX, and VinDR offer X-rays in DICOM format and neither matches CheXpert Plus in the sheer volume of studies.

MIMIC, the closest equivalent, features 377,095 chest X-rays with accompanying reports and includes both DICOM images and patient demographic information. CheXpert Plus  surpasses MIMIC in textual depth, containing 36 million text tokens compared to MIMIC’s 34 million. MIMIC does features a higher number of reports (227,821 unique reports compared to CheXpert Plus’s 187,711 unique reports), these reports are shorter, especially in their impression section: MIMIC impressions count 7,986,317 total tokens, versus the 13,351,758 tokens of CheXpert Plus’s impressions. For any text-image radiology task where the text being handled is restrained to the impression section, CheXpert Plus is therefore also the largest dataset available. Finally, our reports underwent a de-identification process that did not alter their structure, even including de-identified accession number sections, something that is not displayed in MIMIC and makes our reports closer to the natural textual data handled during radiology exams.

Given these distinctions and advancements, CheXpert Plus represents a valuable addition to the existing landscape of medical imaging datasets, poised to significantly enhance and expand the capabilities of research in medical AI.

4.3 De-identification and release of DICOMs

The release of DICOMs required the de-identification of all metadata DICOM headers, as well as their pixel content, as displayed in Figure 2. The metadata headers were automatically de-identified and reviewed by humans to confirm that no PHI information was present (corresponding to steps 5 and 6 of Figure 2). Along that, we guarantee that the pixels in the released DICOMs are identical to those in the original CheXpert 1.0 images, that were themselves cleared for public release by hiding any PHI content (steps 7 and 8). The code used to accomplish this is available in Appendix A.

4.4 De-identification and release of reports

Refer to caption
Figure 2: The de-identification of images and reports from CheXpert Plus is an 8-step process as described in Section 4.4.

As part of the CheXpert Plus release, the reports associated with the CheXpert images underwent a de-identification process that lasted for a year and was supported by 25-30 human annotators. We counted a posteriori the presence of 853,878 total PHI spans, as defined in the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Details of the types of PHI spotted in these reports are further displayed in Table 6.

Data Category Count
Dates 538,160
Age numbers 5,206
Unique identifiers 233,244
Healthcare worker names 57,274
Vendor names 9,110
Phone numbers 10,324
Hospital names 443
Patient names 10,324
Table 6: Data Summary

To the best of our knowledge, this is the largest text de-identification effort in terms of quantity of PHI reviewed. As displayed in Figure 2, these reports underwent a process in 4 steps in order to be considered de-identified and ready for public release.

Step 1 The reports were first automatically de-identified using a two-step model  Chambon et al. (2022b) that first leverages a transformer model for token-level classification into PHI categories, before replacing the true PHI spans with synthetic PHI as a "Hide in plain sight" approach  Carrell et al. (2013). This latter approach adds an additional safety factor to ensure any missed PHI, if any, cannot be easily spotted.

Step 2 Human annotators reviewed each report with the synthetic PHI highlighted by the automatic de-identifier The human reviewers identified any PHI that was missed by the algorithm. A missed PHI can be a full span belonging to a certain PHI category, or any prefix or suffix of a PHI span partially detected by the model. Any missed PHI was highlighted by the annotators for further human review. Out of the 853,878 true PHI spans in all reports, 23 were fully missed by the model (0.002%), and 841 were partially missed by the model (0.01%).

Step 3 Any partially or fully missed PHI then underwent the same "Hide In Plain Sight" step to replace the remaining true PHI spans by synthetic PHI spans.

Step 4 A mapping of each pair of true and synthetic PHI spans was generated and reviewed for PHI by a board-certified radiologist. This combination of automated processes and human review confirmed that all true PHI spans had been replaced by synthetic PHI spans.

4.5 Applicability, Usage and Future Work

CheXpert Plus dataset is being released on the Stanford AIMI Shared Dataset website and can be accessed at the following link. It is associated with a Stanford University Dataset Research Use Agreement, which specifies that CheXpert Plus may not be used for any commercial purposes and is only available for research uses. In particular, you may not distribute, publish or reproduce a copy of this dataset.

As part of the CheXpert Plus release, we underline the following main uses that can be made out of it:

Performance: Due to its extensive size, CheXpert Plus almost double the amount of publicly available English text-image pairs in radiology. Therefore, any model can leverage this dataset to improve in performance compared to models trained before this data release.

Robustness: Along MIMIC, CheXpert Plus is the only large size english text-image dataset in radiology, therefore allowing to not only test models on multiple institutions but also perform cross-institution training, hopefully leading to more robust performance when evaluating on completely new institutions.

Fairness: The inclusion of extensive amounts of patient basic demographic data in CheXpert Plus enables downstream applications to account for the imbalance of patient ethnicity, sex, age or socio-economic background, therefore potentially limiting the bias of models trained for various radiology tasks.

Finally, we underline that any model trained with the help of CheXpert Plus data may still reflect biases based on patient characteristics and pathologies, among else. When using such models, researchers should always look for sources of potential distribution shifts and audit for peformance disparities based on attributes such as race, ethnicity, age or socio-economic background.

4.6 Limitations of the dataset

First, as described in Section 3.3, the focus was put on collecting reports with detailed impression sections, at the cost of having lots of findings. Therefore, impressions are in total of tokens twice as important as findings. Second, some pathologies such as fracture, lung lesion and pleural other are significantly under-represented, as displayed in Section 3.5, which limits the performance of any models aiming at studying these particular pathologies.

5 Conclusion

In our work, we introduced CheXpert Plus , a multi-sourced dataset comprising hundreds of thousands of images paired with texts, patient demographics, and computed pathology labels and RadGraph annotations. First, we release high-quality images in DICOM format along with DICOM metadata encapsulating image processing information, as well as PNG images. Second, we distribute the corresponding reports after a careful de-identification process and pre-parse these reports into their corresponding subsections for ease of use in various downstream tasks. Third, we provide patient metadata detailing clinical and socio-economic conditions such as sex, ethnicity, and medical insurance information, for better analysis of distribution shifts and the diminishment of biases. Fourth, we release improved pathology labels to be used directly for classification or evaluation tasks. Fifth, we pre-compute RadGraph annotations for both findings and impressions, making them available to be directly usable in existing pipelines. Finally, we release a set of models for the main radiology downstream tasks, spanning from text-to-image generation to text-to-text summarization. We hope this substantial data release helps foster the development of AI models in radiology that display improved performance, robustness, and fairness and ultimately improve patient medical care.

6 Acknowledgements

This work was supported in part by MIDRC (The Medical Imaging and Data Resource Center), funded by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under contract 75N92020D00021.
This work has been made possible thanks to the help of Stephanie Bogdan and her contributions to gather the dataset, before proceeding to its de-identification and cleaning.
We would also like to thank Bui Duc Thai Tan and Duong Thi Hong Hanh for all their help in the de-identification and the release of the CheXpert Plus reports.

References

  • Bannur et al. (2023) Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. 2023. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–15027.
  • Bustos et al. (2020) Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria De La Iglesia-Vaya. 2020. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis, 66:101797.
  • Carrell et al. (2013) David Carrell, Bradley Malin, John Aberdeen, Samuel Bayer, Cheryl Clark, Ben Wellner, and Lynette Hirschman. 2013. Hiding in plain sight: Use of realistic surrogates to reduce exposure of protected health information in clinical text.
  • Chambon et al. (2022a) Pierre Chambon, Christian Bluethgen, Jean-Benoit Delbrouck, Rogier Van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P Langlotz, and Akshay Chaudhari. 2022a. Roentgen: Vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737.
  • Chambon et al. (2022b) Pierre J Chambon, Christopher Wu, Jackson M Steinkamp, Jason Adleberg, Tessa S Cook, and Curtis P Langlotz. 2022b. Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. Journal of the American Medical Informatics Association, 30(2):318–328.
  • Chaves et al. (2024) Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, et al. 2024. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002.
  • Chen et al. (2020) Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. 2020. Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449.
  • Chen et al. (2024) Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208.
  • Delbrouck et al. (2022) Jean-Benoit Delbrouck, Pierre Chambon, Christian Bluethgen, Emily Tsai, Omar Almusa, and Curtis Langlotz. 2022. Improving the factual correctness of radiology report generation with semantic rewards. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4348–4360.
  • Demner-Fushman et al. (2012) Dina Demner-Fushman, Sameer Antani, Matthew Simpson, and George R Thoma. 2012. Design and development of a multimodal biomedical information retrieval system. Journal of Computing Science and Engineering, 6(2):168–177.
  • Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883.
  • Feng et al. (2021a) Sijing Feng, Damian Azzollini, Ji Soo Kim, Cheng-Kai Jin, Simon P Gordon, Jason Yeoh, Eve Kim, Mina Han, Andrew Lee, Aakash Patel, et al. 2021a. Curation of the candid-ptx dataset with free-text reports. Radiology: Artificial Intelligence, 3(6):e210136.
  • Feng et al. (2021b) Sijing Feng, Damian Azzollini, Ji Soo Kim, Cheng Kai Jin, Eve Kim, Simon Gordon, Jason Yeoh, Min A Han, Andrew Lee, Aakash Patel, Martin Urschler, Amy Fong, Cameron Simmers, Gregory Tarr, Stuart Barnard, and Ben Wilson. 2021b. CANDID-PTX. Radiology: Artificial Intelligence.
  • (14) JF Healthcare. Object-cxr - automatic detection of foreign objects on chest x-rays. https://web.archive.org/web/20201127235812/https://jfhealthcare.github.io/object-CXR/.
  • Huang et al. (2023) Shih-Cheng Huang, Anuj Pareek, Malte Jensen, Matthew P Lungren, Serena Yeung, and Akshay S Chaudhari. 2023. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digital Medicine, 6(1):74.
  • Huang et al. (2021) Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. 2021. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951.
  • Hyland et al. (2023) Stephanie L Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. 2023. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668.
  • Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597.
  • Jain et al. (2021) Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, and Pranav Rajpurkar. 2021. Radgraph: Extracting clinical entities and relations from radiology reports.
  • Johnson et al. (2019) Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. 2019. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317.
  • Moor et al. (2023) Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. 2023. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
  • Nguyen et al. (2020) H Nguyen, HH Pham, NT Nguyen, DB Nguyen, M Dao, V Vu, K Lam, and LT Le. 2020. Vinbigdata chest x-ray abnormalities detection. Kaggle Competition https://www. kaggle. com/c/vinbi gdatachest-xray-abnor malit ies-detec tion.
  • Oquab et al. (2023) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  • Reis et al. (2022) Eduardo P Reis, Joselisa PQ de Paiva, Maria CB da Silva, Guilherme AS Ribeiro, Victor F Paiva, Lucas Bulgarelli, Henrique MH Lee, Paulo V Santos, Vanessa M Brito, Lucas TW Amaral, et al. 2022. Brax, brazilian labeled chest x-ray dataset. Scientific Data, 9(1):487.
  • Smit et al. (2020) Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. 2020. Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Tu et al. (2024) Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. 2024. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138.
  • Varma et al. (2023) Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, and Curtis Langlotz. 2023. Villa: Fine-grained vision-language representation learning from real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Vayá et al. (2020) Maria De La Iglesia Vayá, Jose Manuel Saborit, Joaquim Angel Montell, Antonio Pertusa, Aurelia Bustos, Miguel Cazorla, Joaquin Galant, Xavier Barber, Domingo Orozco-Beltrán, Francisco García-García, et al. 2020. Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients. arXiv preprint arXiv:2006.01174.
  • Wang et al. (2017) Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106.
  • Wu et al. (2023) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463.
  • Zhang et al. (2022) Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2022. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pages 2–25. PMLR.

Appendix A From CheXpert Plus DICOM to CheXpert 1.0 JPG

Listing 1: Python code to convert CheXpert Plus DICOM to CheXpert 1.0 JPG.

1import pydicom
2import cv2
3import os
4
5def convert_dicom_to_images(input_file_path, jpg_filename):
6 """
7 This function converts a DICOM file to a JPEG image.
8
9 Args:
10 input_file_path (str): The path to the input DICOM file.
11 jpg_filename (str): The filename (including the path) to save the JPEG image.
12
13 Returns:
14 None
15 """
16 # Read the DICOM file
17 dcm_file = pydicom.dcmread(input_file_path)
18
19 # Rescale the pixel array to the range [0, 255]
20 rescaled_image = cv2.convertScaleAbs(dcm_file.pixel_array, alpha=(255.0 / dcm_file.pixel_array.max()))
21
22 # If the PhotometricInterpretation is "MONOCHROME1", invert the pixel values
23 if dcm_file.PhotometricInterpretation == "MONOCHROME1":
24 rescaled_image = cv2.bitwise_not(rescaled_image)
25
26 # Apply histogram equalization to enhance the contrast
27 adjusted_image = cv2.equalizeHist(rescaled_image)
28
29 # Save the adjusted image in JPG format using the specified output file path
30 cv2.imwrite(jpg_filename, adjusted_image)
31
32
33# Define your paths and filenames
34dcm_path = ’path/to/filename.dcm’
35jpg_path = ’path/to/filename.jpg’
36
37# Check if the DICOM file exists
38if os.path.isfile(dcm_path):
39 # Convert DICOM to JPEG and compare
40 convert_dicom_to_images(dcm_path, jpg_path)
41else:
42 print(f"The DICOM file ’{dcm_path}’ does not exist.")

Appendix B DICOM metadata

DICOM metadata
PixelData
BitsAllocated
Rows
Columns
SamplesPerPixel
PhotometricInterpretation
PixelRepresentation
BitsStored
ImagePositionPatient
PixelSpacing
RescaleIntercept
RescaleSlope
WindowCenter
WindowWidth
Manufacturer
SliceThickness
ImageOrientationPatient
VOILUTFunction
VOILUTSequence
PresentationLUTShape
LUTExplanation
Exposure
ExposureControlMode
ExposureControlModeDescription
ExposureInuAs
RelativeXRayExposure
ExposuresOnPlate
ExposureIndex
TargetExposureIndex
ExposureTimeInuS
ExposuresOnDetectorSinceLastCalibration
DetectorTimeSinceLastExposure
TotalNumberOfExposures
ExposureStatus
ExposureTime
ExposureInmAs
ExposureModulationType
KVP
Laterality
ImageLaterality
RescaleType
XRayTubeCurrent
XRayTubeCurrentInuA
ConvolutionKernel
ViewPosition
BodyPartExamined
BurnedInAnnotation
Table 7: Combination of metadata contained in CheXpert Plus DICOM files.