\theorembodyfont\theoremheaderfont\theorempostheader

: \theoremsep
\jmlrvolume \jmlryear

Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction

Abstract

Social determinants of health (SDOH) — the myriad of circumstances in which people live, grow, and age — play an important role in health outcomes. However, existing outcome prediction models often only use proxies of SDOH as features. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH, but manually integrating the most relevant data for individual patients becomes increasingly challenging as the volume and diversity of public SDOH data grows. Large language models (LLMs) have shown promise at automatically annotating structured data. Here, we conduct an end-to-end case study evaluating the feasibility of using LLMs to integrate SDOH data, and the utility of these SDOH features for clinical prediction. We first manually label 700+ variables from two publicly-accessible SDOH data sources to one of five semantic SDOH categories. Then, we benchmark performance of 9 open-source LLMs on this classification task. Finally, we train ML models to predict 30-day hospital readmission among 39k heart failure (HF) patients, and we compare the prediction performance of the categorized SDOH variables with standard clinical variables. Additionally, we investigate the impact of few-shot LLM prompting on LLM annotation performance, and perform a metadata ablation study on prompts to evaluate which information helps LLMs accurately annotate these variables. We find that some open-source LLMs can effectively, accurately annotate SDOH variables with zero-shot prompting without the need for fine-tuning. Crucially, when combined with standard clinical features, the LLM-annotated Neighborhood and Built Environment subset of the SDOH variables shows the best performance predicting 30-day readmission of HF patients.

Chase Fensore¹, Rodrigo M. Carrillo-Larco¹, Shivani A. Patel¹, Alanna A. Morris¹ and Joyce C. Ho¹

¹Emory University

1 Introduction

In recent years, mounting evidence has emerged that social determinants of health (SDOH) are stronger markers of health outcomes than one’s genetics (Schroeder Steven A., 2007). SDOH — the underlying conditions in which individuals live, grow, and age — impacts a diverse range of outcomes including chronic conditions, preventable hospitalizations, and mortality (National Academies of Sciences and Medicine Division and Board on Population Health and Public Health Practice and Committee on Informing the Selection of Leading Health Indicators for Healthy People and 2030, 2020; Office of Disease Prevention and Health Promotion, ). Several frameworks have been proposed to measure and describe SDOH (Hinnant et al., 2022). In general, SDOH measures are grouped into domains or categories. The widely used framework proposed by the U.S. Department of Health and Human Services (Office of Disease Prevention and Health Promotion ) identifies 5 mutually exclusive SDOH domains: (1) social and community context, (2) economic stability, (3) education access and quality, (4) neighborhood and built environment, and (5) healthcare access and quality. Each domain contains varying factors to quantify a patient’s lifelong health, improve the well-being and health outcomes of a population, and achieve health equity. Despite the awareness regarding the detrimental effects of SDOH, existing outcome prediction models often only use SDOH proxies limited to race, health insurance, and census tract poverty, and fail to comprehensively capture SDOH factors across all the domains (Liu et al., 2022).

The increase in open data initiatives presents an opportunity for researchers to paint a more comprehensive picture of patient experiences. Growing collection and aggregation efforts from government agencies and non-profit organizations have resulted in hundreds or thousands of public, neighborhood-level SDOH variables. These efforts have produced a wealth of data with the potential to better inform population health research. There are the low-dimensional SDOH indices like area deprivation index (ADI) (Maroko, 2016), social deprivation index (SDI) (Butler et al., 2013), or social vulnerability index (SVI) (Flanagan et al., 2011). There are high-dimensional SDOH databases with domain annotations including Agency for Healthcare Research and Quality (AHRQ) SDOH Database (SDOHD) (Agency for Healthcare Research and Quality (2022), AHRQ) and Centers for Disease Control and Prevention (CDC) PLACES (Centers for Disease Control and Prevention, ). There are also un-annotated databases that may contain variables relevant to SDOH such as National Neighborhood Data Archive (NaNDA) (Institute for Social Research, ), Inter-university Consortium for Political and Social Research (ICPSR), and Data.gov.

We briefly motivate the potential of SDOH on heart failure (HF) outcomes. HF is a condition caused by an abnormality in the heart, resulting in elevated levels of certain peptides or fluid buildup in the lungs or body (Bozkurt et al., 2021). In the United States, around 6.2 million adults suffered from HF in 2012, costing over $30.7 billion, and this cost is projected to rise to $69.8 billion by 2030 (CDC, 2023; Heidenreich et al., 2013). Hospitalization for HF is a significant indicator of the patient’s prognosis and a major healthcare expense. There is a high risk of mortality after hospitalization, and multiple readmissions for HF can increase the mortality risk by 20-31% (Kommuri et al., 2012). The burden of HF incidence, hospitalization, and mortality disproportionately affects racial and ethnic groups in the U.S. Black patients have a higher rate of HF incidence compared to Hispanic, White, and Chinese Americans (Bahrami et al., 2008). Further, Black patients also have a 2.5 times higher hospitalization rate for HF and higher 30-day and 1-year readmission rates compared to White patients (Rathore et al., 2003; Kommuri et al., 2012). Several SDOH indices have been shown to correlate with higher HF readmission, including SDI (Patel et al., 2020), SVI (Regmi et al., 2021), and various aspects of ADI and SDI (Li et al., 2022). Although these demonstrate the impact of SDOH, the limited number of factors may only provide a partial view of the neighborhood.

During the creation of these SDOH databases, variables can be annotated by domain experts in population health. The variable annotation process is time-consuming, but ultimately makes it easier for researchers to identify relevant variables among the thousands available. For example, AHRQ SDOHD currently includes 1366 census tract SDOH variables from 44 different data sources (Agency for Healthcare Research and Quality (2022), AHRQ). However, existing variables can be deleted or new variables added, with the latter incurring additional annotation tasks. Furthermore, domain assignment is not a straightforward task because each of the 44 datasets within AHRQ SDOHD includes variables that encompass a variety of SDOH domains. This also assumes a single agreed-upon SDOH framework which is not the case (Hinnant et al., 2022). A further complication arises from attempts to incorporate variables from unannotated datasets such as NaNDA and Data.gov. As a result, it is often time-consuming for population health researchers to identify the maximal set of relevant SDOH data for their particular research question. Thus to equip researchers with relevant, comprehensive SDOH data, there is a need for a rapid, flexible approach to annotate variables to determine their relevance to SDOH.

The rise of large language models (LLMs) has shown surprising emergent behaviors like their capacity to perform well at zero-shot tasks. Recently, works have begun to explore the potential to apply LLMs to data management tasks by framing these tasks as text-based prompts. For example, GPT-3 (text-davinci-002) outperformed the previous SoTA deep learning based approach on schema matching tasks (Narayan et al., 2022), where the goal was to identify which fields across two tables or databases correspond to each other. Fernandez et al. (2023) provides grounded examples of where LLMs can help automate tedious but easy problems with data management. These findings raise an important question: With respect to public health data integration relevant to SDOH, these finding raise an important question about the role of LLMs in managing tabular SDOH data. Can LLMs streamline integration of SDOH for clinical studies by performing accurate, automatic, generalizable annotation of SDOH variables?

To answer this question, we tackle the perspective of a researcher seeking to merge un-annotated, publicly available SDOH with individual patient records for a clinical prediction task. In particular, the researcher must (i) annotate the domain of the SDOH variables to determine if this is something they want to incorporate and (ii) integrate the relevant tabular SDOH data to predict patient outcomes. Without automated LLM-based domain annotation, the researcher would need to manually annotate the domain of these 1000+ SDOH variables. Rather than manually reviewing the variables, the researcher can instead prompt the LLM to automatically assign variables to a domain, and then selectively merge the domain-specific variables with patient data. We explore the practicality of this workflow from two perspectives.

First, we demonstrate the potential of leveraging un-annotated datasets to yield more comprehensive neighborhood-level SDOH factors. We manually annotate 223 SDOH variables from NaNDA, a publicly available un-annotated dataset. We merge relevant variables for 3 of the 5 SDOH domains with a private electronic health record (EHR) dataset with nearly 40,000 patients with HF. We then evaluate the impact of these newly incorporated measures towards predicting 30-day hospital readmission compared to using solely the AHRQ SDOH data.

Next, we explore the use of LLMs to provide accurate domain-annotation on both the AHRQ SDOH and NaNDA dataset. We pose the domain annotation task as a multi-class text classification task, where each SDOH variable is assigned to 1 of the 5 possible SDOH domains. We investigate various representations using attributes of each variable (e.g., variable name, variable description) and evaluate 9 different open-source LLMs with a range of model sizes. Our results using zero-shot prompts illustrate the potential effectiveness of using LLMs to provide domain annotations of SDOH datasets and of expediting the data integration of these new datasets. Code for the study can be found at: https://github.com/fensorechase/LLMs_SDOH_Integration.

Generalizable Insights about Machine Learning in the Context of Healthcare

Our approach offers several key insights for ML in healthcare:

•

Our results show that incorporating additional domain-specific neighborhood-level SDOH data such as education and neighborhood can indeed boost the performance of ML models for predicting HF outcomes.
•

Our results demonstrate that open-source LLMs can be used to effectively and accurately annotate the SDOH domain using zero-shot prompting without the need for fine-tuning, thereby expediting the data integration process.
•

We show that descriptions of individual variables allow open-source LLMs to annotate SDOH variables accurately, highlighting the utility of incorporating free-text semantics for this data integration task.

2 Related Work

Here, we specifically focus on studies that use publicly available SDOH and studies applying LLMs for data annotation. Incorporating SDOH into clinical studies can be a tedious undertaking. In general, clinical studies using public SDOH have used only small sets of measures despite the large volume of heterogeneous SDOH data publicly available. From a data integration perspective, three bottlenecks to widespread integration of more comprehensive SDOH data include: identifying, annotating, and integrating heterogeneous public data sources to merge with patient records.

2.1 SDOH for clinical prediction tasks

A recent systematic review studying SDOH impact on HF outcomes found that works used a small range of 1 to 9 SDOH variables as independent variables (Enard et al., 2023). 42% of these studies used individual-level tabular SDOH from electronic medical records such as race, gender, and insurance status. However, only 27% of studies included public external data sources such as SDOH from national surveys. However, there is growing evidence demonstrating the utility of neighborhood-level SDOH for predicting patient outcomes (Enard et al., 2023). Across these studies, publicly available tabular SDOH data has been the most widely used data stream. These works have chiefly focused on low-dimensional SDOH indices like ADI, SDI, or SVI, which only contain 17, 7, and 4 census measures respectively. For example, higher SDI measures are correlated with excess HF readmission and mortality (Patel et al., 2020). Another study found that 2 factors from SVI are associated with higher 30-day readmission (Regmi et al., 2021). More recently, ML classifiers trained with ADI and SDI measures yielded moderately better 30-day readmission performance and fairness with respect to sex and race than clinical features alone (Li et al., 2022). While these illustrate the potential of incorporating SDOH with individual-level clinical data, the limited number of factors within existing SDOH indices may not capture all aspects of the neighborhood.

There have been recent attempts to apply a larger set of SDOH beyond these low-dimensional indices, but the tedious data integration process has limited broader use among the public health community. One study on intensive care unit patients used a wider set of county-level SDOH measures from AHRQ SDOHD to predict patient outcomes (Yang et al., 2023). Yang et al. (2023) also found that subgroup prediction fairness was improved compared to EHR features alone. Still, existing studies have yet to combine neighborhood-level SDOH measures across heterogeneous databases, underscoring the need to improve the pipeline for integrating diverse SDOH data sources with individual-level patient records. This need will continue to grow since new data sources to more comprehensively describe SDOH are constantly emerging.

2.2 Approaches for automating data integration

ML-based automated data integration has been increasingly studied in the database community (Thirumuruganathan et al., 2020). Recent works have begun to explore LLMs to leverage the semantics of the underlying data (Narayan et al., 2022; Fernandez et al., 2023). However, in settings like SDOH databases where the attributes of each variable are generally standardized (i.e., variable name, variable description, and variable data source), the bottleneck to integrating new data sources shifts from merging the data and towards annotating variables to understand their semantics, which can be framed as a binary or multi-class text classification task.

Under the annotation (or text classification) paradigm, recent works have evaluated the viability of LLMs to annotate data via tasks including sentiment analysis (Zhang et al., 2023), hate speech detection (Zhu et al., 2023), and movie genre classification (Mohta et al., 2023).Their results suggest that annotation performance can differ significantly across datasets and LLM architectures. For example, instruction-tuned LLaMA models outperformed supervised baselines on hate speech detection (Mohta et al., 2023), while another work found that GPT-3.5 achieved an accuracy of 0.65 for sentiment analysis compared to human labels (Zhu et al., 2023). Furthermore, the majority of works using LLMs for data annotation have applied closed-source versions of GPT (e.g., ChatGPT) (Narayan et al., 2022; Gilardi et al., 2023; Sun et al., 2023) which can hinder reproducibility due to ambiguity surrounding model size, training data included, and limited accessibility through privately hosted APIs. A limited number of works have explored open-source LLMs like LLaMA and Vicuna (Mohta et al., 2023). Additionally, results have indicated that instruction-tuned or “chat” versions of LLMs are better suited to these data annotation tasks (Mohta et al., 2023). The varying performance of LLMs on data annotation tasks highlights the need to evaluate these models for new domains like SDOH annotation.

3 Integrating SDOH Domain-Specific Data for HF Prediction

We first demonstrate the potential of integrating domain-specific SDOH factors using an un-annotated dataset to pair with an existing annotated SDOH dataset.

3.1 30-day HF Readmission Task Setup

Nearly 25% of hospitalized HF patients are readmitted within 30 days of discharge (Virani et al., 2020) with estimates that a quarter of the readmissions are preventable (van Walraven et al., 2012). Thus, there has been a concerted effort to use ML to improve our ability to predict the 30-day readmission risk of an HF patient (Shin et al., 2021). However, the binary classification task associated with 30-day readmission can be challenging with moderate discrimination (Enard et al., 2023; Shin et al., 2021). Since integrating SDOH has been shown to provide moderate improvement (Li et al., 2022), we hypothesize that incorporating additional SDOH measures for specific domains can improve the prediction performance of 30-day HF readmission.

3.2 Hospitalized HF Patient Dataset

We performed a retrospective analysis on a racially diverse patient population admitted at a large healthcare system in the United States with a primary or secondary discharge diagnosis of HF (based on International Classification of Diseases (ICD), Ninth or Tenth Revision, Clinical Modification codes 428.x and I50.x, respectively) for both the index hospitalization as well as any rehospitalizations from January 1, 2010 to December 31, 2018. Self-reported race and residential address were extracted from the electronic health record (EHR). Our inclusion criteria include patients $\geq 18$ years with systolic (HFrEF), diastolic (HFpEF), and other classifications of HF, and a self-reported race of Black or White. Of the 44,442 HF patients identified, 39,067 (87%) were successfully geo-coded. Table 1 summarizes the baseline characteristics for the two different racial subgroups, Black and White. The usage of this data has been approved by the Institutional Review Board (IRB).

3.3 Heart Failure (HF) Cohort

Table 1: Heart Failure patient cohort characteristics at the time of the index hospitalization. HF, Heart Failure; HFpEF, heart failure with preserved ejection fraction; HFrEF, heart failure with reduced ejection fraction; Charlson comorbidity index (CCI); Social Deprivation Index (SDI); Area Deprivation Index (ADI); Social Vulnerability Index (SVI)

	All Patients	- 30-day	+ 30-day	p
n	39066	32481	6585
Age, years (mean (SD))	70.75 (15.90)	70.78 (15.92)	70.62 (15.81)	0.446
Male (mean (SD))	0.52 (0.50)	0.52 (0.50)	0.52 (0.50)	0.254
Female (mean (SD))	0.48 (0.50)	0.48 (0.50)	0.48 (0.50)	0.254
Black (mean (SD))	0.52 (0.50)	0.50 (0.50)	0.62 (0.49)	< 0.001
White (mean (SD))	0.48 (0.50)	0.50 (0.50)	0.38 (0.49)	< 0.001
Insurance type (%)				< 0.001
Medicaid	3314 (8.5)	2595 (8.0)	719 (10.9)
Private	7629 (19.5)	6538 (20.1)	1091 (16.6)
Medicare	25991 (66.5)	21464 (66.1)	4527 (68.7)
Not recorded	2132 (5.5)	1884 (5.8)	248 (3.8)
HF classification (%)				< 0.001
Systolic (HFrEF)	18280 (46.8)	14754 (45.4)	3526 (53.5)
Diastolic (HFpEF)	9565 (24.5)	7907 (24.3)	1658 (25.2)
Other	11221 (28.7)	9820 (30.2)	1401 (21.3)
Hypertension (mean (SD))	0.67 (0.47)	0.68 (0.47)	0.67 (0.47)	0.115
Coronary Artery Disease (mean (SD))	0.47 (0.50)	0.47 (0.50)	0.47 (0.50)	0.976
Chronic Kidney Disease (mean (SD))	0.38 (0.49)	0.37 (0.48)	0.47 (0.50)	< 0.001
CCI (mean (SD))	4.25 (2.98)	4.20 (3.01)	4.49 (2.83)	< 0.001
Systolic BP, mm Hg (mean (SD))	139.57 (33.64)	139.45 (33.53)	140.14 (34.16)	0.130
Heart rate, beats per minute (mean (SD))	86.03 (25.30)	85.71 (25.14)	87.59 (25.98)	< 0.001
Respiratory rate, breaths per minute (mean (SD))	19.92 (9.31)	19.87 (9.45)	20.16 (8.64)	0.024
Length of stay (mean (SD))	8.75 (11.15)	8.81 (11.35)	8.49 (10.07)	0.032
SDI (percentile) (mean (SD))	61.26 (27.57)	60.63 (27.61)	64.28 (27.21)	< 0.001
ADI, national rank (percentile) (mean (SD))	59.32 (25.45)	58.94 (25.40)	61.10 (25.57)	< 0.001
SVI (mean (SD))	0.27 (17.72)	0.32 (16.28)	0.06 (23.50)	0.326

3.3.1 Individual-Level Clinical Data

As baseline features, we incorporate 36 individual-level clinical variables. These include the Charlson comorbidity index (CCI), vital signs, the HF sub-classification (systolic or HFrEF, diastolic or HFpEF, and other), patient demographics, and insurance status. These features are commonly used to characterize the in-hospital state of acute HF patients (Patel et al., 2020; Li et al., 2022). For clinical features, missing values were imputed with the median for each training split.

3.3.2 SDOH Data

We focus on incorporating 2 publicly-accessible SDOH data sources to supplement the clinical variables: AHRQ SDOHD¹¹1AHRQ: Association for Healthcare Research Quality Social Determinants of Health Database (Agency for Healthcare Research and Quality (2022), AHRQ) and NaNDA²²2NaNDA: National Neighborhood Data Archive (Institute for Social Research, ).

AHRQ includes annotated SDOH indicators covering years 2009-2020 from 11 data sources at the census tract level including CDC PLACES, Opportunity Atlas, American Community Survey (ACS), and other public and private agencies. AHRQ includes measures used to construct ADI, SDI, and SVI indices from the U.S. Census and the American Community Survey (ACS), but it also includes SDOH features from more data sources capturing a wider range of SDOH domains. After restricting AHRQ data to years 2010-2018 to merge with the hospitalized HF patient dataset, we used census-tract level AHRQ measures from years 2010-2018, encompassing 506 variables across all 5 SDOH domains from 8 data sources (see Table 7 in the Appendix). Notably, measures from a single data source can represent several SDOH domains. For example, ACS contains variables that characterize all 5 domains.

NaNDA includes variables from 28 data sources generally describing the relationships between neighborhood and health. For illustrative purposes, we chose 7 of the 28 data sources to obtain 223 variables. We selected these 7 data sources based on two criteria: first, they include data at the census tract level, and second, they have wide overlap with the hospitalized HF patient dataset from 2010-2018. These two inclusion criteria make these 7 NaNDA data sources comparable to AHRQ with respect to spatial granularity and temporal coverage, allowing for head-to-head evaluation of how SDOH measures from varying domains impact HF readmission prediction. One benefit is that NaNDA continuously updates the database with additional data sources from relevant studies. For example, new data have been updated from studies as recently as January 2024. However, unlike AHRQ, NaNDA does not have annotations for the 5 domains. Additional details on SDOH variables included from the AHRQ and the NaNDA databases can be found in Appendix A.

3.3.3 Gold Label Annotation for NaNDA

As NaNDA does not have ground truth annotations for the 5 SDOH domains, we manually annotated the 223 variables we obtained from NaNDA. We had 3 independent annotators manually assign each NaNDA variable to 1 of the 5 SDOH domains, given the variable name, variable description, and data source description. After the first round of annotation, 2 of the 3 annotators agreed on 100% of variables while there was a slight disagreement between all 3 at 9% (22/223). Between the 3 raters, the average Cohen’s $\kappa$ was 0.19, and Krippendorff’s $\alpha$ was 0.0269, indicating slight agreement. Consensus for the NaNDA annotations was reached through discussion and majority votes — these manual labels are used as the ground truth labels for NaNDA variables throughout all studies we conducted.

3.3.4 Integrating SDOH Data with Clinical Data

Given census-tract level SDOH data between 2010-2018 from AHRQ ( $d$ =506) and NaNDA ( $d$ =223), we merged these SDOH features with patient records using the year of the patient’s index admission and the 11-digit census tract code of their primary residence. For missing values of SDOH variables, we imputed with the value for the most recent year available (e.g., $d_{1}$ : if the value is missing for 2016 at census tract $c$ , the value for $d_{1}$ from 2015 at census tract $c$ is used in its place). Table 5 in the Appendix summarizes the clinical variables, the AHRQ, and the NaNDA features.

3.3.5 ML Models & Evaluation

We trained logistic regression (LR) models on the clinical data and the public SODH data.³³3We explored other models including XGBoost, Random Forest, and simple multi-layer perceptrons but found the performance was most stable with LR. The LR models were trained and evaluated using 10-fold cross validation (CV). The performance was assessed using the mean area under the receiver operating characteristic curve (AUROC) and F1-score across the 10 folds. Features were standardized to have zero mean and unit standard deviation.

3.4 Impact of SDOH on 30-day HF readmission prediction

Table 2 summarizes the performance of LR models using 3 different feature sets for the 3 SDOH domains encompassed by the NaNDA variables. As can be seen from the results, AHRQ and NaNDA alone (SDOH-only models) do not outperform the clinical data itself. For example, the highest performing SDOH-only model (Domain 3: AHRQ) yields an AUROC of 0.605, less than clinical features (0.617).

Within SDOH-only models, the performance of NaNDA and AHRQ features differs among SDOH Domains 1, 3, and 4. Overall, Domain 4 SDOH measures (Neighborhood and Built Environment) from both NaNDA and AHRQ yielded the highest AUROC and F1 compared to other SDOH domains. The wide set of Domain 3 SDOH measures (Education Access and Quality) from NaNDA achieves higher AUROC and F1-score than do Domain 3 (Education Access and Quality) measures from AHRQ. However for Domains 1 and 4, AHRQ measures outperform those from NaNDA. Interestingly, Domain 1 from NaNDA contains only a single feature (the census tract population), but performs with comparable AUROC and F1-score relative to the $d$ =129 AHRQ Domain 1 features (Table 2).

Importantly, Clinical $+$ AHRQ $+$ NaNDA features from each domain significantly outperform Clinical features alone. Including all Domain 4 SDOH features yields the best performance overall, improving AUROC by $+$ 0.028 and F1-score by $+$ 0.016 relative to Clinical features alone (Table 2). Domains 1, 3 each perform with the next highest AUROC and F1-scores when combined with Clinical, still significantly outperforming the Clinical features alone.

Table 2: Comparison of AUROC performance on HF 30-day readmission for Clinical versus SDOH features by Domain. All SDOH includes domains 1, 3, and 4. Bold underlined denotes highest performance overall. Using Clinical as a benchmark, paired t-tests across 10 folds were run for Clinical versus each feature set shown. Green denotes performance better than Clinical, red denotes worse. p values are denoted as: *: p

<

0.05, **: p

<

0.01, ***: p

<

0.001

Domain	Source	# Features	AUROC	F1-score
Clinical	—	36	0.617	0.323
All SDOH	—	456	0.598	0.301
1	AHRQ	129	0.591	0.300
	NaNDA	1	0.521	0.269
	Clinical + AHRQ + NaNDA	164	0.637***	0.334**
3	AHRQ	10	0.520	0.255
	NaNDA	157	0.569	0.283
	Clinical + AHRQ + NaNDA	203	0.629***	0.332**
4	AHRQ	94	0.605	0.308
	NaNDA	65	0.582	0.296
	Clinical + AHRQ + NaNDA	195	0.645***	0.339***

4 LLMs Can Help With SDOH Data Integration

4.1 SDOH Domain Classification Dataset

In this section, we evaluate the potential of using open-source large language models (LLMs) to expedite an important but time-consuming manual step in the SDOH data integration process: annotating variables to one of the 5 SDOH domains. We benchmark the performance of LLMs on this task across two separate SDOH datasets, AHRQ (d=506) and NaNDA (d=223).

4.1.1 SDOH Classification Task

We consider a set of SDOH variables, $\mathcal{D}_{SDOH}=\{(a_{i},b_{i},c_{i}),y_{i}\}_{i=1}^{N}$ , where each variable is represented as a tuple $(a_{i},b_{i},c_{i})$ denoting the respective variable metadata ( $a_{i}$ : variable name, $b_{i}$ : variable description, $c_{i}$ : description of the original data source of variable). $N$ denotes the total number of variables in the dataset. Each variable has a gold label $y_{i}$ represents the SDOH domain which best describes the variable, and $y_{i}$ can have exactly one integer value in the set $\{1,...,5\}$ where integers denote the aforementioned nominal SDOH domains (Table 8).

Given a dataset of variables $\mathcal{D}_{SDOH}$ , the SDOH Domain classification task is: for all $(i,...,N)$ , given a “prompt” consisting of a text string description of the classification task, options for the 5 SDOH domains, and some combination of variable metadata $(a_{i},b_{i},c_{i})$ , independently classify each variable to the gold SDOH domain which matches $y_{i}$ . Note that we use manual human annotations as the gold labels for all $y_{i}$ , which was described in 3.3.3.

To represent the scenario of a researcher annotating a high volume of variables from heterogeneous data sources, we assume only these three standard metadata are available to describe the variable. We assume that each variable is annotated independently from the other, so that for all $i=\{1,|\mathcal{D}_{SDOH}|\}$ that $x_{i-1}$ is not known when annotating $x_{i}$ . Note that each variable is assumed to characterize SDOH in some way, however in this scenario, the exact SDOH domain which the variable describes is unknown to the annotator.

4.2 Using LLMs for SDOH Domain Classification

Refer to caption — Figure 1: Overview of using LLMs to classify SDOH variables.
Created with BioRender.com

We constructed an LLM prompt template with 4 key components: a brief description of the text classification task, a specific format for the requested response, names and corresponding integers of the 5 SODH domains, and finally a combination of metadata for the current variable ( $a_{i}$ : variable name, $b_{i}$ : variable description, $c_{i}$ : description of the original data source of variable). In order to allow for the LLMs to express uncertainty, we also include a refusal label (‘?’) option within the prompt, allowing the LLM to indicate that it does know know which of the 5 SDOH domains to classify the variable into.

To evaluate the utility of each variable metadata field for domain classification, we conducted an ablation study. Given the three metadata fields for a variable $\{(a_{i},b_{i},c_{i}),y_{i}\}$ , we selectively prompted the LLM with different combinations of these fields. The ablation study included 7 combinations of the 3 fields denoted as $A,B,C,AB,AC,BC,ABC$ , where $A$ denotes that only the variable name was included in the prompt while $AB$ denotes that both the variable name and variable description were included.

LLMs can be prompted with varying degrees of background information about the task at hand. One way to incorporate more information is through few-shot learning (Brown et al., 2020), where solved examples of the task are provided within the prompt. To evaluate the impact of few-shot learning on LLM performance here, we construct two types of prompts templates — zero-shot prompts do not provide any examples of variable metadata from each domain, and 1-shot prompts, where an example of one variable’s metadata is included for each of the 5 SDOH domains. To construct 1-shot prompts, 5 of the 506 AHRQ variables were randomly sampled (1 variable from each SDOH domain) to serve as SDOH domain examples. These 5 AHRQ variables were held out from evaluation, and were used in 1-shot prompts for LLMs on both the AHRQ and NaNDA variables. We constructed these prompt templates based on prior works applying LLMs for different text classification tasks (Touvron et al., ; Zhang et al., 2024). While other prompting strategies like Chain-of-Thought have shown success on sequential reasoning tasks such as logic puzzles (Wei et al., 2023), we do not explore its applications here in an effort to minimize prompt complexity and because of the non-sequential reasoning structure of this classification task.

4.3 LLM Base Models & Evaluation

Nine open-source language models were evaluated in our work: Llama-2 7B-chat, Llama-2 13B-chat, LLama-2 70B-chat (Touvron et al., ), Gemma 2B-it, Gemma 7B-it (Team et al., 2024), Mistral 7B v0.1 Instruct, Mistral 7B v0.2 Instruct (Jiang et al., 2023), Flan-T5-XL, and Flan-T5-XXL (Chung et al., 2022). For each model, we use the the checkpoints hosted on Huggingface. We only use chat or instruction tuned versions of each model.⁴⁴4Our experiments on base/non-chat Llama-2 models yielded suboptimal adherence to response format. Briefly we discuss the details of the four families of models. Details on LLM hyperparameters are included in Appendix B.

Llama-2-chat: Llama-2 is a decoder-only model available in three sizes: 7B, 13B, and 70B. We include Llama-2 models because their wide range of model sizes allows us to explore the impact of model size on the data annotation task. Llama-2 models have a vocabulary size of 32,000 words (Touvron et al., ).

Gemma-it: Gemma is a dense decoder-only model available in two sizes: 2B and 7B. Gemma models have a vocabulary size of 256,000 words, significantly larger than other models included here (Team et al., 2024).

Mistral 7B Instruct: Mistral 7B models adopt a decoder-only architecture, and have a vocabulary size of 50,000 words (Jiang et al., 2023).

Flan-T5: Flan-T5 is an encoder-decoder model with a vocabulary size of 32,128 words (Chung et al., 2022). Flan-T5 models are instruction tune variants of the original T5 models.

To evaluate performance on the SDOH classification task, we use human annotations as the gold labels for all $y_{i}$ and we use Micro F1 to evaluate LLM performance with respect to these gold labels.

In practice, postprocessing of the LLM response is required to extract the relevant information, in this case the integer classification $\{1,...,5\}$ . For example, the model may respond with the name of the SDOH domain instead of the integer label. Details on the automated postprocessing of LLM responses can be found in Appendix B.3

4.4 Certain LLMs effectively annotate SDOH data

Zero-shot results

Tables 3 and 4 summarize zero-shot performance on NanDA and AHRQ datasets, respectively. Flan-T5 family models show the highest Micro F1 across both NaNDA and AHRQ. On NanDA with zero-shot prompts, the highest Micro F1 of 0.785 was observed with Flan-T5-XXL using variable description alone (B). On AHRQ with 1-shot prompts, the highest Micro F1 of 0.850 was achieved with Flan-T5-XL when prompting with variable name and variable description (AB).

1-shot results

Tables 9 and 10 summarize 1-shot performance on NanDA and AHRQ datasets, respectively. Consistent with zero-shot results, Flan-T5 family models show the highest Micro F1 across both NaNDA and AHRQ. On NanDA with 1-shot prompts, the highest Micro F1 of 0.825 was observed with Flan-T5-XXL using variable name and variable description (AB). On AHRQ with 1-shot prompts, the highest Micro F1 of 0.916 was achieved with Flan-T5-XXL when prompting with all variable metadata (ABC).

Small LLMs may outperform larger ones

In general, we observe larger models do not always yield better performance than smaller models, even within model families. For example, on zero-shot prompts, Llama-2 7B-chat yielded a Micro F1 of 0.596 using NaNDA variable description, outperforming 13B and 70B by $+$ 0.219, $+$ 0.269 (Table 3). A similar trend can be observed for Flan-T5 models on AHRQ, where Flan-T5-XL outperforms the larger Flan-T5-XXL by $+$ 0.088 (Table 4). For 1-shot prompts, larger Flan-T5 models always perform with a higher Micro F1, however Llama-2 7B-chat sometimes outperforms 13B and 70B for shorter prompts on NanDA (Table 9)

Table 13 shows the average prompt length across the 7 metadata combinations and the two datasets (AHRQ, NaNDA). It is possible that LLMs performed better on average for AHRQ compared to NaNDA due to minor differences in length and content between each metadata fields. For example, on average AHRQ contains longer variable names (A) and variable descriptions (B) compared to NaNDA. However NaNDA data source descriptions (C) are 2-3 times the length compared to those from AHRQ.

Table 3: Classifying NaNDA variables: Zero-shot LLM performance (Micro F1) across prompt context on SDOH domain multiclass (5 domains + ‘?’) text classification task. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Llama-2 7B-chat	0.085	0.596	0.291	0.489	0.291	0.323	0.354
Llama-2 13B-chat	0.166	0.377	0.673	0.247	0.112	0.274	0.220
Llama-2 70B-chat	0.175	0.327	0.291	0.256	0.251	0.368	0.341
Gemma 2B-it	0.004	0.009	0.004	0.004	0.004	0.045	0.022
Gemma 7B-it	0.040	0.341	0.027	0.085	0.265	0.224	0.112
Mistral 7B v0.1 Instruct	0.009	0.045	0.004	0.027	0.009	0.031	0.009
Mistral 7B v0.2 Instruct	0.045	0.363	0.112	0.395	0.090	0.247	0.260
Flan-T5-XL (3B)	0.045	0.632	0.229	0.623	0.143	0.552	0.453
Flan-T5-XXL (11B)	0.072	0.785	0.269	0.776	0.224	0.744	0.740

Table 4: Classifying AHRQ variables: Zero-shot LLM performance (Micro F1) across prompt context on SDOH domain multiclass (5 domains + ‘?’) text classification task. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Llama-2 7B-chat	0.347	0.297	0.064	0.253	0.194	0.246	0.335
Llama-2 13B-chat	0.152	0.433	0.020	0.469	0.218	0.467	0.471
Llama-2 70B-chat	0.315	0.535	0.186	0.505	0.311	0.533	0.523
Gemma 2B-it	0.255	0.070	0.255	0.255	0.255	0.184	0.255
Gemma 7B-it	0.479	0.541	0.445	0.575	0.447	0.603	0.571
Mistral 7B v0.1 Instruct	0.263	0.431	0.255	0.283	0.261	0.477	0.293
Mistral 7B v0.2 Instruct	0.571	0.657	0.317	0.653	0.517	0.643	0.591
Flan-T5-XL (3B)	0.295	0.772	0.257	0.850	0.341	0.693	0.701
Flan-T5-XXL (11B)	0.333	0.749	0.317	0.762	0.517	0.762	0.764

5 Discussion

Our results demonstrate the potential of leveraging large language models (LLMs) to accelerate the integration of publicly available social determinants of health (SDOH) data with clinical measures for predictive healthcare tasks. By using LLMs to automatically annotate the domains of over 700 SDOH variables from multiple data sources, researchers can bypass the need for laborious manual annotation. This allows relevant neighborhood-level measures like education and community characteristics to be efficiently integrated with individual patient data.

We found that incorporating additional domain-specific SDOH variables beyond the census-tract level AHRQ dataset boosted the performance of ML models for predicting 30-day hospital readmission in heart failure (HF) patients. Notably, open-source LLMs achieved high Micro F1 annotating SDOH variable domains using readily available metadata like variable name and description as inputs. This SDOH integration was performed using open-source LLMs and without any task-specific fine-tuning. This highlights the capability of LLMs to leverage free-text semantics for zero-shot data integration tasks in healthcare.

As more neighborhood-level data becomes publicly available through open sources such as Data.gov and ICPSR⁵⁵5Institute for Social Research (ICPSR): https://www.icpsr.umich.edu/web/about/cms/5016 automated methods like LLM-based annotation will be crucial for maximizing insights from multi-modal sources. Overall, our work highlights promising applications of large language models to augment and accelerate machine learning pipelines for improved clinical risk prediction.

Limitations and Future Work

Though we implemented a majority voting approach among three human annotators to determine the gold label for NaNDA variables, other approaches may be useful to encode variability in annotators decisions within the ground truth. For example, a small number of recent studies have adopted a perspectivist approach, which preserves multiple annotator’s labels as the gold labels for a given instance when evaluating LLM performance (Santurkar et al., 2023; Lee et al., 2023). This approach could be well suited when evaluating LLMs for subjective inference tasks such as sentiment analysis or controversial topics. However, instead of adopting a perspectivist approach, future work should explore the capacity of LLMs to annotate SDOH variables to specific SDOH topics, which may more precisely characterize semantics of the variable. For example, AHRQ annotates each SDOH variable to 1 of the 5 domains, and also to 1 of 32 SDOH topics defined by AHRQ.

We included a maximum of 3 metadata fields when presenting variables to LLMs for classification (variable name, variable description, and original data source description). These were included as they are standard metadata across several SDOH databases. However, future work may draw on approaches using LLMs for entity recognition (Ashok and Lipton, 2023) to leverage unstructured or semi-structured metadata when integrating new SDOH data sources. These approaches could be especially useful for screening large sets of candidate SDOH variables from frequently updated open access databases like Data.gov.

Though they are outside the scope of this work on tabular SDOH, a small number of recent studies have explored images and unstructured text to describe SDOH. Unstructured text has been applied to measure both individual-level SDOH (e.g., clinical notes) (Patra et al., 2021) and neighborhood-level SDOH (e.g., from geo-tagged social media posts) (Zhang et al., 2022). Images have largely been studied to characterize the neighborhood and built environment domain (Xi et al., 2023). To leverage this wealth of disparate data to characterize neighborhood-level SDOH in more detail, future work may leverage LLMs to efficiently integrate multi-modal SDOH data with patient records.

References

Agency for Healthcare Research and Quality (2022) (AHRQ) Agency for Healthcare Research and Quality (AHRQ). Social determinants of health (sdoh) database data sources documentation, 2022. URL https://www.ahrq.gov/sites/default/files/wysiwyg/sdoh/SDOH-Data-Sources-Documentation-v1-Final.pdf. Accessed: 2024-04-12.
Ashok and Lipton (2023) Dhananjay Ashok and Zachary C. Lipton. PromptNER: Prompting For Named Entity Recognition, June 2023.
Bahrami et al. (2008) Hossein Bahrami, Richard Kronmal, David A. Bluemke, Jean Olson, Steven Shea, Kiang Liu, Gregory L. Burke, and João A. C. Lima. Differences in the Incidence of Congestive Heart Failure by Ethnicity. Archives of internal medicine, 168(19):2138–2145, October 2008. ISSN 0003-9926. 10.1001/archinte.168.19.2138.
Bozkurt et al. (2021) Biykem Bozkurt, Andrew J.S. Coats, Hiroyuki Tsutsui, Ca Magdy Abdelhamid, Stamatis Adamopoulos, Nancy Albert, Stefan D. Anker, John Atherton, Michael Böhm, Javed Butler, Mark H. Drazner, G. Michael Felker, Gerasimos Filippatos, Mona Fiuzat, Gregg C. Fonarow, Juan-Esteban Gomez-Mesa, Paul Heidenreich, Teruhiko Imamura, Ewa A. Jankowska, James Januzzi, Prateeti Khazanie, Koichiro Kinugawa, Carolyn S.P. Lam, Yuya Matsue, Marco Metra, Tomohito Ohtani, Massimo Francesco Piepoli, Piotr Ponikowski, Giuseppe M.C. Rosano, Yasushi Sakata, Petar Seferović, Randall C. Starling, John R. Teerlink, Orly Vardeny, Kazuhiro Yamamoto, Clyde Yancy, Jian Zhang, and Shelley Zieroth. Universal definition and classification of heart failure: A report of the Heart Failure Society of America, Heart Failure Association of the European Society of Cardiology, Japanese Heart Failure Society and Writing Committee of the Universal Definition of Heart Failure. European Journal of Heart Failure, 23(3):352–380, 2021. ISSN 1879-0844. 10.1002/ejhf.2115.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
Butler et al. (2013) Danielle C Butler, Stephen Petterson, Robert L Phillips, and Andrew W Bazemore. Measures of Social Deprivation That Predict Health Care Access and Need within a Rational Area of Primary Care Service Delivery. Health Services Research, 48(2 Pt 1):539–559, April 2013. ISSN 0017-9124. 10.1111/j.1475-6773.2012.01449.x.
CDC (2023) CDC. Heart Failure | cdc.gov. https://www.cdc.gov/heartdisease/heart_failure.htm, January 2023.
(8) Centers for Disease Control and Prevention. Places. URL https://www.cdc.gov/places.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling Instruction-Finetuned Language Models, December 2022.
Enard et al. (2023) Kimberly R. Enard, Alyssa M. Coleman, R. Aver Yakubu, Biona C. Butcher, Donghua Tao, and Paul J. Hauptman. Influence of Social Determinants of Health on Heart Failure Outcomes: A Systematic Review. https://www.ahajournals.org/doi/epub/10.1161/JAHA.122.026590, 2023.
Fernandez et al. (2023) Raul Castro Fernandez, Aaron J. Elmore, Michael J. Franklin, Sanjay Krishnan, and Chenhao Tan. How Large Language Models Will Disrupt Data Management. Proceedings of the VLDB Endowment, 16(11):3302–3309, July 2023. ISSN 2150-8097. 10.14778/3611479.3611527.
Flanagan et al. (2011) Barry E. Flanagan, Edward W. Gregory, Elaine J Hallisey, Janet L. Heitgerd, and Brian Lewis. A Social Vulnerability Index for Disaster Management. Journal of Homeland Security and Emergency Management, 8(1), January 2011. ISSN 1547-7355. 10.2202/1547-7355.1792.
Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, July 2023. 10.1073/pnas.2305016120.
Heidenreich et al. (2013) Paul A. Heidenreich, Nancy M. Albert, Larry A. Allen, David A. Bluemke, Javed Butler, Gregg C. Fonarow, John S. Ikonomidis, Olga Khavjou, Marvin A. Konstam, Thomas M. Maddox, Graham Nichol, Michael Pham, Ileana L. Piña, and Justin G. Trogdon. Forecasting the Impact of Heart Failure in the United States. Circulation. Heart failure, 6(3):606–619, May 2013. ISSN 1941-3289. 10.1161/HHF.0b013e318291329a.
Hinnant et al. (2022) Laurie Hinnant, Sara Hairgrove, Heather Kane, Jason Williams, and Jessica Duncan Cance. Social Determinants of Health: A Review of Publicly Available Indices. RTI Press Occasional Papers. RTI Press, Research Triangle Park (NC), 2022.
(16) University of Michigan Institute for Social Research. National Neighborhood Data Archive (NaNDA). https://www.icpsr.umich.edu/web/ICPSR/series/1920.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, October 2023.
Kommuri et al. (2012) Naga V. A. Kommuri, Todd M. Koelling, and Scott L. Hummel. The Impact of Prior Heart Failure Hospitalizations on Long-term Mortality Differs by Baseline Risk of Death. The American Journal of Medicine, 125(2):209.e9–209.e15, February 2012. ISSN 0002-9343, 1555-7162. 10.1016/j.amjmed.2011.06.014.
Lee et al. (2023) Noah Lee, Na Min An, and James Thorne. Can large language models capture dissenting human voices?, 2023.
Li et al. (2022) Yikuan Li, Hanyin Wang, and Yuan Luo. Improving Fairness in the Prediction of Heart Failure Length of Stay and Mortality by Integrating Social Determinants of Health. Circulation: Heart Failure, 15(11):e009473, November 2022. 10.1161/CIRCHEARTFAILURE.122.009473.
Liu et al. (2022) Jing Liu, Ping Liu, Mei-Rong Lei, Hong-Wei Zhang, Ao-Lin You, and Xiao-Rong Luan. Readmission risk prediction model for patients with chronic heart failure: A systematic review and meta-analysis. Iranian Journal of Public Health, 51(7):1481, 2022.
Maroko (2016) Andrew R. Maroko. Integrating Social Determinants of Health With Treatment and Prevention: A New Tool to Assess Local Area Deprivation. Preventing Chronic Disease, 13, 2016. ISSN 1545-1151. 10.5888/pcd13.160221.
Mohta et al. (2023) Jay Mohta, Kenan Emir Ak, Yan Xu, and Mingwei Shen. Are large language models good annotators? 2023.
Narayan et al. (2022) Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher Ré. Can Foundation Models Wrangle Your Data?, December 2022.
National Academies of Sciences and Medicine Division and Board on Population Health and Public Health Practice and Committee on Informing the Selection of Leading Health Indicators for Healthy People and 2030 (2020) National Academies of Sciences and Medicine Division and Board on Population Health and Public Health Practice and Committee on Informing the Selection of Leading Health Indicators for Healthy People and 2030. Leading health indicators 2030: advancing health, equity, and well-being. 2020.
(26) Office of Disease Prevention and Health Promotion. Social Determinants of Health - Healthy People 2030 | health.gov. https://health.gov/healthypeople/priority-areas/social-determinants-health.
Patel et al. (2020) Shivani A Patel, Maya Krasnow, Kaitlyn Long, Theresa Shirey, Neal Dickert, and Alanna A Morris. Excess 30-day heart failure readmissions and mortality in black patients increases with neighborhood deprivation. Circulation: Heart Failure, 13(12):e007947, 2020.
Patra et al. (2021) Braja G Patra, Mohit M Sharma, Veer Vekaria, Prakash Adekkanattu, Olga V Patterson, Benjamin Glicksberg, Lauren A Lepow, Euijung Ryu, Joanna M Biernacka, Al’ona Furmanchuk, Thomas J George, William Hogan, Yonghui Wu, Xi Yang, Jiang Bian, Myrna Weissman, Priya Wickramaratne, J John Mann, Mark Olfson, Thomas R Campion, Mark Weiner, and Jyotishman Pathak. Extracting social determinants of health from electronic health records using natural language processing: A systematic review. Journal of the American Medical Informatics Association : JAMIA, 28(12):2716–2727, October 2021. ISSN 1067-5027. 10.1093/jamia/ocab170.
Rathore et al. (2003) Saif S. Rathore, JoAnne M. Foody, Yongfei Wang, Grace L. Smith, Jeph Herrin, Frederick A. Masoudi, Pamela Wolfe, Edward P. Havranek, Diana L. Ordin, and Harlan M. Krumholz. Race, quality of care, and outcomes of elderly patients hospitalized with heart failure. JAMA, 289(19):2517–2524, May 2003. ISSN 0098-7484. 10.1001/jama.289.19.2517.
Regmi et al. (2021) Manjari Rani Regmi, Nitin Tandan, Priyanka Parajuli, Mukul Bhattarai, Ruby Maini, Abhishek Kulkarni, and Robert Robinson. Social vulnerability indices as a risk factor for heart failure readmissions. Clinical Medicine & Research, 19(3):116–122, 2021.
Santurkar et al. (2023) Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
Schroeder Steven A. (2007) Schroeder Steven A. We Can Do Better — Improving the Health of the American People. New England Journal of Medicine, 357(12):1221–1228, 2007. 10.1056/NEJMsa073350.
Shin et al. (2021) Sheojung Shin, Peter C Austin, Heather J Ross, Husam Abdel-Qadir, Cassandra Freitas, George Tomlinson, Davide Chicco, Meera Mahendiran, Patrick R Lawler, Filio Billia, et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC heart failure, 8(1):106–115, 2021.
Sun et al. (2023) Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text Classification via Large Language Models, October 2023.
Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L. Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu-hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open Models Based on Gemini Research and Technology, April 2024.
Thirumuruganathan et al. (2020) Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan. Data curation with deep learning. In EDBT, pages 277–286, 2020.
(37) Hugo Touvron, Louis Martin, and Kevin Stone. Llama 2: Open Foundation and Fine-Tuned Chat Models.
van Walraven et al. (2012) Carl van Walraven, Alison Jennings, and Alan J Forster. A meta-analysis of hospital 30-day avoidable readmission rates. Journal of evaluation in clinical practice, 18(6):1211–1218, 2012.
Virani et al. (2020) Salim S Virani, Alvaro Alonso, Emelia J Benjamin, Marcio S Bittencourt, Clifton W Callaway, April P Carson, Alanna M Chamberlain, Alexander R Chang, Susan Cheng, Francesca N Delling, et al. Heart disease and stroke statistics—2020 update: a report from the american heart association. Circulation, 141(9):e139–e596, 2020.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, January 2023.
Xi et al. (2023) Yanxin Xi, Yu Liu, Tong Li, Jingtao Ding, Yunke Zhang, Sasu Tarkoma, Yong Li, and Pan Hui. A Satellite Imagery Dataset for Long-Term Sustainable Development in United States Cities. Scientific Data, 10(1):866, December 2023. ISSN 2052-4463. 10.1038/s41597-023-02576-3.
Yang et al. (2023) Ming Ying Yang, Gloria Hyunjung Kwak, Tom Pollard, Leo Anthony Celi, and Marzyeh Ghassemi. Evaluating the Impact of Social Determinants on Health Prediction in the Intensive Care Unit. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 333–350, August 2023. 10.1145/3600211.3604719.
Zhang et al. (2022) Juhao Zhang, Samantha Lin, Yunjie Wu, Jing Zhang, Alanna A Morris, Shivani A Patel, and Joyce C Ho. Abstract 15011: Deriving and Validating Novel Neighborhood Data for Investigation of Adverse Outcomes in Patients Hospitalized for Heart Failure: A Feasibility Study. Circulation, 146(Suppl_1), November 2022. ISSN 0009-7322, 1524-4539. 10.1161/circ.146.suppl_1.15011.
Zhang et al. (2024) Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. TableLlama: Towards Open Large Generalist Models for Tables, April 2024.
Zhang et al. (2023) Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. Sentiment Analysis in the Era of Large Language Models: A Reality Check, May 2023.
Zhu et al. (2023) Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, and Gareth Tyson. Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks, April 2023.

Appendix A

Table 5: Summary of clinical and SDOH feature sets used to train ML models for heart failure (HF) 30-day readmission prediction. The dimension of each subset of features is specified by d. AHRQ: ground truth SDOH domain annotations for each variable were pulled from the AHRQ database. NaNDA: we manually annotated SDOH domains of each variable with 3 annotators.

Feature Set	# Variables	Variable Names	Personalization of Measure
Clinical	36	Demographics (d=8): Current age, Private insurance (0/1), Medicare (0/1), Medicaid (0/1), Male (0/1), Female (0/1), White (0/1), Black (0/1), Charlson Comorbidity Score (d=22): Myocardial infarction, Congestive heart failure, Peripheral vascular disease, Cerebrovascular disease, Dementia, Chronic pulmonary disease, Connective tissue disease (e.g., rheumatoid arthritis), Peptic ulcer disease, Liver disease (moderate), Diabetes without end-organ damage, Hemiplegia, Renal disease (moderate/severe), Malignant tumor, Liver disease (severe), Metastatic solid tumor, AIDS, Atrial fibrillation, Hypertension, Coronary artery disease, Chronic kidney disease, Dmany Vital signs (d=3): Systolic BP, Heart rate, Respiratory rate, HF classification (d=3): Systolic heart failure status (0/1), Diastolic heart failure status (0/1), Other heart failure classification status (0/1)	Individual level from index hospital visit
AHRQ	506	Domain 1 (d=129), Domain 2 (d=224) — not used in our HF study, Domain 3 (d=10), Domain 4 (d=94), Domain 5 (d=49) — not used in our HF study. Full list here	Census-tract level
NaNDA	223	Domain 1 (d=1), Domain 2 (d=0), Domain 3 (d=157), Domain 4 (d=65), Domain 5 (d=0) Full list here	Census-tract level

All SDOH variables used along with descriptions and annotations from AHRQ and NaNDA (our manual NaNDA annotations) can be found here with appropriate attribution: Variables Used: AHRQ and NaNDA

Table 6: Breakdown of NaNDA Datasets: 28 datasets are included in the NaNDA collection. Bolded dataset were studied in this work. 7 datasets were selectively included to supplement SDOH domains lacking in AHRQ, namely Education Access and Quality & Neighborhood and Built Environment.

NaNDA Dataset Time Period Geographic Unit Personal Care Services and Laundromats 2003-2017 (2003-2015, 2017) Census Tract Broadband Internet Availability, Speed, and Adoption 2014-2020 Census Tract, ZIP Code Tabulation Area Crimes 2002-2014 County Dollar Stores 2003-2017 Census Tract Eating and Drinking Places 2003-2017 Census Tract Education and Training Services 2003-2017 (2003-2015, 2017) Census Tract Grocery Stores 2003-2017 Census Tract Health Care Services 2003-2017 Census Tract Internet Access 2015-2019 Census Tract, ZIP Code Tabulation Area Land Cover 2001-2016 Census Tract, ZIP Code Tabulation Area Law Enforcement Organizations 2003-2017 Census Tract Liquor, Tobacco, and Convenience Stores 2003-2017 Census Tract Neighborhood-School Gap 2009-2010, 2015-2016 Census Tract, ZIP Code Tabulation Area Parks 2018, 2022 Census Tract, ZIP Code Tabulation Area Polluting Sites 1987-2021 Census Tract, ZIP Code Tabulation Area Post Offices and Banks 2003-2017 Census Tract Primary and Secondary Roads 2,010 Census Tract, ZIP Code Tabulation Area Public Transit Stops 2016-2018 Census Tract, ZIP Code Tabulation Area Religious, Civic, and Social Organizations 2003-2017 Census Tract Retail Establishments 2003-2017 Census Tract School District Characteristics and School Counts 2000-2018 Census Tract, ZCTA, School District Social Services 2003-2017 Census Tract Socioeconomic Status and Demographic Characteristics 2000-2020 Census Tract, ZIP Code Tabulation Area Street Connectivity 2010, 2020 Census Tract, ZIP Code Tabulation Area Traffic Volume 1963-2019 Census Tract, ZIP Code Tabulation Area Urbanicity 2,010 Census Tract Voter Registration, Turnout, and Partisanship 2004-2018 County Weather 2003-2016 County

Table 7: Breakdown of Census-tract AHRQ Data: 11 census-tract data sources are included in the AHRQ SDOHD collection. The 8 bolded dataset were studied in this work based on having at least 1 year of data from 2010-2018. Note this summary only measures time periods from 2010-2018, and these datasets may be available for other years. Variables from each data source range over varying sets of years. Details on all 44 datasets available for county or ZIP code level are available in AHRQ documentation (Agency for Healthcare Research and Quality (2022), AHRQ)

AHRQ Dataset Time Period SDOH Domains American Community Survey (ACS) Mixed Domains 1-5 (All) CDC Social Vulnerability Index (CDCSVI) 2014, 2016, 2018 Domain 1 Opportunity Atlas (OPP) 2010, 2015 Domain 2 Food Access Research Atlas (FARA) 2010, 2015 Domain 4 Washington University Saint Louis - Atmospheric Composition Analysis Group (WUSTL) 2010-2018 Domain 4 Centers for Disease Control and Prevention Population Level Analysis and Community Estimates: Local Data for Better Health (CDCP) 2017, 2018 Domain 5 Health Resources and Services Administration (HRSA) Medically Underserved Areas (MUA) 2010-2018 Domain 5 Centers for Medicare and Medicaid Provider of Services File (POS) 2010-2018 Domain 5 Community Resilience Estimates (CRE) 2019 Domain 1 Environmental Protection Agency: National Walkability Index (EPAN) 2019 Domain 4 Homeland Infrastructure Foundation Level Data (HIFLD) 2020 Domain 5

Table 8: SDOH Domains and Topics according to Healthy People 2030 (U.S. Department of Health and Human Services). Table adapted from AHRQ SDOHD, which adopts this framework for variable annotation (Agency for Healthcare Research and Quality (2022) (AHRQ)).

SDOH Domain	SDOH Topic
1. Social and Community Context	Demographics
	Living conditions
	Disability
	Immigration
	Socioeconomic disadvantage indices
	Segregation
2. Economic Stability	Income
	Employment
	Poverty
3. Education Access and Quality	Education Attainment
	School system
	Educational funding
	Literacy
	Numeracy
4. Neighborhood and Built Environment	Housing
	Transport
	Migration
	Internet connectivity
	Environment
	Industry composition
	Social services
	Food access
	Access to exercise
	Crime
5. Healthcare Access and Quality	Health insurance status
	Characteristics of health care providers
	Characteristics of health care facilities
	Distance to provider
	Utilization and costs
	Health behaviors
	Health outcomes
	Health care quality

A.1 1-shot LLM SDOH Variable Classification

Table 9: Classifying NaNDA variables: 1-shot LLM performance (Micro F1) across prompt context on SDOH domain multiclass (5 domains + ‘?’) text classification task. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Llama-2 7B-chat	0.354	0.619	0.076	0.085	0.027	0.215	0.054
Llama-2 13B-chat	0.090	0.381	0.009	0.036	0.076	0.099	0.117
Llama-2 70B-chat	0.229	0.502	0.238	0.489	0.233	0.345	0.283
Gemma 2B-it	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 7B-it	0.004	0.004	0.000	0.004	0.000	0.004	0.004
Mistral 7B v0.1 Instruct	0.004	0.022	0.000	0.018	0.004	0.067	0.018
Mistral 7B v0.2 Instruct	0.072	0.363	0.009	0.309	0.009	0.175	0.103
Flan-T5-XL (3B)	0.054	0.529	0.166	0.170	0.166	0.448	0.233
Flan-T5-XXL (11B)	0.130	0.767	0.269	0.825	0.260	0.713	0.686

Table 10: Classifying AHRQ variables: 1-shot LLM performance (Micro F1) across prompt context on SDOH domain multiclass (5 domains + ‘?’) text classification task. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Llama-2 7B-chat	0.562	0.560	0.295	0.383	0.582	0.626	0.359
Llama-2 13B-chat	0.536	0.730	0.590	0.628	0.483	0.592814	0.674651
Llama-2 70B-chat	0.690	0.770	0.626	0.768	0.718	0.836	0.772
Gemma 2B-it	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Gemma 7B-it	0.397	0.483	0.572	0.259	0.165	0.467	0.255
Mistral 7B v0.1 Instruct	0.417	0.465	0.578	0.425	0.319361	0.512	0.389
Mistral 7B v0.2 Instruct	0.590	0.742	0.391	0.778	0.578	0.718	0.754
Flan-T5-XL (3B)	0.277	0.746	0.289	0.445	0.311	0.802	0.4810
Flan-T5-XXL (11B)	0.700	0.856	0.634	0.762	0.684	0.898	0.916

Appendix B

B.1 Prompting strategies

Table 11: Hyperparameters used for the LLM text classification.

Parameter	Value
Temperature	0.01
Top P	0.9
Max Tokens (to generate)	128

Table 12: BERT (bert-base-cased) one-shot classification results to classify AHRQ SDOH variabes (n=501) to 1 of the 5 SDOH domains. Max_sequence_length = 512, train_batch_size = 1, train_epochs = 1. Note: BERT uses WordPiece tokenization, which averages 4-5 characters/token — this means even the longest feature set, ABC, fits within the max sequence length based on Table 13.

Feature set	Micro F1
A	0.19
B	0.26
C	0.10
AB	0.25
AC	0.02
BC	0.26
ABC	0.14

All LLMs were loaded in 16-bit (half) precision. Tokenizers applied right padding to the prompts. Two NVIDIA H100 GPUs were used to load LLMs and perform inference.

B.1.1 Zero-shot prompt template

⬇

You are a researcher tasked with annotating a variable describing social determinants of health (SDOH). This is a multi-class classification task. Respond with a number (1-5) or ""?"" if unsure. Give a one number response. The five domain options are:

(1) Social and Community Context,

(2) Economic Stability,

(3) Education Access and Quality,

(4) Neighborhood and Built Environment,

(5) Healthcare Access and Quality.

[INST] Variable name: \"{variable_name}\".

Variable description: \"{variable_label}\".

Data source: \"{data_source}\".

Domain:[/INST]

B.1.2 Zero-shot example LLM responses

Using zero-shot prompt ABC:

⬇

"You are a researcher tasked with annotating a variable describing social determinants of health (SDOH). This is a multi-class classification task. Respond with a number (1-5) or ""?"" if unsure. Give a one number response. The five domain options are:

(1) Social and Community Context,

(2) Economic Stability,

(3) Education Access and Quality,

(4) Neighborhood and Built Environment,

(5) Healthcare Access and Quality.

[INST] Variable name: ""aden_491"".

Variable description: ""# post offices/postal service facilities per sq mile"".

Data source: ""The 2017 North American Industry Classification System (NAICS"".

Domain:[/INST]"

True label: Domain (4)

Example responses by model for the above prompt:

⬇

Llama-2 7B-chat: Based on the information provided, I would annotate the variable "aden_491" as belonging to the domain of "Neighborhood and Built Environment". Therefore, I would respond with the number (4).

⬇

Llama-2 13B-chat: Based on the variable description and data source, I would annotate "aden_491" as belonging to the domain of (2) Economic Stability. The number of post offices or postal service facilities per square mile is likely an indicator of economic activity and stability, as it reflects the availability of essential services and infrastructure in a given area.

⬇

Llama-2 70B-chat: I’m unsure which domain the variable "aden_491" belongs to. The variable description mentions post offices/postal service facilities per sq mile, which could potentially fall under the "Neighborhood and Built Environment" domain, but it could also be argued that it belongs to the "Economic Stability" domain as it relates to the number of businesses/services available in a given area. Without further context or information, I would need more clarification to accurately assign it to a specific domain.

⬇

Gemma 2B-it: **Answer: (1) Social and Community Context**

The variable describes the number of post offices or postal service facilities per square mile, which is a measure of social and community context.

⬇

Gemma 7B-it: **Response:**

The variable "aden_491" is most closely related to the domain of "Neighborhood and Built Environment".

⬇

Mistral 7B v0.1 Instruct: 1. Social and Community Context.

⬇

Mistral 7B v0.2 Instruct: (4) Neighborhood and Built Environment.

Explanation: The number of post offices or postal service facilities per square mile can be considered a measure of access to essential services and infrastructure within a neighborhood or built environment. This information can be used to assess the availability and distribution of postal services, which can impact the ability of individuals to receive mail, access healthcare and social services, and engage in economic activities. Therefore, it is most closely related to the Neighborhood and Built Environment domain of social determinants of health.

⬇

Flan-T5-XL (3B): ?

⬇

Flan-T5-XXL (11B): 4

B.1.3 One-shot prompt template

⬇

You are a researcher tasked with annotating a variable describing social determinants of health (SDOH). This is a multi-class classification task. Respond with a number (1-5) or \"?\" if unsure. Give a one number response. The five domain options are:

(1) Social and Community Context,

(2) Economic Stability,

(3) Education Access and Quality,

(4) Neighborhood and Built Environment,

(5) Healthcare Access and Quality.

Variable name: \"ACS_PCT_OTHER_FEMALE\".

Variable description: \"Percentage of population reporting some other race alone and female\".

Data source: \"American Community Survey (ACS)\".

Domain: (1)

Variable name: \"OPP_INCAR_HSP_F_HH_50PT\".

Variable description: \"Fraction incarcerated on April 1st 2010: Hispanic female child with parents from 50th household income percentile\".

Data source: \"The Opportunity Atlas (OPP)\".

Domain: (2)

Variable name: \"ACS_PCT_POSTHS_ED\".

Variable description: \"Percentage of population with any postsecondary education (ages 25 and over)\".\nData source: \"American Community Survey (ACS)\".

Domain: (3)

Variable name: \"ACS_PCT_RENTER_HU_ABOVE65\".

Variable description: \"Percentage of renter-occupied housing units occupied by householders aged 65 and above\".

Data source: \"American Community Survey (ACS)\".

Domain: (4)

Variable name: \"CDCP_KIDNEY_DISEASE_ADULT_C\".

Variable description: \"Crude prevalence of adults aged 18 years and older who report ever having been told by a doctor, nurse, or other health professional that they have kidney disease (%)\".

Data source: \"Centers for Disease Control and Prevention Population Level Analysis and Community Estimates: Local Data for Better Health (CDCP)\".

Domain: (5)

[INST] Variable name: \"{variable_name}\".

Variable description: \"{variable_label}\".

Data source: \"{data_source}\".

Domain:[/INST]

Using 1-shot prompt ABC:

⬇

(1) Social and Community Context,

(2) Economic Stability,

(3) Education Access and Quality,

(4) Neighborhood and Built Environment,

(5) Healthcare Access and Quality.

Variable name: ""ACS_PCT_OTHER_FEMALE"".

Variable description: ""Percentage of population reporting some other race alone and female"".

Data source: ""American Community Survey (ACS)"".

Domain: (1)

Variable name: ""OPP_INCAR_HSP_F_HH_50PT"".

Variable description: ""Fraction incarcerated on April 1st 2010: Hispanic female child with parents from 50th household income percentile"".

Data source: ""The Opportunity Atlas (OPP)"".

Domain: (2)

Variable name: ""ACS_PCT_POSTHS_ED"".

Variable description: ""Percentage of population with any postsecondary education (ages 25 and over)"".

Data source: ""American Community Survey (ACS)"".

Domain: (3)

Variable name: ""ACS_PCT_RENTER_HU_ABOVE65"".

Variable description: ""Percentage of renter-occupied housing units occupied by householders aged 65 and above"".

Data source: ""American Community Survey (ACS)"".

Domain: (4)

Variable name: ""CDCP_KIDNEY_DISEASE_ADULT_C_census_tractlevel"".

Variable description: ""Crude prevalence of adults aged 18 years and older who report ever having been told by a doctor, nurse, or other health professional that they have kidney disease (%)"".

Data source: ""Centers for Disease Control and Prevention Population Level Analysis and Community Estimates: Local Data for Better Health (CDCP)"".

Domain: (5)

[INST] Variable name: ""aden_491"".

Variable description: ""# post offices/postal service facilities per sq mile"".

Data source: ""The 2017 North American Industry Classification System (NAICS"".

Domain:[/INST]"

True label: Domain (4)

⬇

Llama-2 7B-chat: (4) Neighborhood and Built Environment

⬇

Llama-2 13B-chat: Based on the variable description, I would classify "aden_491" as a variable related to the Neighborhood and Built Environment domain (Domain 4). The variable describes the number of post offices or postal service facilities per square mile, which is an indicator of access to essential services and infrastructure in a neighborhood.

⬇

Llama-2 70B-chat: (4) Neighborhood and Built Environment.

The variable "aden_491" measures the number of post offices/postal service facilities per square mile, which is a characteristic of the built environment and infrastructure of a neighborhood. This variable can impact the accessibility and availability of services, including healthcare, for residents in a particular area. Therefore, it belongs to the Neighborhood and Built Environment domain.

⬇

Gemma 2B-it:

No response from Gemma 2B-it.

⬇

Gemma 7B-it: **Response:**

The variable "ACS_PCT_OTHER_FEMALE" is in the domain of "Social and Community Context".

⬇

Mistral 7B v0.1 Instruct: 1

⬇

Mistral 7B v0.2 Instruct: (4) Neighborhood and Built Environment.

⬇

Flan-T5-XL (3B): ?

⬇

Flan-T5-XXL (11B): 4

Table 13: Prompt lengths: mean number of tokens. Lengths are stratified by number of shots (0, 1) and dataset (NaNDA and AHRQ). We calculate number of tokens in a prompt by assuming that 1 token is approximately 4 characters. Here, NaNDA has 223 variables, AHRQ has 501 variables. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Dataset	A	B	C	AB	AC	BC	ABC
NaNDA
Zero-shot	119	128	336	137	344	354	362
One-shot	368	301	434	368	499	612	676
AHRQ
Zero-shot	125	142	123	156	137	154	169
One-shot	483	315	221	387	292	412	483

B.2 Refusal and non-response

Using zero-shot or one-shot prompts, we asked for the LLMs to respond with a single number enclosed in parentheses, as described in B.1. If the model was unable to decide between the five numbers, we asked for a response of a question mark (‘?’) — we call this refusal. Refusal rates are shown in Tables 14 and 15.

However, models sometimes did not provide responses in this specified format — we call this non-response. Non-response is formally defined here as a response other than ‘(1)’ - ‘(5)’, ‘?’, or the exact SDOH domain names specified in the prompt, as shown in B.1. Non-response rates are shown in Tables 16 and17.

Table 14: NaNDA: Refusal rate (zero-shot, 1-shot) by model and prompt feature set. Prompts to classify NaNDA variables to an SDOH domain. Refusal is defined as a response of ‘?’ (i.e., question symbol) or a response that the model was ‘unable’ to classify to one of the five classes. See B.3 for post-processing details. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Zero-shot
Llama-2 7B-chat	0.008	0.00	0.00	0.004	0.008	0.00	0.004
Llama-2 13B-chat	0.008	0.004	0.00	0.004	0.00	0.00	0.00
Llama-2 70B-chat	0.116	0.03	0.0	0.103	0.0	0.004	0.00
Gemma 2B-it	0.0	0.860	0.165	0.040	0.224	0.165	0.165
Gemma 7B-it	0.004	0.0	0.0	0.0	0.0	0.0	0.00
Mistral 7B v0.1 Instruct	0.040	0.0	0.0	0.0	0.004	0.0	0.00
Mistral 7B v0.2 Instruct	0.64	0.017	0.165	0.067	0.192825	0.035	0.143
Flan-T5-XL (3B)	0.538	0.215	0.771	0.224	0.340	0.237	0.219
Flan-T5-XXL (11B)	0.183	0.0	0.0	0.0	0.0	0.004	0.00
1-shot
Llama-2 7B-chat	0.000	0.000	0.000	0.027	0.000	0.000	0.000
Llama-2 13B-chat	0.175	0.009	0.000	0.892	0.000	0.045	0.000
Llama-2 70B-chat	0.076	0.000	0.166	0.000	0.013	0.000	0.000
Gemma 2B-it	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 7B-it	0.323	0.000	0.166	0.157	0.152	0.058	0.027
Mistral 7B v0.1 Instruct	0.004	0.000	0.000	0.000	0.004	0.000	0.054
Mistral 7B v0.2 Instruct	0.152	0.000	0.000	0.000	0.000	0.000	0.000
Flan-T5-XL (3B)	0.359	0.386	0.830	0.238	0.197	0.233	0.108
Flan-T5-XXL (11B)	0.256	0.054	0.000	0.027	0.067	0.018	0.022

Table 15: AHRQ: Refusal rate (zero-shot, 1-shot) by model and prompt feature set. Prompts to classify AHRQ variables to an SDOH domain. Refusal is defined as a response of ‘?’ (i.e., question symbol) or a response that the model was ‘unable’ to classify to one of the five classes. See B.3 for post-processing details. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Zero-shot
Llama-2 7B-chat	0.006	0.000	0.002	0.102	0.006	0.000	0.002
Llama-2 13B-chat	0.000	0.002	0.000	0.004	0.000	0.000	0.000
Llama-2 70B-chat	0.000	0.040	0.000	0.002	0.000	0.014	0.002
Gemma 2B-it	0.000	0.812	0.096	0.002	0.002	0.188	0.000
Gemma 7B-it	0.000	0.000	0.000	0.000	0.000	0.002	0.000
Mistral 7B v0.1 Instruct	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Mistral 7B v0.2 Instruct	0.144	0.128	0.002	0.126	0.088	0.150	0.118
Flan-T5-XL (3B)	0.096	0.064	0.016	0.008	0.044	0.088	0.002
Flan-T5-XXL (11B)	0.000	0.000	0.002	0.000	0.000	0.000	0.000
1-shot
Llama-2 7B-chat	0.000	0.000	0.000	0.010	0.000	0.000	0.000
Llama-2 13B-chat	0.000	0.000	0.000	0.044	0.000	0.000	0.000
Llama-2 70B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 2B-it	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 7B-it	0.026	0.000	0.000	0.000	0.669	0.010	0.000
Mistral 7B v0.1 Instruct	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Mistral 7B v0.2 Instruct	0.000	0.000	0.000	0.012	0.000	0.000	0.000
Flan-T5-XL (3B)	0.144	0.066	0.016	0.004	0.016	0.038	0.000
Flan-T5-XXL (11B)	0.004	0.042	0.002	0.000	0.006	0.042	0.000

Table 16: NaNDA: NaNDA: Non-response rate (zero-shot, 1-shot) by model and prompt feature set. Prompts to classify NaNDA variables to an SDOH domain. Non-response is defined as a response other than (1) - (5), ‘?’ (i.e., question symbol), or the exact SDOH domain names specified in the prompt, as shown in B.1. Lower non-response rate necessarily is a criterion for model understanding of the prompt. Red denotes non-zero non-response rate. See B.3 for post-processing details. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Zero-shot
Llama-2 7B-chat	0.008	0.00	0.00	0.00	0.00	0.008	0.017
Llama-2 13B-chat	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Llama-2 70B-chat	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Gemma 2B-it	0.00	0.00	0.00	0.00	0.134	0.089	0.0403
Gemma 7B-it	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Mistral 7B v0.1 Instruct	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Mistral 7B v0.2 Instruct	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Flan-T5-XL (3B)	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Flan-T5-XXL (11B)	0.00	0.00	0.00	0.00	0.00	0.00	0.00
1-shot
Llama-2 7B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Llama-2 13B-chat	0.000	0.000	0.000	0.000	0.000	0.013	0.000
Llama-2 70B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 2B-it	1.000	1.000	0.973	1.000	1.000	1.000	1.000
Gemma 7B-it	0.103	0.000	0.000	0.004	0.049	0.000	0.004
Mistral 7B v0.1 Instruct	0.000	0.000	0.000	0.040	0.000	0.000	0.000
Mistral 7B v0.2 Instruct	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Flan-T5-XL (3B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Flan-T5-XXL (11B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Table 17: AHRQ: AHRQ: Non-response rate (zero-shot, 1-shot) by model and prompt feature set. Prompts to classify AHRQ variables to an SDOH domain. Non-response is defined as a response other than (1) - (5), ‘?’ (i.e., question symbol), or the exact SDOH domain names specified in the prompt, as shown in B.1. Lower non-response rate necessarily is a criterion for model understanding of the prompt. Red denotes non-zero non-response rate. See B.3 for post-processing details. Features: A: variable name, B: variable description, C: dataset(s) from which variable was originally extracted.

Model	A	B	C	AB	AC	BC	ABC
Zero-shot
Llama-2 7B-chat	0.004	0.000	0.000	0.002	0.000	0.000	0.010
Llama-2 13B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Llama-2 70B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 2B-it	0.000	0.014	0.000	0.034	0.000	0.012	0.012
Gemma 7B-it	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Mistral 7B v0.1 Instruct	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Mistral 7B v0.2 Instruct	0.000	0.004	0.000	0.000	0.000	0.000	0.002
Flan-T5-XL (3B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Flan-T5-XXL (11B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000
1-shot
Llama-2 7B-chat	0.000	0.000	0.000	0.002	0.000	0.000	0.000
Llama-2 13B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Llama-2 70B-chat	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Gemma 2B-it	1.000	1.000	1.000	1.000	1.000	1.000	1.000
Gemma 7B-it	0.058	0.006	0.000	0.000	0.034	0.000	0.000
Mistral 7B v0.1 Instruct	0.000	0.000	0.000	0.004	0.000	0.000	0.000
Mistral 7B v0.2 Instruct	0.000	0.000	0.000	0.004	0.000	0.002	0.002
Flan-T5-XL (3B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Flan-T5-XXL (11B)	0.000	0.000	0.000	0.000	0.000	0.000	0.000

B.3 LLM Labelling Pipeline Details

Post-processing was required to map the text generated by each LLM to the annotation label space (i.e., SDOH domains 1-5). This is standard practice to perform automatic evaluation of LLM responses. We follow the same post-processing approach for LLM responses over 0-shot and 1-shot settings on both NaNDA and AHRQ datasets. We check if the response contains the word entailment (i.e., “Domain:”). Second, we check for refusal — if the word ‘unable’ is present in the response, we count this as refusal which is equivalent to a question mark (i.e., ?). Next, we capture annotations adhering to the prompt format — if the SDOH domain is included as a “lonely” integer (i.e., “1”), if it is included as an integer surrounded by parentheses (i.e., “(1)”), or if the free-text domain name is included (i.e., “Social and Community Context”). Next, the response counted as refusal if none of the prior conditions were met and a question mark was included in the response. Finally, if none of the previous conditions were met, the response counted as non-adherence, meaning the LLM did not appropriately follow the prompt instructions.

B.4 Annotation Efficiency

We compared the total time each human annotator took to annotate the 223 NaNDA variables against the estimated time LLMs took for inference. On average, human annotators took about 46 minutes to annotate all NaNDA variables (roughly 12.5 seconds per variable). The individual human annotation times were as follows: Annotator 1: 45 min, Annotator 2: 55 min, Annotator 3: 40 min. For the largest LLM (Llama-2 70B-chat), inference for one NaNDA variable using the zero-shot prompt was estimated to take 1-5 seconds. This assumes the language model was loaded in half precision on NVIDIA H100 GPUs. So on average, even the largest LLMs annotated SDOH variables in less time than human annotators (1-5 seconds vs. 12.5 seconds). In practice, inference using smaller models like Flan-T5-XL (3B) would likely require even less time per variable than would Llama-2 70B-chat.

Human annotation instructions for NaNDA can be found here: Instructions for NaNDA Variable Annotation.

B.5 Reproducibility

Code for the study can be found at: https://github.com/fensorechase/LLMs_SDOH_Integration