-
Do Responsible AI Artifacts Advance Stakeholder Goals? Four Key Barriers Perceived by Legal and Civil Stakeholders
Authors:
Anna Kawakami,
Daricia Wilkinson,
Alexandra Chouldechova
Abstract:
The responsible AI (RAI) community has introduced numerous processes and artifacts (e.g., Model Cards, Transparency Notes, Data Cards) to facilitate transparency and support the governance of AI systems. While originally designed to scaffold and document AI development processes in technology companies, these artifacts are becoming central components of regulatory compliance under recent regulatio…
▽ More
The responsible AI (RAI) community has introduced numerous processes and artifacts (e.g., Model Cards, Transparency Notes, Data Cards) to facilitate transparency and support the governance of AI systems. While originally designed to scaffold and document AI development processes in technology companies, these artifacts are becoming central components of regulatory compliance under recent regulations such as the EU AI Act. Much prior work has explored the design of new RAI artifacts or their use by practitioners within technology companies. However, as RAI artifacts begin to play key roles in enabling external oversight, it becomes critical to understand how stakeholders--particularly those situated outside of technology companies who govern and audit industry AI deployments--perceive the efficacy of RAI artifacts. In this study, we conduct semi-structured interviews and design activities with 19 government, legal, and civil society stakeholders who inform policy and advocacy around responsible AI efforts. While participants believe that RAI artifacts are a valuable contribution to the broader AI governance ecosystem, many are concerned about their potential unintended, longer-term impacts on actors outside of technology companies (e.g., downstream end-users, policymakers, civil society stakeholders). We organize these beliefs into four barriers that help explain how RAI artifacts may (inadvertently) reconfigure power relations across civil society, government, and industry, impeding civil society and legal stakeholders' ability to protect downstream end-users from potential AI harms. Participants envision how structural changes, along with changes in how RAI artifacts are designed, used, and governed, could help redirect the role of artifacts to support more collaborative and proactive external oversight of AI systems. We discuss research and policy implications for RAI artifacts.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Algorithm-Assisted Decision Making and Racial Disparities in Housing: A Study of the Allegheny Housing Assessment Tool
Authors:
Lingwei Cheng,
Cameron Drayton,
Alexandra Chouldechova,
Rhema Vaithianathan
Abstract:
The demand for housing assistance across the United States far exceeds the supply, leaving housing providers the task of prioritizing clients for receipt of this limited resource. To be eligible for federal funding, local homelessness systems are required to implement assessment tools as part of their prioritization processes. The Vulnerability Index Service Prioritization Decision Assistance Tool…
▽ More
The demand for housing assistance across the United States far exceeds the supply, leaving housing providers the task of prioritizing clients for receipt of this limited resource. To be eligible for federal funding, local homelessness systems are required to implement assessment tools as part of their prioritization processes. The Vulnerability Index Service Prioritization Decision Assistance Tool (VI-SPDAT) is the most commonly used assessment tool nationwide. Recent studies have criticized the VI-SPDAT as exhibiting racial bias, which may lead to unwarranted racial disparities in housing provision. In response to these criticisms, some jurisdictions have developed alternative tools, such as the Allegheny Housing Assessment (AHA), which uses algorithms to assess clients' risk levels. Drawing on data from its deployment, we conduct descriptive and quantitative analyses to evaluate whether replacing the VI-SPDAT with the AHA affects racial disparities in housing allocation. We find that the VI-SPDAT tended to assign higher risk scores to white clients and lower risk scores to Black clients, and that white clients were served at a higher rates pre-AHA deployment. While post-deployment service decisions became better aligned with the AHA score, and the distribution of AHA scores is similar across racial groups, we do not find evidence of a corresponding decrease in disparities in service rates. We attribute the persistent disparity to the use of Alt-AHA, a survey-based tool that is used in cases of low data quality, as well as group differences in eligibility-related factors, such as chronic homelessness and veteran status. We discuss the implications for housing service systems seeking to reduce racial disparities in their service delivery.
△ Less
Submitted 27 August, 2024; v1 submitted 30 July, 2024;
originally announced July 2024.
-
A structured regression approach for evaluating model performance across intersectional subgroups
Authors:
Christine Herlihy,
Kimberly Truong,
Alexandra Chouldechova,
Miroslav Dudik
Abstract:
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluatio…
▽ More
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups defined by combinations of demographic or other sensitive attributes. The standard approach is to stratify the evaluation data across subgroups and compute performance metrics separately for each group. However, even for moderately-sized evaluation datasets, sample sizes quickly get small once considering intersectional subgroups, which greatly limits the extent to which intersectional groups are included in analysis. In this work, we introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups. We provide corresponding inference strategies for constructing confidence intervals and explore how goodness-of-fit testing can yield insight into the structure of fairness-related harms experienced by intersectional groups. We evaluate our approach on two publicly available datasets, and several variants of semi-synthetic data. The results show that our method is considerably more accurate than the standard approach, especially for small subgroups, and demonstrate how goodness-of-fit testing helps identify the key factors that drive differences in performance.
△ Less
Submitted 14 May, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
The Impact of Differential Feature Under-reporting on Algorithmic Fairness
Authors:
Nil-Jana Akpinar,
Zachary C. Lipton,
Alexandra Chouldechova
Abstract:
Predictive risk models in the public sector are commonly developed using administrative data that is more complete for subpopulations that more greatly rely on public services. In the United States, for instance, information on health care utilization is routinely available to government agencies for individuals supported by Medicaid and Medicare, but not for the privately insured. Critiques of pu…
▽ More
Predictive risk models in the public sector are commonly developed using administrative data that is more complete for subpopulations that more greatly rely on public services. In the United States, for instance, information on health care utilization is routinely available to government agencies for individuals supported by Medicaid and Medicare, but not for the privately insured. Critiques of public sector algorithms have identified such differential feature under-reporting as a driver of disparities in algorithmic decision-making. Yet this form of data bias remains understudied from a technical viewpoint. While prior work has examined the fairness impacts of additive feature noise and features that are clearly marked as missing, the setting of data missingness absent indicators (i.e. differential feature under-reporting) has been lacking in research attention. In this work, we present an analytically tractable model of differential feature under-reporting which we then use to characterize the impact of this kind of data bias on algorithmic fairness. We demonstrate how standard missing data methods typically fail to mitigate bias in this setting, and propose a new set of methods specifically tailored to differential feature under-reporting. Our results show that, in real world data settings, under-reporting typically leads to increasing disparities. The proposed solution methods show success in mitigating increases in unfairness.
△ Less
Submitted 3 May, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Multi-Target Multiplicity: Flexibility and Fairness in Target Specification under Resource Constraints
Authors:
Jamelle Watson-Daniels,
Solon Barocas,
Jake M. Hofman,
Alexandra Chouldechova
Abstract:
Prediction models have been widely adopted as the basis for decision-making in domains as diverse as employment, education, lending, and health. Yet, few real world problems readily present themselves as precisely formulated prediction tasks. In particular, there are often many reasonable target variable options. Prior work has argued that this is an important and sometimes underappreciated choice…
▽ More
Prediction models have been widely adopted as the basis for decision-making in domains as diverse as employment, education, lending, and health. Yet, few real world problems readily present themselves as precisely formulated prediction tasks. In particular, there are often many reasonable target variable options. Prior work has argued that this is an important and sometimes underappreciated choice, and has also shown that target choice can have a significant impact on the fairness of the resulting model. However, the existing literature does not offer a formal framework for characterizing the extent to which target choice matters in a particular task. Our work fills this gap by drawing connections between the problem of target choice and recent work on predictive multiplicity. Specifically, we introduce a conceptual and computational framework for assessing how the choice of target affects individuals' outcomes and selection rate disparities across groups. We call this multi-target multiplicity. Along the way, we refine the study of single-target multiplicity by introducing notions of multiplicity that respect resource constraints -- a feature of many real-world tasks that is not captured by existing notions of predictive multiplicity. We apply our methods on a healthcare dataset, and show that the level of multiplicity that stems from target variable choice can be greater than that stemming from nearly-optimal models of a single target.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Examining risks of racial biases in NLP tools for child protective services
Authors:
Anjalie Field,
Amanda Coston,
Nupoor Gandhi,
Alexandra Chouldechova,
Emily Putnam-Hornstein,
David Steier,
Yulia Tsvetkov
Abstract:
Although much literature has established the presence of demographic bias in natural language processing (NLP) models, most work relies on curated bias metrics that may not be reflective of real-world applications. At the same time, practitioners are increasingly using algorithmic tools in high-stakes settings, with particular recent interest in NLP. In this work, we focus on one such setting: chi…
▽ More
Although much literature has established the presence of demographic bias in natural language processing (NLP) models, most work relies on curated bias metrics that may not be reflective of real-world applications. At the same time, practitioners are increasingly using algorithmic tools in high-stakes settings, with particular recent interest in NLP. In this work, we focus on one such setting: child protective services (CPS). CPS workers often write copious free-form text notes about families they are working with, and CPS agencies are actively seeking to deploy NLP models to leverage these data. Given well-established racial bias in this setting, we investigate possible ways deployed NLP is liable to increase racial disparities. We specifically examine word statistics within notes and algorithmic fairness in risk prediction, coreference resolution, and named entity recognition (NER). We document consistent algorithmic unfairness in NER models, possible algorithmic unfairness in coreference resolution models, and little evidence of exacerbated racial bias in risk prediction. While there is existing pronounced criticism of risk prediction, our results expose previously undocumented risks of racial bias in realistic information extraction systems, highlighting potential concerns in deploying them, even though they may appear more benign. Our work serves as a rare realistic examination of NLP algorithmic fairness in a potential deployed setting and a timely investigation of a specific risk associated with deploying NLP in CPS settings.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Overcoming Algorithm Aversion: A Comparison between Process and Outcome Control
Authors:
Lingwei Cheng,
Alexandra Chouldechova
Abstract:
Algorithm aversion occurs when humans are reluctant to use algorithms despite their superior performance. Studies show that giving users outcome control by providing agency over how models' predictions are incorporated into decision-making mitigates algorithm aversion. We study whether algorithm aversion is mitigated by process control, wherein users can decide what input factors and algorithms to…
▽ More
Algorithm aversion occurs when humans are reluctant to use algorithms despite their superior performance. Studies show that giving users outcome control by providing agency over how models' predictions are incorporated into decision-making mitigates algorithm aversion. We study whether algorithm aversion is mitigated by process control, wherein users can decide what input factors and algorithms to use in model training. We conduct a replication study of outcome control, and test novel process control study conditions on Amazon Mechanical Turk (MTurk) and Prolific. Our results partly confirm prior findings on the mitigating effects of outcome control, while also forefronting reproducibility challenges. We find that process control in the form of choosing the training algorithm mitigates algorithm aversion, but changing inputs does not. Furthermore, giving users both outcome and process control does not reduce algorithm aversion more than outcome or process control alone. This study contributes to design considerations around mitigating algorithm aversion.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Algorithmic Fairness and Vertical Equity: Income Fairness with IRS Tax Audit Models
Authors:
Emily Black,
Hadi Elzayn,
Alexandra Chouldechova,
Jacob Goldin,
Daniel E. Ho
Abstract:
This study examines issues of algorithmic fairness in the context of systems that inform tax audit selection by the United States Internal Revenue Service (IRS). While the field of algorithmic fairness has developed primarily around notions of treating like individuals alike, we instead explore the concept of vertical equity -- appropriately accounting for relevant differences across individuals -…
▽ More
This study examines issues of algorithmic fairness in the context of systems that inform tax audit selection by the United States Internal Revenue Service (IRS). While the field of algorithmic fairness has developed primarily around notions of treating like individuals alike, we instead explore the concept of vertical equity -- appropriately accounting for relevant differences across individuals -- which is a central component of fairness in many public policy settings. Applied to the design of the U.S. individual income tax system, vertical equity relates to the fair allocation of tax and enforcement burdens across taxpayers of different income levels. Through a unique collaboration with the Treasury Department and IRS, we use access to anonymized individual taxpayer microdata, risk-selected audits, and random audits from 2010-14 to study vertical equity in tax administration. In particular, we assess how the use of modern machine learning methods for selecting audits may affect vertical equity. First, we show how the use of more flexible machine learning (classification) methods -- as opposed to simpler models -- shifts audit burdens from high to middle-income taxpayers. Second, we show that while existing algorithmic fairness techniques can mitigate some disparities across income, they can incur a steep cost to performance. Third, we show that the choice of whether to treat risk of underreporting as a classification or regression problem is highly consequential. Moving from classification to regression models to predict underreporting shifts audit burden substantially toward high income individuals, while increasing revenue. Last, we explore the role of differential audit cost in shaping the audit distribution. We show that a narrow focus on return-on-investment can undermine vertical equity. Our results have implications for the design of algorithmic tools across the public sector.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
Imagining new futures beyond predictive systems in child welfare: A qualitative study with impacted stakeholders
Authors:
Logan Stapleton,
Min Hun Lee,
Diana Qing,
Marya Wright,
Alexandra Chouldechova,
Kenneth Holstein,
Zhiwei Steven Wu,
Haiyi Zhu
Abstract:
Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ou…
▽ More
Child welfare agencies across the United States are turning to data-driven predictive technologies (commonly called predictive analytics) which use government administrative data to assist workers' decision-making. While some prior work has explored impacted stakeholders' concerns with current uses of data-driven predictive risk models (PRMs), less work has asked stakeholders whether such tools ought to be used in the first place. In this work, we conducted a set of seven design workshops with 35 stakeholders who have been impacted by the child welfare system or who work in it to understand their beliefs and concerns around PRMs, and to engage them in imagining new uses of data and technologies in the child welfare system. We found that participants worried current PRMs perpetuate or exacerbate existing problems in child welfare. Participants suggested new ways to use data and data-driven tools to better support impacted communities and suggested paths to mitigate possible harms of these tools. Participants also suggested low-tech or no-tech alternatives to PRMs to address problems in child welfare. Our study sheds light on how researchers and designers can work in solidarity with impacted communities, possibly to circumvent or oppose child welfare agencies.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Doubting AI Predictions: Influence-Driven Second Opinion Recommendation
Authors:
Maria De-Arteaga,
Alexandra Chouldechova,
Artur Dubrawski
Abstract:
Effective human-AI collaboration requires a system design that provides humans with meaningful ways to make sense of and critically evaluate algorithmic recommendations. In this paper, we propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions. When machine learning algorithms are trained…
▽ More
Effective human-AI collaboration requires a system design that provides humans with meaningful ways to make sense of and critically evaluate algorithmic recommendations. In this paper, we propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions. When machine learning algorithms are trained to predict human-generated assessments, experts' rich multitude of perspectives is frequently lost in monolithic algorithmic recommendations. The proposed approach aims to leverage productive disagreement by (1) identifying whether some experts are likely to disagree with an algorithmic assessment and, if so, (2) recommend an expert to request a second opinion from.
△ Less
Submitted 29 April, 2022;
originally announced May 2022.
-
Heterogeneity in Algorithm-Assisted Decision-Making: A Case Study in Child Abuse Hotline Screening
Authors:
Lingwei Cheng,
Alexandra Chouldechova
Abstract:
Algorithmic risk assessment tools are now commonplace in public sector domains such as criminal justice and human services. These tools are intended to aid decision makers in systematically using rich and complex data captured in administrative systems. In this study we investigate sources of heterogeneity in the alignment between worker decisions and algorithmic risk scores in the context of a re…
▽ More
Algorithmic risk assessment tools are now commonplace in public sector domains such as criminal justice and human services. These tools are intended to aid decision makers in systematically using rich and complex data captured in administrative systems. In this study we investigate sources of heterogeneity in the alignment between worker decisions and algorithmic risk scores in the context of a real world child abuse hotline screening use case. Specifically, we focus on heterogeneity related to worker experience. We find that senior workers are far more likely to screen in referrals for investigation, even after we control for the observed algorithmic risk score and other case characteristics. We also observe that the decisions of less-experienced workers are more closely aligned with algorithmic risk scores than those of senior workers who had decision-making experience prior to the tool being introduced. While screening decisions vary across child race, we do not find evidence of racial differences in the relationship between worker experience and screening decisions. Our findings indicate that it is important for agencies and system designers to consider ways of preserving institutional knowledge when introducing algorithms into high employee turnover settings such as child welfare call screening.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness
Authors:
Kate Donahue,
Alexandra Chouldechova,
Krishnaram Kenthapadi
Abstract:
Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the control of a human, who uses an algorithm's output along with their own personal expertise in order to produce a combined prediction. One ultimate goal of such col…
▽ More
Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the control of a human, who uses an algorithm's output along with their own personal expertise in order to produce a combined prediction. One ultimate goal of such collaborative systems is "complementarity": that is, to produce lower loss (equivalently, greater payoff or utility) than either the human or algorithm alone. However, experimental results have shown that even in carefully-designed systems, complementary performance can be elusive. Our work provides three key contributions. First, we provide a theoretical framework for modeling simple human-algorithm systems and demonstrate that multiple prior analyses can be expressed within it. Next, we use this model to prove conditions where complementarity is impossible, and give constructive examples of where complementarity is achievable. Finally, we discuss the implications of our findings, especially with respect to the fairness of a classifier. In sum, these results deepen our understanding of key factors influencing the combined performance of human-algorithm systems, giving insight into how algorithmic tools can best be designed for collaborative environments.
△ Less
Submitted 1 June, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
The Impact of Algorithmic Risk Assessments on Human Predictions and its Analysis via Crowdsourcing Studies
Authors:
Riccardo Fogliato,
Alexandra Chouldechova,
Zachary Lipton
Abstract:
As algorithmic risk assessment instruments (RAIs) are increasingly adopted to assist decision makers, their predictive performance and potential to promote inequity have come under scrutiny. However, while most studies examine these tools in isolation, researchers have come to recognize that assessing their impact requires understanding the behavior of their human interactants. In this paper, buil…
▽ More
As algorithmic risk assessment instruments (RAIs) are increasingly adopted to assist decision makers, their predictive performance and potential to promote inequity have come under scrutiny. However, while most studies examine these tools in isolation, researchers have come to recognize that assessing their impact requires understanding the behavior of their human interactants. In this paper, building off of several recent crowdsourcing works focused on criminal justice, we conduct a vignette study in which laypersons are tasked with predicting future re-arrests. Our key findings are as follows: (1) Participants often predict that an offender will be rearrested even when they deem the likelihood of re-arrest to be well below 50%; (2) Participants do not anchor on the RAI's predictions; (3) The time spent on the survey varies widely across participants and most cases are assessed in less than 10 seconds; (4) Judicial decisions, unlike participants' predictions, depend in part on factors that are orthogonal to the likelihood of re-arrest. These results highlight the influence of several crucial but often overlooked design decisions and concerns around generalizability when constructing crowdsourcing studies to analyze the impacts of RAIs.
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Soliciting Stakeholders' Fairness Notions in Child Maltreatment Predictive Systems
Authors:
Hao-Fei Cheng,
Logan Stapleton,
Ruiqi Wang,
Paige Bullock,
Alexandra Chouldechova,
Zhiwei Steven Wu,
Haiyi Zhu
Abstract:
Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeh…
▽ More
Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeholders' subjective fairness notions. Combining a user interface that allows stakeholders to examine the data and the algorithm's predictions with an interview protocol to probe stakeholders' thoughts while they are interacting with the interface, we can identify stakeholders' fairness beliefs and principles. We conduct a user study to evaluate our framework in the setting of a child maltreatment predictive system. Our evaluations show that the framework allows stakeholders to comprehensively convey their fairness viewpoints. We also discuss how our results can inform the design of predictive systems.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
The effect of differential victim crime reporting on predictive policing systems
Authors:
Nil-Jana Akpinar,
Maria De-Arteaga,
Alexandra Chouldechova
Abstract:
Police departments around the world have been experimenting with forms of place-based data-driven proactive policing for over two decades. Modern incarnations of such systems are commonly known as hot spot predictive policing. These systems predict where future crime is likely to concentrate such that police can allocate patrols to these areas and deter crime before it occurs. Previous research on…
▽ More
Police departments around the world have been experimenting with forms of place-based data-driven proactive policing for over two decades. Modern incarnations of such systems are commonly known as hot spot predictive policing. These systems predict where future crime is likely to concentrate such that police can allocate patrols to these areas and deter crime before it occurs. Previous research on fairness in predictive policing has concentrated on the feedback loops which occur when models are trained on discovered crime data, but has limited implications for models trained on victim crime reporting data. We demonstrate how differential victim crime reporting rates across geographical areas can lead to outcome disparities in common crime hot spot prediction models. Our analysis is based on a simulation patterned after district-level victimization and crime reporting survey data for Bogotá, Colombia. Our results suggest that differential crime reporting rates can lead to a displacement of predicted hotspots from high crime but low reporting areas to high or medium crime and high reporting areas. This may lead to misallocations both in the form of over-policing and under-policing.
△ Less
Submitted 4 February, 2021; v1 submitted 29 January, 2021;
originally announced February 2021.
-
Leveraging Expert Consistency to Improve Algorithmic Decision Support
Authors:
Maria De-Arteaga,
Vincent Jeanselme,
Artur Dubrawski,
Alexandra Chouldechova
Abstract:
Machine learning (ML) is increasingly being used to support high-stakes decisions. However, there is frequently a construct gap: a gap between the construct of interest to the decision-making task and what is captured in proxies used as labels to train ML models. As a result, ML models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. Thus…
▽ More
Machine learning (ML) is increasingly being used to support high-stakes decisions. However, there is frequently a construct gap: a gap between the construct of interest to the decision-making task and what is captured in proxies used as labels to train ML models. As a result, ML models may fail to capture important dimensions of decision criteria, hampering their utility for decision support. Thus, an essential step in the design of ML systems for decision support is selecting a target label among available proxies. In this work, we explore the use of historical expert decisions as a rich -- yet also imperfect -- source of information that can be combined with observed outcomes to narrow the construct gap. We argue that managers and system designers may be interested in learning from experts in instances where they exhibit consistency with each other, while learning from observed outcomes otherwise. We develop a methodology to enable this goal using information that is commonly available in organizational information systems. This involves two core steps. First, we propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Second, we introduce a label amalgamation approach that allows ML models to simultaneously learn from expert decisions and observed outcomes. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap, yielding better predictive performance than learning from either observed outcomes or expert decisions alone.
△ Less
Submitted 3 June, 2024; v1 submitted 24 January, 2021;
originally announced January 2021.
-
Characterizing Fairness Over the Set of Good Models Under Selective Labels
Authors:
Amanda Coston,
Ashesh Rambachan,
Alexandra Chouldechova
Abstract:
Algorithmic risk assessments are used to inform decisions in a wide variety of high-stakes settings. Often multiple predictive models deliver similar overall performance but differ markedly in their predictions for individual cases, an empirical phenomenon known as the "Rashomon Effect." These models may have different properties over various groups, and therefore have different predictive fairnes…
▽ More
Algorithmic risk assessments are used to inform decisions in a wide variety of high-stakes settings. Often multiple predictive models deliver similar overall performance but differ markedly in their predictions for individual cases, an empirical phenomenon known as the "Rashomon Effect." These models may have different properties over various groups, and therefore have different predictive fairness properties. We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance, or "the set of good models." Our framework addresses the empirically relevant challenge of selectively labelled data in the setting where the selection decision and outcome are unconfounded given the observed data features. Our framework can be used to 1) replace an existing model with one that has better fairness properties; or 2) audit for predictive bias. We illustrate these uses cases on a real-world credit-scoring task and a recidivism prediction task.
△ Less
Submitted 30 April, 2021; v1 submitted 1 January, 2021;
originally announced January 2021.
-
Leveraging Administrative Data for Bias Audits: Assessing Disparate Coverage with Mobility Data for COVID-19 Policy
Authors:
Amanda Coston,
Neel Guha,
Derek Ouyang,
Lisa Lu,
Alexandra Chouldechova,
Daniel E. Ho
Abstract:
Anonymized smartphone-based mobility data has been widely adopted in devising and evaluating COVID-19 response strategies such as the targeting of public health resources. Yet little attention has been paid to measurement validity and demographic bias, due in part to the lack of documentation about which users are represented as well as the challenge of obtaining ground truth data on unique visits…
▽ More
Anonymized smartphone-based mobility data has been widely adopted in devising and evaluating COVID-19 response strategies such as the targeting of public health resources. Yet little attention has been paid to measurement validity and demographic bias, due in part to the lack of documentation about which users are represented as well as the challenge of obtaining ground truth data on unique visits and demographics. We illustrate how linking large-scale administrative data can enable auditing mobility data for bias in the absence of demographic information and ground truth labels. More precisely, we show that linking voter roll data -- containing individual-level voter turnout for specific voting locations along with race and age -- can facilitate the construction of rigorous bias and reliability tests. These tests illuminate a sampling bias that is particularly noteworthy in the pandemic context: older and non-white voters are less likely to be captured by mobility data. We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
△ Less
Submitted 15 April, 2021; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Counterfactual Predictions under Runtime Confounding
Authors:
Amanda Coston,
Edward H. Kennedy,
Alexandra Chouldechova
Abstract:
Algorithms are commonly used to predict outcomes under a particular decision or intervention, such as predicting whether an offender will succeed on parole if placed under minimal supervision. Generally, to learn such counterfactual prediction models from observational data on historical decisions and corresponding outcomes, one must measure all factors that jointly affect the outcomes and the dec…
▽ More
Algorithms are commonly used to predict outcomes under a particular decision or intervention, such as predicting whether an offender will succeed on parole if placed under minimal supervision. Generally, to learn such counterfactual prediction models from observational data on historical decisions and corresponding outcomes, one must measure all factors that jointly affect the outcomes and the decision taken. Motivated by decision support applications, we study the counterfactual prediction task in the setting where all relevant factors are captured in the historical data, but it is either undesirable or impermissible to use some such factors in the prediction model. We refer to this setting as runtime confounding. We propose a doubly-robust procedure for learning counterfactual prediction models in this setting. Our theoretical analysis and experimental results suggest that our method often outperforms competing approaches. We also present a validation procedure for evaluating the performance of counterfactual prediction methods.
△ Less
Submitted 15 April, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
Fairness Evaluation in Presence of Biased Noisy Labels
Authors:
Riccardo Fogliato,
Max G'Sell,
Alexandra Chouldechova
Abstract:
Risk assessment tools are widely used around the country to inform decision making within the criminal justice system. Recently, considerable attention has been devoted to the question of whether such tools may suffer from racial bias. In this type of assessment, a fundamental issue is that the training and evaluation of the model is based on a variable (arrest) that may represent a noisy version…
▽ More
Risk assessment tools are widely used around the country to inform decision making within the criminal justice system. Recently, considerable attention has been devoted to the question of whether such tools may suffer from racial bias. In this type of assessment, a fundamental issue is that the training and evaluation of the model is based on a variable (arrest) that may represent a noisy version of an unobserved outcome of more central interest (offense). We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model as a predictor of reoffense. Our experimental results on two real world criminal justice data sets demonstrate how even small biases in the observed labels may call into question the conclusions of an analysis based on the noisy outcome.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores
Authors:
Maria De-Arteaga,
Riccardo Fogliato,
Alexandra Chouldechova
Abstract:
The increased use of algorithmic predictions in sensitive domains has been accompanied by both enthusiasm and concern. To understand the opportunities and risks of these technologies, it is key to study how experts alter their decisions when using such tools. In this paper, we study the adoption of an algorithmic tool used to assist child maltreatment hotline screening decisions. We focus on the q…
▽ More
The increased use of algorithmic predictions in sensitive domains has been accompanied by both enthusiasm and concern. To understand the opportunities and risks of these technologies, it is key to study how experts alter their decisions when using such tools. In this paper, we study the adoption of an algorithmic tool used to assist child maltreatment hotline screening decisions. We focus on the question: Are humans capable of identifying cases in which the machine is wrong, and of overriding those recommendations? We first show that humans do alter their behavior when the tool is deployed. Then, we show that humans are less likely to adhere to the machine's recommendation when the score displayed is an incorrect estimate of risk, even when overriding the recommendation requires supervisory approval. These results highlight the risks of full automation and the importance of designing decision pipelines that provide humans with autonomy.
△ Less
Submitted 20 February, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Counterfactual Risk Assessments, Evaluation, and Fairness
Authors:
Amanda Coston,
Alan Mishler,
Edward H. Kennedy,
Alexandra Chouldechova
Abstract:
Algorithmic risk assessments are increasingly used to help humans make decisions in high-stakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism…
▽ More
Algorithmic risk assessments are increasingly used to help humans make decisions in high-stakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism. Problematically, most tools are trained and evaluated on historical data in which the outcomes observed depend on the historical decision-making policy. These tools thus reflect risk under the historical policy, rather than under the different decision options that the tool is intended to inform. Even when tools are constructed to predict risk under a specific decision, they are often improperly evaluated as predictors of the target outcome.
Focusing on the evaluation task, in this paper we define counterfactual analogues of common predictive performance and algorithmic fairness metrics that we argue are better suited for the decision-making context. We introduce a new method for estimating the proposed metrics using doubly robust estimation. We provide theoretical results that show that only under strong conditions can fairness according to the standard metric and the counterfactual metric simultaneously hold. Consequently, fairness-promoting methods that target parity in a standard fairness metric may --- and as we show empirically, do --- induce greater imbalance in the counterfactual analogue. We provide empirical comparisons on both synthetic data and a real world child welfare dataset to demonstrate how the proposed method improves upon standard practice.
△ Less
Submitted 10 January, 2020; v1 submitted 30 August, 2019;
originally announced September 2019.
-
What's in a Name? Reducing Bias in Bios without Access to Protected Attributes
Authors:
Alexey Romanov,
Maria De-Arteaga,
Hanna Wallach,
Jennifer Chayes,
Christian Borgs,
Alexandra Chouldechova,
Sahin Geyik,
Krishnaram Kenthapadi,
Anna Rumshisky,
Adam Tauman Kalai
Abstract:
There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protect…
▽ More
There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protected attributes, as well as their intersections. In the context of mitigating bias in occupation classification, we propose a method for discouraging correlation between the predicted probability of an individual's true occupation and a word embedding of their name. This method leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes. Crucially, it only requires access to individuals' names at training time and not at deployment time. We evaluate two variations of our proposed method using a large-scale dataset of online biographies. We find that both variations simultaneously reduce race and gender biases, with almost no reduction in the classifier's overall true positive rate.
△ Less
Submitted 10 April, 2019;
originally announced April 2019.
-
Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting
Authors:
Maria De-Arteaga,
Alexey Romanov,
Hanna Wallach,
Jennifer Chayes,
Christian Borgs,
Alexandra Chouldechova,
Sahin Geyik,
Krishnaram Kenthapadi,
Adam Tauman Kalai
Abstract:
We present a large-scale study of gender bias in occupation classification, a task where the use of machine learning may lead to negative outcomes on peoples' lives. We analyze the potential allocation harms that can result from semantic representation bias. To do so, we study the impact on occupation classification of including explicit gender indicators---such as first names and pronouns---in di…
▽ More
We present a large-scale study of gender bias in occupation classification, a task where the use of machine learning may lead to negative outcomes on peoples' lives. We analyze the potential allocation harms that can result from semantic representation bias. To do so, we study the impact on occupation classification of including explicit gender indicators---such as first names and pronouns---in different semantic representations of online biographies. Additionally, we quantify the bias that remains when these indicators are "scrubbed," and describe proxy behavior that occurs in the absence of explicit gender indicators. As we demonstrate, differences in true positive rates between genders are correlated with existing gender imbalances in occupations, which may compound these imbalances.
△ Less
Submitted 27 January, 2019;
originally announced January 2019.
-
The Frontiers of Fairness in Machine Learning
Authors:
Alexandra Chouldechova,
Aaron Roth
Abstract:
The last few years have seen an explosion of academic and popular interest in algorithmic fairness. Despite this interest and the volume and velocity of work that has been produced recently, the fundamental science of fairness in machine learning is still in a nascent state. In March 2018, we convened a group of experts as part of a CCC visioning workshop to assess the state of the field, and dist…
▽ More
The last few years have seen an explosion of academic and popular interest in algorithmic fairness. Despite this interest and the volume and velocity of work that has been produced recently, the fundamental science of fairness in machine learning is still in a nascent state. In March 2018, we convened a group of experts as part of a CCC visioning workshop to assess the state of the field, and distill the most promising research directions going forward. This report summarizes the findings of that workshop. Along the way, it surveys recent theoretical work in the field and points towards promising directions for research.
△ Less
Submitted 20 October, 2018;
originally announced October 2018.
-
Learning under selective labels in the presence of expert consistency
Authors:
Maria De-Arteaga,
Artur Dubrawski,
Alexandra Chouldechova
Abstract:
We explore the problem of learning under selective labels in the context of algorithm-assisted decision making. Selective labels is a pervasive selection bias problem that arises when historical decision making blinds us to the true outcome for certain instances. Examples of this are common in many applications, ranging from predicting recidivism using pre-trial release data to diagnosing patients…
▽ More
We explore the problem of learning under selective labels in the context of algorithm-assisted decision making. Selective labels is a pervasive selection bias problem that arises when historical decision making blinds us to the true outcome for certain instances. Examples of this are common in many applications, ranging from predicting recidivism using pre-trial release data to diagnosing patients. In this paper we discuss why selective labels often cannot be effectively tackled by standard methods for adjusting for sample selection bias, even if there are no unobservables. We propose a data augmentation approach that can be used to either leverage expert consistency to mitigate the partial blindness that results from selective labels, or to empirically validate whether learning under such framework may lead to unreliable models prone to systemic discrimination.
△ Less
Submitted 4 July, 2018; v1 submitted 2 July, 2018;
originally announced July 2018.
-
Does mitigating ML's impact disparity require treatment disparity?
Authors:
Zachary C. Lipton,
Alexandra Chouldechova,
Julian McAuley
Abstract:
Following related work in law and policy, two notions of disparity have come to shape the study of fairness in algorithmic decision-making. Algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently; algorithms exhibit impact disparity when outcomes differ across subgroups, even if the correlation arises unintentionally. Naturally, we can achieve impac…
▽ More
Following related work in law and policy, two notions of disparity have come to shape the study of fairness in algorithmic decision-making. Algorithms exhibit treatment disparity if they formally treat members of protected subgroups differently; algorithms exhibit impact disparity when outcomes differ across subgroups, even if the correlation arises unintentionally. Naturally, we can achieve impact parity through purposeful treatment disparity. In one thread of technical work, papers aim to reconcile the two forms of parity proposing disparate learning processes (DLPs). Here, the learning algorithm can see group membership during training but produce a classifier that is group-blind at test time. In this paper, we show theoretically that: (i) When other features correlate to group membership, DLPs will (indirectly) implement treatment disparity, undermining the policy desiderata they are designed to address; (ii) When group membership is partly revealed by other features, DLPs induce within-class discrimination; and (iii) In general, DLPs provide a suboptimal trade-off between accuracy and impact parity. Based on our technical analysis, we argue that transparent treatment disparity is preferable to occluded methods for achieving impact parity. Experimental results on several real-world datasets highlight the practical consequences of applying DLPs vs. per-group thresholds.
△ Less
Submitted 11 January, 2019; v1 submitted 19 November, 2017;
originally announced November 2017.
-
Fairer and more accurate, but for whom?
Authors:
Alexandra Chouldechova,
Max G'Sell
Abstract:
Complex statistical machine learning models are increasingly being used or considered for use in high-stakes decision-making pipelines in domains such as financial services, health care, criminal justice and human services. These models are often investigated as possible improvements over more classical tools such as regression models or human judgement. While the modeling approach may be new, the…
▽ More
Complex statistical machine learning models are increasingly being used or considered for use in high-stakes decision-making pipelines in domains such as financial services, health care, criminal justice and human services. These models are often investigated as possible improvements over more classical tools such as regression models or human judgement. While the modeling approach may be new, the practice of using some form of risk assessment to inform decisions is not. When determining whether a new model should be adopted, it is therefore essential to be able to compare the proposed model to the existing approach across a range of task-relevant accuracy and fairness metrics. Looking at overall performance metrics, however, may be misleading. Even when two models have comparable overall performance, they may nevertheless disagree in their classifications on a considerable fraction of cases. In this paper we introduce a model comparison framework for automatically identifying subgroups in which the differences between models are most pronounced. Our primary focus is on identifying subgroups where the models differ in terms of fairness-related quantities such as racial or gender disparities. We present experimental results from a recidivism prediction task and a hypothetical lending example.
△ Less
Submitted 30 June, 2017;
originally announced July 2017.
-
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments
Authors:
Alexandra Chouldechova
Abstract:
Recidivism prediction instruments (RPI's) provide decision makers with an assessment of the likelihood that a criminal defendant will reoffend at a future point in time. While such instruments are gaining increasing popularity across the country, their use is attracting tremendous controversy. Much of the controversy concerns potential discriminatory bias in the risk assessments that are produced.…
▽ More
Recidivism prediction instruments (RPI's) provide decision makers with an assessment of the likelihood that a criminal defendant will reoffend at a future point in time. While such instruments are gaining increasing popularity across the country, their use is attracting tremendous controversy. Much of the controversy concerns potential discriminatory bias in the risk assessments that are produced. This paper discusses several fairness criteria that have recently been applied to assess the fairness of recidivism prediction instruments. We demonstrate that the criteria cannot all be simultaneously satisfied when recidivism prevalence differs across groups. We then show how disparate impact can arise when a recidivism prediction instrument fails to satisfy the criterion of error rate balance.
△ Less
Submitted 28 February, 2017;
originally announced March 2017.
-
Fair prediction with disparate impact: A study of bias in recidivism prediction instruments
Authors:
Alexandra Chouldechova
Abstract:
Recidivism prediction instruments provide decision makers with an assessment of the likelihood that a criminal defendant will reoffend at a future point in time. While such instruments are gaining increasing popularity across the country, their use is attracting tremendous controversy. Much of the controversy concerns potential discriminatory bias in the risk assessments that are produced. This pa…
▽ More
Recidivism prediction instruments provide decision makers with an assessment of the likelihood that a criminal defendant will reoffend at a future point in time. While such instruments are gaining increasing popularity across the country, their use is attracting tremendous controversy. Much of the controversy concerns potential discriminatory bias in the risk assessments that are produced. This paper discusses a fairness criterion originating in the field of educational and psychological testing that has recently been applied to assess the fairness of recidivism prediction instruments. We demonstrate how adherence to the criterion may lead to considerable disparate impact when recidivism prevalence differs across groups.
△ Less
Submitted 24 October, 2016;
originally announced October 2016.