-
Adaptive Recruitment Resource Allocation to Improve Cohort Representativeness in Participatory Biomedical Datasets
Authors:
Victor Borza,
Andrew Estornell,
Ellen Wright Clayton,
Chien-Ju Ho,
Russell Rothman,
Yevgeniy Vorobeychik,
Bradley Malin
Abstract:
Large participatory biomedical studies, studies that recruit individuals to join a dataset, are gaining popularity and investment, especially for analysis by modern AI methods. Because they purposively recruit participants, these studies are uniquely able to address a lack of historical representation, an issue that has affected many biomedical datasets. In this work, we define representativeness…
▽ More
Large participatory biomedical studies, studies that recruit individuals to join a dataset, are gaining popularity and investment, especially for analysis by modern AI methods. Because they purposively recruit participants, these studies are uniquely able to address a lack of historical representation, an issue that has affected many biomedical datasets. In this work, we define representativeness as the similarity to a target population distribution of a set of attributes and our goal is to mirror the U.S. population across distributions of age, gender, race, and ethnicity. Many participatory studies recruit at several institutions, so we introduce a computational approach to adaptively allocate recruitment resources among sites to improve representativeness. In simulated recruitment of 10,000-participant cohorts from medical centers in the STAR Clinical Research Network, we show that our approach yields a more representative cohort than existing baselines. Thus, we highlight the value of computational modeling in guiding recruitment efforts.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
User-Creator Feature Dynamics in Recommender Systems with Dual Influence
Authors:
Tao Lin,
Kun Jin,
Andrew Estornell,
Xiaoying Zhang,
Yiling Chen,
Yang Liu
Abstract:
Recommender systems present relevant contents to users and help content creators reach their target audience. The dual nature of these systems influences both users and creators: users' preferences are affected by the items they are recommended, while creators are incentivized to alter their contents such that it is recommended more frequently. We define a model, called user-creator feature dynami…
▽ More
Recommender systems present relevant contents to users and help content creators reach their target audience. The dual nature of these systems influences both users and creators: users' preferences are affected by the items they are recommended, while creators are incentivized to alter their contents such that it is recommended more frequently. We define a model, called user-creator feature dynamics, to capture the dual influences of recommender systems. We prove that a recommender system with dual influence is guaranteed to polarize, causing diversity loss in the system. We then investigate, both theoretically and empirically, approaches for mitigating polarization and promoting diversity in recommender systems. Unexpectedly, we find that common diversity-promoting approaches do not work in the presence of dual influence, while relevancy-optimizing methods like top-$k$ recommendation can prevent polarization and improve diversity of the system.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
Dataset Representativeness and Downstream Task Fairness
Authors:
Victor Borza,
Andrew Estornell,
Chien-Ju Ho,
Bradley Malin,
Yevgeniy Vorobeychik
Abstract:
Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are bi…
▽ More
Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Predicting Customer Goals in Financial Institution Services: A Data-Driven LSTM Approach
Authors:
Andrew Estornell,
Stylianos Loukas Vasileiou,
William Yeoh,
Daniel Borrajo,
Rui Silva
Abstract:
In today's competitive financial landscape, understanding and anticipating customer goals is crucial for institutions to deliver a personalized and optimized user experience. This has given rise to the problem of accurately predicting customer goals and actions. Focusing on that problem, we use historical customer traces generated by a realistic simulator and present two simple models for predicti…
▽ More
In today's competitive financial landscape, understanding and anticipating customer goals is crucial for institutions to deliver a personalized and optimized user experience. This has given rise to the problem of accurately predicting customer goals and actions. Focusing on that problem, we use historical customer traces generated by a realistic simulator and present two simple models for predicting customer goals and future actions -- an LSTM model and an LSTM model enhanced with state-space graph embeddings. Our results demonstrate the effectiveness of these models when it comes to predicting customer goals and actions.
△ Less
Submitted 22 May, 2024;
originally announced June 2024.
-
Measuring and Reducing LLM Hallucination without Gold-Standard Answers
Authors:
Jiaheng Wei,
Yuanshun Yao,
Jean-Francois Ton,
Hongyi Guo,
Andrew Estornell,
Yang Liu
Abstract:
LLM hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of LLMs. The first step towards solving this complicated problem is to measure it. However, existing hallucination metrics require having a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. Such r…
▽ More
LLM hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of LLMs. The first step towards solving this complicated problem is to measure it. However, existing hallucination metrics require having a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. Such requirements make hallucination measurement costly and prone to human errors. In this work, we propose Factualness Evaluations via Weighting LLMs (FEWL), an innovative hallucination metric that is specifically designed for the scenario when gold-standard answers are absent. FEWL leverages the answers from off-the-shelf LLMs that serve as a proxy of gold-standard answers. The key challenge is how to quantify the expertise of reference LLMs resourcefully. We show FEWL has certain theoretical guarantees and demonstrate empirically it gives more accurate hallucination measures than naively using reference LLMs. We also show how to leverage FEWL to reduce hallucination through both in-context learning and supervised fine-tuning. Extensive experiment results on Truthful-QA, CHALE, and HaluEval datasets demonstrate the effectiveness of FEWL.
△ Less
Submitted 6 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Manipulating Elections by Changing Voter Perceptions
Authors:
Junlin Wu,
Andrew Estornell,
Lecheng Kong,
Yevgeniy Vorobeychik
Abstract:
The integrity of elections is central to democratic systems. However, a myriad of malicious actors aspire to influence election outcomes for financial or political benefit. A common means to such ends is by manipulating perceptions of the voting public about select candidates, for example, through misinformation. We present a formal model of the impact of perception manipulation on election outcom…
▽ More
The integrity of elections is central to democratic systems. However, a myriad of malicious actors aspire to influence election outcomes for financial or political benefit. A common means to such ends is by manipulating perceptions of the voting public about select candidates, for example, through misinformation. We present a formal model of the impact of perception manipulation on election outcomes in the framework of spatial voting theory, in which the preferences of voters over candidates are generated based on their relative distance in the space of issues. We show that controlling elections in this model is, in general, NP-hard, whether issues are binary or real-valued. However, we demonstrate that critical to intractability is the diversity of opinions on issues exhibited by the voting public. When voter views lack diversity, and we can instead group them into a small number of categories -- for example, as a result of political polarization -- the election control problem can be solved in polynomial time in the number of issues and candidates for arbitrary scoring rules.
△ Less
Submitted 17 June, 2022; v1 submitted 29 April, 2022;
originally announced May 2022.
-
Unfairness Despite Awareness: Group-Fair Classification with Strategic Agents
Authors:
Andrew Estornell,
Sanmay Das,
Yang Liu,
Yevgeniy Vorobeychik
Abstract:
The use of algorithmic decision making systems in domains which impact the financial, social, and political well-being of people has created a demand for these decision making systems to be "fair" under some accepted notion of equity. This demand has in turn inspired a large body of work focused on the development of fair learning algorithms which are then used in lieu of their conventional counte…
▽ More
The use of algorithmic decision making systems in domains which impact the financial, social, and political well-being of people has created a demand for these decision making systems to be "fair" under some accepted notion of equity. This demand has in turn inspired a large body of work focused on the development of fair learning algorithms which are then used in lieu of their conventional counterparts. Most analysis of such fair algorithms proceeds from the assumption that the people affected by the algorithmic decisions are represented as immutable feature vectors. However, strategic agents may possess both the ability and the incentive to manipulate this observed feature vector in order to attain a more favorable outcome. We explore the impact that strategic agent behavior could have on fair classifiers and derive conditions under which this behavior leads to fair classifiers becoming less fair than their conventional counterparts under the same measure of fairness that the fair classifier takes into account. These conditions are related to the the way in which the fair classifier remedies unfairness on the original unmanipulated data: fair classifiers which remedy unfairness by becoming more selective than their conventional counterparts are the ones that become less fair than their counterparts when agents are strategic. We further demonstrate that both the increased selectiveness of the fair classifier, and consequently the loss of fairness, arises when performing fair learning on domains in which the advantaged group is overrepresented in the region near (and on the beneficial side of) the decision boundary of conventional classifiers. Finally, we observe experimentally, using several datasets and learning methods, that this fairness reversal is common, and that our theoretical characterization of the fairness reversal conditions indeed holds in most such cases.
△ Less
Submitted 5 December, 2021;
originally announced December 2021.
-
Incentivizing Truthfulness Through Audits in Strategic Classification
Authors:
Andrew Estornell,
Sanmay Das,
Yevgeniy Vorobeychik
Abstract:
In many societal resource allocation domains, machine learning methods are increasingly used to either score or rank agents in order to decide which ones should receive either resources (e.g., homeless services) or scrutiny (e.g., child welfare investigations) from social services agencies. An agency's scoring function typically operates on a feature vector that contains a combination of self-repo…
▽ More
In many societal resource allocation domains, machine learning methods are increasingly used to either score or rank agents in order to decide which ones should receive either resources (e.g., homeless services) or scrutiny (e.g., child welfare investigations) from social services agencies. An agency's scoring function typically operates on a feature vector that contains a combination of self-reported features and information available to the agency about individuals or households.This can create incentives for agents to misrepresent their self-reported features in order to receive resources or avoid scrutiny, but agencies may be able to selectively audit agents to verify the veracity of their reports.
We study the problem of optimal auditing of agents in such settings. When decisions are made using a threshold on an agent's score, the optimal audit policy has a surprisingly simple structure, uniformly auditing all agents who could benefit from lying. While this policy can, in general, be hard to compute because of the difficulty of identifying the set of agents who could benefit from lying given a complete set of reported types, we also present necessary and sufficient conditions under which it is tractable. We show that the scarce resource setting is more difficult, and exhibit an approximately optimal audit policy in this case. In addition, we show that in either setting verifying whether it is possible to incentivize exact truthfulness is hard even to approximate. However, we also exhibit sufficient conditions for solving this problem optimally, and for obtaining good approximations.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Election Control by Manipulating Issue Significance
Authors:
Andrew Estornell,
Sanmay Das,
Edith Elkind,
Yevgeniy Vorobeychik
Abstract:
Integrity of elections is vital to democratic systems, but it is frequently threatened by malicious actors. The study of algorithmic complexity of the problem of manipulating election outcomes by changing its structural features is known as election control. One means of election control that has been proposed is to select a subset of issues that determine voter preferences over candidates. We stu…
▽ More
Integrity of elections is vital to democratic systems, but it is frequently threatened by malicious actors. The study of algorithmic complexity of the problem of manipulating election outcomes by changing its structural features is known as election control. One means of election control that has been proposed is to select a subset of issues that determine voter preferences over candidates. We study a variation of this model in which voters have judgments about relative importance of issues, and a malicious actor can manipulate these judgments. We show that computing effective manipulations in this model is NP-hard even with two candidates or binary issues. However, we demonstrate that the problem is tractable with a constant number of voters or issues. Additionally, while it remains intractable when voters can vote stochastically, we exhibit an important special case in which stochastic voting enables tractable manipulation.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
Deception through Half-Truths
Authors:
Andrew Estornell,
Sanmay Das,
Yevgeniy Vorobeychik
Abstract:
Deception is a fundamental issue across a diverse array of settings, from cybersecurity, where decoys (e.g., honeypots) are an important tool, to politics that can feature politically motivated "leaks" and fake news about candidates.Typical considerations of deception view it as providing false information.However, just as important but less frequently studied is a more tacit form where informatio…
▽ More
Deception is a fundamental issue across a diverse array of settings, from cybersecurity, where decoys (e.g., honeypots) are an important tool, to politics that can feature politically motivated "leaks" and fake news about candidates.Typical considerations of deception view it as providing false information.However, just as important but less frequently studied is a more tacit form where information is strategically hidden or leaked.We consider the problem of how much an adversary can affect a principal's decision by "half-truths", that is, by masking or hiding bits of information, when the principal is oblivious to the presence of the adversary. The principal's problem can be modeled as one of predicting future states of variables in a dynamic Bayes network, and we show that, while theoretically the principal's decisions can be made arbitrarily bad, the optimal attack is NP-hard to approximate, even under strong assumptions favoring the attacker. However, we also describe an important special case where the dependency of future states on past states is additive, in which we can efficiently compute an approximately optimal attack. Moreover, in networks with a linear transition function we can solve the problem optimally in polynomial time.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.