-
Finding Convincing Views to Endorse a Claim
Authors:
Shunit Agmon,
Amir Gilad,
Brit Youngmann,
Shahar Zoarets,
Benny Kimelfeld
Abstract:
Recent studies investigated the challenge of assessing the strength of a given claim extracted from a dataset, particularly the claim's potential of being misleading and cherry-picked. We focus on claims that compare answers to an aggregate query posed on a view that selects tuples. The strength of a claim amounts to the question of how likely it is that the view is carefully chosen to support the…
▽ More
Recent studies investigated the challenge of assessing the strength of a given claim extracted from a dataset, particularly the claim's potential of being misleading and cherry-picked. We focus on claims that compare answers to an aggregate query posed on a view that selects tuples. The strength of a claim amounts to the question of how likely it is that the view is carefully chosen to support the claim, whereas less careful choices would lead to contradictory claims. We embark on the study of the reverse task that offers a complementary angle in the critical assessment of data-based claims: given a claim, find useful supporting views. The goal of this task is twofold. On the one hand, we aim to assist users in finding significant evidence of phenomena of interest. On the other hand, we wish to provide them with machinery to criticize or counter given claims by extracting evidence of opposing statements.
To be effective, the supporting sub-population should be significant and defined by a ``natural'' view. We discuss several measures of naturalness and propose ways of extracting the best views under each measure (and combinations thereof). The main challenge is the computational cost, as naïve search is infeasible. We devise anytime algorithms that deploy two main steps: (1) a preliminary construction of a ranked list of attribute combinations that are assessed using fast-to-compute features, and (2) an efficient search for the actual views based on each attribute combination. We present a thorough experimental study that shows the effectiveness of our algorithms in terms of quality and execution cost. We also present a user study to assess the usefulness of the naturalness measures.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Causal Data Integration
Authors:
Brit Youngmann,
Michael Cafarella,
Babak Salimi,
Anna Zeng
Abstract:
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to p…
▽ More
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources and a corresponding causal DAG is automatically built. We identify key challenges and research opportunities in designing a CDI system, and present a system architecture for solving the CDI problem. Our preliminary experimental results demonstrate that solving CDI is achievable and pave the way for future research.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
On Explaining Confounding Bias
Authors:
Brit Youngmann,
Michael Cafarella,
Yuval Moskovitch,
Babak Salimi
Abstract:
When analyzing large datasets, analysts are often interested in the explanations for surprising or unexpected results produced by their queries. In this work, we focus on aggregate SQL queries that expose correlations in the data. A major challenge that hinders the interpretation of such queries is confounding bias, which can lead to an unexpected correlation. We generate explanations in terms of…
▽ More
When analyzing large datasets, analysts are often interested in the explanations for surprising or unexpected results produced by their queries. In this work, we focus on aggregate SQL queries that expose correlations in the data. A major challenge that hinders the interpretation of such queries is confounding bias, which can lead to an unexpected correlation. We generate explanations in terms of a set of confounding variables that explain the unexpected correlation observed in a query. We propose to mine candidate confounding variables from external sources since, in many real-life scenarios, the explanations are not solely contained in the input data. We present an efficient algorithm that finds the optimal subset of attributes (mined from external sources and the input dataset) that explain the unexpected correlation. This algorithm is embodied in a system called MESA. We demonstrate experimentally over multiple real-life datasets and through a user study that our approach generates insightful explanations, outperforming existing methods that search for explanations only in the input data. We further demonstrate the robustness of our system to missing data and the ability of MESA to handle input datasets containing millions of tuples and an extensive search space of candidate confounding attributes.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Guided Exploration of Data Summaries
Authors:
Brit Youngmann,
Sihem Amer-Yahia,
Aurélien Personnaz
Abstract:
Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Findin…
▽ More
Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Algorithmic Copywriting: Automated Generation of Health-Related Advertisements to Improve their Performance
Authors:
Brit Youngmann,
Ran Gilad-Bachrach,
Danny Karmon,
Elad Yom-Tov
Abstract:
Search advertising, a popular method for online marketing, has been employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and experimentation, which may not be available to health authorities wishing to elicit such changes, especially when dealing with public health crises such as epidemic outbreaks.
Here we develop a fra…
▽ More
Search advertising, a popular method for online marketing, has been employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and experimentation, which may not be available to health authorities wishing to elicit such changes, especially when dealing with public health crises such as epidemic outbreaks.
Here we develop a framework, comprised of two neural networks models, that automatically generate ads. First, it employs a generator model, which create ads from web pages. It then employs a translation model, which transcribes ads to improve performance.
We trained the networks using 114K health-related ads shown on Microsoft Advertising. We measure ads performance using the click-through rates (CTR).
Our experiments show that the generated advertisements received approximately the same CTR as human-authored ads. The marginal contribution of the generator model was, on average, 28\% lower than that of human-authored ads, while the translator model received, on average, 32\% more clicks than human-authored ads. Our analysis shows that the translator model produces ads reflecting higher values of psychological attributes associated with a user action, including higher valance and arousal, and more calls-to-actions. In contrast, levels of these attributes in ads produced by the generator model are similar to those of human-authored ads.
Our results demonstrate the ability to automatically generate useful advertisements for the health domain. We believe that our work offers health authorities an improved ability to nudge people towards healthier behaviors while saving the time and cost needed to build effective advertising campaigns.
△ Less
Submitted 12 July, 2020; v1 submitted 27 October, 2019;
originally announced October 2019.
-
Detecting Parkinson's Disease from interactions with a search engine: Is expert knowledge sufficient?
Authors:
Liron Allerhand,
Brit Youngmann,
Elad Yom-Tov,
David Arkadir
Abstract:
Parkinson's disease (PD) is a slowly progressing neurodegenerative disease with early manifestation of motor signs. Recently, there has been a growing interest in developing automatic tools that can assess motor function in PD patients. Here we show that mouse tracking data collected during people's interaction with a search engine can be used to distinguish PD patients from similar, non-diseased…
▽ More
Parkinson's disease (PD) is a slowly progressing neurodegenerative disease with early manifestation of motor signs. Recently, there has been a growing interest in developing automatic tools that can assess motor function in PD patients. Here we show that mouse tracking data collected during people's interaction with a search engine can be used to distinguish PD patients from similar, non-diseased users and present a methodology developed for the diagnosis of PD from these data. A main challenge we address is the extraction of informative features from raw mouse tracking data. We do so in two complementary ways: First, we manually construct expert-recommended informative features, aiming to identify abnormalities in motor behaviors. Second, we use an unsupervised representation learning technique to map these raw data to high-level features. Using all the extracted features, a Random Forest classifier is then used to distinguish PD patients from controls, achieving an AUC of 0.92, while results using only expert-generated or auto-generated features are 0.87 and 0.83, respectively. Our results indicate that mouse tracking data can help in detecting users at early stages of the disease, and that both expert-generated features and unsupervised techniques for feature generation are required to achieve the best possible performance
△ Less
Submitted 3 May, 2018;
originally announced May 2018.