-
Fairness and Unfairness in Binary and Multiclass Classification: Quantifying, Calculating, and Bounding
Authors:
Sivan Sabato,
Eran Treister,
Elad Yom-Tov
Abstract:
We propose a new interpretable measure of unfairness, that allows providing a quantitative analysis of classifier fairness, beyond a dichotomous fair/unfair distinction. We show how this measure can be calculated when the classifier's conditional confusion matrices are known. We further propose methods for auditing classifiers for their fairness when the confusion matrices cannot be obtained or ev…
▽ More
We propose a new interpretable measure of unfairness, that allows providing a quantitative analysis of classifier fairness, beyond a dichotomous fair/unfair distinction. We show how this measure can be calculated when the classifier's conditional confusion matrices are known. We further propose methods for auditing classifiers for their fairness when the confusion matrices cannot be obtained or even estimated. Our approach lower-bounds the unfairness of a classifier based only on aggregate statistics, which may be provided by the owner of the classifier or collected from freely available data. We use the equalized odds criterion, which we generalize to the multiclass case. We report experiments on data sets representing diverse applications, which demonstrate the effectiveness and the wide range of possible uses of the proposed methodology. An implementation of the procedures proposed in this paper and as the code for running the experiments are provided in https://github.com/sivansabato/unfairness.
△ Less
Submitted 5 April, 2024; v1 submitted 7 June, 2022;
originally announced June 2022.
-
COVID-19 Datathon Based on Deidentified Governmental Data as an Approach for Solving Policy Challenges, Increasing Trust, and Building a Community: Case Study
Authors:
Mor Peleg,
Amnon Reichman,
Sivan Shachar,
Tamir Gadot,
Meytal Avgil Tsadok,
Maya Azaria,
Orr Dunkelman,
Shiri Hassid,
Daniella Partem,
Maya Shmailov,
Elad Yom-Tov,
Roy Cohen
Abstract:
Triggered by the COVID-19 crisis, Israel's Ministry of Health (MoH) held a virtual Datathon based on deidentified governmental data. Organized by a multidisciplinary committee, Israel's research community was invited to offer insights to COVID-19 policy challenges. The Datathon was designed to (1) develop operationalizable data-driven models to address COVID-19 health-policy challenges and (2) bui…
▽ More
Triggered by the COVID-19 crisis, Israel's Ministry of Health (MoH) held a virtual Datathon based on deidentified governmental data. Organized by a multidisciplinary committee, Israel's research community was invited to offer insights to COVID-19 policy challenges. The Datathon was designed to (1) develop operationalizable data-driven models to address COVID-19 health-policy challenges and (2) build a community of researchers from academia, industry, and government and rebuild their trust in the government. Three specific challenges were defined based on their relevance (significance, data availability, and potential to anonymize the data): immunization policies, special needs of the young population, and populations whose rate of compliance with COVID-19 testing is low. The MoH team extracted diverse, reliable, up-to-date, and deidentified governmental datasets for each challenge. Secure remote-access research environments with relevant data science tools were set on Amazon Web. The MoH screened the applicants and accepted around 80 participants, teaming them to balance areas of expertise as well as represent all sectors of the community. One week following the event, anonymous surveys for participants and mentors were distributed to assess overall usefulness and points for improvement. The 48-hour Datathon and pre-event sessions included 18 multidisciplinary teams, mentored by 20 data scientists, 6 epidemiologists, 5 presentation mentors, and 12 judges. The insights developed by the 3 winning teams are currently considered by the MoH as potential data science methods relevant for national policies. The most positive results were increased trust in the MoH and greater readiness to work with the government on these or future projects. Detailed feedback offered concrete lessons for improving the structure and organization of future government-led datathons.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Providing early indication of regional anomalies in COVID19 case counts in England using search engine queries
Authors:
Elad Yom-Tov,
Vasileios Lampos,
Ingemar J. Cox,
Michael Edelstein
Abstract:
COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in rele…
▽ More
COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in relevant symptom searches occurred at specific areas of the country. Our analysis shows that searches for "fever" and "cough" were the most correlated with future case counts, with searches preceding case counts by 16-17 days. Unexpected rises in search patterns were predictive of future case counts multiplying by 2.5 or more within a week, reaching an Area Under Curve (AUC) of 0.64. Similar rises in mortality were predicted with an AUC of approximately 0.61 at a lead time of 3 weeks. Thus, our metric provided Public Health England with an indication which could be used to plan the response to COVID19 and could possibly be utilized to detect regional anomalies of other pathogens.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Tracking COVID-19 using online search
Authors:
Vasileios Lampos,
Maimuna S. Majumder,
Elad Yom-Tov,
Michael Edelstein,
Simon Moura,
Yohhei Hamada,
Molebogeng X. Rangaka,
Rachel A. McKendry,
Ingemar J. Cox
Abstract:
Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's Nationa…
▽ More
Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's National Health Service and Public Health England. We then attempt to minimise an expected bias in these signals caused by public interest -- as opposed to infections -- using the proportion of news media coverage devoted to COVID-19 as a proxy indicator. Our analysis indicates that models based on online searches precede the reported confirmed cases and deaths by 16.7 (10.2 - 23.2) and 22.1 (17.4 - 26.9) days, respectively. We also investigate transfer learning techniques for mapping supervised models from countries where the spread of disease has progressed extensively to countries that are in earlier phases of their respective epidemic curves. Furthermore, we compare time series of online search activity against confirmed COVID-19 cases or deaths jointly across multiple countries, uncovering interesting querying patterns, including the finding that rarer symptoms are better predictors than common ones. Finally, we show that web searches improve the short-term forecasting accuracy of autoregressive models for COVID-19 deaths. Our work provides evidence that online search data can be used to develop complementary public health surveillance methods to help inform the COVID-19 response in conjunction with more established approaches.
△ Less
Submitted 10 February, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Algorithmic Copywriting: Automated Generation of Health-Related Advertisements to Improve their Performance
Authors:
Brit Youngmann,
Ran Gilad-Bachrach,
Danny Karmon,
Elad Yom-Tov
Abstract:
Search advertising, a popular method for online marketing, has been employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and experimentation, which may not be available to health authorities wishing to elicit such changes, especially when dealing with public health crises such as epidemic outbreaks.
Here we develop a fra…
▽ More
Search advertising, a popular method for online marketing, has been employed to improve health by eliciting positive behavioral change. However, writing effective advertisements requires expertise and experimentation, which may not be available to health authorities wishing to elicit such changes, especially when dealing with public health crises such as epidemic outbreaks.
Here we develop a framework, comprised of two neural networks models, that automatically generate ads. First, it employs a generator model, which create ads from web pages. It then employs a translation model, which transcribes ads to improve performance.
We trained the networks using 114K health-related ads shown on Microsoft Advertising. We measure ads performance using the click-through rates (CTR).
Our experiments show that the generated advertisements received approximately the same CTR as human-authored ads. The marginal contribution of the generator model was, on average, 28\% lower than that of human-authored ads, while the translator model received, on average, 32\% more clicks than human-authored ads. Our analysis shows that the translator model produces ads reflecting higher values of psychological attributes associated with a user action, including higher valance and arousal, and more calls-to-actions. In contrast, levels of these attributes in ads produced by the generator model are similar to those of human-authored ads.
Our results demonstrate the ability to automatically generate useful advertisements for the health domain. We believe that our work offers health authorities an improved ability to nudge people towards healthier behaviors while saving the time and cost needed to build effective advertising campaigns.
△ Less
Submitted 12 July, 2020; v1 submitted 27 October, 2019;
originally announced October 2019.
-
Modeling infection methods of computer malware in the presence of vaccinations using epidemiological models: An analysis of real-world data
Authors:
Elad Yom-Tov,
Nir Levy,
Amir Rubin
Abstract:
Computer malware and biological pathogens often use similar mechanisms of infections. For this reason, it has been suggested to model malware spread using epidemiological models developed to characterize the spread of biological pathogens. However, most work examining the similarities between malware and pathogens using such methods was based on theoretical analysis and simulation.
Here we exten…
▽ More
Computer malware and biological pathogens often use similar mechanisms of infections. For this reason, it has been suggested to model malware spread using epidemiological models developed to characterize the spread of biological pathogens. However, most work examining the similarities between malware and pathogens using such methods was based on theoretical analysis and simulation.
Here we extend the classical Susceptible-Infected-Recovered (SIR) epidemiological model to describe two of the most common infection methods used by malware. We fit the proposed model to malware collected over a period of one year from a major anti-malware vendor. We show that by fitting the proposed model it is possible to identify the method of transmission used by the malware, its rate of infection, and the number of machines which will be infected unless blocked by anti-virus software. The Spearman correlation between the number of actual and predicted infected machines is 0.84.
Examining cases where an anti-malware "signature" was transmitted to susceptible computers by the anti-virus provider, we show that the time to remove the malware will be short and independent of the number of infected computers if fewer than approximately 60% of susceptible computers have been infected. If more computers were infected, the time to removal will be approximately 3.2 greater, and will depend on the fraction of infected computers.
Our results show that the application of epidemiological models of infection to malware can provide anti-virus providers with information on malware spread and its potential damage. We further propose that similarities between computer malware and biological pathogens, the availability of data on the former and the dearth of data on the latter, make malware an extremely useful model for testing interventions which could later be applied to improve medicine.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Privacy, Altruism, and Experience: Estimating the Perceived Value of Internet Data for Medical Uses
Authors:
Gilie Gefen,
Omer Ben-Porat,
Moshe Tennenholtz,
Elad Yom-Tov
Abstract:
People increasingly turn to the Internet when they have a medical condition. The data they create during this process is a valuable source for medical research and for future health services. However, utilizing these data could come at a cost to user privacy. Thus, it is important to balance the perceived value that users assign to these data with the value of the services derived from them. Here…
▽ More
People increasingly turn to the Internet when they have a medical condition. The data they create during this process is a valuable source for medical research and for future health services. However, utilizing these data could come at a cost to user privacy. Thus, it is important to balance the perceived value that users assign to these data with the value of the services derived from them. Here we describe experiments where methods from Mechanism Design were used to elicit a truthful valuation from users for their Internet data and for services to screen people for medical conditions. In these experiments, 880 people from around the world were asked to participate in an auction to provide their data for uses differing in their contribution to the participant, to society, and in the disease they addressed. Some users were offered monetary compensation for their participation, while others were asked to pay to participate. Our findings show that 99\% of people were willing to contribute their data in exchange for monetary compensation and an analysis of their data, while 53\% were willing to pay to have their data analyzed. The average perceived value users assigned to their data was estimated at US\$49. Their value to screen them for a specific cancer was US\$22 while the value of this service offered to the general public was US\$22. Participants requested higher compensation when notified that their data would be used to analyze a more severe condition. They were willing to pay more to have their data analyzed when the condition was more severe, when they had higher education or if they had recently experienced a serious medical condition.
△ Less
Submitted 22 March, 2020; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Demographic differences in search engine use with implications for cohort selection
Authors:
Elad Yom-Tov
Abstract:
The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of h…
▽ More
The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations.
Using data from multiple general-purpose Internet search engines gathered over a period of one month we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines.
Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 50% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known gender balance.
Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Detecting Parkinson's Disease from interactions with a search engine: Is expert knowledge sufficient?
Authors:
Liron Allerhand,
Brit Youngmann,
Elad Yom-Tov,
David Arkadir
Abstract:
Parkinson's disease (PD) is a slowly progressing neurodegenerative disease with early manifestation of motor signs. Recently, there has been a growing interest in developing automatic tools that can assess motor function in PD patients. Here we show that mouse tracking data collected during people's interaction with a search engine can be used to distinguish PD patients from similar, non-diseased…
▽ More
Parkinson's disease (PD) is a slowly progressing neurodegenerative disease with early manifestation of motor signs. Recently, there has been a growing interest in developing automatic tools that can assess motor function in PD patients. Here we show that mouse tracking data collected during people's interaction with a search engine can be used to distinguish PD patients from similar, non-diseased users and present a methodology developed for the diagnosis of PD from these data. A main challenge we address is the extraction of informative features from raw mouse tracking data. We do so in two complementary ways: First, we manually construct expert-recommended informative features, aiming to identify abnormalities in motor behaviors. Second, we use an unsupervised representation learning technique to map these raw data to high-level features. Using all the extracted features, a Random Forest classifier is then used to distinguish PD patients from controls, achieving an AUC of 0.92, while results using only expert-generated or auto-generated features are 0.87 and 0.83, respectively. Our results indicate that mouse tracking data can help in detecting users at early stages of the disease, and that both expert-generated features and unsupervised techniques for feature generation are required to achieve the best possible performance
△ Less
Submitted 3 May, 2018;
originally announced May 2018.
-
Characterizing Efficient Referrals in Social Networks
Authors:
Reut Apel,
Elad Yom-Tov,
Moshe Tennenholtz
Abstract:
Users of social networks often focus on specific areas of that network, leading to the well-known "filter bubble" effect. Connecting people to a new area of the network in a way that will cause them to become active in that area could help alleviate this effect and improve social welfare.
Here we present preliminary analysis of network referrals, that is, attempts by users to connect peers to ot…
▽ More
Users of social networks often focus on specific areas of that network, leading to the well-known "filter bubble" effect. Connecting people to a new area of the network in a way that will cause them to become active in that area could help alleviate this effect and improve social welfare.
Here we present preliminary analysis of network referrals, that is, attempts by users to connect peers to other areas of the network. We classify these referrals by their efficiency, i.e., the likelihood that a referral will result in a user becoming active in the new area of the network. We show that by using features describing past experience of the referring author and the content of their messages we are able to predict whether referral will be effective, reaching an AUC of 0.87 for those users most experienced in writing efficient referrals. Our results represent a first step towards algorithmically constructing efficient referrals with the goal of mitigating the "filter bubble" effect pervasive in on line social networks.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Microsoft Malware Classification Challenge
Authors:
Royi Ronen,
Marian Radu,
Corina Feuerstein,
Elad Yom-Tov,
Mansour Ahmadi
Abstract:
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 rese…
▽ More
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
△ Less
Submitted 22 February, 2018;
originally announced February 2018.
-
Screening for cancer using a learning Internet advertising system
Authors:
Elad Yom-Tov
Abstract:
Studies have shown that the traces people leave when browsing the internet may indicate the onset of diseases such as cancer. Here we show that the adaptive engines of advertising systems in conjunction with clinically verified questionnaires can be used to identify people who are suspected of having one of three types of solid tumor cancers.
In the first study, 308 people were recruited through…
▽ More
Studies have shown that the traces people leave when browsing the internet may indicate the onset of diseases such as cancer. Here we show that the adaptive engines of advertising systems in conjunction with clinically verified questionnaires can be used to identify people who are suspected of having one of three types of solid tumor cancers.
In the first study, 308 people were recruited through ads shown on the Bing search engine to complete a clinically verified risk questionnaire. A classifier trained to predict questionnaire response using only past queries on Bing reached an Area Under the Curve of 0.64 for all three cancer types, verifying that past searches could be used to identify people with suspected cancer.
The second study was conducted using the Google ads system in the same configuration as in the first study. However, in this study the ads system was set to automatically learn to identify people with suspected cancer. A total of 70,586 people were shown the ads, and 6,484 clicked and were referred to complete the clinical questionnaires. People from countries with higher Internet access and lower life expectancy tended to click more on the ads. Over time the advertisement system learned to identify people who were likely to have symptoms consistent with suspected cancer, such that the percentage of people completing the questionnaires and found to have suspected cancer reached approximately 11\% at the end of the experiment.
These results demonstrate the utility of using search engine queries to screen for possible cancer and the application of modern advertising systems to help identify people who are likely suffering from serious medical conditions. This is especially true in countries where medical services are less developed.
△ Less
Submitted 8 August, 2018; v1 submitted 21 February, 2018;
originally announced February 2018.
-
Discriminative Learning of Prediction Intervals
Authors:
Nir Rosenfeld,
Yishay Mansour,
Elad Yom-Tov
Abstract:
In this work we consider the task of constructing prediction intervals in an inductive batch setting. We present a discriminative learning framework which optimizes the expected error rate under a budget constraint on the interval sizes. Most current methods for constructing prediction intervals offer guarantees for a single new test point. Applying these methods to multiple test points can result…
▽ More
In this work we consider the task of constructing prediction intervals in an inductive batch setting. We present a discriminative learning framework which optimizes the expected error rate under a budget constraint on the interval sizes. Most current methods for constructing prediction intervals offer guarantees for a single new test point. Applying these methods to multiple test points can result in a high computational overhead and degraded statistical guarantees. By focusing on expected errors, our method allows for variability in the per-example conditional error rates. As we demonstrate both analytically and empirically, this flexibility can increase the overall accuracy, or alternatively, reduce the average interval size.
While the problem we consider is of a regressive flavor, the loss we use is combinatorial. This allows us to provide PAC-style, finite-sample guarantees. Computationally, we show that our original objective is NP-hard, and suggest a tractable convex surrogate. We conclude with a series of experimental evaluations.
△ Less
Submitted 27 February, 2018; v1 submitted 16 October, 2017;
originally announced October 2017.
-
Evidence from web-based dietary search patterns to the role of B12 deficiency in chronic pain
Authors:
Eitan Giat,
Elad Yom-Tov
Abstract:
Profound vitamin B12 deficiency is a known cause of disease, but the role of low or intermediate levels of B12 in the development of neuropathy and other neuropsychiatric symptoms as well as the relationship of eating meat and B12 levels is unclear. Here we use food-related internet search patterns from a sample of 8.5 million US-based people as a proxy to B12 intake and correlate these searches w…
▽ More
Profound vitamin B12 deficiency is a known cause of disease, but the role of low or intermediate levels of B12 in the development of neuropathy and other neuropsychiatric symptoms as well as the relationship of eating meat and B12 levels is unclear. Here we use food-related internet search patterns from a sample of 8.5 million US-based people as a proxy to B12 intake and correlate these searches with internet searches related to possible effects of B12 deficiency. Food-related search patterns are highly correlated with known consumption and food-related searches (Spearman 0.69). Awareness of B12 deficiency was associated with a higher consumption of B12-rich foods and with queries for B12 supplements. Searches for terms related to neurological disorders were correlated with searches for B12-poor foods, in contrast with control terms. Popular medicines, those having fewer indications, and those which are predominantly used to treat pain are more strongly correlated with the ability to predict neuropathic pain queries using the B12 contents of food. Our findings provide evidence for the utility of using Internet search patterns to investigate health questions in large populations and suggest that low B12 intake may be associated with a broader spectrum of neurological disorders than currently appreciated.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.
-
Modeling influenza-like illnesses through composite compartmental models
Authors:
Nir Levy,
Michael Iv,
Elad Yom-Tov
Abstract:
Epidemiological models for the spread of pathogens in a population are usually only able to describe a single pathogen. This makes their application unrealistic in cases where multiple pathogens with similar symptoms are spreading concurrently within the same population. Here we describe a method which makes possible the application of multiple single-strain models under minimal conditions. As suc…
▽ More
Epidemiological models for the spread of pathogens in a population are usually only able to describe a single pathogen. This makes their application unrealistic in cases where multiple pathogens with similar symptoms are spreading concurrently within the same population. Here we describe a method which makes possible the application of multiple single-strain models under minimal conditions. As such, our method provides a bridge between theoretical models of epidemiology and data-driven approaches for modeling of influenza and other similar viruses.
Our model extends the Susceptible-Infected-Recovered model to higher dimensions, allowing the modeling of a population infected by multiple viruses. We further provide a method, based on an overcomplete dictionary of feasible realizations of SIR solutions, to blindly partition the time series representing the number of infected people in a population into individual components, each representing the effect of a single pathogen.
We demonstrate the applicability of our proposed method on five years of seasonal influenza-like illness (ILI) rates, estimated from Twitter data. We demonstrate that our method describes, on average, 44\% of the variance in the ILI time series. The individual infectious components derived from our model are matched to known viral profiles in the populations, which we demonstrate matches that of independently collected epidemiological data. We further show that the basic reproductive numbers ($R0$) of the matched components are in range known for these pathogens.
Our results suggest that the proposed method can be applied to other pathogens and geographies, providing a simple method for estimating the parameters of epidemics in a population.
△ Less
Submitted 7 June, 2017;
originally announced June 2017.
-
Automatic Representation for Lifetime Value Recommender Systems
Authors:
Assaf Hallak,
Yishay Mansour,
Elad Yom-Tov
Abstract:
Many modern commercial sites employ recommender systems to propose relevant content to users. While most systems are focused on maximizing the immediate gain (clicks, purchases or ratings), a better notion of success would be the lifetime value (LTV) of the user-system interaction. The LTV approach considers the future implications of the item recommendation, and seeks to maximize the cumulative g…
▽ More
Many modern commercial sites employ recommender systems to propose relevant content to users. While most systems are focused on maximizing the immediate gain (clicks, purchases or ratings), a better notion of success would be the lifetime value (LTV) of the user-system interaction. The LTV approach considers the future implications of the item recommendation, and seeks to maximize the cumulative gain over time. The Reinforcement Learning (RL) framework is the standard formulation for optimizing cumulative successes over time. However, RL is rarely used in practice due to its associated representation, optimization and validation techniques which can be complex. In this paper we propose a new architecture for combining RL with recommendation systems which obviates the need for hand-tuned features, thus automating the state-space representation construction process. We analyze the practical difficulties in this formulation and test our solutions on batch off-line real-world recommendation data.
△ Less
Submitted 23 February, 2017;
originally announced February 2017.
-
Predicting drug recalls from Internet search engine queries
Authors:
Elad Yom-Tov
Abstract:
Batches of pharmaceutical are sometimes recalled from the market when a safety issue or a defect is detected in specific production runs of a drug. Such problems are usually detected when patients or healthcare providers report abnormalities to medical authorities. Here we test the hypothesis that defective production lots can be detected earlier by monitoring queries to Internet search engines.…
▽ More
Batches of pharmaceutical are sometimes recalled from the market when a safety issue or a defect is detected in specific production runs of a drug. Such problems are usually detected when patients or healthcare providers report abnormalities to medical authorities. Here we test the hypothesis that defective production lots can be detected earlier by monitoring queries to Internet search engines.
We extracted queries from the USA to the Bing search engine which mentioned one of 5,195 pharmaceutical drugs during 2015 and all recall notifications issued by the Food and Drug Administration (FDA) during that year. By using attributes that quantify the change in query volume at the state level, we attempted to predict if a recall of a specific drug will be ordered by FDA in a time horizon ranging from one to 40 days in future.
Our results show that future drug recalls can indeed be identified with an AUC of 0.791 and a lift at 5% of approximately 6 when predicting a recall will occur one day ahead. This performance degrades as prediction is made for longer periods ahead. The most indicative attributes for prediction are sudden spikes in query volume about a specific medicine in each state. Recalls of prescription drugs and those estimated to be of medium-risk are more likely to be identified using search query data.
These findings suggest that aggregated Internet search engine data can be used to facilitate in early warning of faulty batches of medicines.
△ Less
Submitted 27 November, 2016;
originally announced November 2016.
-
Inferring individual attributes from search engine queries and auxiliary information
Authors:
Luca Soldaini,
Elad Yom-Tov
Abstract:
Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user p…
▽ More
Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user privacy. To facilitate research on specific topics of interest, especially in medicine, we introduce an algorithm for identifying a trait of interest in anonymous users. We illustrate how a small set of labeled examples, together with statistical information about the entire population, can be aggregated to obtain labels on unseen examples. We validate our approach using labeled data from the political domain.
We provide two applications of the proposed algorithm to the medical domain. In the first, we demonstrate how to identify users whose search patterns indicate they might be suffering from certain types of cancer. In the second, we detail an algorithm to predict the distribution of diseases given their incidence in a subset of the population at study, making it possible to predict disease spread from partial epidemiological data.
△ Less
Submitted 26 October, 2016;
originally announced October 2016.
-
Predicting Counterfactuals from Large Historical Data and Small Randomized Trials
Authors:
Nir Rosenfeld,
Yishay Mansour,
Elad Yom-Tov
Abstract:
When a new treatment is considered for use, whether a pharmaceutical drug or a search engine ranking algorithm, a typical question that arises is, will its performance exceed that of the current treatment? The conventional way to answer this counterfactual question is to estimate the effect of the new treatment in comparison to that of the conventional treatment by running a controlled, randomized…
▽ More
When a new treatment is considered for use, whether a pharmaceutical drug or a search engine ranking algorithm, a typical question that arises is, will its performance exceed that of the current treatment? The conventional way to answer this counterfactual question is to estimate the effect of the new treatment in comparison to that of the conventional treatment by running a controlled, randomized experiment. While this approach theoretically ensures an unbiased estimator, it suffers from several drawbacks, including the difficulty in finding representative experimental populations as well as the cost of running such trials. Moreover, such trials neglect the huge quantities of available control-condition data which are often completely ignored.
In this paper we propose a discriminative framework for estimating the performance of a new treatment given a large dataset of the control condition and data from a small (and possibly unrepresentative) randomized trial comparing new and old treatments. Our objective, which requires minimal assumptions on the treatments, models the relation between the outcomes of the different conditions. This allows us to not only estimate mean effects but also to generate individual predictions for examples outside the randomized sample.
We demonstrate the utility of our approach through experiments in three areas: Search engine operation, treatments to diabetes patients, and market value estimation for houses. Our results demonstrate that our approach can reduce the number and size of the currently performed randomized controlled experiments, thus saving significant time, money and effort on the part of practitioners.
△ Less
Submitted 26 October, 2016; v1 submitted 24 October, 2016;
originally announced October 2016.
-
A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients
Authors:
Irit Hochberg,
Guy Feraru,
Mark Kozdoba,
Shie Mannor,
Moshe Tennenholtz,
Elad Yom-Tov
Abstract:
Regular physical activity is known to be beneficial to people suffering from diabetes type 2. Nevertheless, most such people are sedentary. Smartphones create new possibilities for helping people to adhere to their physical activity goals, through continuous monitoring and communication, coupled with personalized feedback.
We provided 27 sedentary diabetes type 2 patients with a smartphone-based…
▽ More
Regular physical activity is known to be beneficial to people suffering from diabetes type 2. Nevertheless, most such people are sedentary. Smartphones create new possibilities for helping people to adhere to their physical activity goals, through continuous monitoring and communication, coupled with personalized feedback.
We provided 27 sedentary diabetes type 2 patients with a smartphone-based pedometer and a personal plan for physical activity. Patients were sent SMS messages to encourage physical activity between once a day and once per week. Messages were personalized through a Reinforcement Learning (RL) algorithm which optimized messages to improve each participant's compliance with the activity regimen. The RL algorithm was compared to a static policy for sending messages and to weekly reminders.
Our results show that participants who received messages generated by the RL algorithm increased the amount of activity and pace of walking, while the control group patients did not. Patients assigned to the RL algorithm group experienced a superior reduction in blood glucose levels (HbA1c) compared to control policies, and longer participation caused greater reductions in blood glucose levels. The learning algorithm improved gradually in predicting which messages would lead participants to exercise.
Our results suggest that a mobile phone application coupled with a learning algorithm can improve adherence to exercise in diabetic patients. As a learning algorithm is automated, and delivers personalized messages, it could be used in large populations of diabetic patients to improve health and glycemic control. Our results can be expanded to other areas where computer-led health coaching of humans may have a positive impact.
△ Less
Submitted 13 May, 2016;
originally announced May 2016.
-
On the Effect of Human-Computer Interfaces on Language Expression
Authors:
Dan Pelleg,
Elad Yom-Tov,
Evgeniy Gabrilovich
Abstract:
Language expression is known to be dependent on attributes intrinsic to the author. To date, however, little attention has been devoted to the effect of interfaces used to articulate language on its expression. Here we study a large corpus of text written using different input devices and show that writers unconsciously prefer different letters depending on the interplay between their individual t…
▽ More
Language expression is known to be dependent on attributes intrinsic to the author. To date, however, little attention has been devoted to the effect of interfaces used to articulate language on its expression. Here we study a large corpus of text written using different input devices and show that writers unconsciously prefer different letters depending on the interplay between their individual traits (e.g., hand laterality and injuries) and the layout of keyboards. Our results show, for the first time, how the interplay between technology and its users modifies language expression.
△ Less
Submitted 1 May, 2015;
originally announced May 2015.
-
Echo chamber amplification and disagreement effects in the political activity of Twitter users
Authors:
Kirill Dyagilev,
Elad Yom-Tov
Abstract:
Online social networks have emerged as a significant platform for political discourse. In this paper we investigate what affects the level of participation of users in the political discussion. Specifically, are users more likely to be active when they are surrounded by like-minded individuals, or, alternatively, when their environment is heterogeneous, and so their messages might be carried to pe…
▽ More
Online social networks have emerged as a significant platform for political discourse. In this paper we investigate what affects the level of participation of users in the political discussion. Specifically, are users more likely to be active when they are surrounded by like-minded individuals, or, alternatively, when their environment is heterogeneous, and so their messages might be carried to people with differing views. To answer this question, we analyzed the activity of about 200K Twitter users who expressed explicit support for one of the candidates of the 2012 US presidential election. We quantified the level of political activity (PA) of users by the fraction of political tweets in their posts, and analyzed the relationship between PA and measures of the users' political environment. These measures were designed to assess the likemindedness, e.g., the fraction of users with similar political views, of their virtual and geographic environments. Our results showed that high PA is usually obtained by users in politically balanced virtual environment. This is in line with the disagreement theory of political science that states that a user's PA is invigorated by the disagreement with their peers. Our results also show that users surrounded by politically like-minded virtual peers tend to have low PA. This observation contradicts the echo chamber amplification theory that states that a person tends to be more politically active when surrounded by like-minded people. Finally, we observe that the likemindedness of the geographical environment does not affect PA. We thus conclude that PA of users is independent of the likemindedness of their geographical environment and is correlated with likemindedness of their virtual environment. The exact form of correlation manifests the phenomenon of disagreement and, in a majority of settings, contradicts the echo chamber amplification theory.
△ Less
Submitted 17 June, 2014; v1 submitted 19 March, 2014;
originally announced March 2014.
-
A Gaussian Belief Propagation Solver for Large Scale Support Vector Machines
Authors:
Danny Bickson,
Elad Yom-Tov,
Danny Dolev
Abstract:
Support vector machines (SVMs) are an extremely successful type of classification and regression algorithms. Building an SVM entails solving a constrained convex quadratic programming problem, which is quadratic in the number of training samples. We introduce an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP).…
▽ More
Support vector machines (SVMs) are an extremely successful type of classification and regression algorithms. Building an SVM entails solving a constrained convex quadratic programming problem, which is quadratic in the number of training samples. We introduce an efficient parallel implementation of an support vector regression solver, based on the Gaussian Belief Propagation algorithm (GaBP).
In this paper, we demonstrate that methods from the complex system domain could be utilized for performing efficient distributed computation. We compare the proposed algorithm to previously proposed distributed and single-node SVM solvers. Our comparison shows that the proposed algorithm is just as accurate as these solvers, while being significantly faster, especially for large datasets. We demonstrate scalability of the proposed algorithm to up to 1,024 computing nodes and hundreds of thousands of data points using an IBM Blue Gene supercomputer. As far as we know, our work is the largest parallel implementation of belief propagation ever done, demonstrating the applicability of this algorithm for large scale distributed computing systems.
△ Less
Submitted 9 October, 2008;
originally announced October 2008.