-
Retrieval-Enhanced Machine Learning: Synthesis and Opportunities
Authors:
To Eun Kim,
Alireza Salemi,
Andrew Drozdov,
Fernando Diaz,
Hamed Zamani
Abstract:
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine…
▽ More
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Extrinsic Evaluation of Cultural Competence in Large Language Models
Authors:
Shaily Bhatt,
Fernando Diaz
Abstract:
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generatio…
▽ More
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
△ Less
Submitted 19 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Capturing waste collection planning expert knowledge in a fitness function through preference learning
Authors:
Laura Fernández Díaz,
Miriam Fernández Díaz,
José Ramón Quevedo,
Elena Montañés
Abstract:
This paper copes with the COGERSA waste collection process. Up to now, experts have been manually designed the process using a trial and error mechanism. This process is not globally optimized, since it has been progressively and locally built as council demands appear. Planning optimization algorithms usually solve it, but they need a fitness function to evaluate a route planning quality. The dra…
▽ More
This paper copes with the COGERSA waste collection process. Up to now, experts have been manually designed the process using a trial and error mechanism. This process is not globally optimized, since it has been progressively and locally built as council demands appear. Planning optimization algorithms usually solve it, but they need a fitness function to evaluate a route planning quality. The drawback is that even experts are not able to propose one in a straightforward way due to the complexity of the process. Hence, the goal of this paper is to build a fitness function though a preference framework, taking advantage of the available expert knowledge and expertise. Several key performance indicators together with preference judgments are carefully established according to the experts for learning a promising fitness function. Particularly, the additivity property of them makes the task be much more affordable, since it allows to work with routes rather than with route plannings. Besides, a feature selection analysis is performed over such indicators, since the experts suspect of a potential existing (but unknown) redundancy among them. The experiment results confirm this hypothesis, since the best $C-$index ($98\%$ against around $94\%$) is reached when 6 or 8 out of 21 indicators are taken. Particularly, truck load seems to be a highly promising key performance indicator, together to the travelled distance along non-main roads. A comparison with other existing approaches shows that the proposed method clearly outperforms them, since the $C-$index goes from $72\%$ or $90\%$ to $98\%$.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Fairness and Consensus in Opinion Models (Technical Report)
Authors:
Jesús Aranda,
Sebastián Betancourt,
Juan Fco. Díaz,
Frank Valencia
Abstract:
We introduce a DeGroot-based model for opinion dynamics in social networks. A community of agents is represented as a weighted directed graph whose edges indicate how much agents influence one another. The model is formalized using labeled transition systems, henceforth called opinion transition systems (OTS), whose states represent the agents' opinions and whose actions are the edges of the influ…
▽ More
We introduce a DeGroot-based model for opinion dynamics in social networks. A community of agents is represented as a weighted directed graph whose edges indicate how much agents influence one another. The model is formalized using labeled transition systems, henceforth called opinion transition systems (OTS), whose states represent the agents' opinions and whose actions are the edges of the influence graph. If a transition labeled $(i,j)$ is performed, agent $j$ updates their opinion taking into account the opinion of agent $i$ and the influence $i$ has over $j$. We study (convergence to) opinion consensus among the agents of strongly-connected graphs with influence values in the interval $(0,1)$. We show that consensus cannot be guaranteed under the standard strong fairness assumption on transition systems. We derive that consensus is guaranteed under a stronger notion from the literature of concurrent systems; bounded fairness. We argue that bounded-fairness is too strong of a notion for consensus as it almost surely rules out random runs and it is not a constructive liveness property. We introduce a weaker fairness notion, called $m$-bounded fairness, and show that it guarantees consensus. The new notion includes almost surely all random runs and it is a constructive liveness property. Finally, we consider OTS with dynamic influence and show convergence to consensus holds under $m$-bounded fairness if the influence changes within a fixed interval $[L,U]$ with $0<L<U<1$. We illustrate OTS with examples and simulations, offering insights into opinion formation under fairness and dynamic influence.
△ Less
Submitted 11 July, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Density-based User Representation using Gaussian Process Regression for Multi-interest Personalized Retrieval
Authors:
Haolun Wu,
Ofer Meshi,
Masrour Zoghi,
Fernando Diaz,
Xue Liu,
Craig Boutilier,
Maryam Karimzadehgan
Abstract:
Accurate modeling of the diverse and dynamic interests of users remains a significant challenge in the design of personalized recommender systems. Existing user modeling methods, like single-point and multi-point representations, have limitations w.r.t.\ accuracy, diversity, and adaptability. To overcome these deficiencies, we introduce density-based user representations (DURs), a novel method tha…
▽ More
Accurate modeling of the diverse and dynamic interests of users remains a significant challenge in the design of personalized recommender systems. Existing user modeling methods, like single-point and multi-point representations, have limitations w.r.t.\ accuracy, diversity, and adaptability. To overcome these deficiencies, we introduce density-based user representations (DURs), a novel method that leverages Gaussian process regression (GPR) for effective multi-interest recommendation and retrieval. Our approach, GPR4DUR, exploits DURs to capture user interest variability without manual tuning, incorporates uncertainty-awareness, and scales well to large numbers of users. Experiments using real-world offline datasets confirm the adaptability and efficiency of GPR4DUR, while online experiments with simulated users demonstrate its ability to address the exploration-exploitation trade-off by effectively utilizing model uncertainty.
△ Less
Submitted 22 May, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Distributionally-Informed Recommender System Evaluation
Authors:
Michael D. Ekstrand,
Ben Carterette,
Fernando Diaz
Abstract:
Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (o…
▽ More
Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
Fairness Through Domain Awareness: Mitigating Popularity Bias For Music Discovery
Authors:
Rebecca Salganik,
Fernando Diaz,
Golnoosh Farnadi
Abstract:
As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between mus…
▽ More
As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
The Impact of Group Membership Bias on the Quality and Fairness of Exposure in Ranking
Authors:
Ali Vardasbi,
Maarten de Rijke,
Fernando Diaz,
Mostafa Dehghani
Abstract:
When learning to rank from user interactions, search and recommender systems must address biases in user behavior to provide a high-quality ranking. One type of bias that has recently been studied in the ranking literature is when sensitive attributes, such as gender, have an impact on a user's judgment about an item's utility. For example, in a search for an expertise area, some users may be bias…
▽ More
When learning to rank from user interactions, search and recommender systems must address biases in user behavior to provide a high-quality ranking. One type of bias that has recently been studied in the ranking literature is when sensitive attributes, such as gender, have an impact on a user's judgment about an item's utility. For example, in a search for an expertise area, some users may be biased towards clicking on male candidates over female candidates. We call this type of bias group membership bias. Increasingly, we seek rankings that are fair to individuals and sensitive groups. Merit-based fairness measures rely on the estimated utility of the items. With group membership bias, the utility of the sensitive groups is under-estimated, hence, without correcting for this bias, a supposedly fair ranking is not truly fair. In this paper, first, we analyze the impact of group membership bias on ranking quality as well as merit-based fairness metrics and show that group membership bias can hurt both ranking and fairness. Then, we provide a correction method for group bias that is based on the assumption that the utility score of items in different groups comes from the same distribution. This assumption has two potential issues of sparsity and equality-instead-of-equity; we use an amortized approach to address these. We show that our correction method can consistently compensate for the negative impact of group membership bias on ranking quality and fairness metrics.
△ Less
Submitted 29 April, 2024; v1 submitted 5 August, 2023;
originally announced August 2023.
-
Scaling Laws Do Not Scale
Authors:
Fernando Diaz,
Michael Madaio
Abstract:
Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in…
▽ More
Recent work has proposed a power law relationship, referred to as ``scaling laws,'' between the performance of artificial intelligence (AI) models and aspects of those models' design (e.g., dataset size). In other words, as the size of a dataset (or model parameters, etc) increases, the performance of a given model trained on that dataset will correspondingly increase. However, while compelling in the aggregate, this scaling law relationship overlooks the ways that metrics used to measure performance may be precarious and contested, or may not correspond with how different groups of people may perceive the quality of models' output. In this paper, we argue that as the size of datasets used to train large AI models grows, the number of distinct communities (including demographic groups) whose data is included in a given dataset is likely to grow, each of whom may have different values. As a result, there is an increased risk that communities represented in a dataset may have values or preferences not captured by (or in the worst case, at odds with) the metrics used to evaluate model performance for scaling laws. We end the paper with implications for AI scaling laws -- that models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Best-Case Retrieval Evaluation: Improving the Sensitivity of Reciprocal Rank with Lexicographic Precision
Authors:
Fernando Diaz
Abstract:
Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distingui…
▽ More
Across a variety of ranking tasks, researchers use reciprocal rank to measure the effectiveness for users interested in exactly one relevant item. Despite its widespread use, evidence suggests that reciprocal rank is brittle when discriminating between systems. This brittleness, in turn, is compounded in modern evaluation settings where current, high-precision systems may be difficult to distinguish. We address the lack of sensitivity of reciprocal rank by introducing and connecting it to the concept of best-case retrieval, an evaluation method focusing on assessing the quality of a ranking for the most satisfied possible user across possible recall requirements. This perspective allows us to generalize reciprocal rank and define a new preference-based evaluation we call lexicographic precision or lexiprecision. By mathematical construction, we ensure that lexiprecision preserves differences detected by reciprocal rank, while empirically improving sensitivity and robustness across a broad set of retrieval and recommendation tasks.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
ExoplANNET: A deep learning algorithm to detect and identify planetary signals in radial velocity data
Authors:
L. A. Nieto,
R. F. Díaz
Abstract:
The detection of exoplanets with the radial velocity method consists in detecting variations of the stellar velocity caused by an unseen sub-stellar companion. Instrumental errors, irregular time sampling, and different noise sources originating in the intrinsic variability of the star can hinder the interpretation of the data, and even lead to spurious detections. In recent times, work began to e…
▽ More
The detection of exoplanets with the radial velocity method consists in detecting variations of the stellar velocity caused by an unseen sub-stellar companion. Instrumental errors, irregular time sampling, and different noise sources originating in the intrinsic variability of the star can hinder the interpretation of the data, and even lead to spurious detections. In recent times, work began to emerge in the field of extrasolar planets that use Machine Learning algorithms, some with results that exceed those obtained with the traditional techniques in the field. We seek to explore the scope of the neural networks in the radial velocity method, in particular for exoplanet detection in the presence of correlated noise of stellar origin. In this work, a neural network is proposed to replace the computation of the significance of the signal detected with the radial velocity method and to classify it as of planetary origin or not. The algorithm is trained using synthetic data of systems with and without planetary companions. We injected realistic correlated noise in the simulations, based on previous studies of the behaviour of stellar activity. The performance of the network is compared to the traditional method based on null hypothesis significance testing. The network achieves 28 % fewer false positives. The improvement is observed mainly in the detection of small-amplitude signals associated with low-mass planets. In addition, its execution time is five orders of magnitude faster than the traditional method. The superior performance exhibited by the algorithm has only been tested on simulated radial velocity data so far. Although in principle it should be straightforward to adapt it for use in real time series, its performance has to be tested thoroughly. Future work should permit evaluating its potential for adoption as a valuable tool for exoplanet detection.
△ Less
Submitted 1 July, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Recall, Robustness, and Lexicographic Evaluation
Authors:
Fernando Diaz,
Bhaskar Mitra
Abstract:
Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in ra…
▽ More
Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across three recommendation tasks and 17 information retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
△ Less
Submitted 8 March, 2024; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Commonality in Recommender Systems: Evaluating Recommender Systems to Enhance Cultural Citizenship
Authors:
Andres Ferraro,
Gustavo Ferreira,
Fernando Diaz,
Georgina Born
Abstract:
Recommender systems have become the dominant means of curating cultural content, significantly influencing individual cultural experience. Since recommender systems tend to optimize for personalized user experience, they can overlook impacts on cultural experience in the aggregate. After demonstrating that existing metrics do not center culture, we introduce a new metric, commonality, that measure…
▽ More
Recommender systems have become the dominant means of curating cultural content, significantly influencing individual cultural experience. Since recommender systems tend to optimize for personalized user experience, they can overlook impacts on cultural experience in the aggregate. After demonstrating that existing metrics do not center culture, we introduce a new metric, commonality, that measures the degree to which recommendations familiarize a given user population with specified categories of cultural content. We developed commonality through an interdisciplinary dialogue between researchers in computer science and the social sciences and humanities. With reference to principles underpinning public service media systems in democratic societies, we identify universality of address and content diversity in the service of strengthening cultural citizenship as particularly relevant goals for recommender systems delivering cultural content. We develop commonality as a measure of recommender system alignment with the promotion of content toward a shared cultural experience across a population of users. We empirically compare the performance of recommendation algorithms using commonality with existing metrics, demonstrating that commonality captures a novel property of system behavior complementary to existing metrics. Alongside existing fairness and diversity metrics, commonality contributes to a growing body of scholarship developing `public good' rationales for machine learning systems.
△ Less
Submitted 22 February, 2023; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Unsupervised Mandarin-Cantonese Machine Translation
Authors:
Megan Dare,
Valentina Fajardo Diaz,
Averie Ho Zoen So,
Yifan Wang,
Shibingfeng Zhang
Abstract:
Advancements in unsupervised machine translation have enabled the development of machine translation systems that can translate between languages for which there is not an abundance of parallel data available. We explored unsupervised machine translation between Mandarin Chinese and Cantonese. Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the lan…
▽ More
Advancements in unsupervised machine translation have enabled the development of machine translation systems that can translate between languages for which there is not an abundance of parallel data available. We explored unsupervised machine translation between Mandarin Chinese and Cantonese. Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the language, due to the fact that Cantonese is primarily used for oral communication. The key contributions of our project include: 1. The creation of a new corpus containing approximately 1 million Cantonese sentences, and 2. A large-scale comparison across different model architectures, tokenization schemes, and embedding structures. Our best model trained with character-based tokenization and a Transformer architecture achieved a character-level BLEU of 25.1 when translating from Mandarin to Cantonese and of 24.4 when translating from Cantonese to Mandarin. In this paper we discuss our research process, experiments, and results.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
Redefining Relationships in Music
Authors:
Christian Detweiler,
Beth Coleman,
Fernando Diaz,
Lieke Dom,
Chris Donahue,
Jesse Engel,
Cheng-Zhi Anna Huang,
Larry James,
Ethan Manilow,
Amanda McCroskery,
Kyle Pedersen,
Pamela Peter-Agbia,
Negar Rostamzadeh,
Robert Thomas,
Marco Zamarato,
Ben Zevenbergen
Abstract:
AI tools increasingly shape how we discover, make and experience music. While these tools can have the potential to empower creativity, they may fundamentally redefine relationships between stakeholders, to the benefit of some and the detriment of others. In this position paper, we argue that these tools will fundamentally reshape our music culture, with profound effects (for better and for worse)…
▽ More
AI tools increasingly shape how we discover, make and experience music. While these tools can have the potential to empower creativity, they may fundamentally redefine relationships between stakeholders, to the benefit of some and the detriment of others. In this position paper, we argue that these tools will fundamentally reshape our music culture, with profound effects (for better and for worse) on creators, consumers and the commercial enterprises that often connect them. By paying careful attention to emerging Music AI technologies and developments in other creative domains and understanding the implications, people working in this space could decrease the possible negative impacts on the practice, consumption and meaning of music. Given that many of these technologies are already available, there is some urgency in conducting analyses of these technologies now. It is important that people developing and working with these tools address these issues now to help guide their evolution to be equitable and empower creativity. We identify some potential risks and opportunities associated with existing and forthcoming AI tools for music, though more work is needed to identify concrete actions which leverage the opportunities while mitigating risks.
△ Less
Submitted 16 December, 2022; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Striving for data-model efficiency: Identifying data externalities on group performance
Authors:
Esther Rolf,
Ben Packer,
Alex Beutel,
Fernando Diaz
Abstract:
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data…
▽ More
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions. From a broader perspective, our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
Retrieval Augmentation for T5 Re-ranker using External Sources
Authors:
Kai Hui,
Tao Chen,
Zhen Qin,
Honglei Zhuang,
Fernando Diaz,
Mike Bendersky,
Don Metzler
Abstract:
Retrieval augmentation has shown promising improvements in different tasks. However, whether such augmentation can assist a large language model based re-ranker remains unclear. We investigate how to augment T5-based re-rankers using high-quality information retrieved from two external corpora -- a commercial web search engine and Wikipedia. We empirically demonstrate how retrieval augmentation ca…
▽ More
Retrieval augmentation has shown promising improvements in different tasks. However, whether such augmentation can assist a large language model based re-ranker remains unclear. We investigate how to augment T5-based re-rankers using high-quality information retrieved from two external corpora -- a commercial web search engine and Wikipedia. We empirically demonstrate how retrieval augmentation can substantially improve the effectiveness of T5-based re-rankers for both in-domain and zero-shot out-of-domain re-ranking tasks.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Analyzing the Effect of Sampling in GNNs on Individual Fairness
Authors:
Rebecca Salganik,
Fernando Diaz,
Golnoosh Farnadi
Abstract:
Graph neural network (GNN) based methods have saturated the field of recommender systems. The gains of these systems have been significant, showing the advantages of interpreting data through a network structure. However, despite the noticeable benefits of using graph structures in recommendation tasks, this representational form has also bred new challenges which exacerbate the complexity of miti…
▽ More
Graph neural network (GNN) based methods have saturated the field of recommender systems. The gains of these systems have been significant, showing the advantages of interpreting data through a network structure. However, despite the noticeable benefits of using graph structures in recommendation tasks, this representational form has also bred new challenges which exacerbate the complexity of mitigating algorithmic bias. When GNNs are integrated into downstream tasks, such as recommendation, bias mitigation can become even more difficult. Furthermore, the intractability of applying existing methods of fairness promotion to large, real world datasets places even more serious constraints on mitigation attempts. Our work sets out to fill in this gap by taking an existing method for promoting individual fairness on graphs and extending it to support mini-batch, or sub-sample based, training of a GNN, thus laying the groundwork for applying this method to a downstream recommendation task. We evaluate two popular GNN methods: Graph Convolutional Network (GCN), which trains on the entire graph, and GraphSAGE, which uses probabilistic random walks to create subgraphs for mini-batch training, and assess the effects of sub-sampling on individual fairness. We implement an individual fairness notion called \textit{REDRESS}, proposed by Dong et al., which uses rank optimization to learn individual fair node, or item, embeddings. We empirically show on two real world datasets that GraphSAGE is able to achieve, not just, comparable accuracy, but also, improved fairness as compared with the GCN model. These finding have consequential ramifications to individual fairness promotion, GNNs, and in downstream form, recommender systems, showing that mini-batch training facilitate individual fairness promotion by allowing for local nuance to guide the process of fairness promotion in representation learning.
△ Less
Submitted 9 September, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
Measuring Commonality in Recommendation of Cultural Content: Recommender Systems to Enhance Cultural Citizenship
Authors:
Andres Ferraro,
Gustavo Ferreira,
Fernando Diaz,
Georgina Born
Abstract:
Recommender systems have become the dominant means of curating cultural content, significantly influencing the nature of individual cultural experience. While the majority of research on recommender systems optimizes for personalized user experience, this paradigm does not capture the ways that recommender systems impact cultural experience in the aggregate, across populations of users. Although e…
▽ More
Recommender systems have become the dominant means of curating cultural content, significantly influencing the nature of individual cultural experience. While the majority of research on recommender systems optimizes for personalized user experience, this paradigm does not capture the ways that recommender systems impact cultural experience in the aggregate, across populations of users. Although existing novelty, diversity, and fairness studies probe how systems relate to the broader social role of cultural content, they do not adequately center culture as a core concept and challenge. In this work, we introduce commonality as a new measure that reflects the degree to which recommendations familiarize a given user population with specified categories of cultural content. Our proposed commonality metric responds to a set of arguments developed through an interdisciplinary dialogue between researchers in computer science and the social sciences and humanities. With reference to principles underpinning non-profit, public service media systems in democratic societies, we identify universality of address and content diversity in the service of strengthening cultural citizenship as particularly relevant goals for recommender systems delivering cultural content. Taking diversity in movie recommendation as a case study in enhancing pluralistic cultural experience, we empirically compare systems' performance using commonality and existing utility, diversity, and fairness metrics. Our results demonstrate that commonality captures a property of system behavior complementary to existing metrics and suggest the need for alternative, non-personalized interventions in recommender systems oriented to strengthening cultural citizenship across populations of users. In this way, commonality contributes to a growing body of scholarship developing 'public good' rationales for digital media and ML systems.
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
On Natural Language User Profiles for Transparent and Scrutable Recommendation
Authors:
Filip Radlinski,
Krisztian Balog,
Fernando Diaz,
Lucas Dixon,
Ben Wedin
Abstract:
Natural interaction with recommendation and personalized search systems has received tremendous attention in recent years. We focus on the challenge of supporting people's understanding and control of these systems and explore a fundamentally new way of thinking about representation of knowledge in recommendation and personalization systems. Specifically, we argue that it may be both desirable and…
▽ More
Natural interaction with recommendation and personalized search systems has received tremendous attention in recent years. We focus on the challenge of supporting people's understanding and control of these systems and explore a fundamentally new way of thinking about representation of knowledge in recommendation and personalization systems. Specifically, we argue that it may be both desirable and possible for algorithms that use natural language representations of users' preferences to be developed. We make the case that this could provide significantly greater transparency, as well as affordances for practical actionable interrogation of, and control over, recommendations. Moreover, we argue that such an approach, if successfully applied, may enable a major step towards systems that rely less on noisy implicit observations while increasing portability of knowledge of one's interests.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Retrieval-Enhanced Machine Learning
Authors:
Hamed Zamani,
Fernando Diaz,
Mostafa Dehghani,
Donald Metzler,
Michael Bendersky
Abstract:
Although information access systems have long supported people in accomplishing a wide range of tasks, we propose broadening the scope of users of information access systems to include task-driven machines, such as machine learning models. In this way, the core principles of indexing, representation, retrieval, and ranking can be applied and extended to substantially improve model generalization,…
▽ More
Although information access systems have long supported people in accomplishing a wide range of tasks, we propose broadening the scope of users of information access systems to include task-driven machines, such as machine learning models. In this way, the core principles of indexing, representation, retrieval, and ranking can be applied and extended to substantially improve model generalization, scalability, robustness, and interpretability. We describe a generic retrieval-enhanced machine learning (REML) framework, which includes a number of existing models as special cases. REML challenges information retrieval conventions, presenting opportunities for novel advances in core areas, including optimization. The REML research agenda lays a foundation for a new style of information access research and paves a path towards advancing machine learning and artificial intelligence.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Joint Multisided Exposure Fairness for Recommendation
Authors:
Haolun Wu,
Bhaskar Mitra,
Chen Ma,
Fernando Diaz,
Xue Liu
Abstract:
Prior research on exposure fairness in the context of recommender systems has focused mostly on disparities in the exposure of individual or groups of items to individual users of the system. The problem of how individual or groups of items may be systemically under or over exposed to groups of users, or even all users, has received relatively less attention. However, such systemic disparities in…
▽ More
Prior research on exposure fairness in the context of recommender systems has focused mostly on disparities in the exposure of individual or groups of items to individual users of the system. The problem of how individual or groups of items may be systemically under or over exposed to groups of users, or even all users, has received relatively less attention. However, such systemic disparities in information exposure can result in observable social harms, such as withholding economic opportunities from historically marginalized groups (allocative harm) or amplifying gendered and racialized stereotypes (representational harm). Previously, Diaz et al. developed the expected exposure metric -- that incorporates existing user browsing models that have previously been developed for information retrieval -- to study fairness of content exposure to individual users. We extend their proposed framework to formalize a family of exposure fairness metrics that model the problem jointly from the perspective of both the consumers and producers. Specifically, we consider group attributes for both types of stakeholders to identify and mitigate fairness concerns that go beyond individual users and items towards more systemic biases in recommendation. Furthermore, we study and discuss the relationships between the different exposure fairness dimensions proposed in this paper, as well as demonstrate how stochastic ranking policies can be optimized towards said fairness goals.
△ Less
Submitted 29 April, 2022;
originally announced May 2022.
-
Offline Retrieval Evaluation Without Evaluation Metrics
Authors:
Fernando Diaz,
Andres Ferraro
Abstract:
Offline evaluation of information retrieval and recommendation has traditionally focused on distilling the quality of a ranking into a scalar metric such as average precision or normalized discounted cumulative gain. We can use this metric to compare the performance of multiple systems for the same request. Although evaluation metrics provide a convenient summary of system performance, they also c…
▽ More
Offline evaluation of information retrieval and recommendation has traditionally focused on distilling the quality of a ranking into a scalar metric such as average precision or normalized discounted cumulative gain. We can use this metric to compare the performance of multiple systems for the same request. Although evaluation metrics provide a convenient summary of system performance, they also collapse subtle differences across users into a single number and can carry assumptions about user behavior and utility not supported across retrieval scenarios. We propose recall-paired preference (RPP), a metric-free evaluation method based on directly computing a preference between ranked lists. RPP simulates multiple user subpopulations per query and compares systems across these pseudo-populations. Our results across multiple search and recommendation tasks demonstrate that RPP substantially improves discriminative power while correlating well with existing metrics and being equally robust to incomplete data.
△ Less
Submitted 24 April, 2022;
originally announced April 2022.
-
Exposing Query Identification for Search Transparency
Authors:
Ruohan Li,
Jianxiang Li,
Bhaskar Mitra,
Fernando Diaz,
Asia J. Biega
Abstract:
Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Ex…
▽ More
Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization.
Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25 models. We then propose how this approach can be improved through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
△ Less
Submitted 11 April, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Estimation of Fair Ranking Metrics with Incomplete Judgments
Authors:
Ömer Kırnap,
Fernando Diaz,
Asia Biega,
Michael Ekstrand,
Ben Carterette,
Emine Yılmaz
Abstract:
There is increasing attention to evaluating the fairness of search system ranking decisions. These metrics often consider the membership of items to particular groups, often identified using protected attributes such as gender or ethnicity. To date, these metrics typically assume the availability and completeness of protected attribute labels of items. However, the protected attributes of individu…
▽ More
There is increasing attention to evaluating the fairness of search system ranking decisions. These metrics often consider the membership of items to particular groups, often identified using protected attributes such as gender or ethnicity. To date, these metrics typically assume the availability and completeness of protected attribute labels of items. However, the protected attributes of individuals are rarely present, limiting the application of fair ranking metrics in large scale systems. In order to address this problem, we propose a sampling strategy and estimation technique for four fair ranking metrics. We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items. We evaluate our approach using both simulated and real world data. Our experimental results demonstrate that our method can estimate this family of fair ranking metrics and provides a robust, reliable alternative to exhaustive or random data annotation.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Overview of the TREC 2020 Fair Ranking Track
Authors:
Asia J. Biega,
Fernando Diaz,
Michael D. Ekstrand,
Sergey Feldman,
Sebastian Kohlmeier
Abstract:
This paper provides an overview of the NIST TREC 2020 Fair Ranking track. For 2020, we again adopted an academic search task, where we have a corpus of academic article abstracts and queries submitted to a production academic search engine. The central goal of the Fair Ranking track is to provide fair exposure to different groups of authors (a group fairness framing). We recognize that there may b…
▽ More
This paper provides an overview of the NIST TREC 2020 Fair Ranking track. For 2020, we again adopted an academic search task, where we have a corpus of academic article abstracts and queries submitted to a production academic search engine. The central goal of the Fair Ranking track is to provide fair exposure to different groups of authors (a group fairness framing). We recognize that there may be multiple group definitions (e.g. based on demographics, stature, topic) and hoped for the systems to be robust to these. We expected participants to develop systems that optimize for fairness and relevance for arbitrary group definitions, and did not reveal the exact group definitions until after the evaluation runs were submitted.The track contains two tasks,reranking and retrieval, with a shared evaluation.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Learning to Limit Data Collection via Scaling Laws: A Computational Interpretation for the Legal Principle of Data Minimization
Authors:
Divya Shanmugam,
Samira Shabanian,
Fernando Diaz,
Michèle Finck,
Asia Biega
Abstract:
Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the core data protection principles enshrined in the European Union's General Data Protection Regulation ('GDPR') and requires that only personal data that is adequate, relevant and li…
▽ More
Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the core data protection principles enshrined in the European Union's General Data Protection Regulation ('GDPR') and requires that only personal data that is adequate, relevant and limited to what is necessary is processed. However, the principle has seen limited adoption due to the lack of technical interpretation.
In this work, we build on literature in machine learning and law to propose FIDO, a Framework for Inhibiting Data Overcollection. FIDO learns to limit data collection based on an interpretation of data minimization tied to system performance. Concretely, FIDO provides a data collection stopping criterion by iteratively updating an estimate of the performance curve, or the relationship between dataset size and performance, as data is acquired. FIDO estimates the performance curve via a piecewise power law technique that models distinct phases of an algorithm's performance throughout data collection separately. Empirical experiments show that the framework produces accurate performance curves and data collection stopping criteria across datasets and feature acquisition algorithms. We further demonstrate that many other families of curves systematically overestimate the return on additional data. Results and analysis from our investigation offer deeper insights into the relevant considerations when designing a data minimization framework, including the impacts of active feature acquisition on individual users and the feasability of user-specific data minimization. We conclude with practical recommendations for the implementation of data minimization.
△ Less
Submitted 12 June, 2022; v1 submitted 16 July, 2021;
originally announced July 2021.
-
The Benchmark Lottery
Authors:
Mostafa Dehghani,
Yi Tay,
Alexey A. Gritsenko,
Zhe Zhao,
Neil Houlsby,
Fernando Diaz,
Donald Metzler,
Oriol Vinyals
Abstract:
The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness of different algorithms and methods. This paper proposes the notion of "a benchmark lottery" that describes the overall fragility of the ML benchmarking process. The benchmark lottery postulates that many factors, other than fundamental algorithmic superiority, may lead to a…
▽ More
The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness of different algorithms and methods. This paper proposes the notion of "a benchmark lottery" that describes the overall fragility of the ML benchmarking process. The benchmark lottery postulates that many factors, other than fundamental algorithmic superiority, may lead to a method being perceived as superior. On multiple benchmark setups that are prevalent in the ML community, we show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks, highlighting the fragility of the current paradigms and potential fallacious interpretation derived from benchmarking ML methods. Given that every benchmark makes a statement about what it perceives to be important, we argue that this might lead to biased progress in the community. We discuss the implications of the observed phenomena and provide recommendations on mitigating them using multiple machine learning domains and communities as use cases, including natural language processing, computer vision, information retrieval, recommender systems, and reinforcement learning.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Fairness in Information Access Systems
Authors:
Michael D. Ekstrand,
Anubrata Das,
Robin Burke,
Fernando Diaz
Abstract:
Recommendation, information retrieval, and other information access systems pose unique challenges for investigating and applying the fairness and non-discrimination concepts that have been developed for studying other machine learning systems. While fair information access shares many commonalities with fair classification, the multistakeholder nature of information access applications, the rank-…
▽ More
Recommendation, information retrieval, and other information access systems pose unique challenges for investigating and applying the fairness and non-discrimination concepts that have been developed for studying other machine learning systems. While fair information access shares many commonalities with fair classification, the multistakeholder nature of information access applications, the rank-based problem setting, the centrality of personalization in many cases, and the role of user response complicate the problem of identifying precisely what types and operationalizations of fairness may be relevant, let alone measuring or promoting them.
In this monograph, we present a taxonomy of the various dimensions of fair information access and survey the literature to date on this new and rapidly-growing topic. We preface this with brief introductions to information access and algorithmic fairness, to facilitate use of this work by scholars with experience in one (or neither) of these fields who wish to learn about their intersection. We conclude with several open problems in fair information access, along with some suggestions for how to approach research in this space.
△ Less
Submitted 12 July, 2022; v1 submitted 12 May, 2021;
originally announced May 2021.
-
Multi-FR: A Multi-objective Optimization Framework for Multi-stakeholder Fairness-aware Recommendation
Authors:
Haolun Wu,
Chen Ma,
Bhaskar Mitra,
Fernando Diaz,
Xue Liu
Abstract:
Nowadays, most online services are hosted on multi-stakeholder marketplaces, where consumers and producers may have different objectives. Conventional recommendation systems, however, mainly focus on maximizing consumers' satisfaction by recommending the most relevant items to each individual. This may result in unfair exposure of items, thus jeopardizing producer benefits. Additionally, they do n…
▽ More
Nowadays, most online services are hosted on multi-stakeholder marketplaces, where consumers and producers may have different objectives. Conventional recommendation systems, however, mainly focus on maximizing consumers' satisfaction by recommending the most relevant items to each individual. This may result in unfair exposure of items, thus jeopardizing producer benefits. Additionally, they do not care whether consumers from diverse demographic groups are equally satisfied. To address these limitations, we propose a multi-objective optimization framework for fairness-aware recommendation, Multi-FR, that adaptively balances accuracy and fairness for various stakeholders with Pareto optimality guarantee. We first propose four fairness constraints on consumers and producers. In order to train the whole framework in an end-to-end way, we utilize the smooth rank and stochastic ranking policy to make these fairness criteria differentiable and friendly to back-propagation. Then, we adopt the multiple gradient descent algorithm to generate a Pareto set of solutions, from which the most appropriate one is selected by the Least Misery Strategy. The experimental results demonstrate that Multi-FR largely improves recommendation fairness on multiple stakeholders over the state-of-the-art approaches while maintaining almost the same recommendation accuracy. The training efficiency study confirms our model's ability to simultaneously optimize different fairness constraints for many stakeholders efficiently.
△ Less
Submitted 9 August, 2022; v1 submitted 6 May, 2021;
originally announced May 2021.
-
Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification
Authors:
Jaime Arguello,
Adam Ferguson,
Emery Fine,
Bhaskar Mitra,
Hamed Zamani,
Fernando Diaz
Abstract:
While current information retrieval systems are effective for known-item retrieval where the searcher provides a precise name or identifier for the item being sought, systems tend to be much less effective for cases where the searcher is unable to express a precise name or identifier. We refer to this as tip of the tongue (TOT) known-item retrieval, named after the cognitive state of not being abl…
▽ More
While current information retrieval systems are effective for known-item retrieval where the searcher provides a precise name or identifier for the item being sought, systems tend to be much less effective for cases where the searcher is unable to express a precise name or identifier. We refer to this as tip of the tongue (TOT) known-item retrieval, named after the cognitive state of not being able to retrieve an item from memory. Using movie search as a case study, we explore the characteristics of questions posed by searchers in TOT states in a community question answering website. We analyze how searchers express their information needs during TOT states in the movie domain. Specifically, what information do searchers remember about the item being sought and how do they convey this information? Our results suggest that searchers use a combination of information about: (1) the content of the item sought, (2) the context in which they previously engaged with the item, and (3) previous attempts to find the item using other resources (e.g., search engines). Additionally, searchers convey information by sometimes expressing uncertainty (i.e., hedging), opinions, emotions, and by performing relative (vs. absolute) comparisons with attributes of the item. As a result of our analysis, we believe that searchers in TOT states may require specialized query understanding methods or document representations. Finally, our preliminary retrieval experiments show the impact of each information type presented in information requests on retrieval performance.
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
Chest x-ray automated triage: a semiologic approach designed for clinical implementation, exploiting different types of labels through a combination of four Deep Learning architectures
Authors:
Candelaria Mosquera,
Facundo Nahuel Diaz,
Fernando Binder,
Jose Martin Rabellino,
Sonia Elizabeth Benitez,
Alejandro Daniel Beresñak,
Alberto Seehaus,
Gabriel Ducrey,
Jorge Alberto Ocantos,
Daniel Roberto Luna
Abstract:
BACKGROUND AND OBJECTIVES: The multiple chest x-ray datasets released in the last years have ground-truth labels intended for different computer vision tasks, suggesting that performance in automated chest-xray interpretation might improve by using a method that can exploit diverse types of annotations. This work presents a Deep Learning method based on the late fusion of different convolutional a…
▽ More
BACKGROUND AND OBJECTIVES: The multiple chest x-ray datasets released in the last years have ground-truth labels intended for different computer vision tasks, suggesting that performance in automated chest-xray interpretation might improve by using a method that can exploit diverse types of annotations. This work presents a Deep Learning method based on the late fusion of different convolutional architectures, that allows training with heterogeneous data with a simple implementation, and evaluates its performance on independent test data. We focused on obtaining a clinically useful tool that could be successfully integrated into a hospital workflow. MATERIALS AND METHODS: Based on expert opinion, we selected four target chest x-ray findings, namely lung opacities, fractures, pneumothorax and pleural effusion. For each finding we defined the most adequate type of ground-truth label, and built four training datasets combining images from public chest x-ray datasets and our institutional archive. We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool. Performance was measured on two test datasets: an external openly-available dataset, and a retrospective institutional dataset, to estimate performance on local population. RESULTS: The external and local test sets had 4376 and 1064 images, respectively, for which the model showed an area under the Receiver Operating Characteristics curve of 0.75 (95%CI: 0.74-0.76) and 0.87 (95%CI: 0.86-0.89) in the detection of abnormal chest x-rays. For the local population, a sensitivity of 86% (95%CI: 84-90), and a specificity of 88% (95%CI: 86-90) were obtained, with no significant differences between demographic subgroups. We present examples of heatmaps to show the accomplished level of interpretability, examining true and false positives.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Deep learning-based transformation of the H&E stain into special stains
Authors:
Kevin de Haan,
Yijie Zhang,
Jonathan E. Zuckerman,
Tairan Liu,
Anthony E. Sisk,
Miguel F. P. Diaz,
Kuang-Yu Jen,
Alexander Nobori,
Sofia Liou,
Sarah Zhang,
Rana Riahi,
Yair Rivenson,
W. Dean Wallace,
Aydogan Ozcan
Abstract:
Pathology is practiced by visual inspection of histochemically stained slides. Most commonly, the hematoxylin and eosin (H&E) stain is used in the diagnostic workflow and it is the gold standard for cancer diagnosis. However, in many cases, especially for non-neoplastic diseases, additional "special stains" are used to provide different levels of contrast and color to tissue components and allow p…
▽ More
Pathology is practiced by visual inspection of histochemically stained slides. Most commonly, the hematoxylin and eosin (H&E) stain is used in the diagnostic workflow and it is the gold standard for cancer diagnosis. However, in many cases, especially for non-neoplastic diseases, additional "special stains" are used to provide different levels of contrast and color to tissue components and allow pathologists to get a clearer diagnostic picture. In this study, we demonstrate the utility of supervised learning-based computational stain transformation from H&E to different special stains (Masson's Trichrome, periodic acid-Schiff and Jones silver stain) using tissue sections from kidney needle core biopsies. Based on evaluation by three renal pathologists, followed by adjudication by a fourth renal pathologist, we show that the generation of virtual special stains from existing H&E images improves the diagnosis in several non-neoplastic kidney diseases sampled from 58 unique subjects. A second study performed by three pathologists found that the quality of the special stains generated by the stain transformation network was statistically equivalent to those generated through standard histochemical staining. As the transformation of H&E images into special stains can be achieved within 1 min or less per patient core specimen slide, this stain-to-stain transformation framework can improve the quality of the preliminary diagnosis when additional special stains are needed, along with significant savings in time and cost, reducing the burden on healthcare system and patients.
△ Less
Submitted 12 August, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
On the Social and Technical Challenges of Web Search Autosuggestion Moderation
Authors:
Timothy J. Hazen,
Alexandra Olteanu,
Gabriella Kazai,
Fernando Diaz,
Michael Golebiewski
Abstract:
Past research shows that users benefit from systems that support them in their writing and exploration tasks. The autosuggestion feature of Web search engines is an example of such a system: It helps users in formulating their queries by offering a list of suggestions as they type. Autosuggestions are typically generated by machine learning (ML) systems trained on a corpus of search logs and docum…
▽ More
Past research shows that users benefit from systems that support them in their writing and exploration tasks. The autosuggestion feature of Web search engines is an example of such a system: It helps users in formulating their queries by offering a list of suggestions as they type. Autosuggestions are typically generated by machine learning (ML) systems trained on a corpus of search logs and document representations. Such automated methods can become prone to issues that result in problematic suggestions that are biased, racist, sexist or in other ways inappropriate. While current search engines have become increasingly proficient at suppressing such problematic suggestions, there are still persistent issues that remain. In this paper, we reflect on past efforts and on why certain issues still linger by covering explored solutions along a prototypical pipeline for identifying, detecting, and addressing problematic autosuggestions. To showcase their complexity, we discuss several dimensions of problematic suggestions, difficult issues along the pipeline, and why our discussion applies to the increasing number of applications beyond web search that implement similar textual suggestion features. By outlining persistent social and technical challenges in moderating web search suggestions, we provide a renewed call for action.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Analyzing and Learning from User Interactions for Search Clarification
Authors:
Hamed Zamani,
Bhaskar Mitra,
Everest Chen,
Gord Lueck,
Fernando Diaz,
Paul N. Bennett,
Nick Craswell,
Susan T. Dumais
Abstract:
Asking clarifying questions in response to search queries has been recognized as a useful technique for revealing the underlying intent of the query. Clarification has applications in retrieval systems with different interfaces, from the traditional web search interfaces to the limited bandwidth interfaces as in speech-only and small screen devices. Generation and evaluation of clarifying question…
▽ More
Asking clarifying questions in response to search queries has been recognized as a useful technique for revealing the underlying intent of the query. Clarification has applications in retrieval systems with different interfaces, from the traditional web search interfaces to the limited bandwidth interfaces as in speech-only and small screen devices. Generation and evaluation of clarifying questions have been recently studied in the literature. However, user interaction with clarifying questions is relatively unexplored. In this paper, we conduct a comprehensive study by analyzing large-scale user interactions with clarifying questions in a major web search engine. In more detail, we analyze the user engagements received by clarifying questions based on different properties of search queries, clarifying questions, and their candidate answers. We further study click bias in the data, and show that even though reading clarifying questions and candidate answers does not take significant efforts, there still exist some position and presentation biases in the data. We also propose a model for learning representation for clarifying questions based on the user interaction data as implicit feedback. The model is used for re-ranking a number of automatically generated clarifying questions for a given query. Evaluation on both click data and human labeled data demonstrates the high quality of the proposed method.
△ Less
Submitted 29 May, 2020;
originally announced June 2020.
-
Operationalizing the Legal Principle of Data Minimization for Personalization
Authors:
Asia J. Biega,
Peter Potash,
Hal Daumé III,
Fernando Diaz,
Michèle Finck
Abstract:
Article 5(1)(c) of the European Union's General Data Protection Regulation (GDPR) requires that "personal data shall be [...] adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (`data minimisation')". To date, the legal and computational definitions of `purpose limitation' and `data minimization' remain largely unclear. In particular, the…
▽ More
Article 5(1)(c) of the European Union's General Data Protection Regulation (GDPR) requires that "personal data shall be [...] adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed (`data minimisation')". To date, the legal and computational definitions of `purpose limitation' and `data minimization' remain largely unclear. In particular, the interpretation of these principles is an open issue for information access systems that optimize for user experience through personalization and do not strictly require personal data collection for the delivery of basic service.
In this paper, we identify a lack of a homogeneous interpretation of the data minimization principle and explore two operational definitions applicable in the context of personalization. The focus of our empirical study in the domain of recommender systems is on providing foundational insights about the (i) feasibility of different data minimization definitions, (ii) robustness of different recommendation algorithms to minimization, and (iii) performance of different minimization strategies.We find that the performance decrease incurred by data minimization might not be substantial, but that it might disparately impact different users---a finding which has implications for the viability of different formal minimization definitions. Overall, our analysis uncovers the complexities of the data minimization problem in the context of personalization and maps the remaining computational and regulatory challenges.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
Evaluating Stochastic Rankings with Expected Exposure
Authors:
Fernando Diaz,
Bhaskar Mitra,
Michael D. Ekstrand,
Asia J. Biega,
Ben Carterette
Abstract:
We introduce the concept of \emph{expected exposure} as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item should receive more or less expected exposure than any other item of the same relevance grade. We argue that this principl…
▽ More
We introduce the concept of \emph{expected exposure} as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item should receive more or less expected exposure than any other item of the same relevance grade. We argue that this principle is desirable for many retrieval objectives and scenarios, including topical diversity and fair ranking. Leveraging user models from existing retrieval metrics, we propose a general evaluation methodology based on expected exposure and draw connections to related metrics in information retrieval evaluation. Importantly, this methodology relaxes classic information retrieval assumptions, allowing a system, in response to a query, to produce a \emph{distribution over rankings} instead of a single fixed ranking. We study the behavior of the expected exposure metric and stochastic rankers across a variety of information access conditions, including \emph{ad hoc} retrieval and recommendation. We believe that measuring and optimizing expected exposure metrics using randomization opens a new area for retrieval algorithm development and progress.
△ Less
Submitted 20 October, 2020; v1 submitted 27 April, 2020;
originally announced April 2020.
-
Overview of the TREC 2019 Fair Ranking Track
Authors:
Asia J. Biega,
Fernando Diaz,
Michael D. Ekstrand,
Sebastian Kohlmeier
Abstract:
The goal of the TREC Fair Ranking track was to develop a benchmark for evaluating retrieval systems in terms of fairness to different content providers in addition to classic notions of relevance. As part of the benchmark, we defined standardized fairness metrics with evaluation protocols and released a dataset for the fair ranking problem. The 2019 task focused on reranking academic paper abstrac…
▽ More
The goal of the TREC Fair Ranking track was to develop a benchmark for evaluating retrieval systems in terms of fairness to different content providers in addition to classic notions of relevance. As part of the benchmark, we defined standardized fairness metrics with evaluation protocols and released a dataset for the fair ranking problem. The 2019 task focused on reranking academic paper abstracts given a query. The objective was to fairly represent relevant authors from several groups that were unknown at the system submission time. Thus, the track emphasized the development of systems which have robust performance across a variety of group definitions. Participants were provided with querylog data (queries, documents, and relevance) from Semantic Scholar. This paper presents an overview of the track, including the task definition, descriptions of the data and the annotation process, as well as a comparison of the performance of submitted systems.
△ Less
Submitted 25 March, 2020;
originally announced March 2020.
-
Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks
Authors:
Bhaskar Mitra,
Corby Rosset,
David Hawking,
Nick Craswell,
Fernando Diaz,
Emine Yilmaz
Abstract:
Classical information retrieval (IR) methods, such as query likelihood and BM25, score documents independently w.r.t. each query term, and then accumulate the scores. Assuming query term independence allows precomputing term-document scores using these models---which can be combined with specialized data structures, such as inverted index, for efficient retrieval. Deep neural IR models, in contras…
▽ More
Classical information retrieval (IR) methods, such as query likelihood and BM25, score documents independently w.r.t. each query term, and then accumulate the scores. Assuming query term independence allows precomputing term-document scores using these models---which can be combined with specialized data structures, such as inverted index, for efficient retrieval. Deep neural IR models, in contrast, compare the whole query to the document and are, therefore, typically employed only for late stage re-ranking. We incorporate query term independence assumption into three state-of-the-art neural IR models: BERT, Duet, and CKNRM---and evaluate their performance on a passage ranking task. Surprisingly, we observe no significant loss in result quality for Duet and CKNRM---and a small degradation in the case of BERT. However, by operating on each query term independently, these otherwise computationally intensive models become amenable to offline precomputation---dramatically reducing the cost of query evaluations employing state-of-the-art neural ranking models. This strategy makes it practical to use deep models for retrieval from large collections---and not restrict their usage to late stage re-ranking.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
Promoting Saving for College Through Data Science
Authors:
Fernando Diaz,
Natnaell Mammo
Abstract:
The cost of attending college has been steadily rising and in 10 years is estimated to reach $140,000 for a 4-year public university. Recent surveys estimate just over half of US families are saving for college. State-operated 529 college savings plans are an effective way for families to plan and save for future college costs, but only 3% of families currently use them. The Office of the Illinois…
▽ More
The cost of attending college has been steadily rising and in 10 years is estimated to reach $140,000 for a 4-year public university. Recent surveys estimate just over half of US families are saving for college. State-operated 529 college savings plans are an effective way for families to plan and save for future college costs, but only 3% of families currently use them. The Office of the Illinois State Treasurer (Treasurer) administers two 529 plans to help its residents save for college. In order to increase the number of families saving for college, the Treasurer and Civis Analytics used data science techniques to identify the people most likely to sign up for a college savings plan. In this paper, we will discuss the use of person matching to join accountholder data from the Treasurer to the Civis National File, as well as the use of lookalike modeling to identify new potential signups. In order to avoid reinforcing existing demographic imbalances in who saves for college, the lookalike models used were ensured to be racially and economically balanced. We will also discuss how these new signup targets were then individually served digital ads to encourage opening college savings accounts.
△ Less
Submitted 18 October, 2017;
originally announced October 2017.
-
Auditing Search Engines for Differential Satisfaction Across Demographics
Authors:
Rishabh Mehrotra,
Ashton Anderson,
Fernando Diaz,
Amit Sharma,
Hanna Wallach,
Emine Yilmaz
Abstract:
Many online services, such as search engines, social media platforms, and digital marketplaces, are advertised as being available to any user, regardless of their age, gender, or other demographic factors. However, there are growing concerns that these services may systematically underserve some groups of users. In this paper, we present a framework for internally auditing such services for differ…
▽ More
Many online services, such as search engines, social media platforms, and digital marketplaces, are advertised as being available to any user, regardless of their age, gender, or other demographic factors. However, there are growing concerns that these services may systematically underserve some groups of users. In this paper, we present a framework for internally auditing such services for differences in user satisfaction across demographic groups, using search engines as a case study. We first explain the pitfalls of naïvely comparing the behavioral metrics that are commonly used to evaluate search engines. We then propose three methods for measuring latent differences in user satisfaction from observed differences in evaluation metrics. To develop these methods, we drew on ideas from the causal inference literature and the multilevel modeling literature. Our framework is broadly applicable to other online services, and provides general insight into interpreting their evaluation metrics.
△ Less
Submitted 24 May, 2017;
originally announced May 2017.
-
Luandri: a Clean Lua Interface to the Indri Search Engine
Authors:
Bhaskar Mitra,
Fernando Diaz,
Nick Craswell
Abstract:
In recent years, the information retrieval (IR) community has witnessed the first successful applications of deep neural network models to short-text matching and ad-hoc retrieval. It is exciting to see the research on deep neural networks and IR converge on these tasks of shared interest. However, the two communities have less in common when it comes to the choice of programming languages. Indri,…
▽ More
In recent years, the information retrieval (IR) community has witnessed the first successful applications of deep neural network models to short-text matching and ad-hoc retrieval. It is exciting to see the research on deep neural networks and IR converge on these tasks of shared interest. However, the two communities have less in common when it comes to the choice of programming languages. Indri, an indexing framework popularly used by the IR community, is written in C++, while Torch, a popular machine learning library for deep learning, is written in the light-weight scripting language Lua. To bridge this gap, we introduce Luandri (pronounced "laundry"), a simple interface for exposing the search capabilities of Indri to Torch models implemented in Lua.
△ Less
Submitted 16 February, 2017;
originally announced February 2017.
-
Learning to Match Using Local and Distributed Representations of Text for Web Search
Authors:
Bhaskar Mitra,
Fernando Diaz,
Nick Craswell
Abstract:
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text.…
▽ More
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favorable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or `duet' performs significantly better than either neural network individually on a Web page ranking task, and also significantly outperforms traditional baselines and other recently proposed models based on neural networks.
△ Less
Submitted 25 October, 2016;
originally announced October 2016.
-
The Social Dynamics of Language Change in Online Networks
Authors:
Rahul Goel,
Sandeep Soni,
Naman Goyal,
John Paparrizos,
Hanna Wallach,
Fernando Diaz,
Jacob Eisenstein
Abstract:
Language change is a complex social phenomenon, revealing pathways of communication and sociocultural influence. But, while language change has long been a topic of study in sociolinguistics, traditional linguistic research methods rely on circumstantial evidence, estimating the direction of change from differences between older and younger speakers. In this paper, we use a data set of several mil…
▽ More
Language change is a complex social phenomenon, revealing pathways of communication and sociocultural influence. But, while language change has long been a topic of study in sociolinguistics, traditional linguistic research methods rely on circumstantial evidence, estimating the direction of change from differences between older and younger speakers. In this paper, we use a data set of several million Twitter users to track language changes in progress. First, we show that language change can be viewed as a form of social influence: we observe complex contagion for phonetic spellings and "netspeak" abbreviations (e.g., lol), but not for older dialect markers from spoken language. Next, we test whether specific types of social network connections are more influential than others, using a parametric Hawkes process model. We find that tie strength plays an important role: densely embedded social ties are significantly better conduits of linguistic influence. Geographic locality appears to play a more limited role: we find relatively little evidence to support the hypothesis that individuals are more influenced by geographically local social ties, even in their usage of geographical dialect markers.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
Query Expansion with Locally-Trained Word Embeddings
Authors:
Fernando Diaz,
Bhaskar Mitra,
Nick Craswell
Abstract:
Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally,…
▽ More
Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings.
△ Less
Submitted 22 June, 2016; v1 submitted 25 May, 2016;
originally announced May 2016.
-
Real-Time Web Scale Event Summarization Using Sequential Decision Making
Authors:
Chris Kedzie,
Fernando Diaz,
Kathleen McKeown
Abstract:
We present a system based on sequential decision making for the online summarization of massive document streams, such as those found on the web. Given an event of interest (e.g. "Boston marathon bombing"), our system is able to filter the stream for relevance and produce a series of short text updates describing the event as it unfolds over time. Unlike previous work, our approach is able to join…
▽ More
We present a system based on sequential decision making for the online summarization of massive document streams, such as those found on the web. Given an event of interest (e.g. "Boston marathon bombing"), our system is able to filter the stream for relevance and produce a series of short text updates describing the event as it unfolds over time. Unlike previous work, our approach is able to jointly model the relevance, comprehensiveness, novelty, and timeliness required by time-sensitive queries. We demonstrate a 28.3% improvement in summary F1 and a 43.8% improvement in time-sensitive F1 metrics.
△ Less
Submitted 11 May, 2016;
originally announced May 2016.
-
A Discrete Firefly Algorithm to Solve a Rich Vehicle Routing Problem Modelling a Newspaper Distribution System with Recycling Policy
Authors:
E. Osaba,
Xin-She Yang,
F. Diaz,
E. Onieva,
A. D. Masegosa,
A. Perallos
Abstract:
A real-world newspaper distribution problem with recycling policy is tackled in this work. In order to meet all the complex restrictions contained in such a problem, it has been modeled as a rich vehicle routing problem, which can be more specifically considered as an asymmetric and clustered vehicle routing problem with simultaneous pickup and deliveries, variable costs and forbidden paths (AC-VR…
▽ More
A real-world newspaper distribution problem with recycling policy is tackled in this work. In order to meet all the complex restrictions contained in such a problem, it has been modeled as a rich vehicle routing problem, which can be more specifically considered as an asymmetric and clustered vehicle routing problem with simultaneous pickup and deliveries, variable costs and forbidden paths (AC-VRP-SPDVCFP). This is the first study of such a problem in the literature. For this reason, a benchmark composed by 15 instances has been also proposed. In the design of this benchmark, real geographical positions have been used, located in the province of Bizkaia, Spain. For the proper treatment of this AC-VRP-SPDVCFP, a discrete firefly algorithm (DFA) has been developed. This application is the first application of the firefly algorithm to any rich vehicle routing problem. To prove that the proposed DFA is a promising technique, its performance has been compared with two other well-known techniques: an evolutionary algorithm and an evolutionary simulated annealing. Our results have shown that the DFA has outperformed these two classic meta-heuristics.
△ Less
Submitted 14 April, 2016;
originally announced April 2016.
-
An Improved Discrete Bat Algorithm for Symmetric and Asymmetric Traveling Salesman Problems
Authors:
Eneko Osaba,
Xin-She Yang,
Fernando Diaz,
Pedro Lopez-Garcia,
Roberto Carballedo
Abstract:
Bat algorithm is a population metaheuristic proposed in 2010 which is based on the echolocation or bio-sonar characteristics of microbats. Since its first implementation, the bat algorithm has been used in a wide range of fields. In this paper, we present a discrete version of the bat algorithm to solve the well-known symmetric and asymmetric traveling salesman problems. In addition, we propose an…
▽ More
Bat algorithm is a population metaheuristic proposed in 2010 which is based on the echolocation or bio-sonar characteristics of microbats. Since its first implementation, the bat algorithm has been used in a wide range of fields. In this paper, we present a discrete version of the bat algorithm to solve the well-known symmetric and asymmetric traveling salesman problems. In addition, we propose an improvement in the basic structure of the classic bat algorithm. To prove that our proposal is a promising approximation method, we have compared its performance in 37 instances with the results obtained by five different techniques: evolutionary simulated annealing, genetic algorithm, an island based distributed genetic algorithm, a discrete firefly algorithm and an imperialist competitive algorithm. In order to obtain fair and rigorous comparisons, we have conducted three different statistical tests along the paper: the Student's $t$-test, the Holm's test, and the Friedman test. We have also compared the convergence behaviour shown by our proposal with the ones shown by the evolutionary simulated annealing, and the discrete firefly algorithm. The experimentation carried out in this study has shown that the presented improved bat algorithm outperforms significantly all the other alternatives in most of the cases.
△ Less
Submitted 14 April, 2016;
originally announced April 2016.
-
Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains
Authors:
David Abel,
Alekh Agarwal,
Fernando Diaz,
Akshay Krishnamurthy,
Robert E. Schapire
Abstract:
High-dimensional observations and complex real-world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals. And second, we propose an exploration strateg…
▽ More
High-dimensional observations and complex real-world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals. And second, we propose an exploration strategy inspired by the principles of state abstraction and information acquisition under uncertainty. We demonstrate the empirical effectiveness of these techniques, first, as a preliminary check, on two standard tasks (Blackjack and $n$-Chain), and then on two much larger and more realistic tasks with high-dimensional observation spaces. Specifically, we introduce two benchmarks built within the game Minecraft where the observations are pixel arrays of the agent's visual field. A combination of our two algorithmic techniques performs competitively on the standard reinforcement-learning tasks while consistently and substantially outperforming baselines on the two tasks with high-dimensional observation spaces. The new function approximator, exploration strategy, and evaluation benchmarks are each of independent interest in the pursuit of reinforcement-learning methods that scale to real-world domains.
△ Less
Submitted 13 March, 2016;
originally announced March 2016.
-
Pseudo-Query Reformulation
Authors:
Fernando Diaz
Abstract:
Automatic query reformulation refers to rewriting a user's original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted q…
▽ More
Automatic query reformulation refers to rewriting a user's original query in order to improve the ranking of retrieval results compared to the original query. We present a general framework for automatic query reformulation based on discrete optimization. Our approach, referred to as pseudo-query reformulation, treats automatic query reformulation as a search problem over the graph of unweighted queries linked by minimal transformations (e.g. term additions, deletions). This framework allows us to test existing performance prediction methods as heuristics for the graph search process. We demonstrate the effectiveness of the approach on several publicly available datasets.
△ Less
Submitted 14 July, 2015;
originally announced July 2015.