Search | arXiv e-print repository

Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability

Authors: Joakim Edin, Andreas Geert Motzfeldt, Casper L. Christensen, Tuukka Ruotsalo, Lars Maaløe, Maria Maistro

Abstract: Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions' accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely o… ▽ More Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions' accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely on AOPC to compare faithfulness across different models, which we show can lead to false conclusions about models' faithfulness. Specifically, we find that AOPC is sensitive to variations in the model, resulting in unreliable cross-model comparisons. Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores. Our experiments demonstrate that this normalization can radically change AOPC results, questioning the conclusions of earlier studies and offering a more robust framework for assessing feature attribution faithfulness. △ Less

Submitted 15 August, 2024; originally announced August 2024.

ACM Class: I.2.0

arXiv:2407.17023 [pdf, other]

From Internal Conflict to Contextual Adaptation of Language Models

Authors: Sara Vera Marjanović, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina Lioma, Isabelle Augenstein

Abstract: Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can conflict with the pre-existing LM's memory learned during pre-training. Moreover, conflicting knowledge can already be present… ▽ More Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can conflict with the pre-existing LM's memory learned during pre-training. Moreover, conflicting knowledge can already be present in the LM's parameters, termed intra-memory conflict. Existing works have studied the two types of knowledge conflicts only in isolation. We conjecture that the (degree of) intra-memory conflicts can in turn affect LM's handling of context-memory conflicts. To study this, we introduce the DYNAMICQA dataset, which includes facts with a temporal dynamic nature where a fact can change with a varying time frequency and disputable dynamic facts, which can change depending on the viewpoint. DYNAMICQA is the first to include real-world knowledge conflicts and provide context to study the link between the different types of knowledge conflicts. With the proposed dataset, we assess the use of uncertainty for measuring the intra-memory conflict and introduce a novel Coherent Persuasion (CP) score to evaluate the context's ability to sway LM's semantic output. Our extensive experiments reveal that static facts, which are unlikely to change, are more easily updated with additional context, relative to temporal and disputable facts. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 22 pages, 15 figures

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2406.08958 [pdf, other]

An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records

Authors: Joakim Edin, Maria Maistro, Lars Maaløe, Lasse Borgholt, Jakob D. Havtorn, Tuukka Ruotsalo

Abstract: Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves heal… ▽ More Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-of-the-art explainability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2405.18276 [pdf, other]

doi 10.1145/3626772.3657832

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

Authors: Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma

Abstract: Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and… ▽ More Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted to SIGIR 2024 as full paper

arXiv:2405.04246 [pdf, other]

doi 10.1145/3626772.3657881

Dataset and Models for Item Recommendation Using Multi-Modal User Interactions

Authors: Simone Borg Bruun, Krisztian Balog, Maria Maistro

Abstract: While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such case… ▽ More While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2403.00368 [pdf, other]

doi 10.1145/3606950

Recommending Target Actions Outside Sessions in the Data-poor Insurance Domain

Authors: Simone Borg Bruun, Christina Lioma, Maria Maistro

Abstract: Providing personalized recommendations for insurance products is particularly challenging due to the intrinsic and distinctive features of the insurance domain. First, unlike more traditional domains like retail, movie etc., a large amount of user feedback is not available and the item catalog is smaller. Second, due to the higher complexity of products, the majority of users still prefer to compl… ▽ More Providing personalized recommendations for insurance products is particularly challenging due to the intrinsic and distinctive features of the insurance domain. First, unlike more traditional domains like retail, movie etc., a large amount of user feedback is not available and the item catalog is smaller. Second, due to the higher complexity of products, the majority of users still prefer to complete their purchases over the phone instead of online. We present different recommender models to address such data scarcity in the insurance domain. We use recurrent neural networks with 3 different types of loss functions and architectures (cross-entropy, censored Weibull, attention). Our models cope with data scarcity by learning from multiple sessions and different types of user actions. Moreover, differently from previous session-based models, our models learn to predict a target action that does not happen within the session. Our models outperform state-of-the-art baselines on a real-world insurance dataset, with ca. 44K users, 16 items, 54K purchases and 117K sessions. Moreover, combining our models with demographic data boosts the performance. Analysis shows that considering multiple sessions and several types of actions are both beneficial for the models, and that our models are not unfair with respect to age, gender and income. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.15360

Journal ref: ACM Transactions on Recommender Systems 2023

arXiv:2311.01013 [pdf, other]

doi 10.1145/3631943

Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study

Authors: Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma

Abstract: Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are… ▽ More Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted to ACM Transactions on Recommender Systems (TORS)

arXiv:2304.10909 [pdf, other]

doi 10.1145/3539618.3591918

Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

Authors: Joakim Edin, Alexander Junge, Jakob D. Havtorn, Lasse Borgholt, Maria Maistro, Tuukka Ruotsalo, Lars Maaløe

Abstract: Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that seve… ▽ More Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: 11 pages, 6 figures, to be published in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), July 23--27, 2023, Taipei, Taiwan

ACM Class: H.3.0

arXiv:2301.11009 [pdf, other]

doi 10.1007/978-3-031-28244-7_12

Graph-based Recommendation for Sparse and Heterogeneous User Interactions

Authors: Simone Borg Bruun, Kacper Kenji Lesniak, Mirko Biasini, Vittorio Carmignani, Panagiotis Filianos, Christina Lioma, Maria Maistro

Abstract: Recommender system research has oftentimes focused on approaches that operate on large-scale datasets containing millions of user interactions. However, many small businesses struggle to apply state-of-the-art models due to their very limited availability of data. We propose a graph-based recommender model which utilizes heterogeneous interactions between users and content of different types and i… ▽ More Recommender system research has oftentimes focused on approaches that operate on large-scale datasets containing millions of user interactions. However, many small businesses struggle to apply state-of-the-art models due to their very limited availability of data. We propose a graph-based recommender model which utilizes heterogeneous interactions between users and content of different types and is able to operate well on small-scale datasets. A genetic algorithm is used to find optimal weights that represent the strength of the relationship between users and content. Experiments on two real-world datasets (which we make available to the research community) show promising results (up to 7% improvement), in comparison with other state-of-the-art methods for low-data environments. These improvements are statistically significant and consistent across different data samples. △ Less

Submitted 26 January, 2023; originally announced January 2023.

arXiv:2212.02256 [pdf, other]

doi 10.1109/CoG51982.2022.9893597

Crowdsourcing Controller -- Utilizing Reliable Agents in a Multiplayer Game

Authors: Kacper Kenji Lesniak, Maria Maistro

Abstract: This paper presents a new use case for continuous crowdsourcing, where multiple players collectively control a single character in a video game. Similar approaches have already been proposed, but they suffer from certain limitations: (1) they simply consider static time frames to group real-time inputs from multiple players; (2) then they aggregate inputs with simple majority vote, i.e., each play… ▽ More This paper presents a new use case for continuous crowdsourcing, where multiple players collectively control a single character in a video game. Similar approaches have already been proposed, but they suffer from certain limitations: (1) they simply consider static time frames to group real-time inputs from multiple players; (2) then they aggregate inputs with simple majority vote, i.e., each player is uniformly weighted. We present a continuous crowdsourcing multiplayer game equipped with our Crowdsourcing Controller. The Crowdsourcing Controller addresses the above-mentioned limitations: (1) our Dynamic Input Frame approach groups incoming players' input in real-time by dynamically adjusting the frame length; (2) our Continuous Reliability System estimates players' skills by assigning them a reliability score, which is later used in a weighted majority vote to aggregate the final output command. We evaluated our Crowdsourcing Controller offline with simulated players and online with real players. Offline and online experiments show that both components of our Crowdsourcing Controller lead to higher game scores, i.e., longer playing time. Moreover, the Crowdsourcing Controller is able to correctly estimate and update players' reliability scores. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2212.01387 [pdf, other]

FullBrain: a Social E-learning Platform

Authors: Mirko Biasini, Vittorio Carmignani, Nicola Ferro, Panagiotis Filianos, Maria Maistro, Giorgio Maria di Nunzio

Abstract: We present FullBrain, a social e-learning platform where students share and track their knowledge. FullBrain users can post notes, ask questions and share learning resources in dedicated course and concept spaces. We detail two components of FullBrain: a SIR system equipped with query autocomplete and query autosuggestion, and a Leaderboard module to improve user experience. We analyzed the day-to… ▽ More We present FullBrain, a social e-learning platform where students share and track their knowledge. FullBrain users can post notes, ask questions and share learning resources in dedicated course and concept spaces. We detail two components of FullBrain: a SIR system equipped with query autocomplete and query autosuggestion, and a Leaderboard module to improve user experience. We analyzed the day-to-day users' usage of the SIR system, measuring a time-to-complete a request below 0.11s, matching or exceeding our UX targets. Moreover, we performed stress tests which lead the way for more detailed analysis. Through a preliminary user study and log data analysis, we observe that 97% of the users' activity is directed to the top 4 positions in the leaderboard. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2212.00492 [pdf, other]

doi 10.1145/3459637.3482287

Principled Multi-Aspect Evaluation Measures of Rankings

Authors: Maria Maistro, Lucas Chaves Lima, Jakob Grue Simonsen, Christina Lioma

Abstract: Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation).… ▽ More Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation). However, these methods either are (i) tailor-made for specific aspects and do not extend to other types or numbers of aspects, or (ii) have theoretical anomalies, e.g. assign maximum score to a ranking where all documents are labelled with the lowest grade with respect to all aspects (e.g., not relevant, not credible, etc.). We present a theoretically principled multi-aspect evaluation method that can be used for any number, and any type, of aspects. A thorough empirical evaluation using up to 5 aspects and a total of 425 runs officially submitted to 10 TREC tracks shows that our method is more discriminative than the state-of-the-art and overcomes theoretical limitations of the state-of-the-art. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.15360 [pdf, other]

doi 10.1145/3523227.3546775

Learning Recommendations from User Actions in the Item-poor Insurance Domain

Authors: Simone Borg Bruun, Maria Maistro, Christina Lioma

Abstract: While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to… ▽ More While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to last for a long time. Also, many users still prefer the telephone over the web for purchasing products, reducing the amount of web-logged user interactions. To address this, we present a recurrent neural network recommendation model that uses past user sessions as signals for learning recommendations. Learning from past user sessions allows dealing with the data scarcity of the insurance domain. Specifically, our model learns from several types of user actions that are not always associated with items, and unlike all prior session-based recommendation models, it models relationships between input sessions and a target action (purchasing insurance) that does not take place within the input sessions. Evaluation on a real-world dataset from the insurance domain (ca. 44K users, 16 items, 54K purchases, and 117K sessions) against several state-of-the-art baselines shows that our model outperforms the baselines notably. Ablation analysis shows that this is mainly due to the learning of dependencies across sessions in our model. We contribute the first ever session-based model for insurance recommendation, and make available our dataset to the research community. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2210.10266 [pdf, ps, other]

Corrected Evaluation Results of the NTCIR WWW-2, WWW-3, and WWW-4 English Subtasks

Authors: Tetsuya Sakai, Sijie Tao, Maria Maistro, Zhumin Chu, Yujing Li, Nuo Chen, Nicola Ferro, Junjie Wang, Ian Soboroff, Yiqun Liu

Abstract: Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two version… ▽ More Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two versions of pool files were created for each English topic: a PRI ("prioritised") file, which uses the NTCIRPOOL script to prioritise likely relevant documents, and a RND ("randomised") file, which randomises the pooled documents. This was done for the purpose of studying the effect of document ordering for relevance assessors. However, the programmer who wrote the interface backend assumed that a combination of a topic ID and a document rank in the pool file uniquely determines a document ID; this is obviously incorrect as we have two versions of pool files. The outcome is that all the PRI-based relevance labels for the WWW-2 test collection are incorrect (while all the RND-based relevance labels are correct), and all the RND-based relevance labels for the WWW-3 and WWW-4 test collections are incorrect (while all the PRI-based relevance labels are correct). This bug was finally discovered at the NTCIR-16 WWW-4 task when the first seven authors of this paper served as Gold assessors (i.e., topic creators who define what is relevant) and closely examined the disagreements with Bronze assessors (i.e., non-topic-creators; non-experts). We would like to apologise to the WWW participants and the NTCIR chairs for the inconvenience and confusion caused due to this bug. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: 24 pages

arXiv:2201.07599 [pdf, other]

doi 10.1007/978-3-030-72240-1_51

repro_eval: A Python Interface to Reproducibility Measures of System-oriented IR Experiments

Authors: Timo Breuer, Nicola Ferro, Maria Maistro, Philipp Schaer

Abstract: In this work we introduce repro_eval - a tool for reactive reproducibility studies of system-oriented information retrieval (IR) experiments. The corresponding Python package provides IR researchers with measures for different levels of reproduction when evaluating their systems' outputs. By offering an easily extensible interface, we hope to stimulate common practices when conducting a reproducib… ▽ More In this work we introduce repro_eval - a tool for reactive reproducibility studies of system-oriented information retrieval (IR) experiments. The corresponding Python package provides IR researchers with measures for different levels of reproduction when evaluating their systems' outputs. By offering an easily extensible interface, we hope to stimulate common practices when conducting a reproducibility study of system-oriented IR experiments. △ Less

Submitted 19 January, 2022; originally announced January 2022.

Comments: Accepted at ECIR21. The final authenticated version is available online at https://doi.org/10.1007/978-3-030-72240-1_51

arXiv:2103.02462 [pdf, other]

University of Copenhagen Participation in TREC Health Misinformation Track 2020

Authors: Lucas Chaves Lima, Dustin Brandon Wright, Isabelle Augenstein, Maria Maistro

Abstract: In this paper, we describe our participation in the TREC Health Misinformation Track 2020. We submitted $11$ runs to the Total Recall Task and 13 runs to the Ad Hoc task. Our approach consists of 3 steps: (1) we create an initial run with BM25 and RM3; (2) we estimate credibility and misinformation scores for the documents in the initial run; (3) we merge the relevance, credibility and misinformat… ▽ More In this paper, we describe our participation in the TREC Health Misinformation Track 2020. We submitted $11$ runs to the Total Recall Task and 13 runs to the Ad Hoc task. Our approach consists of 3 steps: (1) we create an initial run with BM25 and RM3; (2) we estimate credibility and misinformation scores for the documents in the initial run; (3) we merge the relevance, credibility and misinformation scores to re-rank documents in the initial run. To estimate credibility scores, we implement a classifier which exploits features based on the content and the popularity of a document. To compute the misinformation score, we apply a stance detection approach with a pretrained Transformer language model. Finally, we use different approaches to merge scores: weighted average, the distance among score vectors and rank fusion. △ Less

Submitted 3 March, 2021; originally announced March 2021.

arXiv:2012.12366 [pdf, other]

Multi-Head Self-Attention with Role-Guided Masks

Authors: Dongsheng Wang, Casper Hansen, Lucas Chaves Lima, Christian Hansen, Maria Maistro, Jakob Grue Simonsen, Christina Lioma

Abstract: The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors.… ▽ More The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: Accepted at ECIR@2021

arXiv:2011.12684 [pdf, other]

Denmark's Participation in the Search Engine TREC COVID-19 Challenge: Lessons Learned about Searching for Precise Biomedical Scientific Information on COVID-19

Authors: Lucas Chaves Lima, Casper Hansen, Christian Hansen, Dongsheng Wang, Maria Maistro, Birger Larsen, Jakob Grue Simonsen, Christina Lioma

Abstract: This report describes the participation of two Danish universities, University of Copenhagen and Aalborg University, in the international search engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the U.S. National Institute of Standards and Technology (NIST) and its Text Retrieval Conference (TREC) division. The aim of the competition was to find the best search engine str… ▽ More This report describes the participation of two Danish universities, University of Copenhagen and Aalborg University, in the international search engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the U.S. National Institute of Standards and Technology (NIST) and its Text Retrieval Conference (TREC) division. The aim of the competition was to find the best search engine strategy for retrieving precise biomedical scientific information on COVID-19 from the largest, at that point in time, dataset of curated scientific literature on COVID-19 -- the COVID-19 Open Research Dataset (CORD-19). CORD-19 was the result of a call to action to the tech community by the U.S. White House in March 2020, and was shortly thereafter posted on Kaggle as an AI competition by the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown University's Center for Security and Emerging Technology, Microsoft, and the National Library of Medicine at the US National Institutes of Health. CORD-19 contained over 200,000 scholarly articles (of which more than 100,000 were with full text) about COVID-19, SARS-CoV-2, and related coronaviruses, gathered from curated biomedical sources. The TREC-COVID challenge asked for the best way to (a) retrieve accurate and precise scientific information, in response to some queries formulated by biomedical experts, and (b) rank this information decreasingly by its relevance to the query. In this document, we describe the TREC-COVID competition setup, our participation to it, and our resulting reflections and lessons learned about the state-of-art technology when faced with the acute task of retrieving precise scientific information from a rapidly growing corpus of literature, in response to highly specialised queries, in the middle of a pandemic. △ Less

Submitted 26 November, 2020; v1 submitted 25 November, 2020; originally announced November 2020.

arXiv:2010.13447 [pdf, other]

doi 10.1145/3397271.3401036

How to Measure the Reproducibility of System-oriented IR Experiments

Authors: Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, Ian Soboroff

Abstract: Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented data… ▽ More Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods. To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: SIGIR2020 Full Conference Paper

Showing 1–19 of 19 results for author: Maistro, M