-
OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources
Authors:
Martin Docekal,
Martin Fajcik,
Pavel Smrz
Abstract:
This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sec…
▽ More
This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sections from all available content instead of generating parts of related work sections from abstracts only, which is the current mainstream in this field for abstractive approaches. We show that the estimated upper bound for extractive summarization increases by 217% in the ROUGE-2 score, when using full content instead of abstracts. Furthermore, we show the benefits of full content data on naive, oracle, traditional, and transformer-based baselines. Long outputs, such as related work sections, pose challenges for automatic evaluation metrics like BERTScore due to their limited input length. We tackle this issue by proposing and evaluating a meta-metric using BERTScore. Despite operating on smaller blocks, we show this meta-metric correlates with human judgment, comparably to the original BERTScore.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach
Authors:
Sergio Burdisso,
Juan Zuluaga-Gomez,
Esau Villatoro-Tello,
Martin Fajcik,
Muskaan Singh,
Pavel Smrz,
Petr Motlicek
Abstract:
In this paper, we describe our participation in the subtask 1 of CASE-2022, Event Causality Identification with Casual News Corpus. We address the Causal Relation Identification (CRI) task by exploiting a set of simple yet complementary techniques for fine-tuning language models (LMs) on a small number of annotated examples (i.e., a few-shot configuration). We follow a prompt-based prediction appr…
▽ More
In this paper, we describe our participation in the subtask 1 of CASE-2022, Event Causality Identification with Casual News Corpus. We address the Causal Relation Identification (CRI) task by exploiting a set of simple yet complementary techniques for fine-tuning language models (LMs) on a small number of annotated examples (i.e., a few-shot configuration). We follow a prompt-based prediction approach for fine-tuning LMs in which the CRI task is treated as a masked language modeling problem (MLM). This approach allows LMs natively pre-trained on MLM problems to directly generate textual responses to CRI-specific prompts. We compare the performance of this method against ensemble techniques trained on the entire dataset. Our best-performing submission was fine-tuned with only 256 instances per class, 15.7% of the all available data, and yet obtained the second-best precision (0.82), third-best accuracy (0.82), and an F1-score (0.85) very close to what was reported by the winner team (0.86).
△ Less
Submitted 14 October, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
IDIAPers @ Causal News Corpus 2022: Extracting Cause-Effect-Signal Triplets via Pre-trained Autoregressive Language Model
Authors:
Martin Fajcik,
Muskaan Singh,
Juan Zuluaga-Gomez,
Esaú Villatoro-Tello,
Sergio Burdisso,
Petr Motlicek,
Pavel Smrz
Abstract:
In this paper, we describe our shared task submissions for Subtask 2 in CASE-2022, Event Causality Identification with Casual News Corpus. The challenge focused on the automatic detection of all cause-effect-signal spans present in the sentence from news-media. We detect cause-effect-signal spans in a sentence using T5 -- a pre-trained autoregressive language model. We iteratively identify all cau…
▽ More
In this paper, we describe our shared task submissions for Subtask 2 in CASE-2022, Event Causality Identification with Casual News Corpus. The challenge focused on the automatic detection of all cause-effect-signal spans present in the sentence from news-media. We detect cause-effect-signal spans in a sentence using T5 -- a pre-trained autoregressive language model. We iteratively identify all cause-effect-signal span triplets, always conditioning the prediction of the next triplet on the previously predicted ones. To predict the triplet itself, we consider different causal relationships such as cause$\rightarrow$effect$\rightarrow$signal. Each triplet component is generated via a language model conditioned on the sentence, the previous parts of the current triplet, and previously predicted triplets. Despite training on an extremely small dataset of 160 samples, our approach achieved competitive performance, being placed second in the competition. Furthermore, we show that assuming either cause$\rightarrow$effect or effect$\rightarrow$cause order achieves similar results.
△ Less
Submitted 20 October, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
Claim-Dissector: An Interpretable Fact-Checking System with Joint Re-ranking and Veracity Prediction
Authors:
Martin Fajcik,
Petr Motlicek,
Pavel Smrz
Abstract:
We present Claim-Dissector: a novel latent variable model for fact-checking and analysis, which given a claim and a set of retrieved evidences jointly learns to identify: (i) the relevant evidences to the given claim, (ii) the veracity of the claim. We propose to disentangle the per-evidence relevance probability and its contribution to the final veracity probability in an interpretable way -- the…
▽ More
We present Claim-Dissector: a novel latent variable model for fact-checking and analysis, which given a claim and a set of retrieved evidences jointly learns to identify: (i) the relevant evidences to the given claim, (ii) the veracity of the claim. We propose to disentangle the per-evidence relevance probability and its contribution to the final veracity probability in an interpretable way -- the final veracity probability is proportional to a linear ensemble of per-evidence relevance probabilities. In this way, the individual contributions of evidences towards the final predicted probability can be identified. In per-evidence relevance probability, our model can further distinguish whether each relevant evidence is supporting (S) or refuting (R) the claim. This allows to quantify how much the S/R probability contributes to the final verdict or to detect disagreeing evidence.
Despite its interpretable nature, our system achieves results competitive with state-of-the-art on the FEVER dataset, as compared to typical two-stage system pipelines, while using significantly fewer parameters. It also sets new state-of-the-art on FAVIQ and RealFC datasets. Furthermore, our analysis shows that our model can learn fine-grained relevance cues while using coarse-grained supervision, and we demonstrate it in 2 ways. (i) We show that our model can achieve competitive sentence recall while using only paragraph-level relevance supervision. (ii) Traversing towards the finest granularity of relevance, we show that our model is capable of identifying relevance at the token level. To do this, we present a new benchmark TLR-FEVER focusing on token-level interpretability -- humans annotate tokens in relevant evidences they considered essential when making their judgment. Then we measure how similar are these annotations to the tokens our model is focusing on.
△ Less
Submitted 7 August, 2023; v1 submitted 28 July, 2022;
originally announced July 2022.
-
R2-D2: A Modular Baseline for Open-Domain Question Answering
Authors:
Martin Fajcik,
Martin Docekal,
Karel Ondrej,
Pavel Smrz
Abstract:
This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system's components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-…
▽ More
This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system's components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-of-the-art on the first two. Our analysis demonstrates that: (i) combining extractive and generative reader yields absolute improvements up to 5 exact match and it is at least twice as effective as the posterior averaging ensemble of the same models with different parameters, (ii) the extractive reader with fewer parameters can match the performance of the generative reader on extractive QA datasets.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Pruning the Index Contents for Memory Efficient Open-Domain QA
Authors:
Martin Fajcik,
Martin Docekal,
Karel Ondrej,
Pavel Smrz
Abstract:
This work presents a novel pipeline that demonstrates what is achievable with a combined effort of state-of-the-art approaches. Specifically, it proposes the novel R2-D2 (Rank twice, reaD twice) pipeline composed of retriever, passage reranker, extractive reader, generative reader and a simple way to combine them. Furthermore, previous work often comes with a massive index of external documents th…
▽ More
This work presents a novel pipeline that demonstrates what is achievable with a combined effort of state-of-the-art approaches. Specifically, it proposes the novel R2-D2 (Rank twice, reaD twice) pipeline composed of retriever, passage reranker, extractive reader, generative reader and a simple way to combine them. Furthermore, previous work often comes with a massive index of external documents that scales in the order of tens of GiB. This work presents a simple approach for pruning the contents of a massive index such that the open-domain QA system altogether with index, OS, and library components fits into 6GiB docker image while retaining only 8% of original index contents and losing only 3% EM accuracy.
△ Less
Submitted 9 April, 2021; v1 submitted 21 February, 2021;
originally announced February 2021.
-
NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
Authors:
Sewon Min,
Jordan Boyd-Graber,
Chris Alberti,
Danqi Chen,
Eunsol Choi,
Michael Collins,
Kelvin Guu,
Hannaneh Hajishirzi,
Kenton Lee,
Jennimaria Palomaki,
Colin Raffel,
Adam Roberts,
Tom Kwiatkowski,
Patrick Lewis,
Yuxiang Wu,
Heinrich Küttler,
Linqing Liu,
Pasquale Minervini,
Pontus Stenetorp,
Sebastian Riedel,
Sohee Yang,
Minjoon Seo,
Gautier Izacard,
Fabio Petroni,
Lucas Hosseini
, et al. (28 additional authors not shown)
Abstract:
We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte…
▽ More
We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing retrieval corpora or the parameters of learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.
△ Less
Submitted 19 September, 2021; v1 submitted 31 December, 2020;
originally announced January 2021.
-
Rethinking the Objectives of Extractive Question Answering
Authors:
Martin Fajcik,
Josef Jon,
Pavel Smrz
Abstract:
This work demonstrates that using the objective with independence assumption for modelling the span probability $P(a_s,a_e) = P(a_s)P(a_e)$ of span starting at position $a_s$ and ending at position $a_e$ has adverse effects. Therefore we propose multiple approaches to modelling joint probability $P(a_s,a_e)$ directly. Among those, we propose a compound objective, composed from the joint probabilit…
▽ More
This work demonstrates that using the objective with independence assumption for modelling the span probability $P(a_s,a_e) = P(a_s)P(a_e)$ of span starting at position $a_s$ and ending at position $a_e$ has adverse effects. Therefore we propose multiple approaches to modelling joint probability $P(a_s,a_e)$ directly. Among those, we propose a compound objective, composed from the joint probability while still keeping the objective with independence assumption as an auxiliary objective. We find that the compound objective is consistently superior or equal to other assumptions in exact match. Additionally, we identified common errors caused by the assumption of independence and manually checked the counterpart predictions, demonstrating the impact of the compound objective on the real examples. Our findings are supported via experiments with three extractive QA models (BIDAF, BERT, ALBERT) over six datasets and our code, individual results and manual analysis are available online.
△ Less
Submitted 12 October, 2021; v1 submitted 28 August, 2020;
originally announced August 2020.
-
JokeMeter at SemEval-2020 Task 7: Convolutional humor
Authors:
Martin Docekal,
Martin Fajcik,
Josef Jon,
Pavel Smrz
Abstract:
This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.
This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.
△ Less
Submitted 25 August, 2020;
originally announced August 2020.
-
BUT-FIT at SemEval-2020 Task 4: Multilingual commonsense
Authors:
Josef Jon,
Martin Fajčík,
Martin Dočekal,
Pavel Smrž
Abstract:
This paper describes work of the BUT-FIT's team at SemEval 2020 Task 4 - Commonsense Validation and Explanation. We participated in all three subtasks. In subtasks A and B, our submissions are based on pretrained language representation models (namely ALBERT) and data augmentation. We experimented with solving the task for another language, Czech, by means of multilingual models and machine transl…
▽ More
This paper describes work of the BUT-FIT's team at SemEval 2020 Task 4 - Commonsense Validation and Explanation. We participated in all three subtasks. In subtasks A and B, our submissions are based on pretrained language representation models (namely ALBERT) and data augmentation. We experimented with solving the task for another language, Czech, by means of multilingual models and machine translated dataset, or translated model inputs. We show that with a strong machine translation system, our system can be used in another language with a small accuracy loss. In subtask C, our submission, which is based on pretrained sequence-to-sequence model (BART), ranked 1st in BLEU score ranking, however, we show that the correlation between BLEU and human evaluation, in which our submission ended up 4th, is low. We analyse the metrics used in the evaluation and we propose an additional score based on model from subtask B, which correlates well with our manual ranking, as well as reranking method based on the same principle. We performed an error and dataset analysis for all subtasks and we present our findings.
△ Less
Submitted 21 August, 2020; v1 submitted 17 August, 2020;
originally announced August 2020.
-
BUT-FIT at SemEval-2020 Task 5: Automatic detection of counterfactual statements with deep pre-trained language representation models
Authors:
Martin Fajcik,
Josef Jon,
Martin Docekal,
Pavel Smrz
Abstract:
This paper describes BUT-FIT's submission at SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. The challenge focused on detecting whether a given statement contains a counterfactual (Subtask 1) and extracting both antecedent and consequent parts of the counterfactual from the text (Subtask 2). We experimented with various state-of-the-art language representati…
▽ More
This paper describes BUT-FIT's submission at SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. The challenge focused on detecting whether a given statement contains a counterfactual (Subtask 1) and extracting both antecedent and consequent parts of the counterfactual from the text (Subtask 2). We experimented with various state-of-the-art language representation models (LRMs). We found RoBERTa LRM to perform the best in both subtasks. We achieved the first place in both exact match and F1 for Subtask 2 and ranked second for Subtask 1.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
BUT-FIT at SemEval-2019 Task 7: Determining the Rumour Stance with Pre-Trained Deep Bidirectional Transformers
Authors:
Martin Fajcik,
Lukáš Burget,
Pavel Smrz
Abstract:
This paper describes our system submitted to SemEval 2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours, Subtask A (Gorrell et al., 2019). The challenge focused on classifying whether posts from Twitter and Reddit support, deny, query, or comment a hidden rumour, truthfulness of which is the topic of an underlying discussion thread. We formulate the problem as a stan…
▽ More
This paper describes our system submitted to SemEval 2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours, Subtask A (Gorrell et al., 2019). The challenge focused on classifying whether posts from Twitter and Reddit support, deny, query, or comment a hidden rumour, truthfulness of which is the topic of an underlying discussion thread. We formulate the problem as a stance classification, determining the rumour stance of a post with respect to the previous thread post and the source thread post. The recent BERT architecture was employed to build an end-to-end system which has reached the F1 score of 61.67% on the provided test data. It finished at the 2nd place in the competition, without any hand-crafted features, only 0.2% behind the winner.
△ Less
Submitted 21 March, 2019; v1 submitted 25 February, 2019;
originally announced February 2019.
-
Automation of Processor Verification Using Recurrent Neural Networks
Authors:
Martin Fajcik,
Marcela Zachariasova,
Pavel Smrz
Abstract:
When considering simulation-based verification of processors, the current trend is to generate stimuli using pseudorandom generators (PRGs), apply them to the processor inputs and monitor the achieved coverage of its functionality in order to determine verification completeness. Stimuli can have different forms, for example, they can be represented by bit vectors applied to the input ports of the…
▽ More
When considering simulation-based verification of processors, the current trend is to generate stimuli using pseudorandom generators (PRGs), apply them to the processor inputs and monitor the achieved coverage of its functionality in order to determine verification completeness. Stimuli can have different forms, for example, they can be represented by bit vectors applied to the input ports of the processor or by programs that are loaded directly into the program memory. In this paper, we propose a new technique dynamically altering constraints for PRG via recurrent neural network, which receives a coverage feedback from the simulation of design under verification. For the demonstration purposes we used processors provided by Codasip as their coverage state space is reasonably big and differs for various kinds of processors. Nevertheless, techniques presented in this paper are widely applicable. The results of experiments show that not only the coverage closure is achieved much sooner, but we are able to isolate a small set of stimuli with high coverage that can be used for running regression tests.
△ Less
Submitted 6 March, 2018;
originally announced March 2018.