Continual Learning for
Temporal-Sensitive Question Answering

Wanqi Yang, Yunqiu Xu, Yanda Li, Kunze Wang, Binbin Huang, Ling Chen Wanqi Yang, Yunqiu Xu, Yanda Li and Ling Chen are with University of Technology Sydney, Sydney, 2007, Australia. (email: [email protected], [email protected], [email protected], [email protected]). Kunze Wang is with University of Sydney, Sydney, 2050, Australia. (email: [email protected]). Binbin Huang is with Hangzhou Dianzi University, Hangzhou, 310018, China. (email: [email protected]).
Abstract

In this study, we explore an emerging research area of Continual Learning for Temporal Sensitive Question Answering (CLTSQA). Previous research has primarily focused on Temporal Sensitive Question Answering (TSQA), often overlooking the unpredictable nature of future events. In real-world applications, it’s crucial for models to continually acquire knowledge over time, rather than relying on a static, complete dataset. Our paper investigates strategies that enable models to adapt to the ever-evolving information landscape, thereby addressing the challenges inherent in CLTSQA. To support our research, we first create a novel dataset, divided into five subsets, designed specifically for various stages of continual learning. We then propose a training framework for CLTSQA that integrates temporal memory replay and temporal contrastive learning. Our experimental results highlight two significant insights: First, the CLTSQA task introduces unique challenges for existing models. Second, our proposed framework effectively navigates these challenges, resulting in improved performance.

Index Terms:
continual learning, temporal-sensitive question, question answering

I Introduction

A temporal-sensitive question refers to a question that involves temporal-related details, and modifying this temporal information within the question will result in a different answer [1]. Take the question “What was the role of Barack Hussein Obama in YEAR?” as an example. If YEAR = 2006, the answer should be “Federal Senator”; whereas if YEAR = 2016, the answer should be “President of the United States”. In everyday life, we frequently encounter questions influenced by time, with answers that can change as new events occur. This unpredictability highlights the need for a novel task called Continual Learning for Temporal Sensitive Question Answering (CLTSQA), which requires continuously learn a model of temporal sensitive question answering as time progresses.

Although some works have been conducted in related areas, two key challenges of CLTSQA have been overlooked: the absence of a suitable dataset, and the scarcity of effective methods in continually dealing with temporal-sensitive questions. While some existing works, e.g., [1, 2, 5, 4, 3], proposed new datasets with the aim of investigating the Temporal-sensitive Question Answering (TSQA) to explore the model’s sensitivity and its reasoning capabilities to temporal information. They follow the setting of traditional question answering. As shown in Fig. 1, TSQA assumes that the entire dataset is adapted for training the model. It lacks the ability to continuously incorporate updated and new data which could potentially alter the answer to a question as time progresses. In terms of the second challenge, many works have been proposed to retain model’s performance with evolving dataset through continual learning. For example, [6] studied continual learning for a single domain (Twitter data from 2018 to 2019), and [7] worked on efficient life-long pre-training on emerging data in multiple domains. Currently, there are no existing efforts or studies focused on the application specifically to address CLTSQA.

Refer to caption
Figure 1: The difference of training process between TSQA and CLTSQA. While TSQA assumes the availability of the whole training dataset, CLTSQA requires the model to keep ingesting up-to-date new knowledge.

The objective of the Continual Learning for Temporal Sensitive Question Answering (CLTSQA) task is to simulate a real-world scenario where updates and new knowledge cannot be learned all at once but requires continual learning. CLTSQA task explores the forgetting degree of model of knowledge in earlier time and the learning capability for acquiring updated and new knowledge over time. To deal with the absence of an available dataset, we construct a new dataset that includes subsets of temporal-sensitive questions, thereby offering a solution to this challenge, and facilitating the study in CLTSQA. Then, to make the model capable of effectively handling temporal-sensitive questions in a continuous fashion, we propose a novel framework featured by 1) temporal memory replay to alleviate the catastrophic forgetting of the past knowledge; and 2) temporal contrastive learning to enhance the model’s sensitivity to temporal information and boost its performance on questions with most up-to-date information. The experimental results show that: 1) the existing models struggle to deal with this challenging task, resulting in poor performance; 2) our proposed framework can effectively help the models to address CLTSQA, demonstrating not only improvement in answering the most up-to-date questions, but also good performance retention when answering historical questions.

The main contributions of this work are summarised as:

  • We propose a novel task called CLTSQA.

  • We propose a new dataset to deal with the absence of available dataset and facilitate the study in CLTSQA.

  • We propose a novel framework featured by temporal memory replay and temporal contrastive learning to deal with the model-level challenge in CLTSQA.

  • We have obtained experimental findings indicating that: 1) CLTSQA is a challenging yet promising task, and 2) our framework assists the model in effectively addressing CLTSQA.

II Related Work

II-A Temporal-Sensitive Question Answering

Some previous studies have explored the task of Temporal-sensitive Question Answering by introducing new datasets. The TempQuestions dataset [8] provides a clear definition of what constitutes a “temporal question” and utilizes specific trigger words such as “before” and “after”. To investigate “temporal question”, [9] mentioned that answers to a question can change over time and created a dataset with 13% temporal-sensitive data. [1], [2] and [3] also created new datasets, but were with a primary focus on TSQA. By evaluating existing models on the proposed datasets, these work proved that answering temporal-sensitive questions is challenging, which serves as a motivation of our study. Different from them, we not only extend TSQA towards a more realistic and challenging task CLTSQA, but also offer solutions to enhance model performance in tackling it.

In addition to the dataset, temporal-sensitive question learning requires the model to be sensitive to temporal information. Several studies have utilized pre-trained language models to aid in question comprehension. However, these models do not effectively distinguish between different temporal expressions found in free-text [10, 12, 11, 5]. Inspired by the framework proposed in [13], our framework develops a temporal contrastive learning that the model can understand the crucial factor lies in recognizing the variation in temporal information, rather than the specific format of the question.

II-B Continual Learning

Numerous research efforts have been dedicated to the examination of continual learning for general QA [14, 15]. Through extensive exploration of the general question answering domain, researchers have discovered that temporal-related QA tasks pose greater challenges.

[4] proposed a dataset named StreamingQA, which aims to investigate models’ adaptation to changing knowledge. The dataset’s context spans the years 2007 to 2020, with questions that do not involve temporally sensitive information. StreamingQA dataset employs a specific data format (question date, question, answer, document date, document), and the question date for each query is intentionally set by the author. However, datasets with additional fields and with narrower timeframes does not inherently enhance the model’s robustness and generalizability. [16] designed a new continual learning task called continual knowledge learning (CKL). From a task-oriented perspective, the aim of CKL involves consistently enhancing the internal knowledge of the language model through ongoing pre-training on new datasets. A noteworthy distinction is that, CKL predominantly concentrates on enriching the internal knowledge within the pre-trained model, encompassing a broader domain. In contrast, CLTSQA places a stronger emphasis on a downstream task, wherein the model continuously learns and adapts to temporal-sensitive question answering. What’s more, some temporal-related QA dataset for continual learning were proposed in [6] and [17]. [6] extracted data from Twitter and divided the data into subsets of three months each for continual learning. And [17] employed the difference between consecutive snapshots of English Wikipedia and English Wikidata for both training and evaluation purposes. However, they simply used the existing classical methods [19, 18, 22, 20, 21] that can alleviate catastrophic forgetting in continual learning, instead of proposing improvement strategies based on their datasets.

III Preliminaries

TSQA

The Temporal Sensitive Question Answering (TSQA) task aims to investigate the model’s sensitivity and reasoning capabilities concerning temporal information. In the TSQA, the model is provided with a context c𝑐citalic_c (e.g., a document, or a series of sentences) and a question q𝑞qitalic_q as the input. Then, the model is required to predict the answer a𝑎aitalic_a through either extracting from c𝑐citalic_c, or selecting one from a set of answer candidates. The specific task setup for TSQA involves training the model on an entire dataset. In order to answer temporal-sensitive questions, the model is required to not only pay specific attention to temporal information within the question, but also be capable of reasoning over the implicit temporal information within the context.

CLTSQA

The TSQA task is conducted with the assumption that the model is trained using a complete dataset, However, it does not possess the capability to continuously integrate updated or new data with temporal information. In order to alleviate this assumption, thus bridging the gap between TSQA and the real world temporal-sensitive problems, we propose a new task, CLTSQA, which forces the model to learn and inference in a continual learning manner. Their major difference lies in the dataset and training settings. Instead of assuming the availability of a whole dataset, in CLTSQA we require the model to keep awareness of the latest knowledge, while not forgetting the old knowledge. The training data is divided into K𝐾Kitalic_K subsets 𝒟={𝒟1,,𝒟K}𝒟subscript𝒟1subscript𝒟𝐾\mathcal{D}=\{\mathcal{D}_{1},\dots,\mathcal{D}_{K}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, with each subset covering time points that are chronologically earlier than those in the subsequent subset t𝒟k1<t𝒟ksubscript𝑡subscript𝒟𝑘1subscript𝑡subscript𝒟𝑘t_{\mathcal{D}_{k-1}}<t_{\mathcal{D}_{k}}italic_t start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Given an initial model M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, it will be subsequently trained on the subsets to obtain the corresponding trained models M1,M2,,MKsubscript𝑀1subscript𝑀2subscript𝑀𝐾M_{1},M_{2},...,M_{K}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the model after training on 𝒟1,𝒟2,,𝒟ksubscript𝒟1subscript𝒟2subscript𝒟𝑘\mathcal{D}_{1},\mathcal{D}_{2},...,\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The subsequent models sequentially load the pre-trained weights of the previous model and continual training. The model Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is required to be well-performing on the current dataset of 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, while not encountering significantly performance decay in the previous subsets 𝒟¯k1subscript¯𝒟𝑘1\mathcal{\overline{D}}_{k-1}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT.

IV CLTSQA Dataset

Refer to caption
Figure 2: Examples of CLTSQA-Data. The above part shows the dataset divided based on time intervals. In the bottom part, the left side represents the context and the right side represents the corresponding question-target pairs.

In this section, we introduce a new dataset - CLTSQA-Data, with the aim of addressing the aforementioned data-level challenge. Our dataset is built on the basis of TimeQA [1], which extracts time-evolving contexts from WikiData, and generates question-answer pairs from these contexts by some manual templates.

We chose a collection of 20,000 questions and 5,000 contexts sourced from TimeQA. Moreover, we produced a higher volume of context-specific temporal-sensitive questions. As a result, our dataset now encompasses a total of 50,000 questions and 5,000 contexts. Then we divides the whole dataset into K𝐾Kitalic_K temporal-sensitive subsets 𝒟={𝒟1,𝒟2,,𝒟K}𝒟subscript𝒟1subscript𝒟2subscript𝒟𝐾\mathcal{D}=\{\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }. Fig. 2 shows some examples, where each subset 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT consists of questions within a specific time range [tkstart,tkend]superscriptsubscript𝑡𝑘𝑠𝑡𝑎𝑟𝑡superscriptsubscript𝑡𝑘𝑒𝑛𝑑[t_{k}^{start},t_{k}^{end}][ italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT ]. We keep the original context unchanged and generate questions based on it, then assign them to subsets with non-overlapping time ranges. For example, given a long context “Introduction of Barack Hussein Obama”, which ranges from 1961 to 2017, we generate a series of related questions, such as “What position did Barack Hussein Obama take in 1963?”, “What position was held by Barack Hussein Obama in 1995?”, “Barack Hussein Obama took which position in 2010?”, then put them into different subsets based on time periods. Besides the explicit questions, whose answers could be directly extracted from the context, we also generate the more challenging implicit questions, whose answers could not be directly obtained, and require the model to reason from the implicit temporal relation. For example, given the context “Barack Hussein Obama won re-election in the 2012 presidential election”, the answer to the question “Who is the President of the United States in 2014” should be “Barack Hussein Obama”.

TABLE I: The statistics of CLTSQA-Data divided by subsets & question types.
Train Dev Test
Subset1 (190-1939) 7091 1562 1455
Subset2 (1940-1976) 6957 1405 1531
Subset3 (1977-1998) 6962 1493 1494
Subset4 (1999-2009) 7216 1415 1584
Subset5 (2010-now) 6788 1549 1344
Easy Reasoning 4068 909 880
Common Sense 3252 730 728
Multi-descriptions Join 6128 1412 1260
Multi-paragraphs Join 15265 3097 3211
Unanswerable 6301 1276 1329
Total 35014 7424 7408
Refer to caption
Figure 3: An overview of the CLTSQA task with our framework. The above figure illustrates the sequential learning of different subsets. The below figure represents our approach of loading the pre-trained weights of the previous model for the next model, while incorporating temporal memory replay and temporal contrastive learning.

Table I shows the statistics the CLTSQA-Data dataset. Our dataset contains a total of 50,000 questions and 5,000 contexts. We construct K=5𝐾5K=5italic_K = 5 subsets, which are made of varying time spans to ensure that they have similar amount of data. The questions could be divided into 5 types:

  • Easy reasoning, where the temporal information in the question is explicitly specified in the context.

  • Joining commonsense, which requires the model to understand the temporal commonsense knowledge. Such as 2010 is included within 2008-2017.

  • Joining multiple descriptions, which requires the model to reason the context from multiple descriptions within the same paragraph.

  • Joining multiple paragraphs, which is a multi-paragraph extension of Joining multiple descriptions - the model is required to reason the context across multiple paragraphs. Joining multiple paragraphs not only limits to adjoining paragraphs, but it also extends to cases where significant temporal gaps exist between paragraphs that must be integrated. For the introductory passage about Giorgos Dedes, where the initial paragraph delineates his birth year as 1943, followed by subsequent paragraphs narrating his life at ages 30 and 40. Failing to incorporate contextual information from earlier periods would render it challenging to address inquiries such as “Which team did Giorgos Dedes play for in 1973/1983?”. This underscores the importance of seamlessly weaving old and new text and the importance of continuous learning.

  • Unanswerable, where the answer could not be found or reasoned from the context. According to the description in a context, “Barack Hussein Obama was born in August 1961”, we cannot answer the question “What position did Barack Hussein Obama hold in 1960?”.

Refer to caption
Figure 4: Illustration of temporal contrastive learning, including generation process of contrastive and similar questions as well as model’s learning process.

V CLTSQA Framework

In this section, we propose a model-agnostic framework - CLTSQA-Framework to address the aforementioned model-level challenge, thus helping an arbitrary model to learn the CLTSQA task. Fig. 3 gives an overview of our framework, which consists of two key features 1) temporal memory replay, and 2) temporal contrastive learning.

Initialized with a pre-trained language model M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we follow the task setting in the Preliminaries section to sequentially train the model on different subsets, where Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the model after training Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT on the subset 𝒟¯i1subscript¯𝒟𝑖1\mathcal{\overline{D}}_{i-1}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. The first key feature is temporal memory replay, which inherits from continual learning to alleviate the forgetting problem during training on the new subset. Specifically, a portion of the data from the time period preceding the new subset is stored, and then replayed during the learning process of the new subset. The second key feature is temporal contrastive learning, which aims at enhancing model’s sensitivity to the temporal information within the questions. Specifically, it involves creating two additional questions based on the original question, and then combining a context along with these questions as three separate inputs for the model.

V-A Temporal Memory Replay

One of the key properties of the CLTSQA task, is the continual learning process, which is always accompanied by the catastrophic forgetting problem - the model tends to “forget” the old knowledge during ingesting the new knowledge [23]. For the temporal-sensitive questions, in particular, after acquiring knowledge about a new question, which shares a similar context to an old question except for the temporal information, the model might encounter difficulties when re-trying to answer the old question. For example, the model might get in trouble in answering “Who is the president of United States in 2009” after learning the new knowledge about “Who is the president of United States after 2020?”. Motivated by the memory replay [24], which helps the model to remember old knowledge through retaining some old training data and reusing them in the subsequent training process, we propose a temporal memory replay strategy that is for dealing with catastrophic forgetting of the data from the previous time periods. Specifically, as the choice of which data to retain plays a crucial role in temporal memory replay, we aim to prioritize the model’s attention towards data that are 1) easily learnable samples for efficiently keeping previous knowledge and 2) susceptible to distraction within the new dataset.

Take the model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT as an example, which has been sequentially trained on the previous subsets 𝒟¯i1subscript¯𝒟𝑖1\mathcal{\overline{D}}_{i-1}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, and will be trained on the current subset 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. 1) To better retain data from previous time periods, we removed the top μ𝜇\muitalic_μ of the hardest samples from the preceding subsets 𝒟¯i1subscript¯𝒟𝑖1\mathcal{\overline{D}}_{i-1}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, while retaining the easily learnable ones. This approach mitigates the challenge of data forgetting. Notably, the term “hard sample” is used to describe the sample that received the lowest evaluation score among the previous subsets. 2) From a temporal perspective, we select a part (ν𝜈\nuitalic_ν) of data from previous time periods that had the same context but different answers, and incorporated them into the new subset. By introducing these distractors, we aimed to enhance the model’s robustness and its sensitivity for temporal information.

V-B Temporal Contrastive Learning

CLTSQA-Data generates multiple questions based on a single context, where the questions have identical content but vary in their temporal information and expression. To enhance the model’s sensitivity to temporal information in questions and acknowledge that differences in question expression do not affect the answer, the strategy of temporal contrastive learning is employed. Fig. 4 shows the strategy encompassing the generation procedure for contrasting and similar questions, along with the learning process employed by the model.

Generation of Contrastive and Similar Question.

We generate a contrastive question qcontrastsubscript𝑞𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡q_{contrast}italic_q start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT and a similar question qsimilarsubscript𝑞𝑠𝑖𝑚𝑖𝑙𝑎𝑟q_{similar}italic_q start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT for the original question q𝑞qitalic_q of each sample in the training dataset.

To create the contrastive question qcontrastsubscript𝑞𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡q_{contrast}italic_q start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT, we simply substitute the temporal information in the original question with different temporal references while keeping everything else unchanged. For example, the contrastive question of the original question “What position did Barack Hussein Obama hold in 2010?” is “What position did Barack Hussein Obama hold in 1995?”. It should be emphasized that the answer to the contrastive question consistently differs from the answer to the original question, thereby ensuring their distinctiveness.

To generate a similar question qsimilarsubscript𝑞𝑠𝑖𝑚𝑖𝑙𝑎𝑟q_{similar}italic_q start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT, we maintain the temporal information while modifying the wording of the question. If there are alternative expressions of the original question available in CLTSQA-Data dataset 𝒟𝒟\mathcal{D}caligraphic_D, then substitute the expression of the original question with one of those alternatives. The original question “What position did Barack Hussein Obama hold in 2010?” can be transformed to a similar question “Barack Hussein Obama took which position in 2010?”. If no other expression exists in CLTSQA-Data dataset, We process the question with word segmentation and randomly rearrange the positions of the tokens in the question, excluding the temporal information. For example, the original question is “What position did Barack Hussein Obama hold in 2010?”, and its similar question is “position What Barack Hussein Obama did hold in 2010?”. The study conducted by [31] and [32] demonstrate that word order does not have a significant impact on model performance across various downstream tasks, including Question Answering (QA). Therefore, we employ the aforementioned approach to strive for consistency between similar questions and the original question.

Temporal Contrastive Learning.

As Fig. 4 shows, we concatenate a context c𝑐citalic_c and original question qorisubscript𝑞𝑜𝑟𝑖{q}_{ori}italic_q start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT, contrastive question qconsubscript𝑞𝑐𝑜𝑛{q}_{con}italic_q start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, similar question qsimsubscript𝑞𝑠𝑖𝑚{q}_{sim}italic_q start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT respectively as the three inputs 𝐱={qori,c}𝐱subscript𝑞𝑜𝑟𝑖𝑐\mathbf{x}=\{{q}_{ori},{c}\}bold_x = { italic_q start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT , italic_c }, 𝐱con={qcon,c}subscript𝐱𝑐𝑜𝑛subscript𝑞𝑐𝑜𝑛𝑐\mathbf{x}_{con}=\{{q}_{con},{c}\}bold_x start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT , italic_c } and 𝐱sim={qsim,c}subscript𝐱𝑠𝑖𝑚subscript𝑞𝑠𝑖𝑚𝑐\mathbf{x}_{sim}=\{{q}_{sim},{c}\}bold_x start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_c } of the model. These inputs are passed through model, obtaining three representations aorisubscript𝑎𝑜𝑟𝑖{a}_{ori}italic_a start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT, aconsubscript𝑎𝑐𝑜𝑛{a}_{con}italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT and asimsubscript𝑎𝑠𝑖𝑚{a}_{sim}italic_a start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT.

We first apply TripletMarginLoss [25] function over aorisubscript𝑎𝑜𝑟𝑖a_{ori}italic_a start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT, aconsubscript𝑎𝑐𝑜𝑛a_{con}italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT and asimsubscript𝑎𝑠𝑖𝑚a_{sim}italic_a start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT to obtain Ltriplesubscript𝐿𝑡𝑟𝑖𝑝𝑙𝑒L_{triple}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_p italic_l italic_e end_POSTSUBSCRIPT.

T(s,p,n)=max{d(si,pi)d(si,ni)+margin,0}𝑇𝑠𝑝𝑛𝑚𝑎𝑥𝑑subscript𝑠𝑖subscript𝑝𝑖𝑑subscript𝑠𝑖subscript𝑛𝑖𝑚𝑎𝑟𝑔𝑖𝑛0T(s,p,n)=max\{d(s_{i},p_{i})-d(s_{i},n_{i})+margin,0\}italic_T ( italic_s , italic_p , italic_n ) = italic_m italic_a italic_x { italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_m italic_a italic_r italic_g italic_i italic_n , 0 } (1)

where

d(x,y)=xyp𝑑𝑥𝑦subscriptnorm𝑥𝑦𝑝d(x,y)=\parallel x-y\parallel_{p}italic_d ( italic_x , italic_y ) = ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (2)

and

Ltriple=T(aori,asim,acon)subscript𝐿𝑡𝑟𝑖𝑝𝑙𝑒𝑇subscript𝑎𝑜𝑟𝑖subscript𝑎𝑠𝑖𝑚subscript𝑎𝑐𝑜𝑛L_{triple}=T(a_{ori},a_{sim},a_{con})italic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_p italic_l italic_e end_POSTSUBSCRIPT = italic_T ( italic_a start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ) (3)

Then aorisubscript𝑎𝑜𝑟𝑖{a}_{ori}italic_a start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and asimsubscript𝑎𝑠𝑖𝑚{a}_{sim}italic_a start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT are processed by a linear layer to obtain representations a^orisubscript^𝑎𝑜𝑟𝑖\hat{a}_{ori}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and a^simsubscript^𝑎𝑠𝑖𝑚\hat{a}_{sim}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT. We get answer prediction loss Lpredictsubscript𝐿𝑝𝑟𝑒𝑑𝑖𝑐𝑡L_{predict}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT by applying CrossEntropy function over target label atargetsubscript𝑎𝑡𝑎𝑟𝑔𝑒𝑡a_{target}italic_a start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT and the representation a^orisubscript^𝑎𝑜𝑟𝑖\hat{a}_{ori}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT. Likewise, get similar loss Lsimilarsubscript𝐿𝑠𝑖𝑚𝑖𝑙𝑎𝑟L_{similar}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT by applying CrossEntropy function over target label atargetsubscript𝑎𝑡𝑎𝑟𝑔𝑒𝑡a_{target}italic_a start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT and the representation a^simsubscript^𝑎𝑠𝑖𝑚\hat{a}_{sim}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT.

Finally we combine Lpredictsubscript𝐿𝑝𝑟𝑒𝑑𝑖𝑐𝑡L_{predict}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT, Lsimilarsubscript𝐿𝑠𝑖𝑚𝑖𝑙𝑎𝑟L_{similar}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT and Ltriplesubscript𝐿𝑡𝑟𝑖𝑝𝑙𝑒L_{triple}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_p italic_l italic_e end_POSTSUBSCRIPT as the final objective function loss:

Loss=αLpredict+βLsimilar+γLtriple𝐿𝑜𝑠𝑠𝛼subscript𝐿𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝛽subscript𝐿𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝛾subscript𝐿𝑡𝑟𝑖𝑝𝑙𝑒Loss=\alpha L_{predict}+\beta L_{similar}+\gamma L_{triple}italic_L italic_o italic_s italic_s = italic_α italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_i italic_l italic_a italic_r end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_p italic_l italic_e end_POSTSUBSCRIPT (4)

where α>0,β>0,γ>0formulae-sequence𝛼0formulae-sequence𝛽0𝛾0\alpha>0,\beta>0,\gamma>0italic_α > 0 , italic_β > 0 , italic_γ > 0 are weight factors.

TABLE II: Results of models’ final performance after sequentially training on the 5 subsets. “FiD-CLTSQA” (“BigBird-CLTSQA”) and “FiD-baseline” (“BigBird-baseline”) denote the model trained with / without the proposed CLTSQA-Framework, respectively.
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
FiD-Baseline 34.25 45.29 29.97 40.22 43.84 53.89 39.26 50.95 40.32 50.60 40.43 49.48 42.26 53.37 46.28 56.60 47.97 55.60 49.03 56.29
FiD-CLTSQA 42.45 52.01 39.31 49.55 47.97 57.95 47.49 57.76 47.96 57.05 48.06 56.84 46.08 55.95 49.12 59.16 49.71 57.48 49.03 57.06
BigBird-Baseline 31.24 40.55 29.48 38.81 35.16 45.14 35.66 44.68 26.59 36.04 32.46 40.85 35.76 43.58 37.94 46.48 41.58 48.08 41.74 50.16
BigBird-CLTSQA 35.21 43.54 33.81 41.49 42.63 51.57 42.91 50.64 38.25 45.53 42.24 50.22 39.93 47.90 39.02 46.83 43.77 49.72 44.72 50.42
TABLE III: Ablation results of model variants after sequentially training on the 5 subsets. “TMR” and “TCL” denote “Temporal Memory Replay” and “Temporal Contrastive Learning”, respectively.
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
FiD-CLTSQA 42.45 52.01 39.31 49.55 47.97 57.95 47.49 57.76 47.96 57.05 48.06 56.84 46.08 55.95 49.12 59.16 49.71 57.48 49.03 57.06
w/o TCL 42.06 52.44 38.63 49.12 45.84 55.63 45.07 55.49 45.41 55.68 48.13 57.22 44.73 54.81 46.28 56.93 48.42 55.72 47.32 54.81
w/o TMR 15.94 19.09 17.59 20.99 17.86 21.53 19.33 23.43 17.95 22.82 18.94 22.14 42.83 52.65 43.81 53.40 48.93 56.91 49.48 57.40
FiD-Baseline 34.25 45.29 29.97 40.22 43.84 53.89 39.26 50.95 40.32 50.60 40.43 49.48 42.26 53.37 46.28 56.60 47.97 55.60 49.03 56.29
TABLE IV: Ablation results of model variants after sequentially training on the 5 subsets. “MR” and “TMR” denote “Memory Replay” and “Temporal Memory Replay”, respectively.
Subset1 Subset2 Subset3 Subset4 Subset5
EM F1 EM F1 EM F1 EM F1 EM F1
FiD-Baseline with MR 40.91 51.51 54.62 56.46 45.75 55.28 43.46 53.71 47.19 54.92
FiD-Baseline with TMR 42.06 52.44 45.84 55.63 45.41 55.68 44.73 54.81 48.42 55.72

VI Experiments

In this section, we conduct experiments for the CLTSQA task, and would like to answer the following three research questions: 1) whether the novel task CLTSQA poses new challenges to the existing QA models; 2) whether our framework helps the models to deal with the CLTSQA task; and 3) which part of our framework contributes more to the performance improvement.

Data

We conduct the experiment upon the proposed CLTSQA-Data dataset. Specifically, we use K=5𝐾5K=5italic_K = 5 subsets, each of which consists of around 7,000 training questions, 1,500 validation questions and 1,500 testing questions. Table I shows the statistics of the subsets.

Model

As illustrated in Sec. V, our framework is model-agnostic and can be applied to arbitrary QA models. We use the following two models as our baselines:

  • FiD [26], whose objective is to generate answers sequentially, token by token, in an auto-regressive manner. It has achieved impressive performance on Natural Questions [27] and TriviaQA [28].

  • BigBird [29], which introduces a sparse attention mechanism that enhances performance across various tasks involving extensive contextual information. This model focuses on extracting the answers from a given sequence and has achieved remarkable outcomes in question answering.

Training

We follow [26] and [29] to construct FiD and BigBird, and initialize the baselines with Natural Question pre-trained weights. For temporal memory replay, we set μ=10%𝜇percent10\mu=10\%italic_μ = 10 % and ν=10%𝜈percent10\nu=10\%italic_ν = 10 %. For temporal contrastive learning, we set α:β:γ=1:0.5:0.5:𝛼𝛽:𝛾1:0.5:0.5\alpha:\beta:\gamma=1:0.5:0.5italic_α : italic_β : italic_γ = 1 : 0.5 : 0.5. During training, we continuously train the model on the 5 subsets. For each subset, we train the model for 8 epochs with a batch size of 1. The model is optimized using AdamW [30] with a learning rate of 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

Evaluation

After training on a subset, we evaluate the model on the testing set of this subset as well as all previous subsets. We use exact match (EM) and F1 score as the evaluation metrics.

VII Results and Discussions

VII-A Main Results

Table II shows models’ evaluation performance after subsequently training on the five subsets. “FiD-CLTSQA” (“BigBird-CLTSQA”) and “FiD-baseline” (“BigBird-baseline”) denote the model trained with / without the proposed CLTSQA-Framework, respectively. The baselines (“FiD-Baseline” and “BigBird-Baseline”), which are trained in a sequential manner but without utilizing the proposed framework (i.e., no temporal memory replay or temporal contrastive learning), exhibit poor performance. In particular, the baselines perform worst when being evaluated on Subset1, which has the greatest temporal difference from the most up-to-date subset (Subset5). Such observations answer our first research question - the current QA models may face challenges when tackling the CLTSQA task.

When it comes to the proposed CLTSQA-Framework, it is evident that this framework helps the models to obtain improved performance, especially in those “earlier” subsets. Taking the earliest subset, Subset1, as an example, when equipped with CLTSQA-Framework, the BigBird model demonstrates a 14.69% increase in EM and 6.91% increase in F1 (“BigBird-CLTSQA” v.s., “BigBird-Baseline”). More significant performance improvement could be observed in FiD, which demonstrates a 31.16% increase in EM and 23.20% increase in F1 (“FiD-CLTSQA” v.s., “FiD-Baseline”). Such observations answer our second research question - the proposed framework helps the models to deal with the CLTSQA task.

The significant performance improvement could be attributed to two strategies introduced by the proposed CLTSQA-Framework: 1) the temporal memory replay, which helps the model to retain the old knowledge when ingesting the latest knowledge; and 2) the temporal contrastive learning, which helps the model to acquire representations in a manner that captures and distinguishes the temporal information present in the question, thus enhancing model’s ability in answering the temporal-sensitive questions. To validate these strategies, Fig.  5 shows the testing performance of “FiD-Baseline” and “FiD-CLTSQA” models in different training stages, where Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the model after training on subset 𝒟¯isubscript¯𝒟𝑖\mathcal{\overline{D}}_{i}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It could be observed that while “FiD-Baseline” encounters performance drop in Subset 1, Subset 2 and Subset 3 with the progress of training, “FiD-CLTSQA” retains its performance on those subsets throughout the training process, validating the first strategy. The second strategy could be validated from two perspectives. Firstly, going beyond retaining the performance, the model with CLTSQA-Framework can even improve performance on Subset 1 with the progress of training, showing the enhancement of ability of answering temporal-sensitive questions. Secondly, in the up-to-date subsets such as Subset 4 and Subset 5, where there is reduced necessity to retain the old knowledge, the model with CLTSQA-Framework could still obtain better performance. Table V gives some examples of answers generated by “FiD-Baseline” and “FiD-CLTSQA”.

TABLE V: Examples of answers generated by “FiD-Baseline” and “FiD-CLTSQA”.
the most up-to-date data (evaluated on 𝒟5testsuperscriptsubscript𝒟5𝑡𝑒𝑠𝑡\mathcal{D}_{5}^{test}caligraphic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT)
context: He signed for South Coast Wolves after transferring from Sydney United ahead of the 2011
NSW Premier League season . Timpano left South Coast Wolves, signing for Dapto Dandaloo Fury
ahead of their 2015 Illawarra Premier League campaign.
question: Which team did Jacob Timpano play for in 2013?
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “Sydney United”
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “South Coast Wolves”
label: “South Coast Wolves”
context: Praveen Kumar was initially with the Royal Challengers Bangalore until 2010. In the
Indian Premier League he played for Kings XI Punjab from 2011 to 2013.
question: Which team did Praveen Kumar play for in 2010?
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “ ” (unanswerable)
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “Royal Challengers Bangalore”
label: “Royal Challengers Bangalore”
context: She was founder and chair of the Graduate Design Program at California College of the Arts
( 2006–2012 ).
question: What was the name of the employer Brenda Laurel work for in 2012?
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “California College of the Arts”
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “ ” (unanswerable)
label: “ ” (unanswerable)
previous data (evaluated on 𝒟1testsuperscriptsubscript𝒟1𝑡𝑒𝑠𝑡\mathcal{D}_{1}^{test}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT)
context: It is known that Vytautas himself knew and spoke in the Lithuanian language with Jogaila.
Struggle for power 1377–1384.
question: What was the residence of Vytautas in 1384?
FiD-Baseline M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “ ” (unanswerable)
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “Lithuania”
FiD-CLTSQA M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “ ” (unanswerable)
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “ ” (unanswerable)
label: “ ” (unanswerable)
context: University Hall , the first residential hall for women students in Scotland ,was founded at
St Andrews University in 1895 ;Louisa Lumsden was appointed its first warden.
question: Which employer did Louisa Lumsden work for in 1895?
FiD-Baseline M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “St Andrews University”
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “University Hall”
FiD-CLTSQA M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “St Andrews University”
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “St Andrews University”
label: “St Andrews University”
context: He was appointed Lord Advocate in 1775. His name appears in the 1776 minute book of the
Poker Club. 2nd Earl of Shelburne and Pitt, he entered the cabinet in 1791 as Secretary of State for the
Home Department.
question: Which position did Henry Dundas, 1st Viscount Melville hold in 1776?
FiD-Baseline M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “Lord Advocate”
FiD-Baseline M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “ ” (unanswerable)
FiD-CLTSQA M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: “Lord Advocate”
FiD-CLTSQA M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: “Lord Advocate”
label: “Lord Advocate”
Refer to caption
Figure 5: The testing performance of “FiD-Baseline” and “FiD-CLTSQA” models in different training stages. The proposed framework effectively helps the FiD model to retain the performance on old subsets throughout the training process.

VII-B Ablation Studies

VII-B1 The Contributions of TMR and TCL

In order to further investigate the contributions of the two strategies brought by CLTSQA-Framework, we conduct ablation studies by building two more model variants upon “FiD-CLTSQA”:

  • FiD-CLTSQA w/o TCL, which only applies temporal memory replay

  • FiD-CLTSQA w/o TMR, which only applies temporal contrastive learning.

Table III shows the final evaluation result, where “FiD-CLTSQA w/o TCL w/o TMR” is indeed the baseline model “FiD-Baseline”. The result answers our third research question: the temporal memory replay effectively alleviates forgetting of the previous knowledge, thus playing a more important role in the old subsets (“FiD-CLTSQA” v.s., “FiD-CLTSQA w/o TMR”). Differently, the temporal contrastive learning brings less significant but consistent performance improvement across all subsets (“FiD-CLTSQA” v.s., “FiD-CLTSQA w/o TCL”). Overall, the CLTSQA-Framework benefits from both modifications.

VII-B2 The Novelty of TMR

In order to emphasize on the novelty of temporal memory replay, we conduct a comparative experiment by employing two more model variants upon “FiD-Baseline”:

  • FiD-Baseline with MR, which only applies memory replay which selects 10% old knowledge from each previous subset and reuses them in the subsequent training process.

  • FiD-Baseline with TMR, which only applies temporal memory replay demonstrated in section V-A.

The experimental results shown in Table IV provide compelling proof of the superiority of our temporal memory replay method over the memory replay.

VIII Conclusion

In this study, we pioneered a novel task, Continual Learning for Temporal Sensitive Question Answering (CLTSQA). We first introduced a new dataset, CLTSQA-Data, to facilitate research in this area, followed by the introduction of a novel framework, CLTSQA-Framework, designed to assist models in handling temporally-sensitive QA in a continual learning context. Our experimental results revealed that while the CLTSQA task poses fresh challenges for existing models, the proposed framework effectively equips the model to overcome these hurdles, resulting in improved performance. We are confident that our contributions, encompassing both the dataset and the framework, will stimulate future research in this innovative direction. As we move forward, there is a need for further exploration of datasets and models to delve deeper into the complexities of CLTSQA.

References

  • [1] W. Chen, X. Wang, and W. Y. Wang, “A dataset for answering time-sensitive questions,” 2021.
  • [2] M. J. Zhang and E. Choi, “Situatedqa: Incorporating extra-linguistic contexts into qa,” 2021.
  • [3] J. Wang, A. Jatowt, and M. Yoshikawa, “Archivalqa: A large-scale benchmark dataset for open-domain question answering over historical news collections,” pp. 3025–3035, 2022.
  • [4] A. Liska, T. Kocisky, E. Gribovskaya, T. Terzi, E. Sezener, D. Agrawal, D. Cyprien De Masson, T. Scholtes, M. Zaheer, S. Young et al., “Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models,” PMLR, pp. 13 604–13 622, 2022.
  • [5] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen, “Time-aware language models as temporal knowledge bases,” pp. 257–273, 2022.
  • [6] D. Loureiro, F. Barbieri, L. Neves, L. E. Anke, and J. Camacho-Collados, “Timelms: Diachronic language models from twitter,” arXiv preprint arXiv:2202.03829, 2022.
  • [7] Y. Qin, J. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “Elle: Efficient lifelong pre-training for emerging data,” arXiv preprint arXiv:2203.06311, 2022.
  • [8] Z. Jia, A. Abujabal, R. Saha Roy, J. Strötgen, and G. Weikum, “Tempquestions: A benchmark for temporal question answering,” in Companion Proceedings of the The Web Conference 2018, 2018, pp. 1057–1062.
  • [9] S. Min, J. Michael, H. Hajishirzi, and L. Zettlemoyer, “Ambigqa: Answering ambiguous open-domain questions,” arXiv preprint arXiv:2004.10645, 2020.
  • [10] Q. Ning, H. Wu, R. Han, N. Peng, M. Gardner, and D. Roth, “Torque: A reading comprehension dataset of temporal ordering questions,” arXiv preprint arXiv:2005.00242, 2020.
  • [11] C. Shang, P. Qi, G. Wang, J. Huang, Y. Wu, and B. Zhou, “Open temporal relation extraction for question answering,” in 3rd Conference on Automated Knowledge Base Construction, 2021.
  • [12] R. Han, X. Ren, and N. Peng, “Econet: effective continual pretraining of language models for event temporal reasoning,” arXiv preprint arXiv:2012.15283, 2020.
  • [13] C. Shang, G. Wang, P. Qi, and J. Huang, “Improving time sensitivity for question answering over temporal knowledge graphs,” arXiv preprint arXiv:2203.00255, 2022.
  • [14] M. Biesialska, K. Biesialska, and M. R. Costa-Jussa, “Continual lifelong learning in natural language processing: A survey,” arXiv preprint arXiv:2012.09823, 2020.
  • [15] Z. Ke and B. Liu, “Continual learning of natural language processing tasks: A survey,” arXiv preprint arXiv:2211.12701, 2022.
  • [16] J. Jang, S. Ye, S. Yang, J. Shin, J. Han, G. Kim, S. J. Choi, and M. Seo, “Towards continual knowledge learning of language models,” arXiv preprint arXiv:2110.03215, 2021.
  • [17] J. Jang, S. Ye, C. Lee, S. Yang, J. Shin, J. Han, G. Kim, and M. Seo, “Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models,” arXiv preprint arXiv:2204.14211, 2022.
  • [18] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu, “Recall and learn: Fine-tuning deep pretrained language models with less forgetting,” arXiv preprint arXiv:2004.12651, 2020.
  • [19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [20] T. He, J. Liu, K. Cho, M. Ott, B. Liu, J. Glass, and F. Peng, “Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1121–1133.
  • [21] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  • [22] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang, M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models with adapters,” arXiv preprint arXiv:2002.01808, 2020.
  • [23] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation.   Elsevier, 1989, vol. 24, pp. 109–165.
  • [24] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010.
  • [25] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks.” in Bmvc, vol. 1, no. 2, 2016, p. 3.
  • [26] G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” arXiv preprint arXiv:2007.01282, 2020.
  • [27] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
  • [28] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551, 2017.
  • [29] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
  • [30] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [31] K. Sinha, P. Parthasarathi, J. Pineau, and A. Williams, “Unnatural language inference,” arXiv preprint arXiv:2101.00010, 2020.
  • [32] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela, “Masked language modeling and the distributional hypothesis: Order word matters pre-training for little,” arXiv preprint arXiv:2104.06644, 2021.

Appendix A CLTSQA-Data Statistics

Distribution of Question Types in CLTSQA-Data

We investigated the various question types present in our dataset, which encompassed Easy Reasoning, Joining Commonsense, Joining Multiple Descriptions, Joining Multiple Paragraphs, and Unanswerable. Furthermore, we calculated the distribution of these question types within the entire dataset as Fig. 6 shows.

Refer to caption
Figure 6: Distribution of question types within the entire dataset.
Examples in Question Types

As shown in Table VIII, we present five different question types of our CLTSQA-Data, including context, question, and answer.

Appendix B Ablation Study on Temporal Memory Replay

We investigated the performance of temporal memory replay with / w.o the step of removing hard samples, respectively. We assess model M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT by 𝒟¯5devsuperscriptsubscript¯𝒟5𝑑𝑒𝑣\mathcal{\overline{D}}_{5}^{dev}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_v end_POSTSUPERSCRIPT. It can be seen from Fig. 7 that temporal memory replay with step removing hard samples has better performance.

Refer to caption
Figure 7: Ablation study on temporal memory replay with / w.o the step of removing hard samples.
Experimental Parameters

The parameter settings for the two models, FiD and BigBird, used in the experiment are illustrated in Table VI and Table VII, respectively.

TABLE VI: FiD Model Parameters.
Parameters FiD
max_query_length 36
max_sequence_length 4096
max_answer_length 60
max_text_length 180
learning_rate 5e5superscript𝑒5e^{-5}italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
adam_epsilon 1e8superscript𝑒8e^{-8}italic_e start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
pre_gpu_train_batch_size 1
pre_gpu_eval_batch_size 1
n_gpu 4
num_train_epochs 8
TABLE VII: BigBird Model Parameters.
Parameters BigBird
max_query_length 36
max_sequence_length 3600
doc_stride 2048
learning_rate 5e5superscript𝑒5e^{-5}italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
adam_epsilon 1e8superscript𝑒8e^{-8}italic_e start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
pre_gpu_train_batch_size 1
pre_gpu_eval_batch_size 1
n_gpu 4
num_train_epochs 8
TABLE VIII: Examples of question types in CLTSQA-Data.
Easy Reasoning
context: … Benedek Jávor, a proponent of the agreement, resigned from his position of parliamentary group leader, and Bernadett Szél were elected co-presidents of the LMP during the partys congress on 24 March 2013 …
question: Who was the head of LMP – Hungary’s Green Party in 2013?
label: “Bernadett Szél”
context: … University Hall, the first residential hall for women students in Scotland, was founded at St Andrews University in 1895; Louisa Lumsden was appointed its first warden …
question: Which employer did Louisa Lumsden work for in 1895?
label: “St Andrews University”
Joining Commonsense
context: He was purchased by the Kolkata Knight Riders at the 2011 IPL auctions for the next 3 years.
question: Which team did the player Eoin Morgan belong to in 2012?
label: “Kolkata Knight Riders”
context: He was Professor of Ancient History at the University of St Andrews from 1998 to 2014.
question: Greg Woolf was an employee for whom in 2010?
label: “University of St Andrews”
Joining Multiple Descriptions
context: In April 2014, Pohjanpalo renewed his contract with HJK, extending it to 2018. At the same time HJK extended his loan a further two years, which Pohjanpalo spent on loan at Fortuna Düsseldorf.
question: Which team did Joel Pohjanpalo play for in 2015?
label: “Fortuna Düsseldorf”
context: Tavares became CEO of Groupe PSA in 2014. Until January 16, 2021, he became the first chief executive officer of the multinational automobile group Stellantis.
question: Carlos Tavares was an employee from whom in 2015?
label: “Groupe PSA”
Joining Multiple Paragraphs
context: (paragraphs 1) … He has been chair of the Labour Party in the House of Representatives since 10 June 2010 … (paragraphs 2) On 20 February 2012, he resigned as leader of the Labour Party, …
question: What position did Job Cohen take in 2011?
label: “Leader of the Labour Party”
context: (paragraphs 1) … He was appointed Lord Advocate in 1775. His name appears in the 1776 minute book of the Poker Club… (paragraphs 2) 2nd Earl of Shelburne and Pitt, he entered the cabinet in 1791 as Secretary of State for the Home Department.
question: Which position did Henry Dundas, 1st Viscount Melville hold in 1776?
label: “Lord Advocate”
Unanswerable
context: … In February 2016, she shared first place with Anastasia Bodnaruk and Soumya Swaminathan in the women’s event of the Moscow Open, finishing third on tiebreak. In 2017 she competed again in the World Youth U16 Olympiad for Russia and her team won the gold medal …
question: Which title was conferred to Alexandra Obolentseva in 2017?
label: “ ”
context: … The P class were later re-allocated to shunting and station pilot duties. All eight locomotives passed into Southern Railway ownership at The Grouping in 1923 …
question: What operated SECR P class in 1921?
label: “ ”
Experimental Results

Table IX, X, XI, XII show results of specific performance of each stage in FiD without CLTSQA-Framework, FiD with Temporal Memory Replay, FiD with Temporal Contrastive Learning and FiD with CLTSQA-Framework respectively. Each model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assessed by 𝒟¯idevsuperscriptsubscript¯𝒟𝑖𝑑𝑒𝑣\mathcal{\overline{D}}_{i}^{dev}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_v end_POSTSUPERSCRIPT and 𝒟¯itestsuperscriptsubscript¯𝒟𝑖𝑡𝑒𝑠𝑡\mathcal{\overline{D}}_{i}^{test}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT.

TABLE IX: FiD-Baseline
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 39.76 50.66 33.54 46.04
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 39.18 49.52 35.33 45.10 45.69 55.26 44.94 55.57
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 37.39 47.77 32.10 42.99 45.27 55.46 43.31 54.77 46.01 55.46 48.13 56.71
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 36.04 47.64 31.89 43.42 43.27 54.13 41.48 54.24 42.80 53.47 45.72 54.47 46.22 57.02 48.61 58.62
M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 34.25 45.29 29.97 40.22 43.84 53.89 39.26 50.95 40.32 50.60 40.43 49.48 42.26 53.37 46.28 56.60 47.97 55.60 49.03 56.29
TABLE X: Temporal Memory Replay
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 39.76 50.66 33.54 46.04
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 40.72 50.92 35.33 46.42 44.41 53.68 45.85 56.06
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 40.08 50.30 35.74 47.16 44.84 54.70 44.02 55.11 44.68 54.81 46.72 55.93
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 40.91 50.38 37.66 47.87 46.05 55.01 44.74 55.21 47.22 57.00 50.87 59.72 43.25 53.98 49.57 59.35
M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 42.06 52.44 38.63 49.12 45.84 55.63 45.07 55.49 45.41 55.68 48.13 57.22 44.73 54.81 46.28 56.93 48.42 55.72 47.32 54.81
TABLE XI: Temporal Contrastive Learning
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 40.65 51.21 36.22 47.59
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 33.29 43.64 30.03 40.26 46.76 58.10 44.35 55.79
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 28.62 39.09 24.74 34.47 47.12 58.07 44.09 55.96 48.09 57.73 49.93 58.57
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 26.12 36.58 21.31 32.22 41.35 53.97 41.54 53.27 46.01 56.86 47.19 57.22 45.86 56.61 49.68 60.02
M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 15.94 19.09 17.59 20.99 17.86 21.53 19.33 23.43 17.95 22.82 18.94 22.14 42.83 52.65 43.81 53.40 48.93 56.91 49.48 57.40
TABLE XII: FiD with CLTSQA-Framework
Subset1 Subset2 Subset3 Subset4 Subset5
Dev Test Dev Test Dev Test Dev Test Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 40.65 51.21 36.22 47.59
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 40.72 50.95 33.68 45.72 48.11 57.57 46.83 56.72
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 42.64 52.14 37.18 48.04 48.11 57.82 48.01 57.99 47.76 56.43 49.33 57.72
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 44.43 53.85 36.08 46.85 48.90 59.03 47.03 58.81 50.77 59.83 50.28 59.68 46.72 57.17 49.74 59.29
M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 42.45 52.01 39.31 49.55 47.97 57.95 47.49 57.76 47.96 57.05 48.06 56.84 46.08 55.95 49.12 59.16 49.71 57.48 49.03 57.06