Continual Learning for
Temporal-Sensitive Question Answering

Wanqi Yang, Yunqiu Xu, Yanda Li, Kunze Wang, Binbin Huang, Ling Chen Wanqi Yang, Yunqiu Xu, Yanda Li and Ling Chen are with University of Technology Sydney, Sydney, 2007, Australia. (email: [email protected], [email protected], [email protected], [email protected]). Kunze Wang is with University of Sydney, Sydney, 2050, Australia. (email: [email protected]). Binbin Huang is with Hangzhou Dianzi University, Hangzhou, 310018, China. (email: [email protected]).

Abstract

In this study, we explore an emerging research area of Continual Learning for Temporal Sensitive Question Answering (CLTSQA). Previous research has primarily focused on Temporal Sensitive Question Answering (TSQA), often overlooking the unpredictable nature of future events. In real-world applications, it’s crucial for models to continually acquire knowledge over time, rather than relying on a static, complete dataset. Our paper investigates strategies that enable models to adapt to the ever-evolving information landscape, thereby addressing the challenges inherent in CLTSQA. To support our research, we first create a novel dataset, divided into five subsets, designed specifically for various stages of continual learning. We then propose a training framework for CLTSQA that integrates temporal memory replay and temporal contrastive learning. Our experimental results highlight two significant insights: First, the CLTSQA task introduces unique challenges for existing models. Second, our proposed framework effectively navigates these challenges, resulting in improved performance.

Index Terms:

continual learning, temporal-sensitive question, question answering

I Introduction

A temporal-sensitive question refers to a question that involves temporal-related details, and modifying this temporal information within the question will result in a different answer [1]. Take the question “What was the role of Barack Hussein Obama in YEAR?” as an example. If YEAR = 2006, the answer should be “Federal Senator”; whereas if YEAR = 2016, the answer should be “President of the United States”. In everyday life, we frequently encounter questions influenced by time, with answers that can change as new events occur. This unpredictability highlights the need for a novel task called Continual Learning for Temporal Sensitive Question Answering (CLTSQA), which requires continuously learn a model of temporal sensitive question answering as time progresses.

Although some works have been conducted in related areas, two key challenges of CLTSQA have been overlooked: the absence of a suitable dataset, and the scarcity of effective methods in continually dealing with temporal-sensitive questions. While some existing works, e.g., [1, 2, 5, 4, 3], proposed new datasets with the aim of investigating the Temporal-sensitive Question Answering (TSQA) to explore the model’s sensitivity and its reasoning capabilities to temporal information. They follow the setting of traditional question answering. As shown in Fig. 1, TSQA assumes that the entire dataset is adapted for training the model. It lacks the ability to continuously incorporate updated and new data which could potentially alter the answer to a question as time progresses. In terms of the second challenge, many works have been proposed to retain model’s performance with evolving dataset through continual learning. For example, [6] studied continual learning for a single domain (Twitter data from 2018 to 2019), and [7] worked on efficient life-long pre-training on emerging data in multiple domains. Currently, there are no existing efforts or studies focused on the application specifically to address CLTSQA.

Refer to caption — Figure 1: The difference of training process between TSQA and CLTSQA. While TSQA assumes the availability of the whole training dataset, CLTSQA requires the model to keep ingesting up-to-date new knowledge.

The objective of the Continual Learning for Temporal Sensitive Question Answering (CLTSQA) task is to simulate a real-world scenario where updates and new knowledge cannot be learned all at once but requires continual learning. CLTSQA task explores the forgetting degree of model of knowledge in earlier time and the learning capability for acquiring updated and new knowledge over time. To deal with the absence of an available dataset, we construct a new dataset that includes subsets of temporal-sensitive questions, thereby offering a solution to this challenge, and facilitating the study in CLTSQA. Then, to make the model capable of effectively handling temporal-sensitive questions in a continuous fashion, we propose a novel framework featured by 1) temporal memory replay to alleviate the catastrophic forgetting of the past knowledge; and 2) temporal contrastive learning to enhance the model’s sensitivity to temporal information and boost its performance on questions with most up-to-date information. The experimental results show that: 1) the existing models struggle to deal with this challenging task, resulting in poor performance; 2) our proposed framework can effectively help the models to address CLTSQA, demonstrating not only improvement in answering the most up-to-date questions, but also good performance retention when answering historical questions.

The main contributions of this work are summarised as:

•

We propose a novel task called CLTSQA.
•

We propose a new dataset to deal with the absence of available dataset and facilitate the study in CLTSQA.
•

We propose a novel framework featured by temporal memory replay and temporal contrastive learning to deal with the model-level challenge in CLTSQA.
•

We have obtained experimental findings indicating that: 1) CLTSQA is a challenging yet promising task, and 2) our framework assists the model in effectively addressing CLTSQA.

II Related Work

II-A Temporal-Sensitive Question Answering

Some previous studies have explored the task of Temporal-sensitive Question Answering by introducing new datasets. The TempQuestions dataset [8] provides a clear definition of what constitutes a “temporal question” and utilizes specific trigger words such as “before” and “after”. To investigate “temporal question”, [9] mentioned that answers to a question can change over time and created a dataset with 13% temporal-sensitive data. [1], [2] and [3] also created new datasets, but were with a primary focus on TSQA. By evaluating existing models on the proposed datasets, these work proved that answering temporal-sensitive questions is challenging, which serves as a motivation of our study. Different from them, we not only extend TSQA towards a more realistic and challenging task CLTSQA, but also offer solutions to enhance model performance in tackling it.

In addition to the dataset, temporal-sensitive question learning requires the model to be sensitive to temporal information. Several studies have utilized pre-trained language models to aid in question comprehension. However, these models do not effectively distinguish between different temporal expressions found in free-text [10, 12, 11, 5]. Inspired by the framework proposed in [13], our framework develops a temporal contrastive learning that the model can understand the crucial factor lies in recognizing the variation in temporal information, rather than the specific format of the question.

II-B Continual Learning

Numerous research efforts have been dedicated to the examination of continual learning for general QA [14, 15]. Through extensive exploration of the general question answering domain, researchers have discovered that temporal-related QA tasks pose greater challenges.

[4] proposed a dataset named StreamingQA, which aims to investigate models’ adaptation to changing knowledge. The dataset’s context spans the years 2007 to 2020, with questions that do not involve temporally sensitive information. StreamingQA dataset employs a specific data format (question date, question, answer, document date, document), and the question date for each query is intentionally set by the author. However, datasets with additional fields and with narrower timeframes does not inherently enhance the model’s robustness and generalizability. [16] designed a new continual learning task called continual knowledge learning (CKL). From a task-oriented perspective, the aim of CKL involves consistently enhancing the internal knowledge of the language model through ongoing pre-training on new datasets. A noteworthy distinction is that, CKL predominantly concentrates on enriching the internal knowledge within the pre-trained model, encompassing a broader domain. In contrast, CLTSQA places a stronger emphasis on a downstream task, wherein the model continuously learns and adapts to temporal-sensitive question answering. What’s more, some temporal-related QA dataset for continual learning were proposed in [6] and [17]. [6] extracted data from Twitter and divided the data into subsets of three months each for continual learning. And [17] employed the difference between consecutive snapshots of English Wikipedia and English Wikidata for both training and evaluation purposes. However, they simply used the existing classical methods [19, 18, 22, 20, 21] that can alleviate catastrophic forgetting in continual learning, instead of proposing improvement strategies based on their datasets.

III Preliminaries

TSQA

The Temporal Sensitive Question Answering (TSQA) task aims to investigate the model’s sensitivity and reasoning capabilities concerning temporal information. In the TSQA, the model is provided with a context $c$ (e.g., a document, or a series of sentences) and a question $q$ as the input. Then, the model is required to predict the answer $a$ through either extracting from $c$ , or selecting one from a set of answer candidates. The specific task setup for TSQA involves training the model on an entire dataset. In order to answer temporal-sensitive questions, the model is required to not only pay specific attention to temporal information within the question, but also be capable of reasoning over the implicit temporal information within the context.

CLTSQA

The TSQA task is conducted with the assumption that the model is trained using a complete dataset, However, it does not possess the capability to continuously integrate updated or new data with temporal information. In order to alleviate this assumption, thus bridging the gap between TSQA and the real world temporal-sensitive problems, we propose a new task, CLTSQA, which forces the model to learn and inference in a continual learning manner. Their major difference lies in the dataset and training settings. Instead of assuming the availability of a whole dataset, in CLTSQA we require the model to keep awareness of the latest knowledge, while not forgetting the old knowledge. The training data is divided into $K$ subsets $\mathcal{D}=\{\mathcal{D}_{1},\dots,\mathcal{D}_{K}\}$ , with each subset covering time points that are chronologically earlier than those in the subsequent subset $t_{\mathcal{D}_{k-1}}<t_{\mathcal{D}_{k}}$ . Given an initial model $M_{0}$ , it will be subsequently trained on the subsets to obtain the corresponding trained models $M_{1},M_{2},...,M_{K}$ , where $M_{k}$ denotes the model after training on $\mathcal{D}_{1},\mathcal{D}_{2},...,\mathcal{D}_{k}$ . The subsequent models sequentially load the pre-trained weights of the previous model and continual training. The model $M_{k}$ is required to be well-performing on the current dataset of $\mathcal{D}_{k}$ , while not encountering significantly performance decay in the previous subsets $\mathcal{\overline{D}}_{k-1}$ .

IV CLTSQA Dataset

In this section, we introduce a new dataset - CLTSQA-Data, with the aim of addressing the aforementioned data-level challenge. Our dataset is built on the basis of TimeQA [1], which extracts time-evolving contexts from WikiData, and generates question-answer pairs from these contexts by some manual templates.

We chose a collection of 20,000 questions and 5,000 contexts sourced from TimeQA. Moreover, we produced a higher volume of context-specific temporal-sensitive questions. As a result, our dataset now encompasses a total of 50,000 questions and 5,000 contexts. Then we divides the whole dataset into $K$ temporal-sensitive subsets $\mathcal{D}=\{\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}\}$ . Fig. 2 shows some examples, where each subset $\mathcal{D}_{k}$ consists of questions within a specific time range $[t_{k}^{start},t_{k}^{end}]$ . We keep the original context unchanged and generate questions based on it, then assign them to subsets with non-overlapping time ranges. For example, given a long context “Introduction of Barack Hussein Obama”, which ranges from 1961 to 2017, we generate a series of related questions, such as “What position did Barack Hussein Obama take in 1963?”, “What position was held by Barack Hussein Obama in 1995?”, “Barack Hussein Obama took which position in 2010?”, then put them into different subsets based on time periods. Besides the explicit questions, whose answers could be directly extracted from the context, we also generate the more challenging implicit questions, whose answers could not be directly obtained, and require the model to reason from the implicit temporal relation. For example, given the context “Barack Hussein Obama won re-election in the 2012 presidential election”, the answer to the question “Who is the President of the United States in 2014” should be “Barack Hussein Obama”.

TABLE I: The statistics of CLTSQA-Data divided by subsets & question types.

	Train	Dev	Test
Subset1 (190-1939)	7091	1562	1455
Subset2 (1940-1976)	6957	1405	1531
Subset3 (1977-1998)	6962	1493	1494
Subset4 (1999-2009)	7216	1415	1584
Subset5 (2010-now)	6788	1549	1344
Easy Reasoning	4068	909	880
Common Sense	3252	730	728
Multi-descriptions Join	6128	1412	1260
Multi-paragraphs Join	15265	3097	3211
Unanswerable	6301	1276	1329
Total	35014	7424	7408

Table I shows the statistics the CLTSQA-Data dataset. Our dataset contains a total of 50,000 questions and 5,000 contexts. We construct $K=5$ subsets, which are made of varying time spans to ensure that they have similar amount of data. The questions could be divided into 5 types:

•

Easy reasoning, where the temporal information in the question is explicitly specified in the context.
•

Joining commonsense, which requires the model to understand the temporal commonsense knowledge. Such as 2010 is included within 2008-2017.
•

Joining multiple descriptions, which requires the model to reason the context from multiple descriptions within the same paragraph.
•

Joining multiple paragraphs, which is a multi-paragraph extension of Joining multiple descriptions - the model is required to reason the context across multiple paragraphs. Joining multiple paragraphs not only limits to adjoining paragraphs, but it also extends to cases where significant temporal gaps exist between paragraphs that must be integrated. For the introductory passage about Giorgos Dedes, where the initial paragraph delineates his birth year as 1943, followed by subsequent paragraphs narrating his life at ages 30 and 40. Failing to incorporate contextual information from earlier periods would render it challenging to address inquiries such as “Which team did Giorgos Dedes play for in 1973/1983?”. This underscores the importance of seamlessly weaving old and new text and the importance of continuous learning.
•

Unanswerable, where the answer could not be found or reasoned from the context. According to the description in a context, “Barack Hussein Obama was born in August 1961”, we cannot answer the question “What position did Barack Hussein Obama hold in 1960?”.

V CLTSQA Framework

In this section, we propose a model-agnostic framework - CLTSQA-Framework to address the aforementioned model-level challenge, thus helping an arbitrary model to learn the CLTSQA task. Fig. 3 gives an overview of our framework, which consists of two key features 1) temporal memory replay, and 2) temporal contrastive learning.

Initialized with a pre-trained language model $M_{0}$ , we follow the task setting in the Preliminaries section to sequentially train the model on different subsets, where $M_{i}$ denotes the model after training $M_{i-1}$ on the subset $\mathcal{\overline{D}}_{i-1}$ . The first key feature is temporal memory replay, which inherits from continual learning to alleviate the forgetting problem during training on the new subset. Specifically, a portion of the data from the time period preceding the new subset is stored, and then replayed during the learning process of the new subset. The second key feature is temporal contrastive learning, which aims at enhancing model’s sensitivity to the temporal information within the questions. Specifically, it involves creating two additional questions based on the original question, and then combining a context along with these questions as three separate inputs for the model.

V-A Temporal Memory Replay

One of the key properties of the CLTSQA task, is the continual learning process, which is always accompanied by the catastrophic forgetting problem - the model tends to “forget” the old knowledge during ingesting the new knowledge [23]. For the temporal-sensitive questions, in particular, after acquiring knowledge about a new question, which shares a similar context to an old question except for the temporal information, the model might encounter difficulties when re-trying to answer the old question. For example, the model might get in trouble in answering “Who is the president of United States in 2009” after learning the new knowledge about “Who is the president of United States after 2020?”. Motivated by the memory replay [24], which helps the model to remember old knowledge through retaining some old training data and reusing them in the subsequent training process, we propose a temporal memory replay strategy that is for dealing with catastrophic forgetting of the data from the previous time periods. Specifically, as the choice of which data to retain plays a crucial role in temporal memory replay, we aim to prioritize the model’s attention towards data that are 1) easily learnable samples for efficiently keeping previous knowledge and 2) susceptible to distraction within the new dataset.

Take the model $M_{i-1}$ as an example, which has been sequentially trained on the previous subsets $\mathcal{\overline{D}}_{i-1}$ , and will be trained on the current subset $\mathcal{D}_{i}$ . 1) To better retain data from previous time periods, we removed the top $\mu$ of the hardest samples from the preceding subsets $\mathcal{\overline{D}}_{i-1}$ , while retaining the easily learnable ones. This approach mitigates the challenge of data forgetting. Notably, the term “hard sample” is used to describe the sample that received the lowest evaluation score among the previous subsets. 2) From a temporal perspective, we select a part ( $\nu$ ) of data from previous time periods that had the same context but different answers, and incorporated them into the new subset. By introducing these distractors, we aimed to enhance the model’s robustness and its sensitivity for temporal information.

V-B Temporal Contrastive Learning

CLTSQA-Data generates multiple questions based on a single context, where the questions have identical content but vary in their temporal information and expression. To enhance the model’s sensitivity to temporal information in questions and acknowledge that differences in question expression do not affect the answer, the strategy of temporal contrastive learning is employed. Fig. 4 shows the strategy encompassing the generation procedure for contrasting and similar questions, along with the learning process employed by the model.

Generation of Contrastive and Similar Question.

We generate a contrastive question $q_{contrast}$ and a similar question $q_{similar}$ for the original question $q$ of each sample in the training dataset.

To create the contrastive question $q_{contrast}$ , we simply substitute the temporal information in the original question with different temporal references while keeping everything else unchanged. For example, the contrastive question of the original question “What position did Barack Hussein Obama hold in 2010?” is “What position did Barack Hussein Obama hold in 1995?”. It should be emphasized that the answer to the contrastive question consistently differs from the answer to the original question, thereby ensuring their distinctiveness.

To generate a similar question $q_{similar}$ , we maintain the temporal information while modifying the wording of the question. If there are alternative expressions of the original question available in CLTSQA-Data dataset $\mathcal{D}$ , then substitute the expression of the original question with one of those alternatives. The original question “What position did Barack Hussein Obama hold in 2010?” can be transformed to a similar question “Barack Hussein Obama took which position in 2010?”. If no other expression exists in CLTSQA-Data dataset, We process the question with word segmentation and randomly rearrange the positions of the tokens in the question, excluding the temporal information. For example, the original question is “What position did Barack Hussein Obama hold in 2010?”, and its similar question is “position What Barack Hussein Obama did hold in 2010?”. The study conducted by [31] and [32] demonstrate that word order does not have a significant impact on model performance across various downstream tasks, including Question Answering (QA). Therefore, we employ the aforementioned approach to strive for consistency between similar questions and the original question.

Temporal Contrastive Learning.

As Fig. 4 shows, we concatenate a context $c$ and original question ${q}_{ori}$ , contrastive question ${q}_{con}$ , similar question ${q}_{sim}$ respectively as the three inputs $\mathbf{x}=\{{q}_{ori},{c}\}$ , $\mathbf{x}_{con}=\{{q}_{con},{c}\}$ and $\mathbf{x}_{sim}=\{{q}_{sim},{c}\}$ of the model. These inputs are passed through model, obtaining three representations ${a}_{ori}$ , ${a}_{con}$ and ${a}_{sim}$ .

We first apply TripletMarginLoss [25] function over $a_{ori}$ , $a_{con}$ and $a_{sim}$ to obtain $L_{triple}$ .

T(s,p,n)=max\{d(s_{i},p_{i})-d(s_{i},n_{i})+margin,0\}

(1)

where

d(x,y)=\parallel x-y\parallel_{p}

(2)

and

L_{triple}=T(a_{ori},a_{sim},a_{con})

(3)

Then ${a}_{ori}$ and ${a}_{sim}$ are processed by a linear layer to obtain representations $\hat{a}_{ori}$ and $\hat{a}_{sim}$ . We get answer prediction loss $L_{predict}$ by applying CrossEntropy function over target label $a_{target}$ and the representation $\hat{a}_{ori}$ . Likewise, get similar loss $L_{similar}$ by applying CrossEntropy function over target label $a_{target}$ and the representation $\hat{a}_{sim}$ .

Finally we combine $L_{predict}$ , $L_{similar}$ and $L_{triple}$ as the final objective function loss:

Loss=\alpha L_{predict}+\beta L_{similar}+\gamma L_{triple}

(4)

where $\alpha>0,\beta>0,\gamma>0$ are weight factors.

TABLE II: Results of models’ final performance after sequentially training on the 5 subsets. “FiD-CLTSQA” (“BigBird-CLTSQA”) and “FiD-baseline” (“BigBird-baseline”) denote the model trained with / without the proposed CLTSQA-Framework, respectively.

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
FiD-Baseline	34.25	45.29	29.97	40.22	43.84	53.89	39.26	50.95	40.32	50.60	40.43	49.48	42.26	53.37	46.28	56.60	47.97	55.60	49.03	56.29
FiD-CLTSQA	42.45	52.01	39.31	49.55	47.97	57.95	47.49	57.76	47.96	57.05	48.06	56.84	46.08	55.95	49.12	59.16	49.71	57.48	49.03	57.06
BigBird-Baseline	31.24	40.55	29.48	38.81	35.16	45.14	35.66	44.68	26.59	36.04	32.46	40.85	35.76	43.58	37.94	46.48	41.58	48.08	41.74	50.16
BigBird-CLTSQA	35.21	43.54	33.81	41.49	42.63	51.57	42.91	50.64	38.25	45.53	42.24	50.22	39.93	47.90	39.02	46.83	43.77	49.72	44.72	50.42

TABLE III: Ablation results of model variants after sequentially training on the 5 subsets. “TMR” and “TCL” denote “Temporal Memory Replay” and “Temporal Contrastive Learning”, respectively.

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
FiD-CLTSQA	42.45	52.01	39.31	49.55	47.97	57.95	47.49	57.76	47.96	57.05	48.06	56.84	46.08	55.95	49.12	59.16	49.71	57.48	49.03	57.06
w/o TCL	42.06	52.44	38.63	49.12	45.84	55.63	45.07	55.49	45.41	55.68	48.13	57.22	44.73	54.81	46.28	56.93	48.42	55.72	47.32	54.81
w/o TMR	15.94	19.09	17.59	20.99	17.86	21.53	19.33	23.43	17.95	22.82	18.94	22.14	42.83	52.65	43.81	53.40	48.93	56.91	49.48	57.40
FiD-Baseline	34.25	45.29	29.97	40.22	43.84	53.89	39.26	50.95	40.32	50.60	40.43	49.48	42.26	53.37	46.28	56.60	47.97	55.60	49.03	56.29

TABLE IV: Ablation results of model variants after sequentially training on the 5 subsets. “MR” and “TMR” denote “Memory Replay” and “Temporal Memory Replay”, respectively.

	Subset1		Subset2		Subset3		Subset4		Subset5
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
FiD-Baseline with MR	40.91	51.51	54.62	56.46	45.75	55.28	43.46	53.71	47.19	54.92
FiD-Baseline with TMR	42.06	52.44	45.84	55.63	45.41	55.68	44.73	54.81	48.42	55.72

VI Experiments

In this section, we conduct experiments for the CLTSQA task, and would like to answer the following three research questions: 1) whether the novel task CLTSQA poses new challenges to the existing QA models; 2) whether our framework helps the models to deal with the CLTSQA task; and 3) which part of our framework contributes more to the performance improvement.

Data

We conduct the experiment upon the proposed CLTSQA-Data dataset. Specifically, we use $K=5$ subsets, each of which consists of around 7,000 training questions, 1,500 validation questions and 1,500 testing questions. Table I shows the statistics of the subsets.

Model

As illustrated in Sec. V, our framework is model-agnostic and can be applied to arbitrary QA models. We use the following two models as our baselines:

•

FiD [26], whose objective is to generate answers sequentially, token by token, in an auto-regressive manner. It has achieved impressive performance on Natural Questions [27] and TriviaQA [28].
•

BigBird [29], which introduces a sparse attention mechanism that enhances performance across various tasks involving extensive contextual information. This model focuses on extracting the answers from a given sequence and has achieved remarkable outcomes in question answering.

Training

We follow [26] and [29] to construct FiD and BigBird, and initialize the baselines with Natural Question pre-trained weights. For temporal memory replay, we set $\mu=10\%$ and $\nu=10\%$ . For temporal contrastive learning, we set $\alpha:\beta:\gamma=1:0.5:0.5$ . During training, we continuously train the model on the 5 subsets. For each subset, we train the model for 8 epochs with a batch size of 1. The model is optimized using AdamW [30] with a learning rate of $5e^{-5}$ .

Evaluation

After training on a subset, we evaluate the model on the testing set of this subset as well as all previous subsets. We use exact match (EM) and F1 score as the evaluation metrics.

VII Results and Discussions

VII-A Main Results

Table II shows models’ evaluation performance after subsequently training on the five subsets. “FiD-CLTSQA” (“BigBird-CLTSQA”) and “FiD-baseline” (“BigBird-baseline”) denote the model trained with / without the proposed CLTSQA-Framework, respectively. The baselines (“FiD-Baseline” and “BigBird-Baseline”), which are trained in a sequential manner but without utilizing the proposed framework (i.e., no temporal memory replay or temporal contrastive learning), exhibit poor performance. In particular, the baselines perform worst when being evaluated on Subset1, which has the greatest temporal difference from the most up-to-date subset (Subset5). Such observations answer our first research question - the current QA models may face challenges when tackling the CLTSQA task.

When it comes to the proposed CLTSQA-Framework, it is evident that this framework helps the models to obtain improved performance, especially in those “earlier” subsets. Taking the earliest subset, Subset1, as an example, when equipped with CLTSQA-Framework, the BigBird model demonstrates a 14.69% increase in EM and 6.91% increase in F1 (“BigBird-CLTSQA” v.s., “BigBird-Baseline”). More significant performance improvement could be observed in FiD, which demonstrates a 31.16% increase in EM and 23.20% increase in F1 (“FiD-CLTSQA” v.s., “FiD-Baseline”). Such observations answer our second research question - the proposed framework helps the models to deal with the CLTSQA task.

The significant performance improvement could be attributed to two strategies introduced by the proposed CLTSQA-Framework: 1) the temporal memory replay, which helps the model to retain the old knowledge when ingesting the latest knowledge; and 2) the temporal contrastive learning, which helps the model to acquire representations in a manner that captures and distinguishes the temporal information present in the question, thus enhancing model’s ability in answering the temporal-sensitive questions. To validate these strategies, Fig. 5 shows the testing performance of “FiD-Baseline” and “FiD-CLTSQA” models in different training stages, where $M_{i}$ denotes the model after training on subset $\mathcal{\overline{D}}_{i}$ . It could be observed that while “FiD-Baseline” encounters performance drop in Subset 1, Subset 2 and Subset 3 with the progress of training, “FiD-CLTSQA” retains its performance on those subsets throughout the training process, validating the first strategy. The second strategy could be validated from two perspectives. Firstly, going beyond retaining the performance, the model with CLTSQA-Framework can even improve performance on Subset 1 with the progress of training, showing the enhancement of ability of answering temporal-sensitive questions. Secondly, in the up-to-date subsets such as Subset 4 and Subset 5, where there is reduced necessity to retain the old knowledge, the model with CLTSQA-Framework could still obtain better performance. Table V gives some examples of answers generated by “FiD-Baseline” and “FiD-CLTSQA”.

TABLE V: Examples of answers generated by “FiD-Baseline” and “FiD-CLTSQA”.

the most up-to-date data (evaluated on

\mathcal{D}_{5}^{test}

)

context: He signed for South Coast Wolves after transferring from Sydney United ahead of the 2011

NSW Premier League season . Timpano left South Coast Wolves, signing for Dapto Dandaloo Fury

ahead of their 2015 Illawarra Premier League campaign.

question: Which team did Jacob Timpano play for in 2013?

FiD-Baseline

M_{5}

: “Sydney United”

FiD-CLTSQA

M_{5}

: “South Coast Wolves”

label: “South Coast Wolves”

context: Praveen Kumar was initially with the Royal Challengers Bangalore until 2010. In the

Indian Premier League he played for Kings XI Punjab from 2011 to 2013.

question: Which team did Praveen Kumar play for in 2010?

FiD-Baseline

M_{5}

: “ ” (unanswerable)

FiD-CLTSQA

M_{5}

: “Royal Challengers Bangalore”

label: “Royal Challengers Bangalore”

context: She was founder and chair of the Graduate Design Program at California College of the Arts

( 2006–2012 ).

question: What was the name of the employer Brenda Laurel work for in 2012?

FiD-Baseline

M_{5}

: “California College of the Arts”

FiD-CLTSQA

M_{5}

: “ ” (unanswerable)

label: “ ” (unanswerable)

previous data (evaluated on

\mathcal{D}_{1}^{test}

)

context: It is known that Vytautas himself knew and spoke in the Lithuanian language with Jogaila.

Struggle for power 1377–1384.

question: What was the residence of Vytautas in 1384?

FiD-Baseline

M_{1}

: “ ” (unanswerable)

FiD-Baseline

M_{5}

: “Lithuania”

FiD-CLTSQA

M_{1}

: “ ” (unanswerable)

FiD-CLTSQA

M_{5}

: “ ” (unanswerable)

label: “ ” (unanswerable)

context: University Hall , the first residential hall for women students in Scotland ,was founded at

St Andrews University in 1895 ;Louisa Lumsden was appointed its first warden.

question: Which employer did Louisa Lumsden work for in 1895?

FiD-Baseline

M_{1}

: “St Andrews University”

FiD-Baseline

M_{5}

: “University Hall”

FiD-CLTSQA

M_{1}

: “St Andrews University”

FiD-CLTSQA

M_{5}

: “St Andrews University”

label: “St Andrews University”

context: He was appointed Lord Advocate in 1775. His name appears in the 1776 minute book of the

Poker Club. 2nd Earl of Shelburne and Pitt, he entered the cabinet in 1791 as Secretary of State for the

Home Department.

question: Which position did Henry Dundas, 1st Viscount Melville hold in 1776?

FiD-Baseline

M_{1}

: “Lord Advocate”

FiD-Baseline

M_{5}

: “ ” (unanswerable)

FiD-CLTSQA

M_{1}

: “Lord Advocate”

FiD-CLTSQA

M_{5}

: “Lord Advocate”

label: “Lord Advocate”

VII-B Ablation Studies

VII-B1 The Contributions of TMR and TCL

In order to further investigate the contributions of the two strategies brought by CLTSQA-Framework, we conduct ablation studies by building two more model variants upon “FiD-CLTSQA”:

•

FiD-CLTSQA w/o TCL, which only applies temporal memory replay
•

FiD-CLTSQA w/o TMR, which only applies temporal contrastive learning.

Table III shows the final evaluation result, where “FiD-CLTSQA w/o TCL w/o TMR” is indeed the baseline model “FiD-Baseline”. The result answers our third research question: the temporal memory replay effectively alleviates forgetting of the previous knowledge, thus playing a more important role in the old subsets (“FiD-CLTSQA” v.s., “FiD-CLTSQA w/o TMR”). Differently, the temporal contrastive learning brings less significant but consistent performance improvement across all subsets (“FiD-CLTSQA” v.s., “FiD-CLTSQA w/o TCL”). Overall, the CLTSQA-Framework benefits from both modifications.

VII-B2 The Novelty of TMR

In order to emphasize on the novelty of temporal memory replay, we conduct a comparative experiment by employing two more model variants upon “FiD-Baseline”:

•

FiD-Baseline with MR, which only applies memory replay which selects 10% old knowledge from each previous subset and reuses them in the subsequent training process.
•

FiD-Baseline with TMR, which only applies temporal memory replay demonstrated in section V-A.

The experimental results shown in Table IV provide compelling proof of the superiority of our temporal memory replay method over the memory replay.

VIII Conclusion

In this study, we pioneered a novel task, Continual Learning for Temporal Sensitive Question Answering (CLTSQA). We first introduced a new dataset, CLTSQA-Data, to facilitate research in this area, followed by the introduction of a novel framework, CLTSQA-Framework, designed to assist models in handling temporally-sensitive QA in a continual learning context. Our experimental results revealed that while the CLTSQA task poses fresh challenges for existing models, the proposed framework effectively equips the model to overcome these hurdles, resulting in improved performance. We are confident that our contributions, encompassing both the dataset and the framework, will stimulate future research in this innovative direction. As we move forward, there is a need for further exploration of datasets and models to delve deeper into the complexities of CLTSQA.

References

[1] W. Chen, X. Wang, and W. Y. Wang, “A dataset for answering time-sensitive questions,” 2021.
[2] M. J. Zhang and E. Choi, “Situatedqa: Incorporating extra-linguistic contexts into qa,” 2021.
[3] J. Wang, A. Jatowt, and M. Yoshikawa, “Archivalqa: A large-scale benchmark dataset for open-domain question answering over historical news collections,” pp. 3025–3035, 2022.
[4] A. Liska, T. Kocisky, E. Gribovskaya, T. Terzi, E. Sezener, D. Agrawal, D. Cyprien De Masson, T. Scholtes, M. Zaheer, S. Young et al., “Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models,” PMLR, pp. 13 604–13 622, 2022.
[5] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen, “Time-aware language models as temporal knowledge bases,” pp. 257–273, 2022.
[6] D. Loureiro, F. Barbieri, L. Neves, L. E. Anke, and J. Camacho-Collados, “Timelms: Diachronic language models from twitter,” arXiv preprint arXiv:2202.03829, 2022.
[7] Y. Qin, J. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “Elle: Efficient lifelong pre-training for emerging data,” arXiv preprint arXiv:2203.06311, 2022.
[8] Z. Jia, A. Abujabal, R. Saha Roy, J. Strötgen, and G. Weikum, “Tempquestions: A benchmark for temporal question answering,” in Companion Proceedings of the The Web Conference 2018, 2018, pp. 1057–1062.
[9] S. Min, J. Michael, H. Hajishirzi, and L. Zettlemoyer, “Ambigqa: Answering ambiguous open-domain questions,” arXiv preprint arXiv:2004.10645, 2020.
[10] Q. Ning, H. Wu, R. Han, N. Peng, M. Gardner, and D. Roth, “Torque: A reading comprehension dataset of temporal ordering questions,” arXiv preprint arXiv:2005.00242, 2020.
[11] C. Shang, P. Qi, G. Wang, J. Huang, Y. Wu, and B. Zhou, “Open temporal relation extraction for question answering,” in 3rd Conference on Automated Knowledge Base Construction, 2021.
[12] R. Han, X. Ren, and N. Peng, “Econet: effective continual pretraining of language models for event temporal reasoning,” arXiv preprint arXiv:2012.15283, 2020.
[13] C. Shang, G. Wang, P. Qi, and J. Huang, “Improving time sensitivity for question answering over temporal knowledge graphs,” arXiv preprint arXiv:2203.00255, 2022.
[14] M. Biesialska, K. Biesialska, and M. R. Costa-Jussa, “Continual lifelong learning in natural language processing: A survey,” arXiv preprint arXiv:2012.09823, 2020.
[15] Z. Ke and B. Liu, “Continual learning of natural language processing tasks: A survey,” arXiv preprint arXiv:2211.12701, 2022.
[16] J. Jang, S. Ye, S. Yang, J. Shin, J. Han, G. Kim, S. J. Choi, and M. Seo, “Towards continual knowledge learning of language models,” arXiv preprint arXiv:2110.03215, 2021.
[17] J. Jang, S. Ye, C. Lee, S. Yang, J. Shin, J. Han, G. Kim, and M. Seo, “Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models,” arXiv preprint arXiv:2204.14211, 2022.
[18] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu, “Recall and learn: Fine-tuning deep pretrained language models with less forgetting,” arXiv preprint arXiv:2004.12651, 2020.
[19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
[20] T. He, J. Liu, K. Cho, M. Ott, B. Liu, J. Glass, and F. Peng, “Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1121–1133.
[21] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[22] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang, M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models with adapters,” arXiv preprint arXiv:2002.01808, 2020.
[23] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
[24] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010.
[25] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks.” in Bmvc, vol. 1, no. 2, 2016, p. 3.
[26] G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” arXiv preprint arXiv:2007.01282, 2020.
[27] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
[28] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551, 2017.
[29] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
[30] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[31] K. Sinha, P. Parthasarathi, J. Pineau, and A. Williams, “Unnatural language inference,” arXiv preprint arXiv:2101.00010, 2020.
[32] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela, “Masked language modeling and the distributional hypothesis: Order word matters pre-training for little,” arXiv preprint arXiv:2104.06644, 2021.

Appendix A CLTSQA-Data Statistics

Distribution of Question Types in CLTSQA-Data

We investigated the various question types present in our dataset, which encompassed Easy Reasoning, Joining Commonsense, Joining Multiple Descriptions, Joining Multiple Paragraphs, and Unanswerable. Furthermore, we calculated the distribution of these question types within the entire dataset as Fig. 6 shows.

Examples in Question Types

As shown in Table VIII, we present five different question types of our CLTSQA-Data, including context, question, and answer.

Appendix B Ablation Study on Temporal Memory Replay

We investigated the performance of temporal memory replay with / w.o the step of removing hard samples, respectively. We assess model $M_{5}$ by $\mathcal{\overline{D}}_{5}^{dev}$ . It can be seen from Fig. 7 that temporal memory replay with step removing hard samples has better performance.

Experimental Parameters

The parameter settings for the two models, FiD and BigBird, used in the experiment are illustrated in Table VI and Table VII, respectively.

TABLE VI: FiD Model Parameters.

Parameters	FiD
max_query_length	36
max_sequence_length	4096
max_answer_length	60
max_text_length	180
learning_rate	5 $e^{-5}$
adam_epsilon	1 $e^{-8}$
pre_gpu_train_batch_size	1
pre_gpu_eval_batch_size	1
n_gpu	4
num_train_epochs	8

TABLE VII: BigBird Model Parameters.

Parameters	BigBird
max_query_length	36
max_sequence_length	3600
doc_stride	2048
learning_rate	5 $e^{-5}$
adam_epsilon	1 $e^{-8}$
pre_gpu_train_batch_size	1
pre_gpu_eval_batch_size	1
n_gpu	4
num_train_epochs	8

TABLE VIII: Examples of question types in CLTSQA-Data.

Easy Reasoning
context:	… Benedek Jávor, a proponent of the agreement, resigned from his position of parliamentary group leader, and Bernadett Szél were elected co-presidents of the LMP during the partys congress on 24 March 2013 …
question:	Who was the head of LMP – Hungary’s Green Party in 2013?
label:	“Bernadett Szél”
context:	… University Hall, the first residential hall for women students in Scotland, was founded at St Andrews University in 1895; Louisa Lumsden was appointed its first warden …
question:	Which employer did Louisa Lumsden work for in 1895?
label:	“St Andrews University”
Joining Commonsense
context:	He was purchased by the Kolkata Knight Riders at the 2011 IPL auctions for the next 3 years.
question:	Which team did the player Eoin Morgan belong to in 2012?
label:	“Kolkata Knight Riders”
context:	He was Professor of Ancient History at the University of St Andrews from 1998 to 2014.
question:	Greg Woolf was an employee for whom in 2010?
label:	“University of St Andrews”
Joining Multiple Descriptions
context:	In April 2014, Pohjanpalo renewed his contract with HJK, extending it to 2018. At the same time HJK extended his loan a further two years, which Pohjanpalo spent on loan at Fortuna Düsseldorf.
question:	Which team did Joel Pohjanpalo play for in 2015?
label:	“Fortuna Düsseldorf”
context:	Tavares became CEO of Groupe PSA in 2014. Until January 16, 2021, he became the first chief executive officer of the multinational automobile group Stellantis.
question:	Carlos Tavares was an employee from whom in 2015?
label:	“Groupe PSA”
Joining Multiple Paragraphs
context:	(paragraphs 1) … He has been chair of the Labour Party in the House of Representatives since 10 June 2010 … (paragraphs 2) On 20 February 2012, he resigned as leader of the Labour Party, …
question:	What position did Job Cohen take in 2011?
label:	“Leader of the Labour Party”
context:	(paragraphs 1) … He was appointed Lord Advocate in 1775. His name appears in the 1776 minute book of the Poker Club… (paragraphs 2) 2nd Earl of Shelburne and Pitt, he entered the cabinet in 1791 as Secretary of State for the Home Department.
question:	Which position did Henry Dundas, 1st Viscount Melville hold in 1776?
label:	“Lord Advocate”
Unanswerable
context:	… In February 2016, she shared first place with Anastasia Bodnaruk and Soumya Swaminathan in the women’s event of the Moscow Open, finishing third on tiebreak. In 2017 she competed again in the World Youth U16 Olympiad for Russia and her team won the gold medal …
question:	Which title was conferred to Alexandra Obolentseva in 2017?
label:	“ ”
context:	… The P class were later re-allocated to shunting and station pilot duties. All eight locomotives passed into Southern Railway ownership at The Grouping in 1923 …
question:	What operated SECR P class in 1921?
label:	“ ”

Experimental Results

Table IX, X, XI, XII show results of specific performance of each stage in FiD without CLTSQA-Framework, FiD with Temporal Memory Replay, FiD with Temporal Contrastive Learning and FiD with CLTSQA-Framework respectively. Each model $M_{i}$ is assessed by $\mathcal{\overline{D}}_{i}^{dev}$ and $\mathcal{\overline{D}}_{i}^{test}$ .

TABLE IX: FiD-Baseline

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
$M_{1}$	39.76	50.66	33.54	46.04
$M_{2}$	39.18	49.52	35.33	45.10	45.69	55.26	44.94	55.57
$M_{3}$	37.39	47.77	32.10	42.99	45.27	55.46	43.31	54.77	46.01	55.46	48.13	56.71
$M_{4}$	36.04	47.64	31.89	43.42	43.27	54.13	41.48	54.24	42.80	53.47	45.72	54.47	46.22	57.02	48.61	58.62
$M_{5}$	34.25	45.29	29.97	40.22	43.84	53.89	39.26	50.95	40.32	50.60	40.43	49.48	42.26	53.37	46.28	56.60	47.97	55.60	49.03	56.29

TABLE X: Temporal Memory Replay

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
$M_{1}$	39.76	50.66	33.54	46.04
$M_{2}$	40.72	50.92	35.33	46.42	44.41	53.68	45.85	56.06
$M_{3}$	40.08	50.30	35.74	47.16	44.84	54.70	44.02	55.11	44.68	54.81	46.72	55.93
$M_{4}$	40.91	50.38	37.66	47.87	46.05	55.01	44.74	55.21	47.22	57.00	50.87	59.72	43.25	53.98	49.57	59.35
$M_{5}$	42.06	52.44	38.63	49.12	45.84	55.63	45.07	55.49	45.41	55.68	48.13	57.22	44.73	54.81	46.28	56.93	48.42	55.72	47.32	54.81

TABLE XI: Temporal Contrastive Learning

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
$M_{1}$	40.65	51.21	36.22	47.59
$M_{2}$	33.29	43.64	30.03	40.26	46.76	58.10	44.35	55.79
$M_{3}$	28.62	39.09	24.74	34.47	47.12	58.07	44.09	55.96	48.09	57.73	49.93	58.57
$M_{4}$	26.12	36.58	21.31	32.22	41.35	53.97	41.54	53.27	46.01	56.86	47.19	57.22	45.86	56.61	49.68	60.02
$M_{5}$	15.94	19.09	17.59	20.99	17.86	21.53	19.33	23.43	17.95	22.82	18.94	22.14	42.83	52.65	43.81	53.40	48.93	56.91	49.48	57.40

TABLE XII: FiD with CLTSQA-Framework

	Subset1				Subset2				Subset3				Subset4				Subset5
	Dev		Test		Dev		Test		Dev		Test		Dev		Test		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
$M_{1}$	40.65	51.21	36.22	47.59
$M_{2}$	40.72	50.95	33.68	45.72	48.11	57.57	46.83	56.72
$M_{3}$	42.64	52.14	37.18	48.04	48.11	57.82	48.01	57.99	47.76	56.43	49.33	57.72
$M_{4}$	44.43	53.85	36.08	46.85	48.90	59.03	47.03	58.81	50.77	59.83	50.28	59.68	46.72	57.17	49.74	59.29
$M_{5}$	42.45	52.01	39.31	49.55	47.97	57.95	47.49	57.76	47.96	57.05	48.06	56.84	46.08	55.95	49.12	59.16	49.71	57.48	49.03	57.06

Continual Learning for Temporal-Sensitive Question Answering

Abstract

Index Terms:

I Introduction

II Related Work

II-A Temporal-Sensitive Question Answering

II-B Continual Learning

III Preliminaries

TSQA

CLTSQA

IV CLTSQA Dataset

V CLTSQA Framework

V-A Temporal Memory Replay

V-B Temporal Contrastive Learning

Generation of Contrastive and Similar Question.

Temporal Contrastive Learning.

VI Experiments

Data

Model

Training

Evaluation

VII Results and Discussions

VII-A Main Results

VII-B Ablation Studies

VII-B1 The Contributions of TMR and TCL

VII-B2 The Novelty of TMR

VIII Conclusion

References

Appendix A CLTSQA-Data Statistics

Distribution of Question Types in CLTSQA-Data

Examples in Question Types

Appendix B Ablation Study on Temporal Memory Replay

Experimental Parameters

Experimental Results

Continual Learning for
Temporal-Sensitive Question Answering