Temporal Knowledge Graph Question Answering: A Survey

Miao Su, Zixuan Li, Zhuo Chen, Long Bai, Xiaolong Jin, Jiafeng Guo11footnotemark: 1
CAS Key Laboratory of Network Data Science and Technology,
Institute of Computing Technology, Chinese Academy of Sciences
Correspondence: [email protected]
  Corresponding authors.
Abstract

Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.

\externaldocument

Latex/sec-preliminary

Temporal Knowledge Graph Question Answering: A Survey


Miao Su, Zixuan Li, Zhuo Chen, Long Bai, Xiaolong Jinthanks:   Corresponding authors., Jiafeng Guo11footnotemark: 1 CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences Correspondence: [email protected]


1 Introduction

Knowledge Base Question Answering (KBQA) aims to answer natural language questions based on existing Knowledge Bases (KBs) Dong et al. (2015). It has garnered significant attention from academia and industry due to its crucial role in various intelligent applications across multiple fields Zhou et al. (2018). A crucial subtask within KBQA is Temporal Knowledge Graph Question Answering (TKGQA), which specifically addresses temporal questions using Temporal Knowledge Graphs (TKGs) Leblay and Chekol (2018a). Temporal questions include temporal constraints or require timestamped answers, reflecting the dynamic and evolving nature of real-world events. The answer can vary significantly with different time constraints. For example, the answer to “Who won the UFC’s strawweight championship in 2022?” is “Carla Esparza”, while the answer to “Who won the UFC’s strawweight championship in 2024?” is "Weili Zhang”. Existing KBQA methods, even for complex questions, struggle with temporal questions Jia et al. (2018b); Sun et al. (2019); Pramanik et al. (2021); Bast and Haussmann (2015); Abujabal et al. (2017).

Despite growing interest in TKGQA Chen et al. (2024); Gao et al. (2024); Du et al. (2024); Huang et al. (2024); Xue et al. (2024), the field still grapples with several challenges: (1) Ambiguities in the classification of temporal questions. As illustrated in Table 1, existing methods vary in their understanding of temporal questions, often concentrating on specific types of questions. Currently, there remains an absence of a comprehensive review encompassing all existing temporal questions. (2) Lack of systematic categorization of existing methods. Existing surveys primarily focus on static factual questions and their related KBQA methods Fu et al. (2020); Lan et al. (2021); Gu et al. (2022); Chakraborty et al. (2021). Considering TKGQA’s special handling of timing, it is crucial to conduct an exhaustive review of TKGQA methods.

Dataset KG/TKG Representation Form Question Types
Temp Questions Freebase CVT Explicit Implicit Ordinal Temp.Answer
Time Questions Wikidata triples n-array tuple (n>3) Explicit Implicit Ordinal Temp.Answer
Crom Questions Wikidata quintuple SimpleTime SimpleEntity Before/After First/Last TimeJoin
MultiTQ ICEWS05-15 quadruple Equal Before/After First/Last Equal Multi Before Last After First
Table 1: TKGQA datasets, as well as their background temporal knowledge graphs, the representation form of temporal fact therein, and question types.

To address the above challenges, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methods categorization for TKGQA. Specifically, we first establish a unified taxonomy that encompasses existing temporal question types and definitions, providing a standardized reference that could be widely adopted. Subsequently, we systematically categorize existing methods into semantic parsing-based and TKG embedding-based. Within each category, we highlight how they uniquely address the temporal questions. We identify the temporal question types that each method can solve and summarize them in a table to analyze the focus of existing methods and the question types that lack attention. Building on this review, we further analyzed the future research directions. To the best of our knowledge, this is the first comprehensive survey on the TKGQA task. This work aims to stimulate further research and foster innovation in the field by serving as a comprehensive reference for TKGQA.

The rest of this paper is organized as follows. In §2, we define in detail the relevant concepts of TKGQA and this task itself. In §3, we classify temporal questions across all datasets based on question content (§3.1), answer type(§3.2), and complexity (§3.3). In §4, we introduce the two categories to TKGQA methods; in §4.1, we detail semantic parsing-based methods, while in §4.2, we elaborate on TKG embedding-based methods; in §4.3, we align each method with the specific types of questions it is designed to solve, providing a detailed table for summary. In §5, we explore new frontiers, summarize their challenges, and highlight opportunities for further research. We conclude this survey in §6. Additionally, in Appendix A, we provide a detailed description of the existing TKGQA datasets (§A.1), including the knowledge graphs behind them; introduce the evaluation metrics (§A.2) for the TKGQA tasks, and provide a leaderboard to illustrate the latest research progress (§A.3).

2 Preliminary

Temporal Knowledge Graph. A TKG usually is denoted as 𝒢=(,,𝒯,)𝒢𝒯\mathcal{G}=(\mathcal{E},\mathcal{R},\mathcal{T},\mathcal{F})caligraphic_G = ( caligraphic_E , caligraphic_R , caligraphic_T , caligraphic_F ), where \mathcal{E}caligraphic_E, \mathcal{R}caligraphic_R, 𝒯𝒯\mathcal{T}caligraphic_T, and \mathcal{F}caligraphic_F represent the entities, relations, timestamps, and facts respectively Cai et al. (2024). A temporal fact f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F comprises one or more entities, relations, and associated timestamps. It can be represented in various forms, including Compound Value Types (CVTs), triples, n-array tuples, quintuples, and quadruples.

Temporal Question.

A temporal question contains at least one temporal constraint or requires timestamps as its answer Jia et al. (2018a). A temporal constraint involves a combination of a temporal expression and a temporal word, setting a condition about a specific time point or interval that the answer must meet (e.g., "in 1996"). Temporal expressions refer to time points or intervals with varying levels of granularity in natural language (e.g., "May 11th, 2024") Pustejovsky et al. ; Huang (2018). Temporal words indicate the temporal relationships between temporal expressions and act as trigger words that impose constraints on the answers (e.g., “in”, “after”, or “during”).

Temporal Knowledge Graph Question Answering.

Given the temporal knowledge graph 𝒢𝒢\mathcal{G}caligraphic_G and a temporal question q𝑞qitalic_q in natural language, the TKGQA task aims to answer the q𝑞qitalic_q using either a set of entities {e|e}conditional-set𝑒𝑒\{e|e\in\mathcal{E}\}{ italic_e | italic_e ∈ caligraphic_E } or timestamps {τ|τ𝒯}conditional-set𝜏𝜏𝒯\{\tau|\tau\in\mathcal{T}\}{ italic_τ | italic_τ ∈ caligraphic_T } from 𝒢𝒢\mathcal{G}caligraphic_G.

3 Taxonomy of Temporal Questions

Refer to caption
Figure 1: Taxonomy of temporal questions from three aspects, including (a) Question Content; (b) Answer Type and (c) Complexity.

We categorize the questions based on three aspects as illustrated in Figure 1: 1) Question Content: We use several time-related dimensions in question content to categorize the questions, as these dimensions naturally differentiate how questions are answered. 2) Answer Type: We classify the questions based on the answer types; unlike KBQA questions with a single answer type (i.e., entity), temporal questions encompass various types of answers. 3) Complexity: Similar to KBQA, we categorize the questions by their complexity Hu et al. (2018); Luo et al. (2018).

3.1 Question Content

Temporal Granularity.

Questions can be categorized by the temporal granularity of their temporal expressions, with “year” being the most common, followed by “day” and “month”.

Temporal Expression.

Questions can be classified as explicit or implicit based on the nature of their temporal expressions. All time points can be normalized to a standard format, such as 2024-08-09. Explicit temporal expression can be normalized without additional context (e.g., “September 2023” as 2023-09). Implicit temporal expression, such as an event name or phrase with a temporal scope (e.g., “2024 Paris Olympics”), requires contextual information to be normalized into a specific interval Jia et al. (2018a).

Temporal Constraints.

The types of temporal constraints mirror those of temporal relations between temporal expressions. We simplify Allen’s internal algebra for temporal reasoning Allen (1983) into six types of relations: Before/After, Equal, Overlap, During/Include, Start/End, Ordinal. Their formalizations are as follows:

  • [beginans,endans]𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠[begin_{ans},end_{ans}][ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT , italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ]: This represents the time interval or specific time point where the answer is located.

  • [begincons,endcons]𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠[begin_{cons},end_{cons}][ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT , italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ]: This denotes the range of the temporal constraint. When begincons=endcons𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠begin_{cons}=end_{cons}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT, it signifies a specific point in time.

A summary of the meanings of these temporal constraint types is provided in Table 2.

Constraint Type Formalization
Before endansbegincons𝑒𝑛subscript𝑑𝑎𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠end_{ans}\leq begin_{cons}italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT
After beginansendcons𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠begin_{ans}\geq end_{cons}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≥ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT
Equal beginans=begincons,endans=endconsformulae-sequence𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠begin_{ans}=begin_{cons},end_{ans}=end_{cons}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT = italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT , italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT = italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT
Overlap
beginansendconsendans𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠begin_{ans}\leq end_{cons}\leq end_{ans}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT or
beginansbeginconsendans𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠begin_{ans}\leq begin_{cons}\leq end_{ans}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT
During beginconsbeginansendansendcons𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠begin_{cons}\leq begin_{ans}\leq end_{ans}\leq end_{cons}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT
Include beginansbeginconsendconsendans𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠begin_{ans}\leq begin_{cons}\leq end_{cons}\leq end_{ans}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT
End beginconsbeginansendcons=endans𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠begin_{cons}\leq begin_{ans}\leq end_{cons}=end_{ans}italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT
Start benginans=beginconsendansendcons𝑏𝑒𝑛𝑔𝑖subscript𝑛𝑎𝑛𝑠𝑏𝑒𝑔𝑖subscript𝑛𝑐𝑜𝑛𝑠𝑒𝑛subscript𝑑𝑎𝑛𝑠𝑒𝑛subscript𝑑𝑐𝑜𝑛𝑠bengin_{ans}=begin_{cons}\leq end_{ans}\leq end_{cons}italic_b italic_e italic_n italic_g italic_i italic_n start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT = italic_b italic_e italic_g italic_i italic_n start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT ≤ italic_e italic_n italic_d start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT
Table 2: Formalization of constraint types.

The Ordinal type requires facts to be arranged in chronological order.

Temporal Constraints Composition.

Temporal constraints composition occurs when multiple temporal constraints are in one question. For instance, “Who was the first to request a meeting with Togo in 2005?” combines an Equal type constraint “in 2005” with an Ordinal type constraint “first”. The answer must satisfy both. This combination represents a more complex and challenging type of question.

3.2 Answer type

Temporal questions can require answers that are either collections of entities or collections of timestamps, with the granularity of the timestamps varying based on the specific question. The type of answer is guided by the question word—such as “who” for entity and “what year” for timestamp.

3.3 Complexity

KBQA works define complex questions as those requiring retrieval of answers from more than one fact Hu et al. (2018); Dubey et al. (2019). Inspired by these works, we also categorize temporal questions based on complexity. Specifically, we classify temporal questions into simple and complex categories.

Simple questions.

Simple questions rely on a single fact for resolution. For instance, “What currency was used in Germany in 2012?” requires retrieving only one fact <Germany, currency, Euro, 2012>.

Complex questions.

Complex questions require the integration of multiple facts. For example, the question “Who was the US President before Obama?” first establishes the time constraint “before 2009” based on the fact <Obama, President of, USA, 2009, 2017>. The system then identifies the individual who served immediately prior, confirmed by the fact <George W. Bush, President of, USA, 2001, 2009>, thus identifying George W. Bush. This multi-step reasoning process illustrates the complexity of such questions.

4 Two Categories of TKGQA Methods

Since TKGQA is a crucial subtask within KBQA, many TKGQA methods have been developed to enrich and improve upon KBQA approaches. KBQA methods are categorized into Semantic Parsing-based (SP-based) and Information Retrieval-based (IR-based) methods by existing surveys Fu et al. (2020); Lan et al. (2021, 2022). Building on this categorization, we classify TKGQA methods into Semantic Parsing-based (SP-based) and TKG Embedding-based (TKGE-based) methods. Slightly different from IR-based methods in KBQA, TKGE-based methods view TKGQA as a TKG completion task Cai et al. (2023); Leblay and Chekol (2018b); Han et al. (2021) and do not always retrieve a question subgraph as in IR-based methods. The following sections delve into the details of these two categorizations of TKGQA methods.

4.1 Semantic Parsing-based Methods

As illustrated in Figure 2, SP-based methods usually have four steps: question understanding, logical parsing, TKG grounding, and query execution. The question understanding module converts unstructured text into encoded questions, facilitating downstream parsing. Next, the logical parsing module transforms the encoded question into uninstantiated logical forms, which are then grounded with the TKG elements through TKG grounding to get executable queries. Finally, the executable queries are processed and executed against the TKG to obtain the final answers during the query execution phase.

Refer to caption
Figure 2: Overall procedure of SP-based methods.

4.1.1 Question Understanding

The question understanding module analyzes the input question to generate an encoded representation. This module is sometimes simplified to tag or extract logical candidates like temporal words, entities, and timestamps. Abstract Meaning Representation (AMR) Kapanipathi et al. (2020) is one of the most widely used representations for KBQA questions, SYGMA Neelam et al. (2021) uses AMR to capture temporal words as part of the :time relation and handling implicit temporal constraints. Kannen et al. (2023) and  Long et al. (2022) also employ AMR to identify question constituents. SF-TQA Ding et al. (2023) fine-tunes BERT Devlin et al. (2019) to annotate elements determined by TimeML Pustejovsky et al. relations. With its impressive performance on text generation and induction, Large Language Model (LLM) have been applied to generate a simplified version of logical forms directly  Chen et al. (2024) and induce step-wise abstract methodological guidance to the present question Chen et al. (2023a).

4.1.2 Logical Parsing

Logical parsing transforms the encoded question into an uninstantiated logical form. TEQUILA uses the existing KBQA engines AQQU (Bast and Haussmann, 2015) and QUINT (Abujabal et al., 2017) to answer the sub-questions; these engines primarily rely on predefined rules or templates to parse questions and derive logical forms Fu et al. (2020). Early TKGQA approaches also employed rule-based translation, further incorporating time-related rules. SYGMA introduces KB-agnostic rules into λ𝜆\lambdaitalic_λ-expressions Cai and Yates (2013) to match temporal constraints indicated by the :time relation in AMR. Built on SYGMA, Kannen et al. (2023) decompose the λ𝜆\lambdaitalic_λ-expression into main-λ𝜆\lambdaitalic_λ and aux-λ𝜆\lambdaitalic_λ, with the former containing the primary event questioned and the latter containing the temporal constraint.

Additionally, many methods design specialized logical forms to represent temporal information Long et al. (2022). Ding et al. (2023) introduce the Semantic Framework of Temporal Constraints (SF-TCons), which captures temporal constraints and their interpretation structures. Six interpretation structures (IS) are summarized based on the intrinsic connection between events and their connectors. For example, the IS-1 Comparison structure ‘COMPARE⟨ INCLUDES, time(“direct”), “1960” ⟩’ in Figure 3 interprets that the “direct” event’s time should be “INCLUDES” by “1960”. After linking, it can be transformed into the query graph under it. Prog-TQA expands temporal operators based on Knowledge-oriented Programming Language (KoPL) Cao et al. (2022), which enables a more concise implementation of temporal logical queries compared to KBQA logical forms such as SPARQL Polleres . ARI defines specialized actions for precise information retrieval, such as “getBetween(entities,Time1,Time2)”, which identifies entities/events that occurred between two specific times. An action sequence generated by LLM can be viewed as a logical form here.

4.1.3 TKG Grounding

TKG grounding grounds the elements in the unbound logical form with the entities, relations, and timestamps in the TKG. A series of methods are employed in this module, including rule-based approaches Neelam et al. (2021), BERT representation similarity Yih et al. (2015), fuzzy matching algorithms Chen et al. (2024), and an off-the-shelf Named Entity Linking (NEL) model Chen et al. (2023a).

Refer to caption
Figure 3: Semantic framework of temporal constraints.

4.1.4 Query Execution

The query execution module runs the grounded logical form against the TKG to retrieve the final answers. Some methods conduct temporal reasoning during this module. TEQUILA casts sub-questions answers’ time range into intervals and conducts reasoning based on rules in Table 2. AE-TQ conducts temporal reasoning using semantic information structures (SISs). One that contains the temporal information computes a temporal constraint, which is then used to filter the candidate answers retrieved by another SIS. ARI performs knowledge-based interaction for multi-step inference Gu and Su (2022). The LLM generates and executes actions on the TKG iteratively until the final state provides the answer. Other methods try to enhance model robustness by generating multiple queries: SF-TQA generates multiple candidate queries and scores the pairs of input questions and serialized queries with BERT. Prog-TQA identifies potential errors in KoPL programs and generates corrected versions. Correct programs are collected and used to fine-tune the LLM for self-improvement Huang et al. (2022) iteratively.

To mitigate the TKG’s incompleteness, Kannen et al. (2023) propose a targeted temporal fact extraction technique. Where they use a reading comprehension question answering (RCQA) style model to obtain missing facts and complete the query.

4.2 TKG Embedding-based Methods

As illustrated in Figure 4, TKGE-based methods typically involve three steps: TKG embedding, question embedding, and answer ranking. In these methods, questions and candidate answers (i.e., entities and timestamps) are converted into embeddings through the question embedding and TKG embedding modules, respectively. The question embedding is then projected into Qentsubscript𝑄𝑒𝑛𝑡Q_{ent}italic_Q start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT and Qtimesubscript𝑄𝑡𝑖𝑚𝑒Q_{time}italic_Q start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT for ranking entities and timestamps during the answer ranking process.

Refer to caption
Figure 4: Overall procedure of TKGE-based methods.

4.2.1 TKG Embedding

The TKG Embedding module generates embeddings of TKG elements. The entity and timestamp embeddings are filtered and augmented to create a pool of candidate answers. EXAQT Jia et al. (2021) follows a line of KBQA research Sun et al. (2018a); Yasunaga et al. (2022), employing relational graph convolutional networks (R-GCNs) to update and derive the candidates’ embeddings. The entity embeddings are initialised with Wikipedia2Vec Yamada et al. (2020) and argumented with timestamp encodings Zhang et al. (2020a), time-aware entity embeddings, temporal signals Setzer (2001a), temporal question categories Jia et al. (2018b) and attention over temporal relations.

CRONKGQA Saxena et al. (2021) initially encodes all elements of the TKG using the TComplEx model Lacroix et al. (2020), a tensor factorization model designed for temporal knowledge graph completion Cai et al. (2023), capturing complex patterns and temporal dependencies within multi-relational data. TSQA Shang et al. (2022a) highlight that TComplEx ignores the temporal order between quadruples; they incorporate temporal order loss during the training of TComplEx, inspired by position embeddings in transformers Vaswani et al. (2023).

To reduce the search space, EXAQT generates compact question subgraphs using Group Steiner Trees (GSTs) Li et al. (2016). SubGTR Chen et al. (2022) crops question subgraphs using temporal constraints.

To address the inconsistency between a question’s granularity and the TKG’s temporal granularity, MultiQA Chen et al. (2023b) employs multi-granularity temporal aggregation. It splices days within each month or year interval, adds position vectors, and then fuses the information using the transformer.

4.2.2 Question Embedding

The question embedding module embeds the temporal question, analyzing its semantics and incorporating time-relevant information. EXAQT embeds the question words with Wikipedia2Vec Yamada et al. (2020) and encodes it with LSTM Hochreiter and Schmidhuber (1997). It then concatenates it with temporal category and temporal signal word encodings and updates using R-GCN.  Saxena et al. (2021) encodes the question with BERT. TempoQR Mavromatis et al. (2021) further leverages TKG embeddings to ground questions with their specific entities and respective time scopes. It replaces the BERT token embeddings of entities and timestamps with their pre-trained TKG embeddings and adds time position to the entity tokens. TSIQA Xiao et al. (2022) derives the time position of entities based on the assumption that entities with co-sharing relations correspond to related timestamps.

Many methods use GNN to further integrate the graphical structure into question embedding; the value of an edge in the graph is the concatenation of relation and timestamp, i.e., r||tr||titalic_r | | italic_t, which is specific to TKGQA tasks. TwiRGCN Sharma et al. (2022) computes question-dependent edge weights to modulate R-GCN messages, enhancing messages through relevant edges and diminishing those from irrelevant ones. LGQA Liu et al. (2023b) fuses global (i.e., sentence-level semantic) and local (i.e., entity-level graphical) information with transformers. GenTKGQA Gao et al. (2024) retrieves a question-relevant subgraph through LLM’s extraction ability Sun et al. (2023) and uses a pre-trained T-GNN layer Veličković et al. (2018) to embed elements in the subgraph into “virtual knowledge indicators” to represent question. M3𝑀3M3italic_M 3TQA Zha et al. (2024) designs a multi-stage aggregation module, enabling asynchronous alignment and fusion of bidirectional heterogeneous information from the PLMs Devlin et al. (2019); Liu et al. (2019) and GNNs.

To emphasize the importance of different knowledge for the question, JMFRN Huang et al. (2024) aggregates entity and timestamp information of retrieved facts using time-aware and entity-aware attention Vaswani et al. (2023). TMA Liu et al. (2023a) selects facts with similar semantics for three kinds of token-level attention. A gating mechanism integrates these representations to enhance the question embedding.

To enhance the model’s sensitivity to temporal words, TSQA and TSIQA alter temporal words (e.g., replacing “before” with “after”) to construct contrastive questions and apply both order loss and answer loss for contrastive learning.

Various approaches extract implicit temporal features from questions: CTRN Jiao et al. (2023) uses multi-head self-attention, GCN Sun et al. (2018b), and CNN Pota et al. (2020) to capture these features and fuse them with augmented BERT representations, while SERQA Du et al. (2024) integrates temporal constraint features computed from syntactic information in constituent and dependency trees Sun et al. (2022); Zhang et al. (2020b); Wang et al. (2023); Liang et al. (2022) combined with Masked Self-Attention (MSA).

To enhance the interpretability of reasoning on implicit temporal questions, SubGTR designs an implicit expression parsing module to rewrite their temporal constraints explicitly.

4.2.3 Answer Ranking

The answer ranking module ranks candidate answers based on the question and candidate answer embeddings. TKG models employ various techniques: leveraging TComplEx scoring functions  Saxena et al. (2021); Mavromatis et al. (2021), applying temporal activation functions to satisfy time constraints Chen et al. (2022), introducing gating mechanisms Sharma et al. (2022) or type discrimination losses Huang et al. (2024) to distinguish among answer types, and fine-tuning a LLM to list the most relevant answers Ye et al. (2023).

Method Category Question Content Answer Type Complexity
\LongunderstackTime
Granularity \LongunderstackTime
Expression \LongunderstackTemporal
Constraint \LongunderstackTemporal
Constraints
Composition Entity Time Simple Complex

Year

Month

Day

Explicit

Implicit

Overlap

Before/After

Ordinal

Equal

During/Include

Start/End

w/ Comp.

w/o Comp.

Year

Month

Day

Semantic Parsing-based
TEQUILA Jia et al. (2018b) \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ
SYGMA Neelam et al. (2021) \circ \circ \circ \circ \circ \bullet \bullet \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \circ
AE-TQ Long et al. (2022) \circ \circ \circ \bullet \bullet \bullet \circ \circ \circ \circ \circ \circ \circ \bullet
SF-TQA Ding et al. (2023) \circ \circ \circ \bullet \bullet \bullet \bullet \bullet \bullet \bullet \circ \circ \circ \circ \circ \circ \circ \circ \bullet
ARI Chen et al. (2023a) \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \bullet
Best of Both Kannen et al. (2023) \circ \circ \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ
Prog-TQA Chen et al. (2024) \bullet \bullet \bullet \circ \circ \bullet \bullet \bullet \bullet \bullet \bullet \circ \circ \circ \bullet \bullet \bullet \circ \bullet
TKG Embedding-based
CronKGQA Saxena et al. (2021) \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet \circ
EXAQT Jia et al. (2021) \circ \bullet \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet \circ \circ
TempoQR Mavromatis et al. (2021) \circ \circ \circ \bullet \bullet \bullet \circ \bullet \circ \circ \circ \circ \circ \circ
TSQA Shang et al. (2022b) \circ \circ \circ \circ \bullet \bullet \bullet \bullet \circ \circ \circ \circ \circ \bullet
CTRN Jiao et al. (2023) \circ \circ \circ \bullet \bullet \bullet \circ \bullet \circ \circ \circ \circ \circ \bullet
SubGTR Chen et al. (2022) \circ \circ \bullet \bullet \bullet \bullet \circ \bullet \circ \circ \circ \circ \circ \bullet
TwiRGCN Sharma et al. (2022) \circ \circ \bullet \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \circ
TSIQA Xiao et al. (2022) \circ \circ \circ \circ \bullet \bullet \bullet \bullet \circ \circ \circ \circ \circ \bullet
TMA Liu et al. (2023a) \circ \circ \circ \bullet \bullet \bullet \circ \bullet \circ \circ \circ \circ \circ \circ
MultiQA Chen et al. (2023b) \bullet \bullet \bullet \circ \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet \bullet \bullet \circ \circ
LGQA Liu et al. (2023b) \circ \circ \circ \circ \circ \circ \bullet \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \bullet
JMFRN Huang et al. (2024) \circ \circ \bullet \circ \circ \bullet \circ \circ \circ \circ \circ \circ \circ \bullet
SERQA Du et al. (2024) \circ \circ \circ \circ \circ \circ \bullet \bullet \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet
QC-MHM Xue et al. (2024) \circ \bullet \bullet \bullet \bullet \bullet \circ \circ \circ \circ \circ \bullet \circ \circ
GenTKGQA Gao et al. (2024) \circ \circ \circ \bullet \bullet \bullet \circ \circ \circ \circ \circ \bullet \bullet \bullet
M3𝑀3M3italic_M 3TQA Zha et al. (2024) \circ \circ \circ \bullet \bullet \circ \circ \circ \circ \circ \circ \circ \circ \circ \bullet \bullet \bullet \circ \bullet
Table 3: Question category coverage comparison across TKGQA methods. The \circ indicates that this method can solve the corresponding question category. The \bullet indicates that this method focuses on or specializes in solving this question category.

4.3 Question Category Coverage Comparison Across TKGQA Methods

Building on the question taxonomy and methodologies overview, we match each type of temporal question with the appropriate method designed to address it effectively, providing a detailed table as Table 3. We can see from the table that finer-grained granularities have been brought into focus over time. Implicit questions received more attention than explicit; before/after and ordinal questions received the most attention, followed by during/include and overlap; start/end and equal questions gain less attention because fewer datasets present them as separate categories. More methods focused on solving more complex questions; however, there was a lack of attention to the most complex type of temporal constraint compositions.

5 Future Directions

This section will discuss emerging frontiers for TKGQA, aiming to stimulate further research in this field.

5.1 Introduce More Question Types

While existing datasets already cover some of the temporal questions, there are still more questions to be explored in the real world. 1) More combination of existing question types: “Who was the first person to win a medal during the 2024 Olympic Games?” 2) More time granularity: Some questions demand more fine-grained granularities, such as “When was the Long March 1 launched?” 3) Questions must consider the posed time: “Where are the seneca indians now?” Jia et al. (2021); Liška et al. (2022) 3) Predicting the future questions: “Will the Palestinian-Israeli conflict end next year?” Jin et al. (2021); Ding et al. (2022b, a) 4) Common sense temporal questions: “How often are the Olympics held?”

5.2 Enhance Model Robustness

Most existing TKGQA datasets provide entity and temporal annotations Saxena et al. (2021); Jia et al. (2021); Neelam et al. (2022), greatly reducing the task’s difficulty. Results on unlabeled datasets rely on the effects of NEL or temporal annotators Chen et al. (2023b), corrupting the model’s robustness. Robust models should be able to perform well on datasets with no additional annotations and be able to generalize to unseen entities and relationships Chen et al. (2022). In addition, most existing datasets rely on template generation and lack diversity; there are very few event types, and they are still single-domain. These can be improved in future work.

5.3 Multi-modal TKGQA

Current TKGQA systems mainly handle plain text input. However, we experience the world with multiple modalities (e.g., language and image). Therefore, building a multi-modal TKGQA system that can handle multiple modalities is an important direction to investigate Yu et al. (2023). A non-trivial challenge is how to effectively make a multimodal feature alignment and complementary to understand the temporal part better.

5.4 LLM for TKGQA

Recently, Large Language Models (LLMs) have gained significant attention for their remarkable performance across a wide range of Natural Language Processing (NLP) tasks Touvron et al. ; OpenAI (2024); Team and Googlba (2024). Existing research has also explored applying LLMs in KBQA scenarios, employing both few-shot and zero-shot learning paradigms Nie et al. (2024); Sun et al. (2024); Jiang et al. (2023); Baek et al. (2023); Li et al. (2023a, b)

However, several critical challenges remain to be addressed in LLM for TKGQA. We summarize the main challenges as follows: LLMs currently have significant shortcomings in understanding temporal expressions Chu et al. (2023), crucial for TKGQA. LLMs also perform poorly in symbolic temporal reasoning, especially in multi-step tasks Chu et al. (2023); Tan et al. (2023); Qin et al. (2021). Enhancing these capabilities for complex temporal questions is essential; approaches like temporal span extraction pre-training, supervised fine-tuning, and time-sensitive reinforcement learning may help Tan et al. (2023).

Several emerging opportunities could further enhance the capabilities of LLMs in TKGQA systems:

  • Multi-Agent Collaboration Interactive Reasoning for TKGQA. Recent LLM works have shifted the focus from traditional NLP tasks to exploring language agents in simulation environments that mimic real-world scenarios Zhang et al. (2024).  Qian et al. (2024) investigates interactive reasoning and collective intelligence in autonomously solving complex problems. This may be further explored for temporal reasoning in temporal questions.

  • Diverse Data Generation. Numerous studies have demonstrated the effectiveness of large models in data generation Chung et al. (2023), which can be used to enhance the diversity of the TKGQA dataset.

  • Supplementing Knowledge. The language model itself can serve as a TKG as demonstrated by  Dhingra et al. (2022). Additionally, LLMs possess temporal commonsense Chu et al. (2023), which is often absent in traditional temporal knowledge graphs. This temporal knowledge can complement existing TKGs for TKGQA.

6 Conclusion

In this paper, we provided an in-depth analysis of the emerging field of TKGQA with a new taxonomy of temporal questions and a systematic categorization of existing methods. We demonstrated the focus and neglect of existing methods for temporal questions, indicating future research directions. We have discussed some new trends in this research field, hoping to attract more breakthroughs in future research.

Limitations

This study offers a comprehensive review of the TKGQA task. However, our primary focus is on temporal question answering specifically based on temporal knowledge graphs, and we do not delve into other temporal question answering tasks based on texts or heterogeneous sources. Furthermore, the descriptions within this survey are deliberately brief to ensure a broad coverage of the topic while adhering to page constraints. Rather than presenting the works in an unstructured sequence, we organize them into meaningful, structured groups. We aim for this work to serve as an index, guiding readers to more detailed information in the referenced works.

References

Appendix A Appendix

A.1 Datasets

This section introduces the datasets, including their background TKG, size, etc. We provide a question category coverage comparison across TKGQA datasets in Table. A.1.

TempQuestions

Jia et al. (2018a) is a benchmark dataset derived from Freebase Bollacker et al. (2008), where temporal knowledge is stored using compound value types (CVTs). Examples of these CVTs include footballPlayer.team.joinedOnDate, footballPlayer.team.leftOnDate, marriage.date, amusement_parks.ride.opened ,amusement_parks.ride.closed. It includes 1,271 questions with temporal signals, question types, and data sources for testing and evaluation.

TempQA-WD

Neelam et al. (2021) is an adaptation of TempQuestions for Wikidata Vrandečić and Krötzsch (2014), fulfilling the need identified by Neelam et al. (2021) for a multi-KG-based dataset to evaluate their model. The dataset comprises:

  • 839 questions with corresponding Wikidata SPARQL queries, answers, and categories, along with TempQuestions’ information.

  • 175 questions with AMR, lambda expressions, entities, relations, and KB-specific lambda expressions, in addition to the above information.

TimeQuestions

Jia et al. (2021) is based on Wikidata, it includes temporal facts of triples, such as <Malia Obama, date of birth, 04-07-1998> or maintain more temporal knowledge with qualifiers like <Barack Obama, position held, President of the US; start date, 20-01-2009; end date, 20-01-2017>. Jia et al. (2021) searched through eight KG-QA datasets for time-related questions and mapped them to Wikidata. Questions in each benchmark are tagged for temporal expressions using SUTime Chang and Manning and HeidelTime Strötgen and Gertz (2010), and for signal words using a dictionary compiled by Setzer (2001b) and manually tagged with its temporal question category. In total, the TimeQuestions comprises 16,859 questions.

CronQuestions

Chen et al. (2022) utilizes a subset of Wikidata that includes facts annotated with temporal information Lacroix et al. (2020), such as <Barack Obama, held position, President of USA, 2008, 2016>. Entities extracted from Wikidata with both “start time” and “end time” annotations are transformed into event format (e.g., <WWII, significant event, occurred, 1939, 1945>). The dataset comprises a Temporal KG with 125k entities and 328k facts (including 5k event facts), and 410k natural language questions requiring temporal reasoning.

Complex-CronQuestions

Chen et al. (2022) observe that existing benchmarks contain many pseudo-temporal questions. For instance, for the question “What’s the first award Carlo Taverna got?” there is only one fact related to Carlo Taverna in the TKG, which makes the temporal word “first” meaningless as a constraint. They remove all simple and pseudo-temporal questions and filter out questions with less than 5 relevant facts in CronQuestions.

MultiTQ

Chen et al. (2023b) is a dataset derived from ICEWS05-15 García-Durán et al. (2018), where all facts are standardized as quadruple (s,r,o,t)𝑠𝑟𝑜𝑡(s,r,o,t)( italic_s , italic_r , italic_o , italic_t ). ICEWS05-15 is notable for its rich semantic information, with a higher average number of relation types per entity than other TKGs. The MultiTQ dataset features several advantages, including its large scale, ample relations, and multiple temporal granularity. ICEWS provides time information at a day granularity, while the authors generate higher granularities, such as year and month, for the questions. MultiTQ contains 500,000 questions, making it a significant resource for temporal question-answering research.

Category \LongunderstackTemp
\LongunderstackTempQA- Questions
\LongunderstackTime
\LongunderstackCron
\LongunderstackComplex
MultiQA
Question Content \LongunderstackTime Granularity Year
Month
Day
\LongunderstackTime Expression Explicit
Implicit
\LongunderstackTemporal Constraint Overlap
Equal
Start/End
During/Include
Before/After
Ordinal
\LongunderstackTemporal Constraint Composition w/ Composition
w/o Composition
Answer Type Entity
\LongunderstackTime Year
Month
Day
Complexity Simple
Complex
Table 4: Question category coverage comparison across TKGQA datasets
Datasets SP-based
Top-1 Top-2
Metrics Hits@1 F1 Hits@1 F1
Tempquestions Jia et al. (2018a) 41.241.241.241.2 Ding et al. (2023) 41.141.141.141.1 Ding et al. (2023) 36.236.236.236.2 Jia et al. (2018b) 37.537.537.537.5 Jia et al. (2018b)
TempQA-WD Neelam et al. (2021) - 41.641.641.641.6 Kannen et al. (2023) - 32.032.032.032.0 Neelam et al. (2021)
TimeQuestions Jia et al. (2021) 56.556.556.556.5 Jia et al. (2021) 52.752.752.752.7 Ding et al. (2023) 53.953.953.953.9 Ding et al. (2023) 49.949.949.949.9 Kannen et al. (2023)
CronQuestions Saxena et al. (2021) 93.793.793.793.7 Chen et al. (2024) 97.397.397.397.3 Chen et al. (2024) 70.070.070.070.0 Chen et al. (2023a) -
Complex-CronQuestions Chen et al. (2022) - - - -
MultiTQ Chen et al. (2023b) 79.779.779.779.7 Chen et al. (2024) 91.091.091.091.0 Chen et al. (2024) 38.038.038.038.0 Chen et al. (2023a) -
(a) Leaderboard for TKGQA datasets (SP-based).
Datasets TKGE-based
Top-1 Top-2
Metrics Hits@1 Hits@10 Hits@1 Hits@10
Tempquestions Jia et al. (2018a) - - - -
TempQA-WD Neelam et al. (2021) - - - -
TimeQuestions Jia et al. (2021) 62.862.862.862.8 Huang et al. (2024) - 60.560.560.560.5 Sharma et al. (2022) -
CronQuestions Saxena et al. (2021) 97.897.897.897.8 Gao et al. (2024) 99.399.399.399.3 Chen et al. (2022) 97.197.197.197.1 Xue et al. (2024) 99.299.299.299.2 Xue et al. (2024); zha2024
Complex-CronQuestions Chen et al. (2022) 92.0 Chen et al. (2022) 98.6  Chen et al. (2022) 79.2 Mavromatis et al. (2021) 95.9 Mavromatis et al. (2021)
MultiTQ Chen et al. (2023b) 29.329.329.329.3 Chen et al. (2023b) 63.563.563.563.5 Chen et al. (2023b) - -
(b) Leaderboard for TKGQA datasets (TKGE-based).
Table 5: Leaderboard for TKGQA datasets.

A.2 Evaluation Metrics

Hits@k: This is the most used metric of the TKGQA task. TKGQA method use Hits@1 (accuracy), Hits@3, Hits@5, Hits@10 for evaluation. This metric is set to one if a correct answer appears in the first k𝑘kitalic_k positions and zero otherwise.

Precision, Recall and F1 score:

This metric is widely used for KBQA task. Precision indicates the ratio of the correct predictions over all the predicted answers. Recall is the ratio of the correct predictions over all the ground truth. F1 score computes the average of precision and recall.

P@1:

Precision at the top rank is one if the highest ranked answer is correct and zero otherwise.

MRR:

This is the reciprocal of the first rank where we have a correct answer. If the correct answer is not featured in the ranked list, MRR is zero.

Average number of reasoning steps:

ARI uses this metric to measure the reasoning steps for each question. The average number of reasoning steps across all tested questions represents this metric.

A.3 Leaderboard

Table 5 presents a leaderboard featuring the top-2 TKGQA models across all mentioned datasets. For semantic parsing-based methods, the widely used metrics are Hits@1 and F1. For TKG embedding-based methods, the commonly used metrics are Hits@1 and Hits@10.

Based on Table 5, we have following observations: (1) Both SP-based and TKGE-based methods are developed to address the TKGQA task, with no clear indication of superiority. According to the Hits@1 result, TKGE-based methods perform better on CronQuestions and TimeQuestions, while SP-based methods excel on MultiTQ. (2) SP-based methods cover more benchmarks than TKGE-based methods. This may be attributed to the flexibility and expressiveness of the logical form, which allows SP-based methods to address a wider range of question types.