A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

He Chang Communication University of China [email protected] Chenchen Ye University of California, Los Angeles [email protected] Zhulin Tao Communication University of China [email protected] Jie Wu Communication University of China [email protected] Zhengmao Yang Zhejiang University [email protected] Yunshan Ma National University of Singapore [email protected] Xianglin Huang Communication University of China [email protected]  and  Tat-Seng Chua National University of Singapore [email protected]
Abstract.

Recently, Large Language Models (LLMs) have demonstrated great potential in various data mining tasks, such as knowledge question answering, mathematical reasoning, and commonsense reasoning. However, the reasoning capability of LLMs on temporal event forecasting has been under-explored. To systematically investigate their abilities in temporal event forecasting, we conduct a comprehensive evaluation of LLM-based methods for temporal event forecasting. Due to the lack of a high-quality dataset that involves both graph and textual data, we first construct a benchmark dataset, named MidEast-TE-mini. Based on this dataset, we design a series of baseline methods, characterized by various input formats and retrieval augmented generation(RAG) modules. From extensive experiments, we find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance. Moreover, enhanced with retrieval modules, LLM can effectively capture temporal relational patterns hidden in historical events. Meanwhile, issues such as popularity bias and the long-tail problem still persist in LLMs, particularly in the RAG-based method. These findings not only deepen our understanding of LLM-based event forecasting methods but also highlight several promising research directions. We consider that this comprehensive evaluation, along with the identified research opportunities, will significantly contribute to future research on temporal event forecasting through LLMs.

Temporal Event Forecasting, Temporal Knowledge Graph, Large Language Model, Retrieval Augmented Generation
ccs: Computing methodologies Temporal reasoningccs: Information systems Specialized information retrieval

1. Introduction

Refer to caption
Figure 1. Illustration of leveraging LLM for temporal event forecasting. Given the complex event Israeli-Palestinian conflict , three formats of historical event representations, i.e., text (left side), graph (right side), or graph-text (both), can be fed into the LLMs, and the LLMs are expected to answer certain input questions about what will happend in the future.

Temporal event forecasting aims to predict future events based on observed historical events (Zhao, 2022). This is a fascinating task that paves the way for humans to master the operating rules of the world, meanwhile, it also possesses significant practical and application value, such as offering early warning to critical events like civil unrest or regional conflicts (Leetaru and Schrodt, 2013; O’brien, 2010; Deng et al., 2020). Given the great values and impacts, temporal event forecasting has garnered increasing interest in both academic and research communities, and various studies have been conducted in recent years (Cai et al., 2022; Li et al., 2021b; Park et al., 2022; Ma et al., 2023b; Jin et al., 2021; Lv et al., 2020).

Among these studies, the representative formulations of temporal events can be broadly categorized into three formats (Zhao, 2022): graph-based, text-based, and graph-text hybrid. Specifically, the graph-based approach represents each event, i.e., the so-called atomic event  (Dettmers et al., 2018; Shang et al., 2019; Schlichtkrull et al., 2018), in a structured format, i.e., a quadruple (s,r,o,t)𝑠𝑟𝑜𝑡(s,r,o,t)( italic_s , italic_r , italic_o , italic_t ), where s𝑠sitalic_s, r𝑟ritalic_r, o𝑜oitalic_o, and t𝑡titalic_t corresponds to the subject, relation 111Relation refers to the type of the atomic event., object, and timestamp, respectively. As illustrated in Figure 1, each atomic event(e.g., (Palestine, use conventional military force, Israel, t-2)) is extracted from a textual document(e.g., Palestinian militant groups led by Hamas launch over 3,500 rockets from the Gaza Strip into Israel. ) , and multiple related atomic events form a complex event(Li et al., 2021c; Ma et al., 2023a, b), such as the Israeli-Palestinian conflict. Following such a structured representation, which is also termed as Temporal Knowledge Graph (TKG), various methods (Jin et al., 2020; Li et al., 2021b; Park et al., 2022; Ma et al., 2023b) that target at modeling the temporal and relational patterns have been proposed and achieved remarkable progress (Cai et al., 2022). In contrast, text-based methods, such as event script prediction (Lv et al., 2020) or forecast question answering (ForecastQA) (Jin et al., 2021), are characterized by directly consuming textual representations, which encapsulate more fine-grained details and contexts. Such detailed and contextual information is often ignored by graph-based methods since they are either not included in the ontology or not extracted by the information extraction system. Consequently, text-based event forecasting (Jin et al., 2021; Lv et al., 2020) concentrates more on the capabilities of natural language understanding and reasoning. The third branch, i.e., graph-text hybrid method, aims to take advantage of the merits from both formats: structured reasoning and fine-grained information. Nevertheless, existing graph-text hybrid methods (Deng et al., 2020, 2021; Ma et al., 2023a) primarily treat text as side information to be integrated into the graph-based backbones, without truly conducting reasoning and forecasting on the texts themselves.

With the striking success of ChatGPT 222https://chat.openai.com, LLMs have demonstrated impressive performance in various tasks (Kojima et al., 2022; Ouyang et al., 2022; Touvron et al., 2023) as well as event forecasting (Xu et al., 2023; Lee et al., 2023; Liao et al., 2023; Luo et al., 2024). These works pioneer in applying LLMs to the task of event forecasting through in-context-learning (ICL) (Lee et al., 2023), instruction tuning (Xu et al., 2023; Luo et al., 2024), and retrieval augmented generation (RAG) (Sun et al., 2023; Liao et al., 2023). However, on the one hand, most of these works only focus on the graph-based formulation (Liao et al., 2023; Luo et al., 2024) through discrete graph reasoning, while ignoring the broad contextual information retained by natural language. Thereby, integrating the raw texts, from which the structured events are extracted, into the LLM-based forecasting processing is a natural and rational direction. On the other hand, considering that event forecasting is mostly related to critical scenarios such as international relationship and domestic stability, trustability and explainability are essentially required, while the solutions that solely rely on the internal knowledge of LLMs would suffer from the problem of hallucination (Lee et al., 2023). Despite RAG (Lewis et al., 2020) is a promising solution, the current graph-based RAG (Liao et al., 2023; Sun et al., 2023) is prone to noisy events that are obtained by unreliable event extraction systems. Therefore, the performance of text-based or graph-text hybrid RAG on temporal event forecasting is worth of further exploration.

Moreover, several previous works, such as SeCoGD[27], have identified that event forecasting suffers from severe popularity bias. Specifically, conventional methods harness historical events to provide additional contextual information for event prediction. Even without relevant historical events, LLMs can still provide correct answers since they implicitly perceive extensive knowledge in their parameters. As shown in Figure 1, although Iraq did not appear in previous knowledge graphs, the model’s understanding of the relationship between Iraq and Iran led to the correct answer. However, when the object to be predicted is an entity that appears for the first time or rarely, such as Egypt in the figure, the performance of LLMs remains uncertain.

To fill this gap, we aim to conduct a comprehensive study to evaluate the performance of LLMs on temporal event forecasting. Nevertheless, the main obstacle lies in the lack of well-established benchmark datasets. Most existing works are using graph-only datasets, e.g., GDELT (Leetaru and Schrodt, 2013) and ICEWS (O’brien, 2010), or text-only datasets, e.g., ForecastQA (Jin et al., 2021). There are some text-enriched event graph datasets (Ma et al., 2023a, b), which enrich an existing structured TKG dataset with the raw texts that are downloaded through the news article URLs offered by GDELT. Unfortunately, the structured events in these datasets are highly noisy, because they are either from the original GDELT dataset (Ma et al., 2023a) or extracted using poor-performing systems (Ma et al., 2023b), with only about 50% accuracy as demonstrated in Section 2.2.

In this work, to conduct the evaluation, we first construct an exploratory benchmark data based on a previous dataset MidEast-TE (Ma et al., 2023b) and name our dataset as MidEast-TE-mini (short as MidEast-TE-m). Specifically, we sample a subset from the MidEast-TE (Ma et al., 2023b) and extract structured events using the state-of-the-art (SOTA) LLM GPT-4 333https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo. Based on this dataset, we design a series of baseline methods to evaluate the performance of LLMs on temporal event forecasting. We are particularly interested in the following research questions: (1) how do LLMs perform on different formulations of event forecasting, including different modalities of input as well as forecasting objectives? Additionally, how are different LLM backbones and fine-tuning affecting the forecasting results? (2) how does RAG perform on event forecasting and how do different retrieval models and settings affect the forecasting performance? (3) how do various methods perform w.r.t. popularity bias (long-tail) issues? To answer these research questions, we conduct extensive experiments to verify the functionality of every designed component. The experimental results unveil multiple interesting phenomena, which not only deepen our understanding of LLM-based event forecasting methods but also surface several interesting research directions. The main contributions of this work are as follows:

  • To the best of our knowledge, we are the first to systematically evaluate LLM-based methods on text-involved temporal event forecasting.

  • To facilitate the evaluation, we construct a benchmark dataset with both raw texts and high-quality structured events.

  • We design a list of baseline methods and LLM-based methods catering to multiple evaluation criteria. Extensive experiments have unveiled meaningful insights and led to several noteworthy and valuable directions for future research.

2. Preliminary

We first introduce the problem formulation of temporal event forecasting. Then, we present the dataset construction pipeline as well as the statistics and human evaluation results of our dataset.

2.1. Problem Formulation

Conventional temporal event forecasting methods define each event as a quadruple (s,r,o,t)𝑠𝑟𝑜𝑡(s,r,o,t)( italic_s , italic_r , italic_o , italic_t ), referred to as an atomic event. Following recent studies (Li et al., 2021c; Ma et al., 2023b), atomic events are grouped into complex events (CE), which include a set of correlated atomic events and represent major events that span over longer time spans and cover more entities. For example, all the events within the red dashed box in Figure 1 illustrate a complex event, specifically the Israeli-Palestinian conflict. Consequently, we define an atomic event at timestamp t𝑡titalic_t as (s,r,o,t,c)𝑠𝑟𝑜𝑡𝑐(s,r,o,t,c)( italic_s , italic_r , italic_o , italic_t , italic_c ), where s𝑠s\in\mathcal{E}italic_s ∈ caligraphic_E, r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R, and o𝑜o\in\mathcal{E}italic_o ∈ caligraphic_E represent the subject, relation, object, and timestamp respectively; \mathcal{E}caligraphic_E and \mathcal{R}caligraphic_R are the entity set and relation set, and c𝑐citalic_c denotes the complex event type. Notably, the timestamp t𝑡titalic_t is converted into a relative timestamp. All the atomic events at the same timestamp t𝑡titalic_t form an event graph denoted as Gt={(sn,rn,on,t,cn)}n=1Nsubscript𝐺𝑡superscriptsubscriptsubscript𝑠𝑛subscript𝑟𝑛subscript𝑜𝑛𝑡subscript𝑐𝑛𝑛1𝑁G_{t}=\{(s_{n},r_{n},o_{n},t,c_{n})\}_{n=1}^{N}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where (sn,rn,on,t,cn)subscript𝑠𝑛subscript𝑟𝑛subscript𝑜𝑛𝑡subscript𝑐𝑛(s_{n},r_{n},o_{n},t,c_{n})( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the n-th event and N𝑁Nitalic_N is the number of events.

In addition to the pure graph-based representation, we aims to incorporate raw news text into temporal event forecasting. Specifically, each event graph Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is extracted from a list of news documents Dt=[d1,d2,,dk]k=1K𝒟subscript𝐷𝑡superscriptsubscriptsubscript𝑑1subscript𝑑2subscript𝑑𝑘𝑘1𝐾𝒟D_{t}=\left[d_{1},d_{2},\ldots,d_{k}\right]_{k=1}^{K}\in\mathcal{D}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ caligraphic_D, where dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the k-th document at timestamp t𝑡titalic_t and 𝒟𝒟\mathcal{D}caligraphic_D denotes the document set. It is noteworthy that the same document d𝑑ditalic_d can be involved in multiple events. Given the historical event graphs or news documents prior to t𝑡titalic_t, denoted as 𝐆<t={G0,G1,,Gt1}subscript𝐆absent𝑡subscript𝐺0subscript𝐺1subscript𝐺𝑡1\mathbf{G}_{<t}=\left\{G_{0},G_{1},\ldots,G_{t-1}\right\}bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and 𝐃<t={D0,D1,,Dt1}subscript𝐃absent𝑡subscript𝐷0subscript𝐷1subscript𝐷𝑡1\mathbf{D}_{<t}=\left\{D_{0},D_{1},\ldots,D_{t-1}\right\}bold_D start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }, respectively, and a query, represented as (s,r,t)𝑠𝑟𝑡(s,r,t)( italic_s , italic_r , italic_t ), temporal event forecasting aims to predict the missing object.

2.2. Dataset Construction

We first describe the data source that we utilized for curating the dataset. Then we present the dataset construction pipeline, followed by the human evaluation results.

2.2.1. Data Source

We construct our dataset based on a subset of MidEast-TE(Ma et al., 2023b), therefore, we name our dataset as MidEast-TE-mini, short as MidEast-TE-m. The original MidEast-TE dataset includes both structured atomic events and news articles, which have already been clustered into complex events. Given the large scale of MidEast-TE, we sample a subset of complex events from it and keep the associated news articles as the original documents to build our dataset. Specifically, the time spans of the complex events in MidEast-TE are of high divergence, ranging from several days to more than three months. To reduce the potential bias introduced by outliers, we just sample 120 complex events whose time spans are within 40-60 days. For the shorter and longer complex events, we do not cover them in this exploratory dataset and leave them for future work.

2.2.2. Construction Pipeline

The overall pipeline of the dataset construction involves two consecutive components: Event Extraction and Entity Linking.

Event Extraction. We instruct GPT-4 to perform the task of event extraction. Since the set of event types is quite large (¿200), putting all the event definitions into prompts yields high cost and may decrease the performance. Therefore, we follow the previous approach (Ma et al., 2023b) and conduct a multi-level event extraction given the three-layer hierarchical structure of CAMEO (Boschee et al., 2015) ontology. We include the description of each event type defined in CAMEO in the prompt. During the first-level extraction, we break down each article into multiple sentence-level trunks, each with around 150 tokens. Then, we extract the first-layer atomic events from each trunk. After recognizing atomic events based on first-level event types, for the second-level extraction, we group 15 atomic events with the same first-level event type and their respective trunks into a single prompt. This approach aims to maximally share the prompt while maintaining the high-quality extraction. The same procedure is then applied to the third-level extraction.

Entity Linking. Due to the absence of a predefined entity set in our event extraction process, resulting in repeated while diverse-formatted entities. Subsequently, we employ an entity linking step using GPT-4. Initially, we apply K-means to cluster all original entities into multiple groups. Then, we batch the entities from each cluster and ask GPT-4 for entity linking.

2.2.3. Human Evaluation

Table 1 details the statistics of our curated dataset MidEast-TE-m, where #atomic events, #CEs, and #docs represent the number of atomic events, complex events, and news documents, respectively. Additionally, |||\mathcal{E}|| caligraphic_E | and |||\mathcal{R}|| caligraphic_R | refer to the number of entities and relations. We split the dataset into train, validation, and test sets in a temporal manner. Specifically, we use the atomic events in the last year for testing, the second-to-last year for validation, and the rest about five years for training. Note that the number of atomic events is not the same as the number of documents, as a single news document may correspond to multiple events.

Table 1. Statistics of our curated dataset MidEast-TE-m
Dataset #atomic events #CEs |||\mathcal{E}|| caligraphic_E | |||\mathcal{R}|| caligraphic_R | #docs
train 8,999 88 4,529 254 2,647
val 1,777 19 989 191 473
test 1,766 18 1,146 198 572
total 12,542 120 5,909 267 3,692

In order to evaluate the quality of the benchmark dataset, we conduct human evaluation. Specifically, given a news article and the extracted events, we ask the human evaluator to tell whether the extracted events are valid or not based on the original article. We randomly sample 20 documents from the document set of the dataset, and conduct human evaluation based on the following criteria:

  • Time: the atomic events extracted are events that have already occurred or are currently happening, rather than future events.

  • Relation: the extracted atomic events appear in the original text and faithfully reflect the semantics of the original text.

  • Entity: the extracted entities are concrete and real-world entities, and they appear in the original news article.

To be noted, for every extracted atomic event, we evaluate it once by going through the three criteria sequentially. For example, if we identify the Time is incorrect, we stop the checking of the following two criteria. Table 2 shows the accuracy and error types of the event extraction results, including both our dataset MidEast-TE-m and previous datasets of MidEast-TE (Ma et al., 2023b) and GDELT-TE (Leetaru and Schrodt, 2013). To be noted, MidEast-TE is constructed using Vicuna-7b (Chiang et al., 2023), while GDELT-TE is curated using proprietary extraction system. Clearly, the dataset constructed by our pipeline is the most accurate one. In our dataset constructed using GPT-4, there are the fewest extraction errors, which makes the following forecasting research more valid and reliable. However, we admit that an overall accuracy of 72.6% is still unsatisfactory in practice, and further efforts will be devoted to improved event extraction performance.

Table 2. Error analysis of the event extraction results in different datasets.

Dataset #atomic events Acc.(%) error type (%) time relation entity GDELT-TE (Leetaru and Schrodt, 2013) 148 29.73 3.85 31.73 64.42 MidEast-TE (Ma et al., 2023b) 35 55.56 0 92.86 7.14 MidEast-TE-m(ours) 78 72.60 16.67 27.78 55.56

3. Methods

Refer to caption
Figure 2. Illustration of rule-based history and retrieved history. The rule-based history is constructed by a set of predefined rules. In contrast, the retrieved history dynamically searches context from the temporal knowledge graph or news documents according to the current query.
Table 3. Input format of history
Graph-only (Pakistan Foreign Ministry, Cooperate economically, Egyptian counterpart officials, 2190); (Shah Mahmood Qureshi, Express intent to provide economic aid, Egypt FM Shoukry, 2190)
Text-only [Date]2190: Pakistani Foreign Minister Shah Mahmood Qureshi expressed great compatibility in visions between Cairo and Islamabad, highlighting the potential for cooperation in eradicating terrorism and intolerance during his visit to Egypt.
Mixed [Date]2189: Foreign Minister Shah Mahmood Qureshi visited Egypt for a two-day official visit to discuss bilateral relations and economic cooperation with his Egyptian counterpart, Sameh Hassan Shoukry, and to promote trade relations with Africa. (Shah Mahmood Qureshi, Make optimistic comment, Pakistan); (Egyptian Ambassador, Host a visit, Chief of Army Staff (COAS) General Qamar Javed Bajwa)

We systematically designed a series of LLM-based methods for event forecasting, categorized into three streams: pure graph-based, pure text-based, and mixed graph-text methods. Notably, we only introduce methods that employ LLM, while non-LLM baselines are described in Section 4.1. Additionally, we developed two methods for constructing event history to assist LLM reasoning in temporal event forecasting, as illustrated in Figure 2. These methods include rule-based history and retrieved history.

3.1. Graph-only Methods

Graph-only methods utilize discrete event graphs as inputs, leveraging the inherent domain knowledge of LLMs to reason over historical events and subsequently make forecasts. To capitalize on contextual information and historical developments related to current query event, we develop two methods for extracting pertinent historical events from event graphs. The input format of the graph is shown in Table 3

Rule-based History. Typically, the causes of an event can be categorized into internal and external reasons. Internal reasons pertain to factors inherent to the subject of the current event, while external reasons originate from the surrounding environment. Similarly, in temporal event forecasting, both internal and external reasons must be considered. On the one hand, we reconstruct the event graph 𝐆<tq={G0q,G1q,,Gt1q}superscriptsubscript𝐆absent𝑡𝑞superscriptsubscript𝐺0𝑞superscriptsubscript𝐺1𝑞superscriptsubscript𝐺𝑡1𝑞\mathbf{G}_{<t}^{q}=\left\{G_{0}^{q},G_{1}^{q},\ldots,G_{t-1}^{q}\right\}bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } from global view to provide historical patterns and trends of the current event, where Gtqsuperscriptsubscript𝐺superscript𝑡𝑞G_{t^{\prime}}^{q}italic_G start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT represents historical events at timestamp tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the same subject as the query. On the other hand, we extract the graph from a local view to offer background information or specific situations for reasoning of LLMs. Formally, complex event graph 𝐆<tc={G0c,G1c,,Gt1c}superscriptsubscript𝐆absent𝑡𝑐superscriptsubscript𝐺0𝑐superscriptsubscript𝐺1𝑐superscriptsubscript𝐺𝑡1𝑐\mathbf{G}_{<t}^{c}=\left\{G_{0}^{c},G_{1}^{c},\ldots,G_{t-1}^{c}\right\}bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT }, where Gtcsuperscriptsubscript𝐺superscript𝑡𝑐G_{t^{\prime}}^{c}italic_G start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents historical events at timestamp tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT within the same complex event. It is noted that due to the unique background information of the current event at specific stages, we only retain historical events from the most recent two days in relative time.

Retrieved History. Historical event graphs serve as rich sources of contextual information. Despite this richness, the intricate nature of the temporal event graph means that the existing graph still contains substantial noise. Therefore, unlike rule-based history, we retrieve relevant entities from the graph to obtain corresponding historical events. Specifically, we first extract entities from the historical events in global view with the same subject as the query. LLMs are then employed to eliminate irrelevant entities. The prompt of graph retrieval is implemented in Appendix  A.1. After identifying the pertinent entities 𝐄<t={E0,E1,,Et1}subscript𝐄absent𝑡subscript𝐸0subscript𝐸1subscript𝐸𝑡1\mathbf{E}_{<t}=\left\{E_{0},E_{1},\ldots,E_{t-1}\right\}bold_E start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }, we reconstruct the history based on the events involving these entities.

3.2. Text-only Methods

Text-only methods utilize raw textual data for forecasting. Table 3 illustrates the input format of the text. Each atomic event originates from a specific news document. Similar to graph-only methods, relevant event history is obtained through rule or retrieval.

Rule-based History. Generally, news documents provide more granular contextual information for temporal event forecasting. Exploring textual information relevant to the current event becomes a critical step. Firstly, we identify the events by the historical events graph from the global view and local view and find the corresponding news documents set 𝐃<tp={D0p,D1p,,Dt1p}superscriptsubscript𝐃absent𝑡𝑝superscriptsubscript𝐷0𝑝superscriptsubscript𝐷1𝑝superscriptsubscript𝐷𝑡1𝑝\mathbf{D}_{<t}^{p}=\left\{D_{0}^{p},D_{1}^{p},\ldots,D_{t-1}^{p}\right\}bold_D start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } through the relationships between documents and events. Considering that news documents usually contain a large amount of irrelevant description about the query event, we utilize LLM to summarize the news documents set, providing concise summaries 𝐒<tp={S0p,S1p,,St1p}superscriptsubscript𝐒absent𝑡𝑝superscriptsubscript𝑆0𝑝superscriptsubscript𝑆1𝑝superscriptsubscript𝑆𝑡1𝑝\mathbf{S}_{<t}^{p}=\left\{S_{0}^{p},S_{1}^{p},\ldots,S_{t-1}^{p}\right\}bold_S start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } that retain only the core thematic information of the news. The prompt of news text summarization is implemented in Appendix  A.1. Unlike methods that rely solely on temporal event graphs, the text-only method aims to predict the missing objects based on historical news summaries.

Retrieved History. Inspired by the recent research of RAG (Sun et al., 2023; Liao et al., 2023), we utilize the embeddings from LLM to directly retrieve relevant news text from a set of historical news documents 𝐃<tc={D0c,D1c,,Dt1c}superscriptsubscript𝐃absent𝑡𝑐superscriptsubscript𝐷0𝑐superscriptsubscript𝐷1𝑐superscriptsubscript𝐷𝑡1𝑐\mathbf{D}_{<t}^{c}=\left\{D_{0}^{c},D_{1}^{c},\ldots,D_{t-1}^{c}\right\}bold_D start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } in the same complex event. Following this, we sort the retrieved news texts by time and filter out historical texts that are excessively outdated in terms of the event timestamp. Similarly to rule-based history, we employ LLM to generate concise summaries 𝐒<tc={S0c,S1c,,St1c}superscriptsubscript𝐒absent𝑡𝑐superscriptsubscript𝑆0𝑐superscriptsubscript𝑆1𝑐superscriptsubscript𝑆𝑡1𝑐\mathbf{S}_{<t}^{c}=\left\{S_{0}^{c},S_{1}^{c},\ldots,S_{t-1}^{c}\right\}bold_S start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } for the filtered news text set, serving as contextual information to predict missing object in the current event.

3.3. Graph-and-Text (Mixed) Methods

Graph-and-text methods leverage both graph and text data as input. The input format of mixed methods is shown in Table  3. Differing from the above two methods, the graph-and-text method combines event graphs and news documents, with the event graph supplying structured information and the news documents providing fine-grained background information. The two specific implementations of this method are as follows:

Rule-based History. The rule-based history with graph-and-text integrates graph-only history with text-only history to form a comprehensive graph-and-text history. The input of LLM includes not only historical events but also their corresponding news summaries. The graph history supplies pertinent temporal event graph structure details. Concurrently, the text history furnishes detailed background information in a granular manner.

Retrieved History. Rather than directly concatenating graph-only history with text-only history, the retrieval-based method initially obtains the relevant entity set by the historical event graph. Compared to the text-only retrieval method, the mixed retrieval method retrieves news text not only based on the subject of the query event but also the relevant entity set, thus augmenting the text exploration with a certain level of contextual information. The retrieved news text, in turn, better assists LLM in conducting structured reasoning.

4. Experiments

Implementing these methods, we are particularly interested in answering the following research questions:

  • RQ1: What is the overall performance of LLM-based methods, with various input modalities and forecasting objectives?

  • RQ2: Is RAG helpful and how do various retrieval-relevant settings affect the performance?

  • RQ3: How does the popularity bias (long-tail) issue affect the forecasting performance?

4.1. Experimental Settings

To evaluate various methods, we conduct experiments on our proposed dataset MidEast-TE-m, as described in Section 2.2. We follow previous approaches and use Accuracy (Acc) as the evaluation metric. For non-zero shot methods, we train the models on the training set, choose the best-performing model based on the validation set, and finally obtain the performance by testing the model on the testing set. For zero-shot methods, we directly test them on the testing set and record the performance.

4.1.1. Compared Methods

In addition to LLM-based methods, we implement a list of representative graph-based methods, described as below:

  • ConvTransE (Shang et al., 2019): it is a static knowledge graph (KG) representation learning method, which uses both a convolutional neural network and a translational operation to capture the patterns within triplets.

  • RGCN (Schlichtkrull et al., 2018): this is also a static KG representation learning model, which leverages graph convolutional neural network to model various relations among entities.

  • RE-GCN (Li et al., 2021b): REGCN is a SOTA method for TKG, which leverages GNN and RNN to capture both relational and temporal patterns, respectively.

  • CMF (Deng et al., 2021): CMF proposes a context-based feature fusion method, which integrate multilevel features including event frequency, news documents and event graphs.

  • LoGo (Ma et al., 2023b): LoGo is the SOTA method for the temporal complex event (TCE), which employs two RT-GNN modules to model both local (within complex event) and global (across all complex events) contexts. To be noted, only LoGo leverage the complex event information, while the above three methods do not use it.

  • ForecastQA (Jin et al., 2021): ForecastQA simulate the the forecasting scenario on temporal news documents and design a method based on pretrain language models to make a forecasting judgement with the retrieved documents.

  • CoH (Luo et al., 2024): It is a LLM-based method that construct the event history by designed rules and employ fine-tuning to understand textual graph information.

  • GenTKG (Liao et al., 2023): this is a retrieval-augmented generation framework that incorporates temporal logical rules into the retrieval process. Additionally, fine-tuning is employed to adapt LLM for the task of temporal event forecasting.

Table 4. Performance (Accuracy) comparison of LLM-based methods and non-LLM methods. ”N/A” stands for ”Not Applicable”.
Model Type Model Backbone Object Entity Prediction Relation Prediction
Graph-only Text-only Mixed Graph-only Text-only Mixed
Non-LLM ConvTransE (Shang et al., 2019) GNN 0.3737 K.A. K.A. 0.7327 K.A. K.A.
RGCN (Schlichtkrull et al., 2018) 0.3777 K.A. K.A. 0.7203 K.A. K.A.
RE-GCN (Li et al., 2021b) 0.3879 K.A. K.A. 0.7333 K.A. K.A.
CMF (Deng et al., 2021) K.A. K.A. 0.3783 K.A. K.A. 0.7265
LoGo (Ma et al., 2023b) 0.3969 K.A. K.A. 0.7406 K.A. K.A.
ForecastQA (Jin et al., 2021) Bert K.A. 0.3901 K.A. K.A. 0.7389 K.A.
Zero-shot rule-based history gpt-3.5-turbo 0.3290 0.3233 0.3154 0.5515 0.4807 0.5464
retrieved history 0.3533 0.3228 0.3539 0.5498 0.4637 0.5402
Fine-tune CoH (Luo et al., 2024) llama2-7b 0.4856 K.A. K.A. 0.7763 K.A. K.A.
GenTKG (Liao et al., 2023) 0.4785 K.A. K.A. 0.7907 K.A. K.A.
rule-based history vicuna-7b 0.5271 0.5266 0.5798 0.8023 0.8103 0.8058

4.1.2. Implementation Details

Owing to the large entity set and limited context length of LLMs, we simplify the event forecasting settings as an Multi-Choice Questions(MCQ) problem and adopt negative sampling strategies to construct candidate sets. Specifically, we randomly sample two entities from Gt1csuperscriptsubscript𝐺𝑡1𝑐G_{t-1}^{c}italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, two entities from Gt2csuperscriptsubscript𝐺𝑡2𝑐G_{t-2}^{c}italic_G start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, and one entity from 𝐆qsuperscript𝐆𝑞\mathbf{G}^{q}bold_G start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT. As a result, the length of the object candidate set is set to 6. For the LLM-based method in this paper, we employ a variety of LLMs which include Llama2-7b 444https://huggingface.co/meta-llama/Llama-2-7b-chat, Vicuna-7b 555https://huggingface.co/lmsys/vicuna-7b-v1.5, and gpt-3.5-turbo 666https://platform.openai.com/docs/models/gpt-3-5-turbo, where the temperature to 0 and the seed parameter to a fixed integer for reproducibility in our experimental setup. The maximum token length of output is set to 256 to prevent invalid responses.

For the rule-based method, we generate corresponding summaries for all documents in advance with Vicuna-7b. Notably, we further apply fine-tuning (instruction tuning with QLoRA (Dettmers et al., 2023)) to these open-source LLMs. Since the retrieved history directly retrieves news text, we utilize gpt-3.5-turbo to generate summaries for the retrieved news text set and keep the retrieval models fixed without tuning. The retrieval models include three different retrievers: BM25 (Robertson et al., 2009), Contriever (Izacard et al., 2021), and LlamaIndex (Liu, 2022).To ensure the fairness of the experiment, for all the baselines and LLM-based methods, we search historical lengths from {3,7,15,30,90}37153090\{3,7,15,30,90\}{ 3 , 7 , 15 , 30 , 90 }. Moreover, considering the limitation of the context window, the maximum number of events and text in history is set to 20 and 5 respectively.

4.2. Performance Comparison (RQ1)

We aim to compare the performance between various non-LLM methods and LLM-based methods, in terms of different input and output settings, as well as various negative sampling strategies.

4.2.1. Performance w.r.t. Various Inputs and Outputs.

Table 4 demonstrates the overall performance comparison on the task of object entity prediction and relation prediction, respectively. Clearly, we have the following observations. First, the LLM-based method proposed in this paper achieves the best performance in the fine-tune setting. The preliminary results imply that there is great potential of fine-tuning LLMs for event forecasting, and more sophisticated fine-tuning is worth further exploration in the future. Compared to the other fine-tuned LLM-based methods such as CoH and GenTKG, the introduction of complex events and news text provides more fine-grained information for temporal event forecasting. Moreover, enhanced with the retrieval modules, LLM-based methods in zero-shot setting can achieve not bad performance for both tasks of tail entity and relation prediction. Nevertheless, they still under-perform conventional non-LLM-based methods on all three formats of inputs, aka. Graph-only, Text-only, and Mixed, demonstrating that non-LLM-based methods remain competitive. Second, surprisingly, incorporating raw text into the input does not demonstrate positive effects in zero-shot setting. The performance on the text-only setting is obviously worse than the graph-only and mixed settings. This may be because the raw textual inputs are longer than the graph-only inputs, resulting in increasing difficulty in reasoning. Third, comparing the two forecasting tasks, the absolute performance of relation prediction is higher than object entity prediction by an obvious margin, indicating that forecasting entities are more challenging than relations in the domain of international cooperation and conflict events. The possible reasons are from two aspects. First, the entity set is much larger than the relation set, therefore, predicting entity is more difficult than relation given a larger candidate set. Second, when predicting relations, the input question includes both the subject and object entity, while the subject entity and relation are the input when predicting object entity. We deem that entities retain more information than relation in event forecasting, thus the question of relation prediction could offer more information and is easier.

4.2.2. Performance w.r.t. Backbone LLMs

The capability of the backbone LLM plays a crucial role in various tasks. In addition to the commercial proprietary LLMs, we are also interested in the performance of open-source LLMs. These open-source LLMs can be adapted, re-configured, and more importantly, safer than commercial LLMs due to privacy issues. Hence, we choose two popular open-source LLMs, i.e., LLaMA2-7b and Vicuna-7b, to test. The results are presented in Table 5. Clearly, both LLaMA2-7b and Vicuna-7b perform significantly worse than gpt-3.5-turbo on the zero-shot setting, owing to the inherent capacity gap between open-source smaller models and commercial powerful models. Nevertheless, Under the fine-tuning setting, all the open-source models exhibit significant performance gains. They are not only better than the zero-shot version counterparts but also outperform gpt-3.5-turbo by a large margin. This phenomenon indicates that supervised instruction fine-tuning is crucial for enhancing the ability of LLM to learn temporal relational patterns. Additionally, the performance of Vicuna-7b outperforms LLaMA2-7b, implying that advanced LLM capabilities are more effective at capturing evolutionary patterns in historical events.

Table 5. Performance comparison w.r.t. different backbone LLMs and fine-tuning. ”N/A” means ”Not Applicable”.
Model LLaMA2-7b Vicuna-7b gpt-3.5-turbo
Zero-shot without history 0.1976 0.1812 0.2746
rule-based history graph 0.2225 0.2027 0.3290
rule-based history text 0.2639 0.2384 0.3233
rule-based history Mix 0.2157 0.2524 0.3154
Fine-tune rule-based history graph 0.5057 0.5271 K.A.
rule-based history text 0.4892 0.5266 K.A.
rule-based history Mix 0.4977 0.5798 K.A.

4.3. Effects of RAG (RQ2)

We conduct an investigation into the characteristics of retrieval in temporal event forecasting from multiple perspectives, including retrieval model, retrieval scope, and historical length.

4.3.1. Performance w.r.t. Various Retrieval Models

Different retrieval models may affect the performance. To study this problem, for text-involved settings, i.e., Text-only and Mixed, we try three different retrieval models, including the classical BM25 and more recent methods of Contriver and LlamaIndex. From the results in Table 7, we can observe that the LlamaIndex ¿ Contriver ¿ BM25, in terms of forecasting accuracy on both settings. These results justify our hypothesis that a stronger retriever always yields better forecasting performance. This further shows that it is a promising direction to incorporate more powerful retrievers or even design event forecasting oriented retrievers. While for graph-only methods, the research on incorporating various graph retrieval models into LLMs is still under exploration, which is also expected to be extensively studied in the future.

4.3.2. Performance w.r.t. Various Retrieval Scopes

Table 6. Performance comparison w.r.t. applying different retrieval scopes. ”N/A” means ”Not Applicable”. ”\ast” indicates no historical input and it is falls into graph-only.
Model (gpt-3.5-turbo) Graph-only Text-only Mixed
without history 0.2746superscript0.27460.2746^{*}0.2746 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT K.A. K.A.
global history 0.3403 0.2429 0.3267
complex event 0.3533 0.3228 0.3539

To comprehensively examine the influence of retrieval scope on temporal event forecasting, we conduct a comparative analysis of the RAG model with three different retrieval options: without history (i.e., without using retrieval), global retrieval (i.e., retrieve from the global context that includes all complex events), and complex event (i.e., retrieve from within complex event). Upon analyzing the results shown in Table 6, we find out that regardless of whether the retrieval is performed on graphs, text, or a combination of both, the performance of complex event retrieval surpasses the other options. This can be primarily attributed to the fact that complex events provide a comprehensive depiction of the development and evolution of a sequence of events, offering rich contextual information. Thus, the retrieval scope is significantly narrowed down, enabling better performance. In contrast, global retrieval introduces a considerable amount of noise during the retrieval process, leading to a decrease in retrieval efficiency compared to complex event retrieval. Nonetheless, it still provides some historical information, resulting in noticeable improvement when compared to the absence of retrieval. Moreover, these findings highlight the effectiveness of retrieval enhancement in large language models for predicting temporal events.

Table 7. Performance comparison of using different retrieval models on the Text-only and Mixed settings.
Retriever Text-only Mixed
BM25 (Robertson et al., 2009) 0.2893 0.3129
Contriver (Izacard et al., 2021) 0.3015 0.3285
LlamaIndex (Liu, 2022) 0.3228 0.3539

4.3.3. Performance w.r.t. Varying Historical Length

The scope of history can be interpreted as the spatial dimensionality of the world, while the temporal dimensionality does also matter. Especially in the task of temporal event forecasting, properly setting the historical length, even with the utilization of retrieval models, may play a crucial role in effective forecasting. Driven by this hypothesis, we test our default model of ”retrieved history” in terms of different historical lengths, ranging from {3,7,15,30,90}37153090\{3,7,15,30,90\}{ 3 , 7 , 15 , 30 , 90 } days. In other words, we constrain the retrieval models to search within the specified period of history. For example, three days of historical length means the retrieval models can only search from the events/articles in the past three days. We demonstrate the results in Figure 3. For graph-involved methods (Graph-only and Mixed), the performance first grows and then drops, with the historical length increasing. This phenomenon makes sense because when the given history is very short, the useful information for sound forecasting is likely to be in the farther past. Hence, no matter how powerful the retrieval model is, it will be bottle-necked by the limited accessible history. When the historical length increases, more information is brought to the view, resulting in increased performance. However, the expansion of history length comes along with more noisy information, which is why when the historical length exceeds some threshold, the performance starts dropping. Given its poor performance, such phenomenon does not appear in the Text-only setting. In this case, it may be because the text-based retrieval models constrain the effectiveness of the retrieval process, making the forecasting method insensitive to historical length.

Refer to caption
Figure 3. Performance comparison considering varying historical length for the model of ”retrieved history”.

4.4. Effects of Popularity Bias (RQ3)

Several previous works, such as SeCoGD (Ma et al., 2023a), have identified that event forecasting suffers from severe popularity bias. To study this interesting problem, we group the testing atomic events into four clusters according to the sparsity (popularity) degree of the object entity 777Alternative ways to measure the sparsity degree of an atomic event are using the occurrence frequency of subject entity or the average frequency of both subject and object. We leave these additional settings for future work. within each quadruple. As shown in Figure 4, each bar corresponds to one group of entities and its x-axis label denotes the occurrence frequency span. The height of the bar represents the number of atomic events whose object entities fall in the group, meanwhile, we separately annotate the number of atomic events in train/val/test sets for each group. The line denotes the number of entities (including both subject and object entities) in each group. Apparently, with a similar size of atomic events, the sparser groups have larger sets of entities while denser groups have fewer, exhibiting significant popularity bias (long-tail) phenomenon.

To study the popularity bias issue, we make statistics of the performance of several methods according to the sparsity groups, as illustrated in Figure 5. Each line corresponds to the performance for one sparsity group on various methods, where the top sub-figure presents the two denser groups while the bottom sub-figure shows the two sparser groups. Interestingly, we can observe that the non-LLM method, i.e., LoGo, exhibits significant performance gaps between sparse and dense groups. In contrast, LLM-based methods, under no matter what specific settings, have much smaller gaps. This is because LoGo has been trained on the training set, thus simply fitting the entity distribution in the dataset. Thereafter, LoGo demonstrates the strongest overall performance; however, in long-tail sparse entity groups, LoGo performs much worse than LLM-based models. This presents that LLMs without fine-tuning are more robust to popularity bias, being able to generate better results for long-tail sparse entities. Among all the LLM-based methods, retrieved graph demonstrates the largest performance gap between sparse and dense groups. This phenomenon raises a concern that graph-based retrieval would exacerbate the popularity bias issue since the retrieval modules are also affected by popularity bias.

Refer to caption
Figure 4. The statistics of different sparsity (density) groups, where bars show the number of atomic events and lines indicate the number of entities in each group.
Refer to caption
Figure 5. The breakdown performance comparison on different sparsity (density) groups, where the top sub-figure shows two denser (more popular) groups while the bottom sub-figure illustrates two sparser groups. The horizontal axis corresponds to multiple methods including both non-LLM method (LoGo) and LLM-based methods.

5. Related Work

We briefly survey the related works from two perspectives: general works on temporal event forecasting and more specifically, large language model-based methods in temporal event forecasting.

5.1. Temporal Event Forecasting

Temporal event forecasting aims to predict future events based on the observed historical events. Different formulations have been raised for this task (Zhao, 2022). They can generally be grouped into three categories based on the event format: time series, unstructured textual events, and structured events.

Some works model event occurrence or features using time series (Weidmann and Ward, 2010; Benjamin et al., 2023; Morstatter, 2021), but they fail to represent multi-relations among entities and multi-line natural of events. Several studies also explore the unstructured textual representation of temporal events, where each atomic event is generated from multi-documents in the form of summary (Gholipour Ghalandari et al., 2020) or phrases (Jiao et al., 2023) and is chained in temporal order. The natural language event description contains more fine-grained details but also leads to higher information complexity that impedes the downstream forecasting task formulation and performance. ForecastQA (Jin et al., 2021) formulates the event forecasting task in a question-answering manner with a retrieval database consisting of all documents that contain historical events. However, due to the lack of structure extracted events, the generation of QA pairs requires heavy human labeling and extensive domain knowledge.

Efforts have also been exhausted in structured event forecasting. The major line of work represents temporal events using temporal knowledge graphs where each atomic event is a time-stamped link (Leetaru and Schrodt, 2013; O’brien, 2010). Extensive methods have been proposed for forecasting by aggregating the temporal and relational relation among entities (Jin et al., 2020; Li et al., 2021b; Park et al., 2022), or retrieving relevant historical events (Zhu et al., 2020; Sun et al., 2021; Li et al., 2021a), or modeling a continuous time development of events (Trivedi et al., 2017; Ding et al., 2021). Some methods have also tried to incorporate textual event information into TKG. Glean (Deng et al., 2020) and CMF (Deng et al., 2021) fuse textual embeddings into graph edges, SeCoGD (Ma et al., 2023a) uses textual topic modeling to disentangle subgraphs, and the dataset MidEast-TE and model LoGo (Ma et al., 2023b) leverages text clustering to construct complex event for forecasting with local and global contexts. However, all of them still conduct the forecast reasoning on graphs, while in our work, we investigate forecasting in a hybrid setting of leveraging both text and graph.

5.2. LLMs in Temporal Event Forecasting

Generative Language Models (LM), especially the Large Language Models (LLMs) (OpenAI, 2023), have exhibited superior capability in language understanding and reasoning in various tasks and domains, such as science (Chen et al., 2023; Wang et al., 2023; Lu et al., 2024) and healthcare (Wu et al., 2023; Thirunavukarasu et al., 2023). Researchers have also conducted various studies on LLMs for temporal reasoning. One line of work focuses on temporal understanding where LLMs are tested for temporal event ordering or storyline understanding (Tan et al., 2023; Ning et al., 2020; Zhou et al., 2019; Zhang and Choi, 2021; Wang and Zhao, 2023). Compared to the understanding task, the forecasting task is generally of higher difficulty where the prediction targets do not appear in the input and therefore require the model to conduct necessary inferences. Several studies have leveraged LLMs for temporal event forecasting by converting the TKG formulation into text sequence and converting missing object prediction into next token prediction (Xu et al., 2023). GPT-NeoX-ICL (Lee et al., 2023) leverages in-context learning of LLMs and constructs prompts as a list of historical events each in quadruplet format. GENTKG (Liao et al., 2023) improves the selection of historical event inputs by a temporal logical rule-based retrieval strategy. LAMP (Shi et al., 2023) instead applies LLMs to perform abductive reasoning to assist the retrieval. However, all of these works only investigate LLMs on temporal event forecasting with structured graph event data. The evaluation of LLMs on textual temporal event forecasting remains explored, not to mention the hybrid setting of graph and text that we also study in this work.

6. Conclusion and Future Work

In this paper, we systematically evaluated LLM-based methods on the task of text-involved temporal event forecasting. Specifically, we first built a benchmark dataset by using the SOTA LLM GPT-4 as the event extractor. Then we designed a series of LLM-based event forecasting models equipped with multiple configurable components, including optional input modalities, different forecasting objectives, different backbone LLMs, fine-tuning, and RAG with various settings. We also studied how popularity bias affects temporal event forecasting. Implementing all these model variations, we obtained a comprehensive understanding of the current status of how LLMs perform on text-involved temporal event forecasting. More importantly, from the extensive evaluation, we pinpointed several key research questions that are meaningful while challenging. First, developing effective text-based retrieval models in the context of temporal event forecasting is essentially important, given the large amount unstructured news articles posted online every day. Second, popularity bias and long-tail issues are still severe, and pertinent measurement and mitigation approaches are highly desired. Finally, developing event forecasting-oriented LLMs, i.e., tuning a task-specific LLM, seems to be the most promising direction.

Moving forward, to address the identified key challenges, there is a lot of work to be done, and here we would like to highlight two aspects. First, building larger, more accurate, and versatile benchmark datasets is pressing. Event forecasting is featured with highly domain-specific characteristics, therefore, it is hard to obtain a generalizable conclusion on small-scaled noisy datasets in one or a small number of domains, which is also the major limitation of this work. Second, focusing on task-level characteristics instead of arms racing on overall performance is necessary and more valuable. Event forecasting is a long-standing but slow-developing research area, one non-negligible reason of which is under exploration and poor understanding of problem settings.

Appendix A Appendix

A.1. prompts

In this section, we show the prompts designed for various modules. Table  9 and Table  10 show the prompts of news summarization and graph retrieval, respectively. As shown in Table 8, the red section represents the rule-based method, while the blue section represents the retrieval-based method.

Table 8. Prompt template of event forecasting
You are an assistant to perform event forecasting with the following rules: 1. The atomic event is the basic unit describing a specific event, typically presented in the form of a quadruple (S, R, O, T), where S represents the subject, R represents the relation, O represents the object, and T represents the relative time. 2. Complex Event, which is composed of a set of atomic events, describes the temporal evolution process of multiple atomic events. 3. Please remember the meanings of the following identifiers: [Query] represents the event to be predicted in the form of (S, O, T). [Nearest Events]represents a list of (summaries / atomic events / summaries and atomic events) in the complex event that is relatively closer in time to the predicted event. [Further Events] represents a list of (summaries / atomic events / summaries and atomic events) in the complex event that is relatively further in time from the predicted event. [Related Events] encompasses the past (summaries / atomic events / summaries and atomic events) that contain the same subject or object as the question. [Relevant Event] represents a list of (atomic events / summaries and atomic events) relevant to the query. [Relevant News Text] represents background information about subject. [Options] represents the candidate set of the missing object. 4. Given a query of (S, O, T) in the future and the list of historical events until t, event forecasting aims to predict the missing object.
Table 9. Prompt template of summarization
You are an assistant to summarize news text with the following rules: 1. You need to generate a concise summary based on the news text. 2. Your response should only include the generated summary.
Table 10. Prompt template of graph retrieval
You are an assistant to find relevant entities with the following rules: 1. [Subject] represents the event subject in a specific event. [Candidate Set] represents a list of entities. 2. You need to select the entities that may be relevant to [Subject].

References

  • (1)
  • Benjamin et al. (2023) Daniel M Benjamin, Fred Morstatter, Ali E Abbas, Andres Abeliuk, Pavel Atanasov, Stephen Bennett, Andreas Beger, Saurabh Birari, David V Budescu, Michele Catasta, et al. 2023. Hybrid forecasting of geopolitical events. AI Magazine (2023).
  • Boschee et al. (2015) Elizabeth Boschee, Jennifer Lautenschlager, Sean O’Brien, Steve Shellman, James Starz, and Michael Ward. 2015. CAMEO.CDB.09b5.pdf. In ICEWS Coded Event Data. Harvard Dataverse. https://doi.org/10.7910/DVN/28075/SCJPXX
  • Cai et al. (2022) Borui Cai, Yong Xiang, Longxiang Gao, He Zhang, Yunfeng Li, and Jianxin Li. 2022. Temporal Knowledge Graph Completion: A Survey. CoRR abs/2201.08236 (2022).
  • Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. In TMLR.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  • Deng et al. (2020) Songgaojun Deng, Huzefa Rangwala, and Yue Ning. 2020. Dynamic Knowledge Graph based Multi-Event Forecasting. In KDD. ACM, 1585–1595.
  • Deng et al. (2021) Songgaojun Deng, Huzefa Rangwala, and Yue Ning. 2021. Understanding Event Predictions via Contextualized Multilevel Feature Learning. In CIKM. ACM, 342–351.
  • Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2D Knowledge Graph Embeddings. In AAAI. AAAI Press, 1811–1818.
  • Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. CoRR abs/2305.14314 (2023).
  • Ding et al. (2021) Zifeng Ding, Zhen Han, Yunpu Ma, and Volker Tresp. 2021. Temporal Knowledge Graph Forecasting with Neural ODE. abs/2101.05151 (2021). https://api.semanticscholar.org/CorpusID:231592393
  • Gholipour Ghalandari et al. (2020) Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. 2020. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1302–1308. https://doi.org/10.18653/v1/2020.acl-main.120
  • Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
  • Jiao et al. (2023) Yizhu Jiao, Ming Zhong, Jiaming Shen, Yunyi Zhang, Chao Zhang, and Jiawei Han. 2023. Unsupervised Event Chain Mining from Multiple Documents. In WWW. ACM, 1948–1959.
  • Jin et al. (2021) Woojeong Jin, Rahul Khanna, Suji Kim, Dong-Ho Lee, Fred Morstatter, Aram Galstyan, and Xiang Ren. 2021. ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data. In ACL/IJCNLP (1). Association for Computational Linguistics, 4636–4650.
  • Jin et al. (2020) Woojeong Jin, Meng Qu, Xisen Jin, and Xiang Ren. 2020. Recurrent Event Network: Autoregressive Structure Inferenceover Temporal Knowledge Graphs. In EMNLP (1). Association for Computational Linguistics, 6669–6683.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In NeurIPS.
  • Lee et al. (2023) Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, and Jay Pujara. 2023. Temporal Knowledge Graph Forecasting Without Knowledge Using In-Context Learning. In EMNLP. Association for Computational Linguistics, 544–557.
  • Leetaru and Schrodt (2013) Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, Vol. 2. Citeseer, 1–49.
  • Lewis et al. (2020) Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS.
  • Li et al. (2021c) Manling Li, Sha Li, Zhenhailong Wang, Lifu Huang, Kyunghyun Cho, Heng Ji, Jiawei Han, and Clare R. Voss. 2021c. The Future is not One-dimensional: Complex Event Schema Induction by Graph Modeling for Event Prediction. In EMNLP (1). Association for Computational Linguistics, 5203–5215.
  • Li et al. (2021a) Zixuan Li, Xiaolong Jin, Saiping Guan, Wei Li, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. 2021a. Search from History and Reason for Future: Two-stage Reasoning on Temporal Knowledge Graphs. In ACL. https://api.semanticscholar.org/CorpusID:235266233
  • Li et al. (2021b) Zixuan Li, Xiaolong Jin, Wei Li, Saiping Guan, Jiafeng Guo, Huawei Shen, Yuanzhuo Wang, and Xueqi Cheng. 2021b. Temporal Knowledge Graph Reasoning Based on Evolutional Representation Learning. In SIGIR. ACM, 408–417.
  • Liao et al. (2023) Ruotong Liao, Xu Jia, Yunpu Ma, and Volker Tresp. 2023. GenTKG: Generative Forecasting on Temporal Knowledge Graph. CoRR abs/2310.07793 (2023).
  • Liu (2022) Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
  • Lu et al. (2024) Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. In ICLR.
  • Luo et al. (2024) Ruilin Luo, Tianle Gu, Haoling Li, Junzhe Li, Zicheng Lin, Jiayi Li, and Yujiu Yang. 2024. Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion. CoRR abs/2401.06072 (2024).
  • Lv et al. (2020) Shangwen Lv, Fuqing Zhu, and Songlin Hu. 2020. Integrating external event knowledge for script learning. In Proceedings of the 28th International Conference on Computational Linguistics. 306–315.
  • Ma et al. (2023a) Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, and Tat-Seng Chua. 2023a. Context-aware Event Forecasting via Graph Disentanglement. In KDD. ACM, 1643–1652.
  • Ma et al. (2023b) Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, Liang Pang, and Tat-Seng Chua. 2023b. Structured, Complex and Time-complete Temporal Event Forecasting. CoRR abs/2312.01052 (2023).
  • Morstatter (2021) Fred Morstatter. 2021. RCT-B. (2021). https://doi.org/10.7910/DVN/ROTHFT
  • Ning et al. (2020) Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth. 2020. TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions. In EMNLP. 1158–1172. https://doi.org/10.18653/v1/2020.emnlp-main.88
  • O’brien (2010) Sean P O’brien. 2010. Crisis early warning and decision support: Contemporary approaches and thoughts on future research. International studies review 12, 1 (2010), 87–104.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
  • Park et al. (2022) Namyong Park, Fuchen Liu, Purvanshi Mehta, Dana Cristofor, Christos Faloutsos, and Yuxiao Dong. 2022. EvoKG: Jointly Modeling Event Time and Network Structure for Reasoning over Temporal Knowledge Graphs. In WSDM. ACM, 794–803.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  • Schlichtkrull et al. (2018) Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling Relational Data with Graph Convolutional Networks. In ESWC (Lecture Notes in Computer Science, Vol. 10843). Springer, 593–607.
  • Shang et al. (2019) Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. 2019. End-to-End Structure-Aware Convolutional Networks for Knowledge Base Completion. In AAAI. AAAI Press, 3060–3067.
  • Shi et al. (2023) Xiaoming Shi, Siqiao Xue, Kangrui Wang, Fan Zhou, James Y. Zhang, JUN ZHOU, Chenhao Tan, and Hongyuan Mei. 2023. Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning. In NeurIPS.
  • Sun et al. (2021) Haohai Sun, Jialu Zhong, Yunpu Ma, Zhen Han, and Kun He. 2021. TimeTraveler: Reinforcement Learning for Temporal Knowledge Graph Forecasting. In EMNLP. https://api.semanticscholar.org/CorpusID:237454564
  • Sun et al. (2023) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. 2023. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. CoRR abs/2307.07697 (2023).
  • Tan et al. (2023) Qingyu Tan, Hwee Tou Ng, and Lidong Bing. 2023. Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models. In ACL. Association for Computational Linguistics, 14820–14835. https://doi.org/10.18653/v1/2023.acl-long.828
  • Thirunavukarasu et al. (2023) Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature Medicine 29, 1930–1940.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
  • Trivedi et al. (2017) Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. 2017. Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In ICML. 3462–3471.
  • Wang et al. (2023) Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. 2023. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. (2023). arXiv:2307.10635
  • Wang and Zhao (2023) Yuqing Wang and Yun Zhao. 2023. TRAM: Benchmarking Temporal Reasoning for Large Language Models. (2023). arXiv:2310.00835
  • Weidmann and Ward (2010) Nils B Weidmann and Michael D Ward. 2010. Predicting conflict in space and time. Journal of Conflict Resolution 54, 6 (2010), 883–901.
  • Wu et al. (2023) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. PMC-LLaMA: Further Finetuning LLaMA on Medical Papers. ArXiv abs/2304.14454 (2023). https://api.semanticscholar.org/CorpusID:263888272
  • Xu et al. (2023) Wenjie Xu, Ben Liu, Miao Peng, Xu Jia, and Min Peng. 2023. Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion. In ACL (Findings). Association for Computational Linguistics, 7790–7803.
  • Zhang and Choi (2021) Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating Extra-Linguistic Contexts into QA. In EMNLP.
  • Zhao (2022) Liang Zhao. 2022. Event Prediction in the Big Data Era: A Systematic Survey. ACM Comput. Surv. 54, 5 (2022), 94:1–94:37.
  • Zhou et al. (2019) Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding. In EMNLP. 3363–3369. https://doi.org/10.18653/v1/D19-1332
  • Zhu et al. (2020) Cunchao Zhu, Muhao Chen, Changjun Fan, Guangquan Cheng, and Yan Zhan. 2020. Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:229180723