MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

Haoxuan Li University of Electronic Science and Technology of ChinaChengduChina [email protected] Zhengmao Yang Zhejiang UniversityHangzhouChina [email protected] Yunshan Ma National University of SingaporeSingapore [email protected] Yi Bin Tongji UniversityShanghaiChina National University of SingaporeSingapore [email protected] Yang Yang University of Electronic Science and Technology of ChinaChengduChina [email protected]  and  Tat-Seng Chua National University of SingaporeSingapore [email protected]
(2024)
Abstract.

We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at https://github.com/LuminosityX/MM-Forecast.

Temporal Event Forecasting, Multimodal Event Forecasting, Multimodal Large Language Model
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australiabooktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australiadoi: 10.1145/3664647.3681593isbn: 979-8-4007-0686-8/24/10ccs: Information systems Multimedia and multimodal retrievalccs: Computing methodologies Temporal reasoning

1. Introduction

Temporal event forecasting aims to predict future events according the observed events in history. The forecasting of critical events, such as pandemic outbreak, civil unrest, and international conflicts, can help shape policies in advance and minimize potential impacts (Zhao, 2021). Due to its great potential application value, temporal event forecasting (Jin et al., 2021; Ma et al., 2023b; Lv et al., 2020; Liang et al., 2024) has garnered increasing attention from both the academic and industrial community. Despite promising progress, current methods have ignored the rich multimodal information, e.g., images, leaving this an unexplored research gap.

With the enormous success of LLMs, an increasing number of studies (Lee et al., 2023; Liao et al., 2023; Luo et al., 2024; Xu et al., 2023; Zhang et al., 2024; Ye et al., 2024; Chang et al., 2024) have been exploring LLMs to tackle the temporal event forecasting problem. These pioneering works explore the application of LLMs in the task of temporal event forecasting, leveraging techniques such as in-context learning (ICL) (Lee et al., 2023; Zhang et al., 2024), instruction tuning (Luo et al., 2024; Xu et al., 2023), and retrieval-augmented generation (RAG) (Sun et al., 2023; Chang et al., 2024). Compared to LLM-based methods, traditional methods have several shortcomings in terms of effectiveness, flexibility, and scalability. Specifically, traditional non-LLM methods (Jin et al., 2020; Li et al., 2021; Ma et al., 2023b; Park et al., 2022; Ma et al., 2023a), whether based on structured or unstructured data, typically require large-scale well-annotated datasets. Moreover, model selection is often a challenge for these traditional methods due to high computational costs. Additionally, traditional methods generally require separate training for different datasets, as a result, they often struggle to make fast adaptation w.r.t. frequent changing in dataset and temporal shifts. Therefore, applying LLMs to the task of temporal event forecasting is a worthwhile direction to explore (Lee et al., 2023). However, all of the existing LLM-based methods only consider a single modality, such as text (Lee et al., 2023) or graph (Luo et al., 2024), while ignoring the prevalent visual modality, i.e., images. Some previous works (Li et al., 2024a, 2023) have justified that images are helpful in multimodal event detection (Li et al., 2024a) and extraction (Tong et al., 2020; Li et al., 2022), while none of them investigate images’ utility in temporal event forecasting.

Refer to caption
Figure 1. Illustration of our motivation about why images will help in temporal event forecasting. We identify two essential functions of images, i.e., highlighting and complementary. By offering auxiliary highlighting or complementary information, images enhance the understanding of temporal events, thus boosting the forecasting performance.

To bridge this gap, we aim to integrate images into temporal event forecasting and construct multimodal temporal event forecasting models. However, it is a non-trivial objective due to the following challenges. First, it is necessary to clarify the function between visual information and other modal information, i.e., the interplay between visual and non-visual modalities. Next, we need to figure out how the function between two modalities can contribute to the task of temporal event forecasting. Second, previous works (Tong et al., 2020) that explores the image function typically require large amounts of labeled training data. Additionally, images serve different functions for different specific tasks, so these methods often struggle to generalize effectively to other task definitions. Therefore, there is a pressing need to design an effective method to identify the function between modalities and seamlessly integrating them into LLM-based forecasting models.

To address the aforementioned challenges, we propose a novel framework for multimodal temporal event forecasting, named as MM-Forecast. Specifically, we identify two essential functions of images, i.e., highlighting and complementary. As illustrated in Figure 1, when the function of associated image is highlighting, the image plays the role of emphasizing the key events. In contrast, when the function of associated image is complementary, the image provides supplementary information that complements the textual content. In order to recognize these two types of functions, we propose an Image Function Identification module that is based on Multimodal LLMs (MLLMs) due to their superior multimodal understanding and reasoning capabilities in zero-shot settings (Li et al., 2024b). This proposed module is designed to recognize the function of images in historical events, and then transform this information into verbal descriptions that can be seamlessly integrated into the LLM-based event forecasting model. Equipping this Image Function Identification module into the overall framework, we integrate it into two distinct LLM-based forecasting models, i.e., one based on the in-context learning (ICL) method (Lee et al., 2023), and the other based on the retrieval-augmented generation (RAG) technique (Lewis et al., 2020). In order to evaluate our approach, we construct an exploratory dataset by incorporating images into an existing dataset MidEast-TE-mini (Chang et al., 2024). We name this new dataset MidEast-TE-multimodal (short as MidEast-TE-mm). In the final evaluation, with the enhancement of visual information, the temporal event forecasting task achieves superior forecasting accuracy compared to the unimodal approach. The experimental results illustrate that our method accurately recognizes the function of images in various aspects. Furthermore, the findings demonstrate that multimodal temporal forecasting represents a potential and promising research direction worthy of further exploration. The main contributions are as follows:

  • To the best of our knowledge, this is the first comprehensive study of exploring visual information for temporal event forecasting in the era of LLMs.

  • We identify two main functions that images play in temporal event forecasting, and design a framework to recognise and integrate visual information into LLM-based forecasting models.

  • Extensive experiments justify that our framework is able to identify the functions of images and visual information can enhance the performance of temporal event forecasting. Furthermore, these findings have led to several noteworthy and promising directions for future research.

2. Related Works

We survey the related works of temporal event forecasting and LLMs for event analysis.

2.1. Temporal Event Forecasting

Temporal event forecasting centers on predicting future event occurrences based on historical events, and the typical approaches can be categorized by event format: time series, structured, and unstructured events. Regarding the time series paradigm, existing works (Liang et al., 2024; Benjamin et al., 2023; Morstatter, 2021) typically represent events as an ordered sequence of data points that describe the progression of actions or occurrences. However, this paradigm inherently fails to represent multiple relationships among entities. Alternatively, another branch of works (Shang et al., 2019; Dettmers et al., 2018; Sun et al., 2019; Yang et al., 2015; Ma et al., 2023a; Jin et al., 2020; Li et al., 2021; Park et al., 2022) focus on the prediction of structured events, i.e., using graph to represent events, which is known as temporal knowledge graph (TKG). Recent works(Ma et al., 2023a, b) introduce context into temporal event forecasting models, enhancing the prediction performance by elaborating the event’s occurrence situation. In addition, several studies have explored the use of unstructured textual representations of temporal events, where each atomic event is generated from multi-document summaries (Gholipour Ghalandari et al., 2020) or event chains (Jiao et al., 2023). Nonetheless, all of them design forecasting models relying on single modality data. Some works (Tong et al., 2020; Li et al., 2022) explore the image utility in event extraction task, while none of them investigate images’ utility in temporal event forecasting.

2.2. LLMs for Event Analysis

The tremendous success of LLMs in recent years, exemplified by ChatGPT and its numerous successors (Touvron et al., 2023; Zhang et al., 2022; Chowdhery et al., 2023; Chiang et al., 2023), has inspired researchers to explore the application of these powerful models to various event-related tasks (Deng et al., 2024; Liao et al., 2023; Lee et al., 2023; Zhang et al., 2024; Ye et al., 2024; Chang et al., 2024). One area of research focuses on temporal understanding, where LLMs are tested for the task of temporal event ordering or storyline understanding (Ning et al., 2020; Zhang and Choi, 2021; Zhou et al., 2019). More works focus on leveraging LLMs to tackle the typical task of temporal reasoning (Tan et al., 2023; Wang and Zhao, 2023), while the task of forecasting receives much less attention. Deng et al.  (Deng et al., 2024) surveyed the recent advances in event modeling, ranging from graph neural networks to LLMs. Specifically, GENTKG (Liao et al., 2023) improves the selection of historical event inputs by a temporal logical rule-based retrieval strategy. Beyond specific methods, more works are focusing on benchmarking LLMs’ capability in temporal event forecasting. Zhang et al.  (Zhang et al., 2024) propose a method to evaluate the proficiency of LLMs in handling temporal dynamics and understanding extensive text through three distinct tasks. And Ye et al.  (Ye et al., 2024) introduce a novel benchmarking environment designed to rigorously assess and advance the capabilities of LLM agents for international event forecasting over time. Furthermore, Chang et al.  (Chang et al., 2024) propose a unified dataset of structured and unstructured data, and systematically evaluates LLM-based methods on the task of text-involved temporal event forecasting. However, these existing LLM-based methods still solely rely on single-modality data, potentially missing valuable information from other modalities, such as images. With the success of LLMs, MLLMs, such as LLaVA (Liu et al., 2024), and Gemini (Team et al., 2023), have emerged as promising means for unifying visual and textual modalities. These MLLMs have demonstrated impressive performance gain across various visual-language tasks (Alayrac et al., 2022; Liu et al., 2024; Bin et al., 2023; Ding et al., 2024), suggesting their potential in the task of temporal event forecasting by leveraging visual information.

3. Our Approach: MM-Forecast

Refer to caption
Figure 2. The schematic overview of MM-Forecast. By consuming historical events in either format of unstructured or structured input (left), our image function identification module (middle) recognizes the image functions as verbal descriptions, which are then feed into LLM-based forecasting model (right). Our framework is versatile to handle both structured and unstructured events, meanwhile, it is compatible to popular LLM components for event forecasting, i.e., ICL and RAG.

The overall framework of our proposed approach is depicted in Figure 2. We first formally define the multimodal temporal event forecasting task in Section 3.1. Second, we specifically introduce the key module of Image Function Identification in Section 3.2. Finally, we elaborate on how to integrate the recognized image functions into LLM-based forecasting models in Section 3.3.

3.1. Problem Formulation

To give formal definition of the problem, we separate it into two sub-tasks given the different representations of historical information.

Structured Event Forecasting (Graph111”Graph” is interchangeably used to represent this setting.). This formulation defines each event as a quadruple (s,r,o,t)𝑠𝑟𝑜𝑡(s,r,o,t)( italic_s , italic_r , italic_o , italic_t ), which is also called an atomic event, where s,r,o,t𝑠𝑟𝑜𝑡s,r,o,titalic_s , italic_r , italic_o , italic_t corresponds to the subject, relation222Relation and event type are interchangeably used in this work, object, and timestamp. At each timestamp t𝑡titalic_t, all the quadruples form an event graph, denoted as Gt={(s,r,o,t)}Nsubscript𝐺𝑡superscript𝑠𝑟𝑜𝑡𝑁G_{t}=\{(s,r,o,t)\}^{N}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_s , italic_r , italic_o , italic_t ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of events at timestamp t𝑡titalic_t. Recent work(Ma et al., 2023b) introduces the concept of complex event (CE) into the structured event representation by document clustering, elaborating the event’s occurrence situation or context. Specifically, each atomic event is extended from a quadruple to a quintuple, i.e., (s,r,o,t,c)𝑠𝑟𝑜𝑡𝑐(s,r,o,t,c)( italic_s , italic_r , italic_o , italic_t , italic_c ), where s𝑠s\in\mathcal{E}italic_s ∈ caligraphic_E, r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R, o𝑜o\in\mathcal{E}italic_o ∈ caligraphic_E, and c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C represent the subject, relation, object, and CE, respectively; \mathcal{E}caligraphic_E, \mathcal{R}caligraphic_R and 𝒞𝒞\mathcal{C}caligraphic_C are the entity set, relation set and complex context set. Correspondingly, the event graph at each timestamp will be extended as Gt={(s,r,o,t,c)}Nsubscript𝐺𝑡superscript𝑠𝑟𝑜𝑡𝑐𝑁G_{t}=\{(s,r,o,t,c)\}^{N}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_s , italic_r , italic_o , italic_t , italic_c ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Furthermore, in addition event graph, there are images associated with structured events, denoted as Vt={v1,v2,,vm}m=1Msubscript𝑉𝑡subscriptsuperscriptsubscript𝑣1subscript𝑣2subscript𝑣𝑚𝑀𝑚1V_{t}=\{v_{1},v_{2},...,v_{m}\}^{M}_{m=1}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT, where M𝑀Mitalic_M is the number of images at timestamp t𝑡titalic_t. Finally, the structured event forecasting task can then be formulated as follows: given the historical event graphs G<t={G0,G1,,Gt1}subscript𝐺absent𝑡subscript𝐺0subscript𝐺1subscript𝐺𝑡1G_{<t}=\{G_{0},G_{1},...,G_{t-1}\}italic_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and associated images V<t={V0,V1,,Vt1}subscript𝑉absent𝑡subscript𝑉0subscript𝑉1subscript𝑉𝑡1V_{<t}=\{V_{0},V_{1},...,V_{t-1}\}italic_V start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } before timestamp t𝑡titalic_t, and a query (s,r,t)𝑠𝑟𝑡(s,r,t)( italic_s , italic_r , italic_t ) oder (s,o,t)𝑠𝑜𝑡(s,o,t)( italic_s , italic_o , italic_t ), the goal is to predict the missing object o𝑜oitalic_o or relation r𝑟ritalic_r.

Unstructured Event Forecasting (Text333”Text” is interchangeably used to represent this setting.). In addition to the structured event representation, we also consider the unstructured event representation, where the historical information is represented in the form of textual sub-events, i.e., At={a1,a2,,ak}k=1Ksubscript𝐴𝑡subscriptsuperscriptsubscript𝑎1subscript𝑎2subscript𝑎𝑘𝐾𝑘1A_{t}=\{a_{1},a_{2},...,a_{k}\}^{K}_{k=1}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT and At𝒜subscript𝐴𝑡𝒜A_{t}\in\mathcal{A}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A, where aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the k-th textual sub-events and 𝒜𝒜\mathcal{A}caligraphic_A denotes the corpus of textual sub-events. The textual sub-events are obtained by summarizing the content of news articles. Similar to structured event forecasting, textual sub-events have associated images, denoted as V<t={V0,V1,,Vt1}subscript𝑉absent𝑡subscript𝑉0subscript𝑉1subscript𝑉𝑡1V_{<t}=\{V_{0},V_{1},...,V_{t-1}\}italic_V start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }. The unstructured event forecasting task can be formulated as: given the historical textual sub-events A<t={A0,A1,,At1}subscript𝐴absent𝑡subscript𝐴0subscript𝐴1subscript𝐴𝑡1A_{<t}=\{A_{0},A_{1},...,A_{t-1}\}italic_A start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and associated images V<t={V0,V1,,Vt1}subscript𝑉absent𝑡subscript𝑉0subscript𝑉1subscript𝑉𝑡1V_{<t}=\{V_{0},V_{1},...,V_{t-1}\}italic_V start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } before timestamp t𝑡titalic_t, and a query (s,r,t)𝑠𝑟𝑡(s,r,t)( italic_s , italic_r , italic_t ) oder (s,o,t)𝑠𝑜𝑡(s,o,t)( italic_s , italic_o , italic_t ), the goal is to predict the missing object o𝑜oitalic_o or relation s𝑠sitalic_s.

3.2. Image Function Identification

Identifying the function of images in the temporal event forecasting is the key to utilize multimodal visual information. In news articles, images play a vital role not only in attracting readers but also in completing and enriching the textual content, especially key event content. We will identify the image functions into three categories, i.e., highlighting, complementary, and irrelevant, during the dataset construction stage. Excluding the irrelevant images, the others serve distinct roles in the temporal event forecasting task. We propose an Image Function Identification module to recognize these functions as verbal descriptions using MLLMs, and subsequently incorporate these function descriptions into LLM-based forecasting models. Specifically, when the function of associated image is highlighting, the visual elements directly support and highlight the key sub-events described in the text. These ”highlighting” sub-events, substantiated by corroborating information across modalities, can be identified as key events. To determine which sub-event is a key event, we leverage the MLLMs to analyze the images and sub-events from multiple aspects, including main objects, celebrities, activities, environment, and labeled items. In cases where the function of associated image is complementary, the visual content contains information that supplements and extends what is covered in the news text. To more effectively extract the relevant supplementary information, we consider the following aspects: 1) identify the main subject of the image as the central point, 2) directly relate the extracted information to the news event in the article, 3) prioritize the most newsworthy visual elements, 4) ensure all information comes directly from the provided news article without fabrication, and 5) aim for a concise summary using clear language. By analyzing the interplay between visual images and textual content within news articles, we can gain a more comprehensive understanding of the underlying events and better contextualize the temporal evolution of historical events. Ultimately, the prompts utilized in making predictions are shown below:

SYSTEM:
You are an assistant to perform event forecasting
with the following rules:
1. The atomic event is the basic unit describing a spec-
ific event, typically presented in the form of a quadru-
ple (S, R, O, T), where S represents the subject, R repre-
sent the relation, O represents the object, and T repres-
ents the relative time.
2. When formulating the ultimate prediction, the preemi-
nent factor to be meticulously weighed and scrutinized
is the [Key Events]. Complementing this paramount consi-
deration is the [Related events], which, though ancilla-
ry in nature, serves as a valuable adjunct, furnishing
pertinent contextual details and auxiliary insights to
fortify the predictive analysis.
3. Given a query of (S, R, T) in the future and the list
of historical events until t, event forecasting aims to
predict the missing object.
USER:
[Query]: (S, O/R, T)
[Key Events]: xxx.
[Related Events]: xxx.
[Options]: A.xxx B.xxx C.xxx D.xxx E.xxx

The key events are explicitly highlighted within the prompts, while complementary information is provided as additional relevant events.

3.3. Forecasting Framework

Given there are few established studies of using LLMs for event forecasting, we consider two representative approaches, i.e., In-context Learning (ICL) (Lee et al., 2023) and Retrieval Augmented Generation (RAG) (Lewis et al., 2020). Each of these two methods can accept both structured and unstructured historical input, and answer the structured forecasting questions.

3.3.1. In-context Learning (ICL)

In-context learning leverages both intrinsic and extrinsic factors to construct historical events. Specifically, the intrinsic factors of an event are related to its inherent elements, particularly the subject. In contrast, the extrinsic factors are driven by the contextual environment surrounding the event. Therefore, whether the data is structured or unstructured, we construct the historical events based on the subject and the complex event, separately. The details are as follows:

  • Structured Data. For structured data, the method takes the discrete event graph as the input. To capture the intrinsic factors, we use the subject of the current event as a guiding clue to construct the historical event graph 𝐆<ts={G0s,G1s,,Gt1s}superscriptsubscript𝐆absent𝑡𝑠superscriptsubscript𝐺0𝑠superscriptsubscript𝐺1𝑠superscriptsubscript𝐺𝑡1𝑠\mathbf{G}_{<t}^{s}=\{G_{0}^{s},G_{1}^{s},...,G_{t-1}^{s}\}bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT }, where Gtssuperscriptsubscript𝐺𝑡𝑠G_{t}^{s}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT represents historical events graph at timestamp t𝑡titalic_t with the same subject as the current event. To account for the extrinsic factors, we construct the historical event graph from the complex event, i.e. 𝐆<tc={G0c,G1c,,Gt1c}superscriptsubscript𝐆absent𝑡𝑐superscriptsubscript𝐺0𝑐superscriptsubscript𝐺1𝑐superscriptsubscript𝐺𝑡1𝑐\mathbf{G}_{<t}^{c}=\{G_{0}^{c},G_{1}^{c},...,G_{t-1}^{c}\}bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT }, where Gtcsuperscriptsubscript𝐺𝑡𝑐G_{t}^{c}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT represents historical events graph at timestamp t𝑡titalic_t with the same complex event as the current event. Finally, with the highlighting and complementary functions of the images, the input historical event graph is 𝐆input=[Gk,Gr,Gc]subscript𝐆𝑖𝑛𝑝𝑢𝑡subscriptG𝑘subscriptG𝑟subscriptG𝑐\mathbf{G}_{input}=[\textbf{G}_{k},\textbf{G}_{r},\textbf{G}_{c}]bold_G start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT = [ G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ], where 𝐆input𝐆<ts𝐆<tcsubscript𝐆𝑖𝑛𝑝𝑢𝑡superscriptsubscript𝐆absent𝑡𝑠superscriptsubscript𝐆absent𝑡𝑐\mathbf{G}_{input}\in\mathbf{G}_{<t}^{s}\bigcup\mathbf{G}_{<t}^{c}bold_G start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT ∈ bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⋃ bold_G start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐆ksubscript𝐆𝑘\mathbf{G}_{k}bold_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the key events, 𝐆rsubscript𝐆𝑟\mathbf{G}_{r}bold_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents the remaining events, and 𝐆csubscript𝐆𝑐\mathbf{G}_{c}bold_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT corresponds to the complementary events, respectively.

  • Unstructured Data. For unstructured data, the method takes the textual sub-events as input. Firstly, we identify the events by the historical events graph from the subject and complex event and find the corresponding textual sub-events set 𝐀<ts={A0s,A1s,,At1s}superscriptsubscript𝐀absent𝑡𝑠superscriptsubscript𝐴0𝑠superscriptsubscript𝐴1𝑠superscriptsubscript𝐴𝑡1𝑠\mathbf{A}_{<t}^{s}=\{A_{0}^{s},A_{1}^{s},...,A_{t-1}^{s}\}bold_A start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = { italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } and 𝐀<tc={A0c,A1c,,At1c}superscriptsubscript𝐀absent𝑡𝑐superscriptsubscript𝐴0𝑐superscriptsubscript𝐴1𝑐superscriptsubscript𝐴𝑡1𝑐\mathbf{A}_{<t}^{c}=\{A_{0}^{c},A_{1}^{c},...,A_{t-1}^{c}\}bold_A start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } through the relationships between textual sub-events and graph sub-events. Then, with the highlighting and complementary functions of the images, the input historical textual sub-events are similarly 𝐀input=[Ak,Ar,Ac]subscript𝐀𝑖𝑛𝑝𝑢𝑡subscriptA𝑘subscriptA𝑟subscriptA𝑐\mathbf{A}_{input}=[\textbf{A}_{k},\textbf{A}_{r},\textbf{A}_{c}]bold_A start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT = [ A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ], where 𝐀input𝐀<ts𝐀<tcsubscript𝐀𝑖𝑛𝑝𝑢𝑡superscriptsubscript𝐀absent𝑡𝑠superscriptsubscript𝐀absent𝑡𝑐\mathbf{A}_{input}\in\mathbf{A}_{<t}^{s}\bigcup\mathbf{A}_{<t}^{c}bold_A start_POSTSUBSCRIPT italic_i italic_n italic_p italic_u italic_t end_POSTSUBSCRIPT ∈ bold_A start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⋃ bold_A start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐀ksubscript𝐀𝑘\mathbf{A}_{k}bold_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the key events, 𝐀rsubscript𝐀𝑟\mathbf{A}_{r}bold_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents the remaining events, and 𝐀csubscript𝐀𝑐\mathbf{A}_{c}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT corresponds to the complementary events, respectively.

3.3.2. Retrieval Augmented Generation (RAG)

Despite the rich information provided by in-context learning methods, the inherent nature of the temporal event means that the existing historical event still contains substantial noise. Inspired by the recent research of RAG (Lewis et al., 2020), we also adopt the retrieve-then-generate paradigm to find the most relevant historical events to mitigate the problem of noise. Similar to ICL methods, we utilize two forms of data representation, structured data and unstructured data:

  • Structured Data. Due to the structured nature of the data representation, the event graphs adhered to a unified quintuple format. Therefore, we first retrieve the entities that have interacted with the subject of the query event. Once we have obtained the related entity set, we can construct the history with the historical events where the subject or object is within this set. Similarly, through the function of images, the retrieval process also contains key events and complementary events.

  • Unstructured Data. Unlike structured data, we can use the embedding techniques to directly retrieve relevant news events from a set of historical news articles for the unstructured data. Following this, we filter historical news events based on timestamps, eliminating outdated and irrelevant events. We also select the key events and complement information based on the images, which will be input according to the prompts described in Section 3.2, and finally obtain the prediction results.

4. Experiments

Table 1. Performance (accuracy) comparison between zero-shot LLM-based methods and the non-LLM methods in both settings of object entity prediction and relation prediction. For LLM-based methods, we include multiple backbones with two representative forecasting method, i.e., ICL and RAG. Results of our methods are highlighted with grey backgrounds, where the key novelty lies in we leverage images by the Image Function Identification module.
Model Type/Backbone Forecasting Model Multimodal Model Object Entity Prediction Relation Prediction
Text Graph Text Graph
Non-LLM ConvTransE (Shang et al., 2019) Uni-modal K.A. 0.3737 K.A. 0.7327
RGCN (Schlichtkrull et al., 2018) Uni-modal K.A. 0.3777 K.A. 0.7203
RE-GCN (Li et al., 2021) Uni-modal K.A. 0.3879 K.A. 0.7333
LoGo (Ma et al., 2023b) Uni-modal K.A. 0.3969 K.A. 0.7406
Gemini-1.0-Pro-Vision666https://ai.google.dev/models/gemini ICL (Lee et al., 2023) MLLM666https://ai.google.dev/models/gemini 0.3023 0.3319 0.5541 0.6085
RAG (Lewis et al., 2020) MLLM666https://ai.google.dev/models/gemini 0.3305 0.3465 0.5769 0.5848
Gemini-1.0-Pro666https://ai.google.dev/models/gemini ICL (Lee et al., 2023) Uni-modal 0.3312 0.3657 0.5900 0.6257
MM-Forecast (ours) 0.3527 0.3837 0.6087 0.6324
RAG (Lewis et al., 2020) Uni-modal 0.3340 0.3669 0.6081 0.5866
MM-Forecast (ours) 0.3425 0.3692 0.6121 0.5991
GPT-3.5-Turbo777https://platform.openai.com/docs/models/gpt-3-5-turbo ICL (Lee et al., 2023) Uni-modal 0.3063 0.3431 0.4847 0.5345
MM-Forecast (ours) 0.3414 0.3522 0.5317 0.5521
RAG (Lewis et al., 2020) Uni-modal 0.3272 0.3397 0.4943 0.4666
MM-Forecast (ours) 0.3652 0.3647 0.5152 0.5113

We conduct experiments to evaluate the proposed approach, and answer the following research questions:

  • RQ1: What is the overall performance of temporal event forecasting methods by including visual information?

  • RQ2: How do the highlighting and complementary functions of images affect the forecasting performance?

  • RQ3: How do different LLM backbones as well as fine-tuning affect the performance?

4.1. Experimental Settings

We introduce the experimental settings, including the dataset, the methods compared, and the implementation details.

4.1.1. Dataset

We build our dataset based on MidEast-TE-mini (Chang et al., 2024), which includes structured atomic events and news articles. We aim to add images that correspond to the events in the dataset, hence we will have the data in visual modality. An intuitive way is to download the web page according to the URL provided by the original dataset. However, the original web page always contains a lot of irrelevant images, such as advertisement images, that are cumbersome and difficult to be accurately filtered out. Instead of directly solving this problem, we propose an alternative solution that we use Google Image Search444https://images.google.com/ to search the images using the news article title as the query. Among the returned images, we select the top-ranked ones as the associated images of the news article. In order to further filter out irrelevant images, we instruct the Gemini-1.0-Pro-Vision model to determine the relevance of images to news articles. We give three options: highlighting, complementary and irrelevant. Highlighting means that the images and the content of the news are highly matched, and complementary means that the image has supplementary meaning to the content of the news. Images beyond these two are regarded as irrelevant. We further remove images that are classified as irrelevant. Finally, we name our dataset as MidEast-TE-multimodal, short as MidEast-TE-mm.

4.1.2. Compared Methods

The compared methods are categorised into non-LLM-based methods and LLM-based methods. For non-LLM-based methods, only text or graph modalities are involved, since these methods architecture are fixed. We train the models on the training set, selecting the best-performing model based on the validation set results, and obtain the final results of the testing set. For LLM-based methods, we use the proprietary LLMs due to their superior performance compared to open-source LLMs. Therefore, testing is generally done in a zero-shot manner, i.e., directly test them on the testing set. The specific methods are shown below:

  • ConvTransE (Shang et al., 2019): This method employs a convolutional neural network (CNN) and a translational operation to capture the relational patterns within triplet data.

  • RGCN (Schlichtkrull et al., 2018): RGCN leverages a graph convolutional neural network (GCN) to capture the diverse relations between entities.

  • RE-GCN (Li et al., 2021): RE-GCN utilizes a combination of GCN and recurrent neural network (RNN) to capture both the relational patterns and temporal dynamics.

  • LoGo (Ma et al., 2023b): This method models relationships within and between complex events from both local and global perspectives.

  • GPT-3.5-Turbo777https://platform.openai.com/docs/models/gpt-3-5-turbo: The GPT-3.5-turbo model is the prevalent iteration of the GPT (Generative Pre-trained Transformer) language model developed by OpenAI555https://openai.com/.

  • Gemini-1.0666https://ai.google.dev/models/gemini: Gemini-1.0 is a cutting-edge family of multimodal models developed by the Gemini Team at Google.

4.1.3. Implementation Details

To ensure the reproducibility, we fixed the temperature parameter of the proprietary LLMs used to 0 and set the seed parameter to a constant value. When making forecasting, we limit the maximum output token length to 256 to prevent invalid responses. To ensure fairness across the experiments, the history that can be retrieved is set to 30 days. Notably, the retrieval models that we employ include: BM25 (Robertson et al., 2009), Contriever (Izacard et al., 2021), and LlamaIndex (Liu, 2022). Additionally, considering the limitation of the context window, we further restrict the maximum number of sub-events in the historical context to 50. Following previous methods (Chang et al., 2024), we employ the Accuracy (Acc) as the evaluation metric.

4.2. Performance Comparison (RQ1)

We analyze our model’s performance, by comparing various baseline methods on different experimental settings, different input forms, and different retrieval models.

4.2.1. Performance w.r.t. Various Settings.

The overall performance comparison is presented in the Table 1. To comprehensively explore and evaluate methods, we conduct experiments across multiple dimensions, including the format of data representation (Text or Graph), the construction of historical information (RAG-based or ICL-based), and the prediction objective (Object or Relation). Clearly, we have the following observations.

First, enhancing LLM-based methods with visual information consistently improves their accuracy across all experimental settings. This demonstrates that our proposed MM-Forecast makes effective use of visual information, leading to a better contextual understanding of historical information. Hence, our method strengthens the inference ability of LLM and makes more accurate event forecasting performance.

Second, even though the performances of all LLM-based methods have been improved, they still under-perform the traditional Non-LLM based methods. The reason is that LLM-based methods are tested in zero-shot manner, while the Non-LLM methods, which follow supervised learning, are still competitive. Notably, by using our MM-Forecast method, LLM-based methods can achieve close or even better performance than Non-LLM methods for the object entity prediction task.

Table 2. The results of using different retrieval models.
Retriever Gemini-1.0-Pro GPT-3.5-Turbo
BM25 (Robertson et al., 2009) 0.3272 0.3318
Contriver (Izacard et al., 2021) 0.3335 0.3431
LlamaIndex (Liu, 2022) 0.3425 0.3652

Third, the relation prediction task exhibits higher accuracy compared to the object entity prediction task. This suggests that the forecasting of entities is more challenging than relations. There are a few potential reasons for this. First, the set of entities (5909) is much larger than the set of relation types (267), so predicting specific entities is inherently more difficult given the larger candidate pool. Second, we deem that the information implied in entities is more explicit. Thus when two entities are given for a relation prediction, it is easier than when the subject and relation are given for an object prediction.

Refer to caption
Figure 3. Ablation study of each type of image functions.

4.2.2. Performance w.r.t. Directly Using Images.

To illustrate the limitations of existing MLLMs in the task of temporal event forecasting, we also conduct experiments using the Gemini-1.0-Pro-Vsion model (Team et al., 2023) and directly consuming the images in the sub-events. Specifically, this approach leverages the inherent image understanding capabilities of the Gemini-1.0-Pro-Vision model, which embeds image patches as features and seamlessly concatenates thes image features with textual features. From Table 1, we can observe that the accuracy of using images directly is not only lower than our MM-Forecast, but also even worse than the method using only textual data (Uni-modal methods). This illustrates that existing proprietary MLLMs still struggle to make effective event forecasting with multiple images, and reflects the superiority of our MM-Forecast.

4.2.3. Performance w.r.t. Various Retrieval Models.

The choice of retrieval model may have a significant impact on forecasting. The experiments here involve only unstructured event forecasting, since the structured approach employs retrieval based on keyword search techniques. To explore the effect of retrieval model, we adopt three different retrieval models, i.e., BM25 (Robertson et al., 2009), Contriver (Izacard et al., 2021), and LLamaIndex (Liu, 2022), then equip them into our forecasting framework, and obtain the forecasting results. From the results in Table 2, we can observe that the performance progressively improves by using stronger retrieval models, with LLamaIndex performing the best, followed by Contriver, and then BM25. These results verify that stronger retrieval capabilities lead to better forecasting performance, suggesting that retrieval-oriented method design is a promising direction for future research. This phenomenon is consistent with the observation concluded from recent works (Chang et al., 2024).

4.3. Study of the Image Functions (RQ2)

4.3.1. Effects of Image Functions

We conduct ablation experiments for the highlighting and complementary function of images. The results are shown in Figure 3. First, the model that leverages both the highlighting of key events and the complementary information performs the best across the experimental settings. In addition, the performance of the model with only key events highlighted is sub-optimal. This illustrates the effectiveness of the highlighting function of images, and it elicit the fact that highlighting and complementary reinforce each other to achieve even better prediction results. Second, we can observe that in some settings (Text-ICL, Text-RAG), the performances of the model with only complementary information are even worse than the baseline model. The possible reason for this is that the offering of complementary information also introduces more noise and therefore leads the degradation of performance. Third, the performance of RAG-based method is obviously worse than the ICL-based method in the relation prediction task, meanwhile, such performance gap does not exist in the entity prediction task. This is may because that relation prediction is easier than object entity prediction, as mentioned in section 4.2.1. As a result, ICL-based historical events may already contain enough information to make accurate relation prediction, whereas the retrieval model may not retrieve relevant information instead.

Table 3. The accuracy of image function identification.
Data-Type GPT-4-Vision Human
Text Graph Text Graph
Highlighting 0.68 0.68 0.73 0.83
Complementary 0.88 0.93 0.87 0.86
Table 4. Result comparison between using our identified and randomly-assigned image functions.
Model Einstellungen Object Relation
Text Graph Text Graph
GPT-3.5-Turbo Random 0.3284 0.3394 0.5156 0.5249
Ours 0.3414 0.3522 0.5317 0.5521
Refer to caption
Figure 4. Case study: two examples that when considering highlighting and complementary functions of images, our method yields better forecasting results compared with the baselines.

4.3.2. Analysis of the Image Function Identification

In addition to overall forecasting performance analysis, we conduct in-depth study to directly assess the efficacy of highlighting and complementary function. Specifically, we design additional experiments at the data level and prompt level to further verify the function of images. At the data level, we randomly sample 100 images of two categories respectively, and then judge the correctness of the classification by the powerful MLLM GPT-4-Vision 888https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4. As shown in Table 3, both classification of highlighting and complementary functions show high accuracy. Furthermore, we can observe that the accuracy of highlighting is lower than that of complementary on all settings, which should be due to its more strict definition. The high accuracy of image functions in both LLM and human identification indicate that the images we used can indeed play the highlighting and complementary functions. In addition to direct assessment of the quality of image function identification, we conduct another ablation study by replacing our identified functions with randomly selected sub-events. Looking into the forecasting results in Table 4, random selection of sub-events leads to a decrease in forecasting accuracy, indicating that correct image function identification is crucial to the forecasting.

Finally, on top of quantitative evaluation, we conduct qualitative analysis and demonstrate two examples in Figure 4. The first image emphasizes the event of Makhdoom Shah Mahmood Qureshi’s visit to Abdel Fattah Al-Sisi, highlighting their efforts to strengthen and diversify bilateral relations. This highlighting function leads to a correct prediction of the event type. The second image provides supplementary information about the meeting between the two politicians, enabling an accurate prediction of the question.

4.4. Performance on Open-source and Fine-tuned LLMs (RQ3)

All of the above LLM-based forecasting backbones are implemented using proprietary LLMs in the zero-shot manner without any finetuning. We are interested in how our method performs on open-source LLMs, especially finetuned open-source LLMs. Addressing this intriguing question, we select one of the most popular open-source LLMs, i.e., Vicuna-7b, to replace the forecasting backbone LLM in our framework, with both zero-shot manner and fine-tune following typical instruction tuning with QLoRA (Dettmers et al., 2023). The results of object entity prediction are presented in Table 5, which also includes the best results for proprietary LLMs and non-LLM methods. We observe that the zero-shot performance of Vicuna-7B is worse than its corresponding performance on proprietary LLMs, owing to the inherent capacity gap. However, after fine-tuning, Vicuna-7B achieves substantial performance gains, not only surpassing the proprietary LLMs but also outperforming all the non-LLM methods. In addition to fine-tuning the LLMs on object entity prediction, we also fine-tuning on the relation prediction task, as shown in Table 6. In both the text and the graph settings, the relation prediction results are consistent with the entity prediction, i.e., fine-tuned LLMs achieve the best performance. These results demonstrate the significant potential of fine-tuning LLMs for the temporal event forecasting task.

Table 5. Performance of fine-tuned LLMs and its comparison with proprietary LLMs and non-LLM methods.
Model Vicuna-7b LLM Non-LLM
zero-shot MM-Forecast-text-h 0.2723 0.3527 K.A.
MM-Forecast-graph-h 0.2502 0.3837 K.A.
fine-tune MM-Forecast-text-h 0.4490 K.A. K.A.
MM-Forecast-graph-h 0.5480 K.A. 0.3969
Table 6. Performance of FT LLMs on the relation prediction.
Model Vicuna-7b LLM Non-LLM
MM-Forecast-text-h 0.7809 0.6087 K.A.
MM-Forecast-graph-h 0.7901 0.6324 0.7406

5. Conclusion and Future Work

In this paper, we studied an emerging and interesting problem of multimodal temporal event forecasting. We identified two essential image functions in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we introduced MM-Forecast, a novel framework that leverages visual information to enhance temporal event forecasting. By recognizing the highlighting and complementary functions of images and translating them into verbal descriptions, we were able to seamlessly integrate this visual information into LLM-based forecasting models. Ultimately, this enabled the integration of visual information to enhance temporal event forecasting task.

Looking ahead, there are numerous avenues for future work to address the key challenges. In particular, we would like to highlight three distinct aspects that warrant further exploration. First, multi-images relationship need to be considered. There are inherent relationships between images in related historical events, and these relationships are also important for event forecasting. Second, seeing is believing. Images have significant effects on the event forecasting task rather than accuracy improvement, that is credibility or trustability. Third, our current solution is still a multi-step pipeline, while devising an end-to-end approach using MLLMs is intriguing to explore in the future.

Acknowledgements.
This work is partially supported by the National Natural Science Foundation of China under grant 62220106008, U20B2063 and 62102070. This work is also partially supported by Sichuan Science and Technology Program under grant 2023NSFSC1392. This research is also supported by Asian Institute of Digital Finance and NExT Research Center.

References

  • (1)
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. In NeurIPS.
  • Benjamin et al. (2023) Daniel M Benjamin, Fred Morstatter, Ali E Abbas, Andres Abeliuk, Pavel Atanasov, Stephen Bennett, Andreas Beger, Saurabh Birari, David V Budescu, Michele Catasta, et al. 2023. Hybrid forecasting of geopolitical events. AI Magazine (2023).
  • Bin et al. (2023) Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, and Heng Tao Shen. 2023. Unifying two-stream encoders with transformers for cross-modal retrieval. In Proceedings of the 31st ACM International Conference on Multimedia. 3041–3050.
  • Chang et al. (2024) He Chang, Chenchen Ye, Zhulin Tao, Jie Wu, Zhengmao Yang, Yunshan Ma, Xianglin Huang, and Tat-Seng Chua. 2024. A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting. arXiv preprint arXiv:2407.11638 (2024).
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  • Deng et al. (2024) Songgaojun Deng, Maarten de Rijke, and Yue Ning. 2024. Advances in Human Event Modeling: From Graph Neural Networks to Language Models. (2024).
  • Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2D Knowledge Graph Embeddings. In AAAI. AAAI Press, 1811–1818.
  • Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. CoRR abs/2305.14314 (2023).
  • Ding et al. (2024) Yujuan Ding, Yunshan Ma, Wenqi Fan, Yige Yao, Tat-Seng Chua, and Qing Li. 2024. Fashionregen: Llm-empowered fashion report generation. In Companion Proceedings of the ACM on Web Conference 2024. 991–994.
  • Gholipour Ghalandari et al. (2020) Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. 2020. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1302–1308. https://doi.org/10.18653/v1/2020.acl-main.120
  • Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
  • Jiao et al. (2023) Yizhu Jiao, Ming Zhong, Jiaming Shen, Yunyi Zhang, Chao Zhang, and Jiawei Han. 2023. Unsupervised Event Chain Mining from Multiple Documents. In WWW. ACM, 1948–1959.
  • Jin et al. (2021) Woojeong Jin, Rahul Khanna, Suji Kim, Dong-Ho Lee, Fred Morstatter, Aram Galstyan, and Xiang Ren. 2021. ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data. In ACL/IJCNLP (1). Association for Computational Linguistics, 4636–4650.
  • Jin et al. (2020) Woojeong Jin, Meng Qu, Xisen Jin, and Xiang Ren. 2020. Recurrent Event Network: Autoregressive Structure Inferenceover Temporal Knowledge Graphs. In EMNLP (1). Association for Computational Linguistics, 6669–6683.
  • Lee et al. (2023) Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, and Jay Pujara. 2023. Temporal Knowledge Graph Forecasting Without Knowledge Using In-Context Learning. In EMNLP. Association for Computational Linguistics, 544–557.
  • Lewis et al. (2020) Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS.
  • Li et al. (2024b) Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. 2024b. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision 16, 1-2 (2024), 1–214.
  • Li et al. (2023) Haoxuan Li, Yi Bin, Junrong Liao, Yang Yang, and Heng Tao Shen. 2023. Your negative may not be true negative: Boosting image-text matching with false negative elimination. In Proceedings of the 31st ACM International Conference on Multimedia. 924–934.
  • Li et al. (2024a) Jun Li, Yi Bin, Liang Peng, Yang Yang, Yangyang Li, Hao Jin, and Zi Huang. 2024a. Focusing on Relevant Responses for Multi-modal Rumor Detection. IEEE Transactions on Knowledge and Data Engineering (2024).
  • Li et al. (2022) Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-Event: Connecting Text and Images with Event Structures. In CVPR. IEEE, 16399–16408.
  • Li et al. (2021) Zixuan Li, Xiaolong Jin, Wei Li, Saiping Guan, Jiafeng Guo, Huawei Shen, Yuanzhuo Wang, and Xueqi Cheng. 2021. Temporal Knowledge Graph Reasoning Based on Evolutional Representation Learning. In SIGIR. ACM, 408–417.
  • Liang et al. (2024) Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. 2024. Foundation Models for Time Series Analysis: A Tutorial and Survey. arXiv preprint arXiv:2403.14735 (2024).
  • Liao et al. (2023) Ruotong Liao, Xu Jia, Yunpu Ma, and Volker Tresp. 2023. GenTKG: Generative Forecasting on Temporal Knowledge Graph. CoRR abs/2310.07793 (2023).
  • Liu et al. (2024) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. In NeurIPS.
  • Liu (2022) Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
  • Luo et al. (2024) Ruilin Luo, Tianle Gu, Haoling Li, Junzhe Li, Zicheng Lin, Jiayi Li, and Yujiu Yang. 2024. Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion. CoRR abs/2401.06072 (2024).
  • Lv et al. (2020) Shangwen Lv, Fuqing Zhu, and Songlin Hu. 2020. Integrating external event knowledge for script learning. In Proceedings of the 28th International Conference on Computational Linguistics. 306–315.
  • Ma et al. (2023a) Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, and Tat-Seng Chua. 2023a. Context-aware Event Forecasting via Graph Disentanglement. In KDD. ACM, 1643–1652.
  • Ma et al. (2023b) Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, Liang Pang, and Tat-Seng Chua. 2023b. Structured, Complex and Time-complete Temporal Event Forecasting. CoRR abs/2312.01052 (2023).
  • Morstatter (2021) Fred Morstatter. 2021. RCT-B. (2021). https://doi.org/10.7910/DVN/ROTHFT
  • Ning et al. (2020) Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth. 2020. TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions. In EMNLP. 1158–1172. https://doi.org/10.18653/v1/2020.emnlp-main.88
  • Park et al. (2022) Namyong Park, Fuchen Liu, Purvanshi Mehta, Dana Cristofor, Christos Faloutsos, and Yuxiao Dong. 2022. EvoKG: Jointly Modeling Event Time and Network Structure for Reasoning over Temporal Knowledge Graphs. In WSDM. ACM, 794–803.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  • Schlichtkrull et al. (2018) Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling Relational Data with Graph Convolutional Networks. In ESWC (Lecture Notes in Computer Science, Vol. 10843). Springer, 593–607.
  • Shang et al. (2019) Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. 2019. End-to-End Structure-Aware Convolutional Networks for Knowledge Base Completion. In AAAI. AAAI Press, 3060–3067.
  • Sun et al. (2023) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. 2023. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. CoRR abs/2307.07697 (2023).
  • Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space. In ICLR (Poster). OpenReview.net.
  • Tan et al. (2023) Qingyu Tan, Hwee Tou Ng, and Lidong Bing. 2023. Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models. In ACL. Association for Computational Linguistics, 14820–14835. https://doi.org/10.18653/v1/2023.acl-long.828
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  • Tong et al. (2020) Meihan Tong, Shuai Wang, Yixin Cao, Bin Xu, Juanzi Li, Lei Hou, and Tat-Seng Chua. 2020. Image Enhanced Event Detection in News Articles. In AAAI. AAAI Press, 9040–9047.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
  • Wang and Zhao (2023) Yuqing Wang and Yun Zhao. 2023. TRAM: Benchmarking Temporal Reasoning for Large Language Models. (2023). arXiv:2310.00835
  • Xu et al. (2023) Wenjie Xu, Ben Liu, Miao Peng, Xu Jia, and Min Peng. 2023. Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion. In ACL (Findings). Association for Computational Linguistics, 7790–7803.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In ICLR (Poster).
  • Ye et al. (2024) Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, and Wei Wang. 2024. MIRAI: Evaluating LLM Agents for Event Forecasting. arXiv preprint arXiv:2407.01231 (2024).
  • Zhang and Choi (2021) Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating Extra-Linguistic Contexts into QA. In EMNLP.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  • Zhang et al. (2024) Zhihan Zhang, Yixin Cao, Chenchen Ye, Yunshan Ma, Lizi Liao, and Tat-Seng Chua. 2024. Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding. arXiv preprint arXiv:2406.02472 (2024).
  • Zhao (2021) Liang Zhao. 2021. Event prediction in the big data era: A systematic survey. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–37.
  • Zhou et al. (2019) Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding. In EMNLP. 3363–3369. https://doi.org/10.18653/v1/D19-1332

Appendix A Appendix

Table 7. Prompts of image function identification module.
Identification You are a professional news writer.
Please judge the relationship between images and news based on the following rules:
1. Final judgment please choose between [highlighting, complementary, irrelevant].
2. The relationship between an image and a news article is highlighting if the image’s subject matter and depicted
      event are highly related to the news and the specific event shown in the image is already mentioned in detail in
      the article’s description.
3. The relationship between an image and a news article is complementary if the image’s overall theme and
      background information are highly related to the news, but the specific event depicted in the image is not
      mentioned in detail in the article, and the visual information in the image can complement the news story as
      a whole.
4. Except in cases where the relationship is highlighting or complementary, in other cases, the relationship
      between the image and the text is irrelevant.
Highlighting You are a professional news writer.
Please determine which sub-event in the news the image is most relevant to based on the following rules:
1. For the final judgement, please answer with the serial number of the sub-event. For example: [The number of
      the sub-event most relevant to the image is 1.]
2. Identify the main subjects or objects prominently featured in the image. Sub-events that provide details,
      background information or context directly about these central visual elements are highly relevant.
3. If people are depicted, identify who those individuals are. Sub-events involving those particular people should
      take priority.
4. Analyze the overall activities, actions, emotions or mood being portrayed in the image. Relevant sub-events
      likely delve into similar situations, occurrences or sentiments illustrated.
5. Take note of the specific location, setting or environment depicted in the image. Prioritize sub-events that
      discuss that geographic area, type of place, or related events.
6. Look for any text, logos, labeled items or signs visible in the image content. Sub-events elaborating on the
      organizations, companies, products or public figures represented by those texts are applicable.
Complementary You are a professional news writer.
Please extract the image information according to the following rules based on the content of the provided news:
1. Extract the image information as a sub-event. Instead of multiple sub-events.
2. The phrases: [In the image], [The image shows], [In the picture], [The image is], [In the photo], etc, should
      never appear in the summarised sub-event.
3. Identify the primary focus or subject of the image that represents the core piece of information being conveyed.
      This main subject should serve as the central point around which the image information is extracted.
4. Directly relate the extracted image information to the associated news event covered in the article. The image
      summary should complement and enhance the understanding of the news content, not introduce unrelated
      information.
5. Prioritize and emphasize the most newsworthy and significant details visible in the image. These could include
      specific actions, emotions, or identifying characteristics of the main subject.
6. Ensure that all information included in the image summary originates directly from the provided image
      and news article. Avoid introducing fabricated content, speculative details.
7. Aim for a succinct summary, using clear and straightforward language. Avoid excessive detail or subjective
      commentary.
8. Maintain an objective and impartial tone when describing the image. Avoid inserting personal opinions or
      interpretations.

A.1. Prompts: Image Function

In this section, we show all the prompts that need to be used in the image function identification module. As show in Table 7, the first row is the prompt for image function recognition, which is mainly from the perspective of the subject background and the specific event to judge the function of the image. The last two rows are the prompts of the different functions of the images to achieve their respective functions and transform their information into verbal descriptions. Eventually, the verbal information will be integrated into the LLM-based event forecasting model.

A.2. Case study: Image Function

To further illustrate that our approach does indeed identify truly key events and the required complementary information, we provide additional examples. In the first example of the highlighting function, the image directly depicts Ocasio-Cortez, with the background appearing to be the Congressional sites, thereby emphasizing the relevant key event. Correspondingly, the key event also mentions the relationship between Congress and Ocasio-Cortez. Consequently, an accurate prediction is achieved. In the second example of the highlighting function, the key event highlighted by the image directly mentions the disqualification of Ali Larijani from the election, which perfectly aligns with the results that need to be predicted and the information provided to present those results. For the first example of complementary functions, the image provides information about the signing of a free trade agreement between Turkey and the United Kingdom. While enhanced trade has the potential to lead to employment and economic growth, the image offers complementary information on the role of labor. Therefore, an accurate forecast is achieved. In the second example about the complementary function, the image shows Bernie Sanders who is a democratic progressive socialist like Ocasio-Cortez. They share many commonalities and connections to Congress, which can provide supplementary information to more accurately predict the outcome. Through these examples, the distinct functions of highlighting key events and providing complementary information are elucidated, substantiating the effectiveness of our approach in leveraging multimodal information for accurate temporal event forecasting.

Refer to caption
Figure 5. The case study of highlighting function of image.
Refer to caption
Figure 6. The case study of highlighting function of image.
Refer to caption
Figure 7. The case study of complementary function of image.
Refer to caption
Figure 8. The case study of complementary function of image.