ESQA: Event Sequences Question Answering

Irina Abdullaeva
AIRI
Moscow, Russia
&Andrei Filatov
Sber AI, Skoltech
Moscow, Russia
Mikhail Orlov
Sber AI Lab
Moscow, Russia
&Ivan Karpukhin
Sber AI Lab
Moscow, Russia
&Viacheslav Vasilev
Sber AI, MIPT
Moscow, Russia
&Denis Dimitrov
Sber AI, AIRI
Moscow, Russia
&Andrey Kuznetsov
AIRI, Sber AI
Moscow, Russia
&Ivan Kireev
Sber AI Lab
Moscow, Russia
&Andrey Savchenko
Sber AI Lab
Moscow, Russia

Abstract

Event sequences (ESs) arise in many practical domains including finance, retail, social networks, and healthcare. In the context of machine learning, event sequences can be seen as a special type of tabular data with annotated timestamps. Despite the importance of ESs modeling and analysis, little effort was made in adapting large language models (LLMs) to the ESs domain. In this paper, we highlight the common difficulties of ESs processing and propose a novel solution capable of solving multiple downstream tasks with little or no finetuning. In particular, we solve the problem of working with long sequences and improve time and numeric features processing. The resulting method, called ESQA, effectively utilizes the power of LLMs and, according to extensive experiments, achieves state-of-the-art results in the ESs domain.

1 Introduction

Temporal data often comes in the form of event sequences, where each event is characterized by the arrival time and additional structured data. This type of data is widely spread in domains like geoscience (Bergen et al., 2019), healthcare (Esteva2019healthcare), sociology (Hossain et al., 2020), industry (Choi et al., 2021), e-commerce (Ni et al., 2018) and finance (Babaev et al., 2022). Event sequences combine properties of time series and tabular data while having major differences. Unlike time series, events can arrive with irregular time steps and can have structured annotations, similar to tabular datasets. Unlike tabular data, events have timestamps and associated order. These differences require special data processing, modeling, and inference approaches.

The new frontier in machine learning, especially in deep learning, focuses on adapting large language models (LLMs) to domains beyond language. The reasons behind this adaptation is that LLMs can use additional information, not found in domain-specific data, can process textual context of the underlying task, generate answers in a free natural form, can argue its decisions and support dialog with the user. The potential benefits of using LLMs include improved modeling quality and generalization. The latter means that the hybrid model can solve new problems with little or no finetuning, that largely increases the applicability of the model and reduces development costs. Successful applications of LLMs were demonstrated in both time series (Cai et al., 2023) and tabular datasets (Dinh et al., 2022), but no effort was made to adapt LLMs to event sequences: financial transactions, electronic health records, activity on different devices and so on. These data characterise a human live and are used to personalise many AI services across different domains.

Event sequence processing with LLM encounters several difficulties. First, structured data must be effectively encoded at the LLM’s input. Textual representation considerably increases the sequence length and can’t be effectively processed by modern Transformer models due to the quadratic complexity. Second, the desired method must be capable of processing long input sequences, even when the downstream tasks require historical data analysis. The problem is similar to the first but focuses on the model architecture rather than input processing. Finally, time feature and the order must be properly provided to the model, as they constitute the essence of event sequences and include important information for solving downstream tasks.

In this paper, we propose a new neural architecture, called ESQA, that exploits the power of LLMs to model event sequences and to solve associated practical tasks. In particular, we for the first time develop a question-answering approach with LLM backbone in the event sequences domain. We show the proposed model is capable of solving multiple downstream tasks without finetuning. When finetuned, ESQA outperforms other methods and achieves a new state-of-the-art.

2 Background

Event Sequences. We assume that events, denoted as $e_{i}$ , are arranged in sequences $S_{n}={\{e_{i}\}}_{i=1}^{I_{n}}$ based on their association with a common entity. Here, ${I_{n}}$ represents the number of events in the sequence $S_{n}$ . An entity could represent a bank customer or a web user, while the events within the sequence might include actions like a completed transaction or a series of clicks. These events are connected by a temporal order: $t(e_{i})<t(e_{i+1})$ , where $t(.)$ indicates the time at which the event occurs. Event sequences encompass a diverse range of attributes, with each event, $e_{i}$ , characterized by a set of features ${\{c_{j}\}}_{j=1}^{C}$ . These features can be depicted as a vector of values with dimension $C$ . Additionally, $Y_{m}$ represents the target variable vector for the problem at hand, which may be based on the value of a sequence feature $c_{m}$ or external variables, such as a bank client’s default status. Attributes of events comprise both numeric $c_{j}^{num}$ and categorical features $c_{j}^{cat}$ of various types. Categorical features define attribute values within a finite set of categories $c_{j}^{cat}\in|c_{j}|=\{cat_{j;1},...,cat_{j;K_{j}}\}$ , where $K_{j}$ denotes the number of possible values for the feature $c_{j}^{cat}$ (Lane, 2003). Numerical features $c_{j}^{num}\in\mathbb{R}$ are those represented as numbers, allowing meaningful arithmetic operations to be performed (Lane, 2003).

LLMs for Tabular Data. Large Language Models (LLMs) are a family of neural architectures pretrained on a large corpus of texts. LLMs accept inputs in the form of text and generate textual output. In practice, LLM architecture is composed of three main blocks. The first one is an embedding layer, that converts input text to a sequence of numeric vectors known as embeddings. The second block, the backbone, transforms input embeddings to the output embeddings sequence with possibly different length. The final part of the model maps embeddings to the output text.

There are two main approaches for encoding tabular data at the input of LLM. The first one is to provide a description of each table field in the textual form (Dinh et al., 2022). This approach suffers from little flexibility and extremely long input sequences. The second approach is to replace the embedding layer, with a newly designed module capable of directly encoding table fields to embeddings with the required number of features. The latter approach is also known as embedding injection and usually achieves better results (Koh et al., 2023; Huang et al., 2023).

Question Answering with LLMs. The popular way to solve problems with LLMs is to design a question such that a valid answer to this question solves the problem (Dinh et al., 2022). The question must include the context, i.e. all necessary data required for reasoning, and the task definition. This way LLM input is usually composed of the context, task, and connecting words indicating the boundaries of each part.

3 Event Sequences Question Answering

The general view of the proposed model, called Event Sequences Question Answering (ESQA), is presented in Figure 1. Below we will give a detailed description of the model’s input and the backbone LLM.

3.1 Questions and answers construction

The concept behind this method is to frame all tasks involving temporally structured data as natural language questions and answers. Each task from ${\{task_{m}\}}_{m=1}^{M}$ takes the form $\{Q_{m},X_{m},A_{m}\}$ , where $Q_{m}$ is the question that defines the problem, $X$ represents the input data, and $A_{m}$ is the answer sought based on the target variable $Y_{m}$ .

A question $Q_{m}$ consists of two components: the prefix and the question body. The prefix initiates the query token sequence and is placed before embeddings of other modalities. The question body then describes the task in textual form. For example, the task of determining the most frequent value of feature $c_{m}$ is represented as: “What is the most frequent value of $c_{m}$ in the entire dataset?”.

Given the nature of time-structured data, we classify questions into two types: extractive and predictive. Extractive questions focus on tasks involving existing event sequences, such as computing statistics or identifying trends and characteristics. Predictive questions, on the other hand, pertain to tasks concerning the prediction of future events or attributes based on available data.

Tasks and their corresponding questions can also be categorized based on the type of response sought: binary, multiple choice, or open-ended. Binary questions seek a straightforward answer, either as $A_{m}\in\{0,1\}$ or in the form of “Yes” oder “No”. For instance, a question like “Is drinking water the most frequently purchased product?” can be answered with a simple “Yes” oder “No”.

In contrast to binary questions, multiple-choice and open-ended questions assume a specific answer corresponding to the required feature, whether numerical $A_{m}\in\mathbb{R}$ or categorical $A_{m}\in|c_{j}|=\{cat_{1},\dots,cat_{K}\}$ . Multiple-choice questions provide a list of possible answer choices. For example, one might ask “What is the most frequently purchased product? Options: black tea; bread; drinking water; grapes.”. Open-ended questions, on the other hand, prompt a direct response, such as “What is the name of the most frequently purchased product? Please provide the name in your response.”.

3.2 Events embeddings

To address the integration of event sequences into a language model, we propose adapting the method outlined in previous works (Koh et al., 2023; Huang et al., 2023). This involves embedding multi-modal information into an LLM, parameterized by ${\theta}$ , by directly mapping it into the intrinsic embedding space $E^{\theta}$ , bypassing the discrete text token layer. To achieve this, we introduce a trainable mapping $\phi:Z\rightarrow E^{\theta}$ , where $Z$ represents the observation space of temporally structured data. This mapping converts the data into a sequence of $f$ -dimensional vectors in $E^{\theta}$ , which are then integrated into a sequence of text embeddings. This interleaving of modalities creates a multi-modal input for the LLM.

Refer to caption — Figure 1: Model architecture. The components of the approach that do not require training are colored in blue. Components whose weights are optimised during training are colored in orange. The trainable embeddings and associated tokens are colored in red.

ESQA represents all event features as trainable embeddings. It is achieved by encoding each value $x_{ij}$ of a categorical or integer numeric feature $c_{j}$ with a sequential index $k_{x_{ij}}$ based on the total number of unique values for that feature $k=[0,\dots,K_{j}]$ . This index uniquely identifies the embedding $emb_{k}$ of a feature value in the embedding matrix $W_{e}$ . The embedding dimension is selected based on the formula: $dim(e_{k})=\lceil{\lambda\times K_{j}^{\mu}}\rceil$ . The coefficients $\lambda=1.6$ and $\mu=0.56$ have been chosen empirically.

Numerical features in the form of real numbers are discretized into non-overlapping intervals: $B_{j}^{1},\dots,B_{j}^{n}$ , $B_{j}^{i}=[b_{j}^{i-1},b_{j}^{i})$ . The distribution of the feature $c_{j}$ in the training sample is used to determine these intervals. The number of intervals is chosen based on the approach in (Doane, 1976), using the formula $n=1+\log_{2}(n)+\log_{2}(1+\frac{|g_{1}|}{\sigma_{g_{1}}})$ , where $g_{1}$ is the estimated third-moment skewness of the distribution and $\sigma_{g_{1}}=\sqrt{\frac{6(n-2)}{(n+1)(n+3)}}$ . This method is particularly suited for distributions of features that deviate significantly from the normal distribution.

Once the intervals have been defined, Eq. 3.2 determines the value $x_{j;disc}^{num}$ of the $j$ ’th numerical feature:

x_{j;disc}^{num}=\begin{cases}b_{j}^{0},&x_{ij}<b_{j}^{0},\\ b_{j}^{n},&x_{ij}\geq b_{j}^{n},\\ b_{j}^{i}&b_{j}^{i-1}\leq x_{ij}<b_{j}^{i}.\end{cases}

The resulting feature embeddings are concatenated into a tensor $e_{i}^{emb}$ of dimension $dim(e_{i}^{emb})=\sum_{j=1}^{C}|c_{j}|$ , which describes a single event $e_{i}$ from the sequence. A vector representation of sequence $S_{n}$ is formed by combining vector representations of individual events into a joint tensor $S_{n}^{emb}$ shown on Fig.2a.

3.3 Encoder

After the initial layer of input data embeddings, vectorized event sequences are fed into a specialized encoder model Fig. 2b. This module, based on the architecture of the Transformer decoder, processes sequences of events in an autoregressive manner by predicting each subsequent event. For our implementation, we used both Whisper-tiny and Whisper-small models (Radford et al., 2022), initialized with weights pre-trained on audio data. The input tensor for the encoder comprises concatenated feature embedding vectors for all events $S_{n}^{emb}$ (Section 3.3.1) and has a size of $dim(S_{n}^{emb})=(I_{n},dim(e_{i}^{emb}))$ . The encoder processes this tensor autoregressively, similar to the sequence of text token embeddings, resulting in a sequence of vectors $\tilde{S_{n}}^{emb}$ with a size $dim(\tilde{S_{n}}^{emb})=(I_{n},d_{enc})$ . Here, $d_{enc}$ represents the output layer dimensionality of the encoder model. To ensure compatibility between the dimensions of the input embeddings of the event sequences $dim(S_{n}^{emb})$ and the embedding layer of the encoder model $d_{enc}$ , we used a linear projection layer.

This choice of encoder architecture is motivated both by the temporal nature of the event sequences, which aligns with autoregressive modelling, and by the results of a series of experiments. Appendix A.1 provides a detailed description of the experiments and their results.

3.4 Connector

The output representation of the event sequence encoder grows in dimensionality as the number of events in each sequence increases. This size is crucial, as it must fit within a common multimodal embedding sequence, impacting the extension of the language model’s context length. Our goals are to shorten the event sequence length without significant information loss and to adapt each event’s vectorized representation to match the language model’s embedding dimension. To achieve this, we propose an intermediate connection layer between the event sequence encoder and the LLM. We suggest using the Query Transformer model, or Q-Former (Li et al., 2023), to efficiently extract features from the encoder output.

The Q-Former architecture (Fig. 3) includes two transformer submodules: a novel modality transformer (originally an image transformer) that works with a fixed image encoder for feature extraction, and a text transformer that functions as both an encoder and a decoder. A set of trainable query embeddings $q$ serves as the input for the novel modality transformer. These queries engage in self-attention, interacting with each other and with the fixed modality features through cross-attention layers in every other transformer block.

In our approach, Q-Former produces $q$ query vectors for each event sequence, which are then passed to the LLM. We use a single fully-connected layer to project the output query vectors into the language model’s text embedding dimension. In this study, we initialize Q-Former with the weights from the BLIP-2 approach, derived from BLIP-2 with the FLAN-T5-xl model (Li et al., 2023). The architecture and initialization of the connection layer were chosen based on a series of experiments detailed in Appendix A.2.

3.5 Language Model

As the backbone for the pre-trained LLM, our approach utilizes the FLAN-T5 family of encoder-decoder models (Wei et al., 2021). Any process of fine-tuning model parameters influences the model’s proficiency in a specific domain but also causes it to "forget" essential general and linguistic knowledge. To preserve this knowledge and save computational resources, we have frozen most of the LLM parameters. Studies (Lu et al., 2021; Zhou et al., 2023) indicate that freezing most of the model’s weights often yields better results than fully fine-tuning a pre-trained LLM.

To efficiently select a limited set of trainable parameters, we propose using Parameter-Efficient Fine-Tuning (PEFT) methods. Specifically, we employed the Low-Rank Adaptation (LoRA) approach (Hu et al., 2021), which keeps most of the model weights frozen while adding trainable rank decomposition matrices to a subset of the parameters.

4 Experiments

In this section, we begin by presenting the evaluation details, which include the comparison methods and the datasets used for evaluation. Following this, we conduct a series of systematic experiments to showcase the capability of the developed ESQA approach in addressing a diverse range of problems based on event sequences.

4.1 Experimental setup

Datasets. Sequences of events are prevalent across various domains and tasks, with a particularly high demand for analyzing such data in the fintech sector. In this field, transactional activity of individuals serves as the primary source of information. Consequently, we have chosen to utilize a collection of datasets containing customer transactions from banks and marketplaces as examples of event sequences. The sensitivity of the information in these datasets has significantly influenced our choice and the number of datasets used, given the limited availability of public datasets in this area.

We selected five publicly available datasets with event sequences. These include: AlfaBattle2.0 (Evgeny and Max, 2021), Age Group Prediction Competition (Sirius, 2020), X5 Retail Hero: Uplift Modelling for Promotional Campaign from (Babaev et al., 2022), Taiwan Default of Credit Card Customers from (Yeh and hui Lien, 2009), and Gender Prediction Competition from (Max, 2019). The data were divided into two subsets, training and validation sets, ensuring no overlap by unique client identifiers. The proportions for partitioning these sets were chosen independently for each dataset. A detailed description of the datasets used to evaluate the approach’s quality is provided in Appendix B.1.

Baselines. We have chosen representative baseline approaches for analysing event sequences, which have demonstrated effectiveness across various benchmarks. For the AlfaBattle and Taiwan Default of Credit Card Clients datasets, we implemented, trained, and fine-tuned the baseline models ourselves to achieve optimal results. For other benchmarks, we relied on the findings from CoLES (Babaev et al., 2022), which provide the most current and comprehensive empirical studies of event sequences.

For the next event prediction task we also provide calculated statistical baselines for both numerical and categorical target features to compare the quality of prediction tasks in zero-shot setups. These include, for example, the prediction of the mean or median value for numerical target variables and the prediction of the most frequent value for categorical attributes.

A complete list and detailed description of baselines is provided in Appendix B.2.

4.2 Experimental results

4.2.1 Main results

Each dataset used for model evaluation corresponds to a specific downstream task. For instance, the AlfaBattle dataset is utilized for predicting a bank customer’s loan default, while the Age dataset is employed for predicting the age group. It is important to highlight that the AlfaBattle dataset is highly imbalanced, with the positive class constituting less than 3%, while the Gender dataset has a slight over-representation of the class denoting male gender. Therefore, we used the ROC-AUC metric for problems with binary target variables and class imbalance. For multiclass classification with balanced classes, we employed Accuracy. A more detailed explanation of the metric calculation methodology and the assessment of response quality for ESQA is provided in Appendix B.0.2. The results of the experiments on the downstream tasks for datasets described in Section 4.1 are summarized in Table 1.

Table 1: A comparison of ESQA on the downstream tasks of the five event sequence datasets described in Section 4.1 with the baseline approaches presented in Section 4.2.1. The best results are highlighted in bold and the second best results are underlined.

Dataset	AlfaBattle	Age	Gender	X5	Taiwan
Metric	AUCROC	Accuracy	AUCROC	Accuracy	AUCROC
Handcrafted feat.	0.7792	0.629	0.877	0.547	0.784
Randomly init. RNN	0.6456	0.375	0.593	-	0.722
CPC	0.7919	0.602	0.851	0.525	-
Barlow Twins	0.7878	0.634	0.865	-	-
CoLES	0.7921	0.640	0.881	0.539	-
NSP	0.7655	0.621	0.852	0.425	-
RTD	0.7910	0.631	0.855	0.520	-
SOP	0.7238	0.512	0.785	0.428	-
MLM NSP	0.7591	-	-	-	-
TabFormer	0.7862	-	-	-	-
GPT	0.7737	-	-	-	-
ESQA (ours)	0.7568	0.699	0.850	0.598	0.793

The results indicate that the ESQA approach matches both self-supervised contrastive and supervised methods in quality. Notably, on the Age and X5 datasets, ESQA surpasses the baseline scores. Although specific comparative results for other models are not available for the Taiwan dataset, ESQA’s impressive performance underscores its effectiveness. These outcomes highlight ESQA’s superior capability in handling multi-class classification tasks with balanced classes.

However, for the client default problem on the AlfaBattle dataset and the Gender dataset, the results are less clear-cut. The CoLES contrastive approach achieves the highest quality for these problems. While ESQA slightly lags behind CoLES, it still shows competitive performance, closely following models like Barlow Twins and RTD, and outperforming the SOP approach. It is important to note that both datasets exhibit class imbalance, which is especially pronounced in the AlfaBattle case.

This leads us to conclude that the ESQA approach generally performs classification tasks as well as, or better than, the selected baseline methods. However, it is significantly affected by the imbalance of the target variable. This limitation can be attributed to the nature of LLMs, originally designed to extract common patterns from text data to model complex language structures.

4.2.2 Predictive tasks

The majority of tasks involving event sequences require answering predictive questions about event features. To address such challenges, we propose utilizing the ESQA approach in a multi-task setting, enabling simultaneous predictions of all features of the next event in the sequence. Experimental results for predictive questions against baselines are detailed in Table 2.

Table 2: Table comparing ESQA with the baseline approaches presented in Section 4.2.1 for predicting attributes of the next transaction on the AlfaBattle dataset. The best results are highlighted in bold and the second best results are underlined.

Attribute	MCC code	Amount	Hour diff
Metric	Acc./F1	MAE/MSE	MAE/MSE
CoLES	0.440 / 0.351	0.197 / 0.082	36.05 / 1586.52
CPC	0.475 / 0.411	0.196 / 0.074	34.89 / 1508.71
RNN with CoLES	0.469 /0.411	0.184 /0.077	32.25 / 1573.02
CatBoost	0.440 /0.367	0.190 /0.090	34.40 / 1613.41
GPT with descr.	0.462 /0.423	0.179 / 0.083	32.63 / 1726.42
Text LLM	0.382 / 0.381	0.103 / 0.0176	116.38 / 62161
ESQA (ours)	0.546 / 0.546	0.191 / 0.1021	18.313 / 1033.87

On categorical feature prediction tasks, such as MCC code attribute prediction, ESQA achieves the highest performance with an Accuracy/F1 scores, outperforming all other models, with the closest being CPC. This indicates that ESQA is particularly effective in handling categorical prediction tasks within the transaction history context.

In predicting the numerical Amount attribute, while the Text LLM achieves the lowest MAE/MSE, ESQA still performs competitively. Although ESQA is not the top performer here, it maintains reasonable accuracy, demonstrating its versatility across different types of prediction tasks.

For the temporal Hour diff attribute, ESQA significantly outperforms all other models. The next best model, RNN CoLES, has a much higher MAE/MSE, highlighting ESQA’s superior capability in handling temporal prediction tasks effectively.

4.2.3 Generalization abilities

LLMs possess an extraordinary capacity to generalise to novel, previously unseen tasks. Our method maintains the integrity of the language model’s weights, thereby preserving its inherent capabilities. Moreover, by training adaptors within the attention layers, we expand the domain of zero-shot tasks from exclusively text-based tasks to those based on event sequences. We evaluated the ESQA approach’s adaptability to new predictive tasks following comprehensive pre-training on a set of contextual tasks. The model was trained in multi-task setting on all event features of the AlfaBattle dataset and was subsequently tested in a zero-shot setting across various predictive tasks within the same dataset. Table 3 presents a comparison of our experimental results against statistical baselines, a text baseline, and an ESQA model specifically trained on those predictive tasks.

Table 3: Table comparing the generalisation abilities of the ESQA approach with the statistical baseline approaches presented in Section 4.2.1, and a text-based approach. The ESQA approach trained on predictive tasks in a multitask setting is referred to as ’ESQA m/t’. While ESQA trained on contextual tasks and adapting to new tasks is referred to as ’ESQA z/s’. The best results are highlighted in bold and the second best results are underlined.

Attribute	Stat. baseline	Text-only	ESQA m/t	ESQA z/s
MCC code, acc.	0.388	0.382	0.546	0.381
MCC category, acc.	0.437	0.402	0.588	0.435
Amount, MAE/MSE	0.241	0.103/0.018	0.191/0.102	0.389/0.228
City, acc.	0.704	0.691	0.731	0.343
Country, acc.	0.970	0.970	0.972	0.971
Currency, acc.	0.987	0.986	0.987	0.988
Op. type gr., acc.	0.766	0.733	0.840	0.781
Op. type, acc.	0.499	0.393	0.633	0.543
Op. kind, acc.	0.548	0.494	0.693	0.598
Days before, MAE/MSE	140.5 / 23823.3	10.5 / 657.2	6.3/195.9	11.394 / 666.2
Hour diff, MAE/MSE	36.33	116.4/62161	18.3/1033.9	48.85/3980

For the MCC code and MCC category attributes, ESQA multi-task outperforms all baselines, indicating its strength in handling categorical predictions. However, in the zero-shot setting, ESQA’s performance is comparable to the statistical baseline, suggesting room for improvement in scenarios without task-specific training. In predicting the Amount attribute, the text-only approach achieves the best MAE/MSE, while ESQA multi-task shows competitive performance, demonstrating its robustness in handling regression tasks despite not being the top performer. However, the regression problem on real numbers with a large number of decimals is still a challenging task for zero-shot ESQA, which performed poorly. For temporal predictions like Days before and Hour diff, ESQA multi-task significantly outperforms other approaches, showcasing its superior capability in modelling temporal patterns. Overall, ESQA zero-shot performance, while not leading, still provides valuable insights into ESQA’s versatility and potential for improvement in less customised settings.

5 Related work

Event sequences. Temporal Point Processes (TPPs) and their marked variants (MTPPs) can be seen as the simplest forms of event sequences. Previous research was focused on accurate next event prediction with or without neural networks (Liniger, 2009; Mei and Eisner, 2017; Xue et al., 2023). Another branch of research addressed event streams of the general form Padhi et al. (2021); Babaev et al. (2022); McDermott et al. (2024). To the best of our knowledge, question answering with LLMs was not previously applied to TPPs nor event sequence modeling.

Structured modeling with neural networks. The problem of modeling multiple heterogeneous features with neural networks was addressed in tabular neural networks (Yin et al., 2020; Iida et al., 2021; Padhi et al., 2021; Hegselmann et al., 2022; Yang et al., 2022; Dinh et al., 2022). We reuse best practices for making embeddings from categorical and numeric features. At the same time, event sequences require analysis of multiple events at once, while tabular datasets can be processed one row at a time. To this end, ESQA applies encoding method and adapt Q-Former, not seen in tabular neural networks.

LLMs for time series. Previous works made use of LLMs in the context of time series analysis (Gruver et al., 2023; Cai et al., 2023; Zhang et al., 2024). Unlike time series models, ESQA implements a novel context encoding and can process complex data structures.

6 Conclusion

In this paper, we introduced Event Sequences Question Answering (ESQA), a novel approach for modelling event sequences with LLMs. Our empirical results demonstrate that our approach performs robustly across various datasets. For several downstream problems, ESQA performs at least as well as specialised baselines 1, and for the task of predicting the attributes of the next event, it significantly surpasses baseline methods 2. Furthermore, we have shown that ESQA can handle multiple tasks simultaneously without any special fine-tuning 3, highlighting its remarkable ability to adapt swiftly to new tasks without the need for complex and time-consuming training. These findings position ESQA as an exceptionally promising approach leveraging the strong generalisation capabilities of LLM backbones for the field of event sequences.

Limitations. Our research has certain limitations. In processing numerical features, ESQA employs value discretization, which introduces an inherent discretization error. This error is significantly influenced by the number of discretization buckets and the ranges of the actual feature values. To mitigate this error, we conducted several additional experiments to refine the pre-processing method. Furthermore, handling time features in event sequences requires special attention. We are actively exploring ways to enhance temporal feature processing within ESQA. In future work, we will focus on implementing these improvements and addressing the challenge of dealing with unbalanced classes.

References

Babaev et al. [2022] Dmitrii Babaev, Nikita Ovsov, Ivan Kireev, Maria Ivanova, Gleb Gusev, Ivan Nazarov, and Alexander Tuzhilin. Coles: Contrastive learning for event sequences with self-supervision, 2022. URL http://dx.doi.org/10.1145/3514221.3526129.
Beltagy et al. [2020] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020. URL https://api.semanticscholar.org/CorpusID:215737171.
Bergen et al. [2019] Karianne J. Bergen, Paul A. Johnson, Maarten V. de Hoop, and Gregory C. Beroza. Machine learning for data-driven discovery in solid earth geoscience. Science, 363(6433):eaau0323, 2019. doi: 10.1126/science.aau0323. URL https://www.science.org/doi/abs/10.1126/science.aau0323.
Cai et al. [2023] Yifu Cai, Mononito Goswami, Arjun Choudhry, Arvind Srinivasan, and Artur Dubrawski. Jolt: Jointly learned representations of language and time-series. In Deep Generative Models for Health Workshop NeurIPS 2023, 2023.
Choi et al. [2021] Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access, 9:120043–120065, 2021. doi: 10.1109/ACCESS.2021.3107975.
Clark et al. [2020] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
Dinh et al. [2022] Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
Doane [1976] David P. Doane. Aesthetic frequency classifications. The American Statistician, 30:181–183, 1976. URL https://api.semanticscholar.org/CorpusID:119563223.
Evgeny and Max [2021] Smirnov Evgeny and Mayer Max. Alfabattle2.0, 2021. URL https://github.com/smirnovevgeny/AlfaBattle2.0.
Gruver et al. [2023] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters, 2023.
Hegselmann et al. [2022] Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David A. Sontag. Tabllm: Few-shot classification of tabular data with large language models, 2022. URL https://api.semanticscholar.org/CorpusID:252992811.
Hossain et al. [2020] Sohrab Hossain, Ahmed Abtahee, Imran Kashem, Mohammed Moshiul Hoque, and Iqbal H. Sarker. Crime prediction using spatio-temporal data, 2020.
Hu et al. [2021] J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
Huang et al. [2023] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models, 2023. URL https://api.semanticscholar.org/CorpusID:257219775.
Iida et al. [2021] Hiroshi Iida, Dung Ngoc Thai, Varun Manjunatha, and Mohit Iyyer. Tabbie: Pretrained representations of tabular data, 2021. URL https://api.semanticscholar.org/CorpusID:233864627.
Ke et al. [2017] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:3815895.
Koh et al. [2023] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs, 2023. URL https://api.semanticscholar.org/CorpusID:258947258.
Lan et al. [2019] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2019. URL https://api.semanticscholar.org/CorpusID:202888986.
Lane [2003] D. Lane. Introduction to statistics $\backslash$ , 2003. URL https://books.google.ru/books?id=jzyAzQEACAAJ.
Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URL https://api.semanticscholar.org/CorpusID:256390509.
Liniger [2009] Thomas Liniger. Multivariate hawkes processes. PhD thesis, ETH Zurich, 2009.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017. URL https://api.semanticscholar.org/CorpusID:3312944.
Lu et al. [2021] Kevin Lu, Aditya Grover, P. Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines, 2021. URL https://api.semanticscholar.org/CorpusID:232168936.
Max [2019] Valeriy Max. Python and data analysis: Final project, 2019. URL https://kaggle.com/competitions/python-and-analyze-data-final-project.
McDermott et al. [2024] Matthew McDermott, Bret Nestor, Peniel Argaw, and Isaac S Kohane. Event stream gpt: a data pre-processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events. Advances in Neural Information Processing Systems, 36, 2024.
Mei and Eisner [2017] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
Ni et al. [2018] Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks, 2018.
Ostroumova et al. [2017] Liudmila Ostroumova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:5044218.
Padhi et al. [2020] Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre L. Dognin, Jerret Ross, Ravi Nair, and Erik Altman. Tabular transformers for modeling multivariate time series. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3565–3569, 2020. URL https://api.semanticscholar.org/CorpusID:226237049.
Padhi et al. [2021] Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. Tabular transformers for modeling multivariate time series, 2021.
Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
Radford et al. [2022] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URL https://api.semanticscholar.org/CorpusID:252923993.
Sirius [2020] Educational Center Sirius. Age group prediction competition, 2020. URL https://ods.ai/competitions/sberbank-sirius-lesson/data.
van den Oord et al. [2018] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018. URL https://api.semanticscholar.org/CorpusID:49670925.
Wei et al. [2021] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2021. URL https://api.semanticscholar.org/CorpusID:237416585.
Xue et al. [2023] Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang, Fan Zhou, Hongyan Hao, Caigao Jiang, Chen Pan, Yi Xu, James Y Zhang, et al. Easytpp: Towards open benchmarking the temporal point processes. arXiv preprint arXiv:2307.08097, 2023.
Yang et al. [2022] Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul. Tableformer: Robust transformer modeling for table-text encoding, 2022. URL https://api.semanticscholar.org/CorpusID:247187588.
Yeh and hui Lien [2009] I-Cheng Yeh and Che hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl., 36:2473–2480, 2009. URL https://api.semanticscholar.org/CorpusID:15696161.
Yin et al. [2020] Pengcheng Yin, Graham Neubig, Wen tau Yih, and Sebastian Riedel. Tabert: Pretraining for joint understanding of textual and tabular data, 2020. URL https://api.semanticscholar.org/CorpusID:218674345.
Zbontar et al. [2021] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. ArXiv, abs/2103.03230, 2021. URL https://api.semanticscholar.org/CorpusID:232110471.
Zhang et al. [2024] Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K Gupta, and Jingbo Shang. Large language models for time series: A survey. arXiv preprint arXiv:2402.01801, 2024.
Zhou et al. [2023] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One Fits All: Power general time series analysis by pretrained lm, 2023.

Appendix A Architecture components selection

A.1 Event sequence encoder architecture selection

Training a language model to understand a different modality is not a novel challenge and has been addressed for various data types. Therefore, in our experiments on the event sequence encoder architecture, we built upon advancements from other modalities. We focused on established models for three highly developed modalities: text, images, and audio. For text architectures, we examined several models including encoder-only models like BERT (base, large), encoder-decoder models like T5 (small, base, large), and decoder-only models like GPT (base, medium, large). For image architectures, we used ViT (base, large). For audio models, we considered various versions of Whisper (tiny, small, medium), utilizing only the decoder part of the Whisper architecture.

The models were compared based on their ability to predict the default of a bank client in the AlfaBattle 2.0 dataset, a binary problem where the task is to determine if a bank client will repay a loan based on their transaction history over two years. We used AUC as the metric for comparison. All models were trained from scratch.

We maintained a consistent training scheme across all experiments, employing Adam as the optimizer with a learning rate of 1e-4. A linear warm-up of the learning rate was applied for the first epoch, followed by a linear decay to zero after 10 epochs. To ensure compatibility between the dimensions of transaction embeddings and the dimensions of pretrained model embeddings, we used a linear layer for text and audio models. Since ViT models cannot process sequences, we addressed this issue by applying a single layer of cross-attention to a fixed number of learnable latent tokens.

As shown in Table 4, decoder-only models outperformed both encoder-only and encoder-decoder models in event sequence encoding in almost all setups. Specifically, experiments with text models demonstrated that the decoder-only GPT2 model outperformed the encoder-decoder T5 model, and the BERT model training did not converge. Similarly, audio architectures, which are primarily decoder-based, also showed superior performance. In response to the concerns about the performance of larger models within the same family, as observed in Table 4, our analysis suggests that the enlargement of the encoder size contributes to overfitting. This overfitting is the primary reason for the degradation in performance outcomes.

Table 4: Table comparing different architectures for predicting default of the client on the AlfaBattle dataset. The best results are highlighted in bold and the second best results are underlined.

Architecture	Typ	Number of parameters	AUC
GPT2 Base	Decoder	124M	0.7869
GPT2 Medium	Decoder	355M	0.7833
GPT2 Large	Decoder	774M	0.7747
Whisper-tiny	Decoder	29M	0.7892
Whisper-small	Decoder	153M	0.7894
Whisper-medium	Decoder	456M	0.7715
T5 Small	Encoder-Decoder	60M	0.7721
T5 Base	Encoder-Decoder	223M	0.7756
T5 Large	Encoder-Decoder	770M	Diverged
BERT Base	Encoder	110M	Diverged
BERT Large	Encoder	335M	Diverged
ViT Base	Encoder	85M	0.7822
ViT Large	Encoder	302M	0.7639

After determining the type of architecture (i.e., the decoder), we conducted further experiments to identify the specific type and size of the decoder architecture. We compared Whisper-tiny, Whisper-small, and GPT2-base, as they produced the best results. Additionally, we evaluated various types and sizes of recurrent architectures: GRU-1, GRU-6, GRU-12, LSTM-1, and LSTM-4, where the number indicates the number of layers used in each model. The embedding size for all recurrent models was set to 1024.

Table 5: Table comparing different decoder architectures presented in Section A.1 for predicting default and attributes of the next transaction on the AlfaBattle dataset. The best results are highlighted in bold and the second best results are underlined.

Architecture	# params.	Amount	MCC Category	24-hour acc	Default
Metric		MSE	Accuracy	Accuracy	AUC
Whisper-tiny	29 M.	0.0660	0.4861	Diverged	0.7892
Whisper-small	153 M.	0.0656	0.4896	0.645	0.7894
GPT-2-base	100 M.	0.0657	0.4888	Diverged	0.7869
GRU-1	0.3 M.	0.0668	0.4817	0.418	0.7854
GRU-big	16 M.	0.0670	0.4805	Diverged	0.7578
GRU-large	35 M.	0.0662	0.4815	Diverged	0.7732
LSTM-1	0.4 M.	0.0669	0.4830	0.634	0.7710
LSTM-4	2 M.	0.0664	0.4858	0.655	0.7664

Table 5 indicates that transformer architectures outperformed recurrent models. Scaling up recurrent models did not significantly enhance their quality and sometimes even degraded their performance. Given the similar results among transformer architectures, we selected Whisper-small as the optimal model for all ESQA experiments.

A.2 Connector architecture selection

Table 6: Table comparing different connector architectures for better modalities alignment. Q-Former architecture based connectors with initialisations from BLIP-2 (Li et al., 2023) pretrained weights are labelled ‘w. init.’, without initialisation are indicated by ‘w/o. init.’. The best results are highlighted in bold and the second best results are underlined.

Architecture	# params.	MCC code	MCC category	Amount
Metric		Accuracy	Accuracy	MSE
Linear	197 k.	0.501	0.574	0.0174
2 x Linear	920 k.	0.523	0.561	0.0196
RNN (LSTM)	3.94 M.	0.509	0.558	0.0220
Transformer	1.1 M.	0.478	0.529	0.1361
2 x Transformer	2.09 M.	0.519	0.555	0.0168
Q-Former-small	14.7 M.	0.519	0.579	0.0162
Q-Former-base (w/o init.)	96 M.	0.526	0.570	0.0189
Q-Former-base (w. init.)	96 M.	0.527	0.569	0.0177

Integrating multiple modalities within a single approach centered around an LLM requires mapping new modalities into a textual model. Employing a separate encoder for each modality simplifies the task to finding an efficient architecture for mapping each modality’s vector space to the LLM embedding text space. When analyzing event sequences, processing extended data sequences presents challenges due to increased context length, which leads to higher computational complexity. In some instances, the sequence length may surpass the maximum context length of the language model.

To address these challenges, we conducted experiments to determine the optimal architecture for the connection layer between the event sequence encoder and the LLM. We evaluated several potential implementations: a single linear layer, a transformer layer, and two model sizes of the Q-Former architecture. Additionally, we investigated the impact of initialization on problem-solving quality and training speed by initializing the Q-Former with weights from the pre-trained visual-text model BLIP-2, based on FLAN-T5. In all experiments, we tackled three tasks in a multi-task mode using the AlfaBattle dataset. The components used in all experiments included Whisper-tiny as the transaction encoder and FLAN-T5-small as the language model. Performance was measured at 20 epochs, with fixed batch size, learning rate, and optimization parameters. We used multi-class accuracy for classification tasks and MSE for numerical response prediction tasks as target metrics.

The results revealed that simply increasing the number of trainable parameters does not necessarily enhance task solution quality. A linear layer with a small number of parameters performed worse than Q-Former-small, which also trained much faster. However, adding more simple identical blocks within a single connector, such as ${}^{\prime}2xLinear^{\prime}$ , did not significantly improve performance. On the other hand, more complex blocks, such as ${}^{\prime}2xTransformer^{\prime}$ , showed substantial quality gains. Increasing the model size to Q-Former-base yielded mixed results: while MCC code prediction quality improved by 2%, the metrics for MCC category prediction and numerical attribute Amount declined.

Additional initialization with weights from visual-text pre-training marginally improved the MCC code prediction task but slightly degraded the metrics for the other two tasks. The overall impact of initialization was minimal, indicating few common patterns between extracting salient information from images and deriving dependencies from event sequences. This discrepancy is expected due to the lack of temporal dependence within a single image, in contrast to the strong temporal dependence between events in a sequence.

Therefore, we selected the Q-Former-base model without initialization, anticipating an increase in the number of tasks our approach can handle simultaneously. This model offers a sufficient margin for increasing the complexity of future experiments.

Appendix B Implementation details

B.0.1 Training and hyper-parameters

All experiments for ESQA described below utilised consistent hyperparameters and approach components, unless otherwise specified. We employed the AdamW optimizer (Loshchilov and Hutter, 2017) with parameters $\beta_{1}=0.9$ , $\beta_{2}=0.98$ , and a weight decay of 0.01. Cosine learning rate decay with restarts was applied, featuring different peak learning rates for each dataset and varying numbers of warm-up steps. In our experiments, LoRA (Hu et al., 2021) with a rank of $r=16$ was applied only to the matrices $W_{q}$ and $W_{v}$ of the self-attention and encoder-decoder attention layers. The LoRA scaling factor was set to 32, and the dropout rate to 0.05. The number of trainable parameters in the language model was calculated as $\theta^{train}=2\times L\times d_{model}\times r$ , where $L$ is the number of layers and $d_{model}$ is the internal dimensionality of the language model. The rank of trainable decomposition matrices is denoted by $r$ . Therefore, the number of trainable parameters in each FLAN-T5 model did not exceed 0.9% of the total parameters (Table 8). All models were trained using 6 Nvidia A100 (80G) GPUs. The training hyperparameters are summarised in Table 7.

Table 7: Hyperparameters used for ESQA training. In all experiments, the Whisper-small model architecture was used as the encoder.

Dataset	AlfaBattle2	Age	Gender	X5	Taiwan
LLM	flan-T5-xl	FLAN-T5-xl	fla-T5-large	FLAN-T5-xl	FLAN-T5-xl
Emb. size	201	110	74	163	100
Learn. rate	3e-4	3e-4	1e-4	1e-4	1e-4
warmup steps	4 k.	1 k.	1 k.	4 k.	1 k.
Max. epochs	40	10	10	20	30
Batch size	300	250	50	250	50
Min seq. len.	50	0	0	0	6
Max seq. len.	750	1500	1500	750	6

Table 8: Trainable parameters of LLM with LoRA.

Model	% trainable params.	# trainable params.
FLAN-T5-small	0.8862	0.688 M.
FLAN-T5-base	0.7096	1.77 M.
FLAN-T5-large	0.5989	4.72 M.
FLAN-T5-xl	0.3301	9.44 M.
FLAN-T5-xxl	0.1692	18.87 M.

B.0.2 Evaluation strategy

We employed several classical machine learning metrics to thoroughly evaluate the proposed approach. As previously mentioned, ESQA is designed to handle tasks that can be framed as binary or multi-class classification as well as regression settings.

Classification Metrics. We utilised classification metrics for tasks that involved predicting a categorical feature of the next event or a characteristic of the entire sequence (e.g., default of a bank customer). For non-binary target tasks, we used Accuracy and F1-score. For binary target tasks, we employed the Area Under the Receiver Operating Characteristic curve (ROC-AUC). The model with the highest performance on these metrics was deemed the best.

To calculate the classification metrics Accuracy and F1 score using the language model’s response in the question-answer format, we applied the following process. The question body was followed by an instruction specifying the format of the answer to clearly define the structure of the language model’s output. The tokens predicted by the language model were then decoded into text, and the segments containing the desired answer were extracted. These extracted values $y$ were compared to the target $\hat{y}$ in a classification format, where the number of classes matched the cardinality of the predicted value. Subsequently, Accuracy and F1 were calculated as follows:

\texttt{Accuracy}(y,\hat{y})=\frac{1}{n_{\text{samples}}}\sum_{i=0}^{n_{\text{% samples}}-1}1(\hat{y}_{i}=y_{i})

\text{F1}=\frac{2*\text{TP}}{2*\text{TP}+\text{FP}+\text{FN}}

In this context, $TP$ represents the number of true positives, $FP$ stands for the number of false negatives and $FP$ denotes the number of false positives.

In calculating the ROC-AUC metric, we utilised the difference between the probabilities of the positive and negative response tokens.

Regression Metrics. To evaluate prediction performance for tasks with real-valued target variables, we employed Mean Absolute Error (MAE) and Mean Squared Error (MSE) metrics. For calculating these regression metrics, each question was accompanied by instructions specifying the format and range of the expected answer. The required numerical values (both real and integer) were then extracted from the LLM’s textual predictions according to the given response structure. Instances where the prediction could not be interpreted as a number were excluded from the final metric calculation¹¹1We made this assumption based on the rarity of such instances, given the clarity of the questions and the accompanying guidance provided for answering them.. The selected numerical responses, denoted as $y$ , were compared with the target values $\hat{y}$ for accurate assessment:

\text{MAE}(y,\hat{y})=\frac{1}{n_{\text{samples}}}\sum_{i=0}^{n_{\text{samples% }}-1}\left|y_{i}-\hat{y}_{i}\right|

\text{MSE}(y,\hat{y})=\frac{1}{n_{\text{samples}}}\sum_{i=0}^{n_{\text{samples% }}-1}(y_{i}-\hat{y}_{i})^{2}

B.1 Detailed datasets description

A complete list of the datasets and a description of each dataset is given below. Main statistics and descriptions for each dataset are provided in Table 9.

Table 9: Statistics of the datasets used for models evaluation.

Dataset	AlfaBattle	Age	Gender	X5	Taiwan
# events	443 M.	44 M.	6,85 M.	45,8 M.	0.18 M
# sequences	1,47 M.	30 K.	9,2 K.	400 K.	30 K.
Avg, seq. len.	881.7	862.4	446.6	114.3	6
# numeric	3	1	3	3	3
# categorical	15	2	2	3	5
# classes	2	4	2	4	2
train/val split %	70/30	90/10	90/10	90/10	90/10

AlfaBattle2.0 dataset. The AlfaBattle2.0 dataset Evgeny and Max (2021) consists of transaction activity records of bank customers over a two-year period, capturing spending, payments, and transfers. The primary goal is to estimate the probability of a customer defaulting on a loan within a given timeframe. The default rate in this dataset is 2.76%. Each customer is associated with a sequence of transactions, each described by 18 features: 3 numeric and 15 categorical. The numeric features include the normalized transaction amount, the number of hours since the customer’s last transaction, and the number of days until the loan is disbursed. The categorical features encompass various identifiers: the merchant’s code and category, the currency and type of payment card, and the city, country, etc. All categories are encoded with numeric values to ensure the dataset remains anonymized. The temporal component is defined by the attributes of hour, day of the week and week of the year, which in combination form the transaction date and time.

Age Group Prediction Competition. This dataset (Sirius, 2020) comprises anonymized transaction records of bank customers, with the aim of predicting the age group of each client based on their transactions. Each transaction is characterised by three features: a discrete MCC (Merchant Category Code) identifying the type of merchant, the transaction date, and the transaction amount. Transactions can be grouped according to the unique customer identifier specified in the transaction description. The merchant identifier is also provided in text form, with categories such as ’bookshop’, ’ATM’, ’pharmacy’, etc. This allows for a more detailed and nuanced analysis of spending patterns related to different age groups.

Gender Prediction Competition. The primary goal of this competition is to predict the gender of bank customers based on their transaction activity (Max, 2019). The dataset includes historical transaction and transfer data spanning one year and three months. Each transaction record is associated with a unique client ID and contains the time and date of the transaction, its type, the transaction amount, and a discrete identifier for the merchant point. The transaction amount is not normalised and can indicate both inflows and outflows of funds. A negative value signifies a debit, while a positive value denotes a credit to the account.

Taiwan Default of Credit Card Clients. This dataset (Yeh and hui Lien, 2009) includes customer transaction data from April to September 2005, and it is used to predict whether a customer will repay their borrowed credit. Each record in the dataset contains 8 real-valued attributes. Some attributes describe the customer’s characteristics, such as age, education level, and marital status, while the remaining attributes provide details about the history of loan repayments.

X5 Retail Hero: Uplift Modeling for Promotional Campaign. Initially designed for an uplift modeling competition, this dataset focuses on predicting a customer’s age based on their purchasing activity (Babaev et al., 2022). Each purchase in the dataset is characterized by the time of the transaction, product type, segment, purchase amount, and the type of loyalty program associated with the customer.

B.2 Baselines implementation details

Below, we provide details about the architectures and hyperparameters of the baseline approaches used in our study.

Handcrafted features with LightGBM: This baseline aggregates numerical feature values across buckets and includes statistics such as count, mean, variance, minimum, and maximum. The LightGBM classifier (Ke et al., 2017) is then used for prediction.

Randomly initialised RNN encoder: This approach utilizes a randomly initialized and untrained RNN sequence encoder based on a unidirectional Gated Recurrent Unit (GRU) with a single hidden layer of size 1024. The resulting 1024-dimensional event sequence representations are used with LightGBM to solve the downstream task.

CoLES (Contrastive Learning for Event Sequences): This method employs a self-supervised contrastive pretraining approach called CoLES (Babaev et al., 2022) to generate vector representations of event sequences. The encoder is a recurrent neural network (RNN) GRU with one hidden layer of size 1024, producing final embeddings of the same size. A supervised classifier based on LightGBM is then trained using the pretrained embeddings.

CPC (Contrastive Predictive Coding): This approach uses a similar sequence encoder architecture to CoLES but applies the Contrastive Predictive Coding (CPC) method (van den Oord et al., 2018) for pretraining. CPC is a self-supervised technique for learning vector representations using an autoregressive model for non-discrete data sequences.

Barlow Twins: This method follows the same scheme and sequence encoder architecture as CoLES and CPC but implements a Barlow Twins Loss (Zbontar et al., 2021) for encoder pre-training. LightGBM is then used on the obtained embeddings for solving the downstream problem.

NSP (Next Sequence Prediction): This baseline employs an RNN sequence encoder with a unidirectional GRU and a single hidden layer of size 1024, pretrained on the Next Sequence Prediction task (Devlin et al., 2019). The resulting 1024-dimensional embeddings are used with LightGBM for the downstream task.

RTD (Replaced Token Detection): Similar in architecture to the NSP baseline, this approach uses the Replaced Token Detection loss function from the ELECTRA paper (Clark et al., 2020).

SOP (Sequences Order Prediction): Identical in architecture to NSP and RTD, this baseline uses the Sequences Order Prediction loss function from the ALBERT work (Lan et al., 2019).

MLM NSP (Masked Language Modelling with Next Sentence Prediction): This approach uses a LongFormer (Beltagy et al., 2020) with 4 attention heads, 8 hidden layers of dimension 2048, and a maximum of 2000 positions as an event encoder. The output embedding size is 2048. The encoder is pretrained using a combination of Masked Language Model and Next Sentence Prediction tasks as in BERT (Devlin et al., 2019). LightGBM is then used on the obtained embeddings for the downstream task.

TabFormer: This approach implements the TabFormer method (Padhi et al., 2020), utilizing a LongFormer (Beltagy et al., 2020) with 4 attention heads, 8 hidden layers of dimension 2048, and a maximum of 2000 positions as the sequence encoder. The output embedding size is 2048. The encoder is pretrained using the Masked Language Modelling (MLM) task (Devlin et al., 2019). LightGBM is then used on the obtained embeddings for solving the downstream problem.

GPT: This approach uses a GPT-2 architecture (Radford et al., 2019) as the event sequence encoder, with 12 layers, 12 heads per layer, and position encoding up to 2056 positions. The embedding dimension is 768. The encoder is pretrained on an autoregressive task of predicting the fields of the next transaction, each using a separate head. LightGBM is used on the obtained embeddings for the downstream task.

RNN with CoLES: This baseline differs from the standard CoLES approach by adding several MLP heads to the event sequence encoder after contrastive pre-training. This architecture is then end-to-end trained on the target task.

GPT with descr.: This approach modifies the conventional GPT-2 baseline by applying discretization to the numerical features of events.

CatBoost: A simple implementation of the CatBoost algorithm (Ostroumova et al., 2017) trained on event features.

Text LLM: This text-based LLM approach serializes event features into a string using a template, selecting only the attributes necessary for the task while ignoring others due to the long token sequence. The length of event sequences is also reduced to fit the language model’s context. For this baseline, we used the FLAN-T5-xl (Wei et al., 2021) model.

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: The main claims made in the abstract and introduction match theoretical and experimental results from the Section 4.2. The sections ’Predictive tasks’ and ’Generalisation abilities’ show confirmation that the approach is able to solve multiple tasks with and without fine-tuning. Results supporting the ability to solve downstream tasks are presented in the ’Main results’ section.
Guidelines:
- •
  
  The answer NA means that the abstract and introduction do not include the claims made in the paper.
- •
  
  The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- •
  
  The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- •
  
  It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: Limitations are discussed in separate section “Limitations” 6.
Guidelines:
- •
  
  The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- •
  
  The authors are encouraged to create a separate "Limitations" section in their paper.
- •
  
  The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- •
  
  The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- •
  
  The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- •
  
  The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- •
  
  If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- •
  
  While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that are not acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
3.

Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [N/A]
Justification: This paper does not include theoretical results.
Guidelines:
- •
  
  The answer NA means that the paper does not include theoretical results.
- •
  
  All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- •
  
  All assumptions should be clearly stated or referenced in the statement of any theorems.
- •
  
  The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- •
  
  Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- •
  
  Theorems and Lemmas that the proof relies upon should be properly referenced.
4.

Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: The article includes a detailed description of the configurations and parameters of the experiments. The implementation code of the neural network architecture and supporting artefacts will be published after the article has been accepted for publication.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- •
  
  If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- •
  
  Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- •
  While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. (a)
    
    If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. (b)
    
    If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. (c)
    
    If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. (d)
    
    We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [No]
Justification: The implementation code of the neural network architecture and supporting artefacts will be published after the article has been accepted for publication.
Guidelines:
- •
  
  The answer NA means that paper does not include experiments requiring code.
- •
  
  Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- •
  
  The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- •
  
  The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- •
  
  At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- •
  
  Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
6.

Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: The article includes a detailed description of the configurations and hyperparameters of the experiments in Sections B.0.1, B.0.2 of the Supplementary materials.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- •
  
  The full details can be provided either with the code, in appendix, or as supplemental material.
7.

Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [No]
Justification: Statistical errors have not been reported in the results tables due to space limitations. They can be provided later if required. Statistical errors for the baselines and metrics of the presented approach were calculated for 3 runs of the training pipeline.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- •
  
  The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- •
  
  The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- •
  
  The assumptions made should be given (e.g., Normally distributed errors).
- •
  
  It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- •
  
  It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- •
  
  For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- •
  
  If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
8.

Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: The technical details for launching the training and inference of the presented approach are described in Section B.0.1.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- •
  
  The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- •
  
  The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that did not make it into the paper).
9.

Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: The authors of this paper have reviewed the NeurIPS Code of Ethics.
Guidelines:
- •
  
  The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- •
  
  If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- •
  
  The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
10.

Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [N/A]
Justification: We are aware of the potential risks of using predictive models in real-world applications and business environments, such as incorrect predictions or leakage of sensitive data. Regarding the first risk, we maintain that at this stage of the work, a pure prototype version is presented, with no practical implementation. Its predictions are advisory and need to be verified by a human. To address the latter risk, all datasets have been thoroughly anonymised for sensitive and confidential information and cleaned of any initial ethical bias.
Guidelines:
- •
  
  The answer NA means that there is no societal impact of the work performed.
- •
  
  If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- •
  
  Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- •
  
  The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- •
  
  The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- •
  
  If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [No]
Justification: At this stage of development, it is assumed that the approach will be purely recommendatory, with human verification of responses to guard against error.
Guidelines:
- •
  
  The answer NA means that the paper poses no such risks.
- •
  
  Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- •
  
  Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- •
  
  We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: All of the methods, algorithms and datasets mentioned in the paper include references to the original papers, sources and contributors.
Guidelines:
- •
  
  The answer NA means that the paper does not use existing assets.
- •
  
  The authors should cite the original paper that produced the code package or dataset.
- •
  
  The authors should state which version of the asset is used and, if possible, include a URL.
- •
  
  The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- •
  
  For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- •
  
  If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- •
  
  For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- •
  
  If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
13.

New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification: The paper does not present new assets.
Guidelines:
- •
  
  The answer NA means that the paper does not release new assets.
- •
  
  Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- •
  
  The paper should discuss whether and how consent was obtained from people whose asset is used.
- •
  
  At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
14.

Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification: This the paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- •
  
  According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
15.

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification: This paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- •
  
  We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- •
  
  For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.