Integrating Large Language Models with Graph-based Reasoning for Conversational Question Answering

Parag Jain   Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
[email protected]    [email protected]

We focus on a conversational question answering task which combines the challenges of understanding questions in context and reasoning over evidence gathered from heterogeneous sources like text, knowledge graphs, tables, and infoboxes. Our method utilizes a graph structured representation to aggregate information about a question and its context (i.e., the conversation so far and evidence retrieved to find an answer), while also harnessing the reasoning and text generation capabilities of large language models (LLMs). Graph embeddings are directly injected into the LLM, bypassing the token embedding layers, and learned end-to-end by minimizing cross-entropy. Our model maintains a memory module to track and update past evidence, thus influencing the graph’s structure, as the conversation evolves. Experimental results on the ConvMix benchmark Christmann et al. (2022a) show that graph embeddings enhance the LLM’s ability to reason, while the memory module provides robustness against noise and retrieval errors.

1 Introduction

Conversational question answering is an information seeking task where users engage in interactive conversations with AI systems Choi et al. (2018); Reddy et al. (2019); Dalton et al. (2022). Unlike traditional question answering applications Rajpurkar et al. (2016), conversational systems are expected to track the context of a conversation, i.e., remember previous questions and answers to provide relevant responses in an ongoing dialogue. The majority of prior work has studied different instantiations of conversational question answering, based on the simplifying assumption that answers can be found in a single information source. Examples include querying knowledge graphs such as Wikidata Perez-Beltrachini et al. (2023); Christmann et al. (2022a); Saha et al. (2018), identifying answer spans in Wikipedia articles Reddy et al. (2019); Choi et al. (2018), and searching for answers in table cells Iyyer et al. (2017).

In this paper we focus on conversational question answering over multiple and heterogeneous information sources. Figure 1 shows an example interaction from ConvMix (Christmann et al., 2022b), a recently curated dataset, which combines the challenges of understanding questions in context, and retrieving their answers from multiple sources. As can be seen, answers are located in knowledge base triples (response to Q1), infoboxes (responses to Q4 and Q5), and tables (responses to Q2 and Q3). It is also possible for an answer to be found in different sources which may in turn disagree. Moreover, the interaction in Figure 1 displays the hallmarks of naturalistic dialogue. The second question (Fact Rank?) can only be interpreted by taking into account the topic of the conversation (i.e., the album Kid A) mentioned in the previous utterance. Follow-on questions are short and may seem ungrammatical taken out of context. As the conversation unfolds, the topic shifts from the album Kid A to the Rolling Stone magazine; Q4 in Figure 1 has no dependencies on previous utterances and a hypothetical system would have to recognize that a new topic is being introduced.

Refer to caption
Figure 1: Example interaction (left) from the ConvMix development set Christmann et al. (2022b) and relevant evidence at query Q3 (right). Utterances Q1–Q3 explore the topic of album Kid A. Q4 transitions to the topic of Rolling Stone magazine. The evidence is retrieved from diverse sources highlighted in red. Wikipedia text and tables are prepended with their respective article title. Known entities are shown in blue. Underlined entities are identified through string matching.
Refer to caption
Figure 2: Graph for retrieved evidence (subset) from Figure 1. Tokens within each instance create local subgraphs in the form of a linear chain. Local subgraphs are connected through common entities (within <n> – </n>) to build a global graph. Same color highlights connections between similar entities (some edges are omitted for clarity).

We propose a modeling approach to conversational question answering which integrates large language models (LLMs) with graph-based reasoning. The core idea is to represent information about a question and its context — such as the conversation so far and sources retrieved to find an answer — through a dynamically generated graph and size varies with each utterance. Our method utilizes a graph structured representation Gori et al. (2005); Scarselli et al. (2009) to aggregate information (and resolve conflicts) from multiple sources, while also harnessing the reasoning and text generation capabilties of LLMs. Our graph network is efficiently trained using gradients from the LLM. Graph embeddings are directly injected into the LLM, bypassing the token embedding layers, and learned end-to-end by minimizing cross-entropy loss. To manage topic shifts and keep track of the conversation flow, we introduce a memory module that stores evidence used to answer previous questions, thus allowing to re-use past information for answering future questions. Our contributions are:

  • A method to aggregate evidence from multiple sources into a dynamic graph representation for conversational question answering.

  • We efficiently integrate the evidence-based graph with LLMs for end-to-end training.

  • We keep track of past evidence in a memory module which is updated as the conversation evolves and influences the graph structure and its representation.

  • Extensive experiments on the ConvMix dataset Christmann et al. (2022b), demonstrate that graph structure enhances the LLM’s ability to reason over multiple sources, while the memory module affords robustness to noise and retrieval errors.

2 Related Work

Conversational Question Answering

Most previous work on conversational question answering operates over a single infromation source such as a knowledge graph, text passage, or table (Choi et al., 2018; Reddy et al., 2019; Perez-Beltrachini et al., 2023; Iyyer et al., 2017). Existing models tend to be specialized, catering to isolated modalities (e.g., text or tables), while a few approaches adopt graph-based representations to organize the conversation and available information (Shen et al., 2019; Jain and Lapata, 2023; Kacupaj et al., 2021; Mueller et al., 2019). A notable exception are  Christmann et al. (2023) who propose an end-to-end model for multiple information sources. Specifically, their method constructs a heterogeneous graph based on evidence retrieved from tables, infoboxes, text snippets, and Wikidata triples. This graph is iteratively pruned at inference time to a smaller subgraph containing the answer (i.e., an entity node) to the question.

Our work also integrates information from multiple sources into a graph. However, we do not model question answering as a classification task, but instead propose a generative model. We leverage graph representations and the reasoning capabilities of language models, without relying on specialized inference procedures.

LLMs with Graphs

A common approach to encoding graph structure for LLMs involves describing the graph in natural language so that it resembles text (Ye et al., 2023; Wang et al., 2024). There is no agreed consensus on how to convert graphs to text, and most methods rely on hand-crafted rules. Previous efforts have shown it is challenging for LLMs to reason over graph representations (Fatemi et al., 2024; Huang et al., 2024), even when explicit prompts are given that describe the structure of the graph in natural language Huang et al. (2024). Performance tends to be brittle and task dependent Wang et al. (2024); Fatemi et al. (2024).

Our work proposes a parameter-efficient method for learning task-specific graph representations. It is closest to Perozzi et al. (2024), who use graph embeddings as soft-prompts to represent structured data for LLMs. In a similar vein, Chai et al. (2023) use prefix-tuning to integrate graph embeddings with LLM attention layers. Their approach shows promising results on small graphs with a few nodes (similar-to\sim20) and limited variability. It also relies on the architecture of the LLM and may not seamlessly integrate with other models, e.g., Mixture-of-Experts (MoE; Shazeer et al. 2017; Jacobs et al. 1991).

Retrieval-augmented Generation

Our work integrates LLMs with graph structural information based on evidence retrieved from the Wikidata knowledge graph Vrandečić and Krötzsch (2014), Wikipedia text, tables, and infoboxes. Although we do not focus on retrieval as such, it plays a key role in identifying information for building the graph. Our approach can thus be viewed as a variant of retrieval augmented generation (RAG), since it conditions generation on freshly retrieved evidence based on user queries (Izacard et al., 2024; Khandelwal et al., 2020; Guu et al., 2020).

3 Overview

We assume a conversational question answering setting Christmann et al. (2022b) that requires resoning over Wikipedia facts attested in multiple sources such as text, tables, infoboxes, and the Wikidata knowledge graph (KG). Given interaction I, our task is to answer question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at turn t𝑡titalic_t, taking into account retrieved evidence rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and previous turns I[:t1]\text{I}[:t-1]I [ : italic_t - 1 ] which consist of questions and their answers qt,atsubscript𝑞𝑡subscript𝑎𝑡\langle q_{t},a_{t}\rangle⟨ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ (see Figure 1). To accomodate information from the conversation so far, we concatenate question qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at turn t𝑡titalic_t with previous question-answer pairs, i.e., Qt=[q1,a1qt1,at1,qt]subscript𝑄𝑡subscript𝑞1subscript𝑎1subscript𝑞𝑡1subscript𝑎𝑡1subscript𝑞𝑡Q_{t}=[q_{1},a_{1}\ldots q_{t-1},a_{t-1},q_{t}]italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], and use this to retrieve evidence.

Refer to caption
Figure 3: Sketch of proposed architecture. 1 shows query Q3 from the interaction in Figure 1. 2 shows KG triples retrieved with CLOCQ and their entities ( 3). Wikipedia articles for 3 are parsed to extract sentences, infoboxes and tables. In 4, retrieved evidence is ranked based on the current query using BM25. 5 creates an instruction prompt based on the input query (see Appendix A for the prompt template). In 6, a graph is constructed based on top ranked instances. 7 depicts the learned graph neural network. Graph node embeddings are initialized using LLM token embeddings that are separate from the base model. 8 shows the final embeddings which are passed to the LLM and are obtained by concatenating prompt (prefix, suffix) and graph embeddings (shown in different colors). 9 is the LLM without the token embedding layer.

As depicted in Figure 3, we adopt a modular approach. Given query Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we retrieve and rank relevant evidence (Section 4.1). We next organize retrieved information into a graph (Section 4.3) and learn graph embeddings using Graph Attention Networks (GAT; Velickovic et al. 2018; Brody et al. 2022). Finally, graph embeddings are injected in a LLM by skipping the token embeddings layer (Section 4.5). Unlike Christmann et al. (2023) who extract answers from retrieved evidence, we generate them. Our model \mathcal{M}caligraphic_M is thus formulated as:

at=(I[:t1],qt,rt;Θ)a_{t}=\mathcal{M}\left(\text{I}[:t-1],q_{t},r_{t};\Theta\right)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_M ( I [ : italic_t - 1 ] , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; roman_Θ ) (1)

where qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the current question, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the graph representing retrieved evidence, I[:t1]\text{I}[:t-1]I [ : italic_t - 1 ] are previous turns, and ΘΘ\Thetaroman_Θ the parameters of our model which are fine-tuned on task-specific data (Section 4.6).

4 Model

4.1 Evidence Retrieval

We adopt the retrieval pipeline outlined in Christmann et al. (2022b). As mentioned earlier, information is obtained from Wikipedia pages and the Wikidata KG using a query based on the current question concatenated with previous question-answer pairs. Retrieval takes place in two stages. Initially, evidence is retrieved from the Wikidata KG, and then followed by retrieval from Wikipedia.

We extract Wikidata triples (see 2 in Figure 3) using CLOCQ (Christmann et al., 2022a), a retrieval engine specifically tailored to question answering over knowledge bases. It preprocesses the knowledge graph in a memory efficient manner and returns the top-k𝑘kitalic_k triples based on query terms. Figure 3, shows a subset of relevant triples retrieved for Q3 along with the KG entities Esubscript𝐸E_{\mathcal{E}}italic_E start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT.

We next obtain evidence pertaining to additional Wikipedia sources by retrieving articles corresponding to the entities in Esubscript𝐸E_{\mathcal{E}}italic_E start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT. These pages are subsequently processed to extract text, tables, and infoboxes (see 3 in Figure 3). Tables are linearized by individually transforming each row into text and concatenating it with corresponding column headers. Infoboxes are linearized in a similar fashion by concatenating key-value pairs with header information (if available). KB triples are linearized by a simple concatenation of individual elements. Wikipedia text is split into sentences, each of which serves as a separate piece of evidence.

The evidence collected at this stage can be extensive, potentially comprising of several thousand instances, which would in turn lead to a very large graph (see Section 4.3). To manage this, we employ BM25 (Robertson and Zaragoza, 2009) to rank the evidence against the query and retain only the best scoring instances (see 4 in Figure 3). Let Etsubscript𝐸𝑡{E}_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the set of top-k𝑘kitalic_k retrieved instances at turn t𝑡titalic_t.

4.2 Evidence Memory

By design, we retrieve new evidence at every turn t𝑡titalic_t, which may suggest that every question introduces a new topic. However, a well-known property of conversational dialogue is topic inertia (Chai and Jin, 2004), i.e., users tend to explore the same topic for a while before switching to a new topic (see the interaction in Figure 1). We propose to keep track of past topics through a memory module which stores previously retrieved pieces of evidence to be re-utilized and re-ranked against Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, at each turn t𝑡titalic_t we define evidence memory Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as,

Mt={Ejj[1t1]}subscript𝑀𝑡direct-sumconditional-setsubscript𝐸𝑗𝑗delimited-[]1𝑡1M_{t}=\oplus\,\{E_{j}\,\mid\,j\in[1\ldots t-1]\}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⊕ { italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ [ 1 … italic_t - 1 ] } (2)

where direct-sum\oplus denotes concatenation. We replace a proportion (e.g., one third) of low-ranked instances from Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the top-ranking ones from Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We employ the Sentence-BERT model Reimers and Gurevych (2019) to re-rank the evidence stored in Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, using Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a query.

4.3 Graph Construction

Retrieved information is organized into a graph (see  6, Figure 3) by first converting individual pieces of evidence into a linear chain. Local subgraphs are then merged into a global graph by linking common entities between them. Figure 2 shows example graphs with local and global connections.

To construct a local graph, evidence from different sources is linearized (as discussed in Section 4.1) and tokenized using a base LLM tokenizer. Tokens within each instance are treated as graph nodes connected in a linear chain. In other words, evidence w𝑤witalic_w with tokens w1w|w|subscript𝑤1subscript𝑤𝑤w_{1}\ldots w_{|w|}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT | italic_w | end_POSTSUBSCRIPT is represented by local sub-graph w1w2w|w|subscript𝑤1subscript𝑤2subscript𝑤𝑤w_{1}\rightarrow w_{2}\rightarrow\ldots\rightarrow w_{|w|}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → … → italic_w start_POSTSUBSCRIPT | italic_w | end_POSTSUBSCRIPT.

Connecting different pieces of evidence together is critical for enabling more global reasoning. We create a global graph by linking similar entities across local subgraphs. In this context, entities are KG items but also text spans in Wikipedia text, infoboxes, and tables gathered during retrieval. We identify entity spans by performing string matching against KG entities. In Figure 2, such entities are encircled by <n> node </n> tags. Finally, entity spans referring to same entity are linked, thus creating a more globally connected graph.

4.4 Graph Encoder

Our model generates an answer at each turn t𝑡titalic_t given query Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and graph 𝒢tsubscript𝒢t\mathcal{G_{\text{t}}}caligraphic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT representing relevant evidence (see Figure 3). More formally, 𝒢t=(𝒱,)subscript𝒢t𝒱\mathcal{G_{\text{t}}}=(\mathcal{V},\mathcal{E})caligraphic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT = ( caligraphic_V , caligraphic_E ) is a directed graph with nodes 𝒱={v1,v2,,vn}𝒱subscript𝑣1subscript𝑣2subscript𝑣𝑛\mathcal{V}=\{v_{1},v_{2},\dots,v_{n}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and edges 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V.

We do not learn graph node embeddings from scratch. Instead, we initialize them using token embeddings from a large language model (see  7, Figure 3). This step is crucial for achieving feature alignment between the evidence graph and the downstream LLM. Generally, integrating LLMs with information from a different modality necessitates aligning features between them. For example, vision-language models like BLIP-2 (Li et al., 2023) and LLaVA (Liu et al., 2023) perform feature alignment by heavily pretraining a network whose goal is to act as a bridge between a frozen image encoder and a frozen LLM. This approach requires large amounts of pretraining data (as well as computational resources) which are not readily available for our task. We found that simply initializing graph node embeddings with token embeddings from a base LLM is effective and crucial for achieving good performance.

Let {xii[1,n]}conditional-setsubscript𝑥𝑖𝑖1𝑛\{x_{i}\mid i\in[1,n]\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ 1 , italic_n ] } denote the set of initial node embeddings. We learn graph structure representations with the Graph Attention Network (GAT; Velickovic et al. 2018; Brody et al. 2022), a neural network architecture designed for handling graph-structured data. It is computationally efficient, it requires less memory and storage compared to other deep learning models, and is applicable to inductive problems. GAT uses the attention mechanism to weigh the importance of neighboring nodes when aggregating information in a graph. Attention between two nodes is calculated as:

αij=exp(ψ(xi,xj))k𝒩iexp(ψ(xi,xk))subscript𝛼𝑖𝑗𝜓subscript𝑥𝑖subscript𝑥𝑗subscript𝑘subscript𝒩𝑖𝜓subscript𝑥𝑖subscript𝑥𝑘\alpha_{ij}=\frac{\exp\bigl{(}\psi\bigl{(}x_{i},x_{j}\bigr{)}\bigr{)}}{\sum_{k% \in\mathcal{N}_{i}}\exp\bigl{(}\psi\bigl{(}x_{i},x_{k}\bigr{)}\bigr{)}}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_ARG (3)

where 𝒩i={vj𝒱(j,i)}subscript𝒩𝑖conditional-setsubscript𝑣𝑗𝒱𝑗𝑖\mathcal{N}_{i}=\{v_{j}\in\mathcal{V}\mid\bigl{(}j,i\bigr{)}\in\mathcal{E}\}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V ∣ ( italic_j , italic_i ) ∈ caligraphic_E } are the neighbors of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the attention score between node embeddings xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Following Brody et al. (2022), we compute the scoring function ψ𝜓\psiitalic_ψ as:

ψ(xi,xj)=aTLeakyReLU(W[xixj])𝜓subscript𝑥𝑖subscript𝑥𝑗superscript𝑎𝑇LeakyReLU𝑊delimited-[]direct-sumsubscript𝑥𝑖subscript𝑥𝑗\psi\bigl{(}x_{i},x_{j}\bigr{)}=a^{T}\operatorname{LeakyReLU}\bigl{(}W\cdot[x_% {i}\oplus x_{j}]\bigr{)}start_ROW start_CELL italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_LeakyReLU ( italic_W ⋅ [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ) end_CELL end_ROW (4)

where Tsuperscript𝑇\cdot^{T}⋅ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represents transposition and direct-sum\oplus is the concatenation operation. Attention coefficients corresponding to each node i𝑖iitalic_i are then used to compute a linear combination of the features corresponding to neighboring nodes as:

xi=σ(j𝒩iαijWxj)subscript𝑥𝑖𝜎subscript𝑗subscript𝒩𝑖subscript𝛼𝑖𝑗𝑊subscript𝑥𝑗x_{i}=\sigma\left(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}{W}x_{j}\right)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_W italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (5)

4.5 Integration with LLMs

The LLM takes as input a composite embdedding consisting of the graph embeddings discussed above, and embeddings corresponing to a prompt prefix 𝖯prefixsubscript𝖯prefix\mathsf{P}_{\text{prefix}}sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT, and a prompt suffix 𝖯suffixsubscript𝖯suffix\mathsf{P}_{\text{suffix}}sansserif_P start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT (see 5 in Figure 3). 𝖯prefixsubscript𝖯prefix\mathsf{P}_{\text{prefix}}sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT is an initial instruction prompt and 𝖯suffixsubscript𝖯suffix\mathsf{P}_{\text{suffix}}sansserif_P start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT represents the conversational query at turn t𝑡titalic_t to be answered. See Appendix A(Figure 6) for an example prompt. More formally, LLM input embeddings are obtained as:

𝖧=𝖧prefix𝖧g𝖧suffix𝖧direct-sumsubscript𝖧prefixsubscript𝖧𝑔subscript𝖧suffix\mathsf{H}=\mathsf{H}_{\text{prefix}}\oplus\mathsf{H}_{g}\oplus\mathsf{H}_{% \text{\text{suffix}}}sansserif_H = sansserif_H start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT ⊕ sansserif_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊕ sansserif_H start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT (6)

where 𝖧gsubscript𝖧𝑔\mathsf{H}_{g}sansserif_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the list of embeddings of all graph nodes and 𝖧prefixsubscript𝖧prefix\mathsf{H}_{\text{prefix}}sansserif_H start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT is the text embedding of 𝖯prefixsubscript𝖯prefix\mathsf{P}_{\text{prefix}}sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT:

𝖧prefix=Embed(Tok(𝖯prefix))subscript𝖧prefixEmbedToksubscript𝖯prefix\mathsf{H}_{\text{prefix}}=\operatorname{Embed}(\operatorname{Tok}(\mathsf{P}_% {\text{prefix}}))sansserif_H start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT = roman_Embed ( roman_Tok ( sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT ) ) (7)

where  TokTok\operatorname{Tok}roman_Tok and EmbedEmbed\operatorname{Embed}roman_Embed are the base LLM tokenizer and embedding layer, respectively.  𝖯suffixsubscript𝖯suffix\mathsf{P}_{\text{suffix}}sansserif_P start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT is encoded in a similar manner using Equation (7) to obtain 𝖧suffixsubscript𝖧suffix\mathsf{H}_{\text{suffix}}sansserif_H start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT. We use the embeddings obtained with Equation (6) as the initial token embeddings for the pretrained LLM.

4.6 Training

Our model is trained end-to-end by optimizing cross-entropy loss. For all variants (with and without graph structure), the loss is calculated on completion tokens only, i.e., prompt tokens do not observe any loss. This is similar to setting the prompt loss weight to 0 (Wang et al., 2023).

Given training instance I[:t1],qt,rt;Θ\langle\text{I}[:t-1],q_{t},r_{t};\Theta\rangle⟨ I [ : italic_t - 1 ] , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; roman_Θ ⟩, and sequence of gold output tokens at1,at2,,at|at|subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎2𝑡subscriptsuperscript𝑎subscript𝑎𝑡𝑡\langle a^{1}_{t},a^{2}_{t},\dots,a^{|a_{t}|}_{t}\rangle⟨ italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩, we minimize token-level cross-entropy as:

(a^ti)=logp(atiI[:t1],qt,rt;Θ)\mathcal{L}\left(\hat{a}^{i}_{t}\right)=-\operatorname{log}p\left(a^{i}_{t}% \mid\text{I}[:t-1],q_{t},r_{t};\Theta\right)caligraphic_L ( over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - roman_log italic_p ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ I [ : italic_t - 1 ] , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; roman_Θ ) (8)

where a^tisubscriptsuperscript^𝑎𝑖𝑡\hat{a}^{i}_{t}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the predicted output token at decoder step i𝑖iitalic_i. We use a mixed approach for training the whole network. Our graph network is trained from scratch, however, the base LLM is updated using LoRA (Hu et al., 2022) in a parameter efficient manner. We perform inference based on the conversation context (i.e., I[:t1]\text{I}[:t-1]I [ : italic_t - 1 ]) and current query qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

5 Experimental Setup

We use Mistral-7B-Instruct-v0.2 Jiang et al. (2023) as our base model, given its good performance across complex reasoning tasks, and wider context window of 32K tokens. Recall that we retrieve and encode a large number of instances as evidence for a question. Our implementation predominantly relies on PyTorch (Paszke et al., 2019). We adapt the Mistral implementation available at the HuggingFace Transformers library (Wolf et al., 2020). For developing the graph neural network, we utilize PyTorch Geometric (PyG; Fey and Lenssen 2019). We use Hugging Face’s TRL (Transformer Reinforcement Learning) library (von Werra et al., 2020) for fine-tuning model without graph. Additional training parameters and prompts can be found in Appendices  B and  A, respectively.

5.1 Dataset

Entities covered 5,418
Long-tail entities 2,511
conversations 2,800
Number of turns 5
Split ratio 60:20:20
ConvMix-10T test set
Conversations 200
Number of turns 10
Domains: Books, Movies, Music, TV series, Soccer
Answer Source: Text, Tables, Infobox Wikidata
Table 1: ConvMix dataset statistics. Long tail entities are those attested in less than 50 KG facts.

We evaluate our work on ConvMix (Christmann et al., 2022b), a conversational question answering dataset that requires reasoning over heterogeneous sources, specifically Wikipedia text, infoboxes, tables, and the Wikidata KG. Aside from reasoning, the conversational nature of ConvMix requires handling discourse phenomena, such as coreference, ellipsis, and topic-shift (Sun and Chai, 2007; Jain and Lapata, 2021). Table 1 summarizes various dataset statistics. As can be seen (first block), the main dataset (CovMix-5T) contains 2,800 conversations, each with five turns (i.e., question-answer pairs), split into training, development, and test set. In addition, ConvMix-10T is a separate test set used to measure generalization on longer interactions. It contains 200 conversations, each 10 turns long (see last block in Table 1). We follow the splits provided in Christmann et al. (2022b) and report results on both test sets combined.

5.2 Evaluation Metrics

Our model generates answers which may be valid but not identical to the gold standard (e.g., United States, United States of America, and USA are all paraphrases of the same concept). When there is no exact match, we follow previous work Christmann et al. (2022b) and try to normalize the answer to its canonical form. We use the Levenshtein distance (Levenshtein, 1965) to measure the similarity of the generated answer with entities in our retrieved evidence set. The entity with the smallest distance is used as the answer in such cases. We report H@1 (i.e., precision at 1) and H@5 (i.e., whether an answer match is found within the top 5 matching entities).

6 Results

Our experiments were designed to assess whether graph structure enhances LLM performance for our conversational question-answering task. Our results are summarized in Table 2.

We evaluate our approach against Mistral-7B variants without graph structure. Specifically, we compare against (a) Mistral-7B zero-shot prompted with top-k𝑘kitalic_k retrieved instances and the conversational history, i.e., the current query concatenated with previous QA pairs (see Appendix A for the prompt); and (b) Mistral-7B fine-tuned on the ConvMix training set using LoRA (Mistral-7B + FT) and top-k𝑘kitalic_k retrieved instances. We present three variants of our model, fine-tined with graph embeddings (Mistral-7B + Graph) and additionally with a memory management component (+Memory, +Rand Memory).

We also compare with several state-of-the-art systems built on top of T5 Raffel et al. (2020). T5-FiD (Christmann et al., 2022b) is a fusion-in-decoder model which acts as a “generative reader” and is trained on (top-k𝑘kitalic_k) retrieved instances and gold answers. Specifically, query-evidence pairs are encoded independently, and passed on to the decoder to generate an answer. We also report results with a T5-based model (T5-FiD + Question rewriting) which rewrites the question based on the conversational history context Raposo et al. (2022); Elgohary et al. (2019) and a related approach (T5-FiD + Question resolution) which performs query resolution, i.e., by appending relevant terms from previous question-answer pairs to the current question (Voskarides et al., 2020).111All FiD models are based on T5-base (Christmann et al., 2022b).

Finally, although not directly comparable, we report the performance of EXPLAIGNN (Christmann et al., 2023) and Convinse T5-FiD Christmann et al. (2022b). EXPLAIGNN is a classification model that identifies entity nodes in a graph as answer predictions. It learns a task specific structured representation optimized for better retrieval and query understanding. The learned representation is used to train a classification model based on graph neural networks tying both of them together. Convinse T5-FiD is similar in that it also learns a task-specific structured representation for retrieval and query understanding, without, however, creating a graph.

All models in Table 2 use the same retrieval engine (i.e., CLOCQ; Christmann et al. 2022a) which allows us to focus on architectural differences and compare models on equal footing.

Models H@1 H@5
Mistral-7B zero-shot 0.292 0.346
Mistral-7B + FT 0.350 0.400
Mistral-7B + Graph 0.425 0.459
Mistral-7B + Graph + Memory 0.445 0.512
Mistral-7B + Graph + Rand Memory 0.425 0.461
T5-FiD 0.300 0.350
T5-FiD + Question resolution 0.282 0.297
T5-FiD + Question rewriting 0.271 0.285
Convinse T5-FiD 0.342 0.386
EXPLAIGNN 0.406 0.561
Table 2: Model performance on the ConvMix dataset (results are averaged for ConvMix-5T and convMix-10T test sets). H@1 represents precision at 1 and H@5 represents a match at 5. A fine-tuned Mistral-7B with graph embeddings and a memory module performs best.

Integrating LLMs with graph-based reasoning boosts conversational QA performance.

As shown in Table 2, Mistral-7B + Graph is superior to a plain fine-tuned version of Mistral-7B (+ FT) by a large margin. This suggests that organizing and representing retrieved evidence as a graph improves reasoning compared to processing pieces of evidence independently. Perhaps unsurprisingly, fine-tuning generally improves Mistral’s performance on the conversational QA task over a zero-shot model. This is due to an improved understanding of task requirements, like regular shift in focus and answer format. For example, the model learns to avoid verbosity in answers and respond using dataset-specific conventions such as spelling out the month in dates (e.g., 2 October 2002 instead of 2/10/2002). The performance of the T5-FiD systems is comparable to zero-shot Mistral-7B. In general, we observe that performance improvements are not simply due to increased model size. Rather, it is important to model the conversational nature of the task and interpret the retrieved information more globally.

Adding a memory module improves QA precision.

Table 2 shows that results further improve when a memory module is added to our model (+Graph +Memory). Recall that previously retrieved instances are kept in memory and re-reranked against the current query. To further assess the usefulness of re-ranking, we conducted a controlled experiment where evidence was selected randomly from the memory. We observe that random selection (+Rand Memory) amounts to not having a memory component at all.

BooksMoviesMusicTVSeriesSoccer000. Mistral-7B+Graph Mistral-7B+FT Mistral-7B (zero-shot)
(a) Model performance across domains (ConvMix test set).
(b) Model performance across modalities (ConvMix test set).
111122223333444455556666777788889999101010100. PositionH@1Mistral-7B+Graph+MemoryMistral-7B+GraphMistral-7B+FTMistral-7B
(c) Model performance at different turn positions (ConvMix test set).
Figure 4: Analysis experiments for different model variants based on Mistral-7B prompted in a zero-shot setting, fine-tuned on ConMix without graph embeddings (+FT), with graph embeddings (+Graph), and with a memory module (+Graph +Memory). Performance degrades with numbers, tables, and later conversation turns.

It is challenging to provide accurate answers to questions that require numerical responses.

Figure 4(a) shows model performance broken down by question domain. Overall, we observe similar trends across domains, with TV Series and Soccer being most challenging. Performance for these domains decreases by similar-to\sim10 percentage points, e.g., in comparison to Books. To uncover the reason for this gap, we further investigate whether there is an effect of answer type. We automatically annotate222We use regex and python-dateutil to automatically categorize the answers. the ConvMix development set with the following answer categories: strings, dates, and numbers. The results in Table 3 (top) show average H@1 stratified by different answer types.

We observe that questions with numeric answers are harder compared to other categories. There are several reasons for this, including variability in numerical reasoning performance due to the choice of numeric data tokenization by the base model (Singh and Strouse, 2024; Sun et al., 2023). As well as the effect of pre-training data on the output predictions and their probability (McCoy et al., 2023). Table 3b (bottom) reveals that the proportion of instances with numeric answers is highest for the TV Series and Soccer domains, thus explaining why performance drops for these domains.

(a)    Answer Type Date String Number H@1 0.50 0.45 0.14

(b)    Domain Books Movies Music TV series Soccer % Number 3.9 2.1 5.0 10.0 7.9

Table 3: Model performance (Mistral-7B + Graph + Memory) across answer types (top) and proportion of numeric answers per domain (ConvMix dev set).

It is challenging to extract accurate information from tables.

Figure 4(b), shows how performance varies depending on the source of the answer. Across models, we observe that performance deteriorates when the answers are located in tables. On the contrary, performance is generally better when answers are found in the knowledge graph. We believe this performance gap is due to how tabular information is linearized. In contrast to the knowledge graph from which facts can be easily extracted, Wikipedia tables often have complex hierarchical structure Parikh et al. (2020) making it challenging to achieve clean and robust linearization Alonso et al. (2023).

It is more difficult to answer questions occurring later in the conversation.

In Figure 4(c) we examine how performance varies with conversation length. Ideally, a model should be able to answer questions irrespective of where these occur (e.g., beginning or end). As mentioned in Section 5.1, ConvMix contains conversations with a maximum length of 10 turns. The results in Figure 4(c) show a general decrease in performance as the dialogue progresses. Initial questions tend to be more complex while follow-on questions often extend or elaborate upon the initial topic (Chai and Jin, 2004; Jain and Lapata, 2021). Our results show that graph enhanced models generally outperform LLM variants which do not organize the retrieved information in any way. Furthermore, we observe that having a memory (of previously retrieved instances) is particularly helpful in longer interactions. Keeping track of past evidence helps ameliorate retrieval errors which might erroneously steer the model towards new topics. Aside from contextual factors, the quality of retrieval largely influences model precision, as approximately half of the answers cannot be found even at the beginning of the dialogue (see turn 1 in Figure 4(c)).

7 Conclusion

In this paper we propose a method to aggregate evidence from multiple sources into a dynamic graph representation for conversational question answering. We demonstrate how this graph can be efficiently integrated with large language models (LLMs) for end-to-end training, enhancing the model’s ability to handle evolving conversational contexts. Our approach maintains a memory module to track and update past evidence, thus influencing the graph’s structure and representation, as the conversation evolves. Experiments on the ConvMix dataset show that the graph enhances the LLM’s ability to reason over multiple modalities, while the memory module provides robustness against noise and retrieval errors. In the future, we would like to improve information retrieval for our task, through using pretrained embeddings for better entity linking. We could also adopt a structured memory module for more complex reasoning.

8 Limitations

Our experiments are limited to one dataset (i.e., ConvMix) and one language, namely English. It would be interesting to see if our findings generalize to other datasets which are conversational in nature but do not target our specific question answering task. For example, 𝕊𝕀𝔼𝕊𝕀𝔼\mathbb{SPICE}blackboard_S blackboard_P blackboard_I blackboard_C blackboard_E (Perez-Beltrachini et al., 2023) is a recently released conversational semantic parsing dataset where utterances are translated into executable semantic parses (in this case Sparql queries). It would also be interesting to examine how our model handles languages other than English, however, we are not aware of any multi-lingual or cross-lingual datasets for conversational question answering.

In this work, we do not study the effect of various prompting techniques on our task. In experiments, we found Mistral-7B’s performance superior to Llama2-7B Touvron et al. (2023), however, we did not perform an in-depth study on prompts and models. Measuring the effect of these factors on our task and model performance is non-trivial and a topic for future work.


Appendix A Prompt Description

Figure 5 shows an example prompt for the Mistral-7B model without graph embeddings (see Mistral-7B zero-shot in Table 2). The prompt includes a sequence of retrieved and ranked pieces of evidence, each encapsulated within <evidence>--</evidence> tags. We represent the past interaction I[:t1]\text{I}[:t-1]I [ : italic_t - 1 ] as a series of question and answer pairs. The same prompt is used for fine-tuning (see Mistral-7B + FT in Table 2) with the subsequent response as the gold output tokens (see Section 4.6 for details).

Figure 6 shows an example prompt for the graph-based model (all model variants with +Graph in Table 2). The prompt consists of three parts, the initial instructions which we refer to as 𝖯prefixsubscript𝖯prefix\mathsf{P}_{\text{prefix}}sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT, a sequence of graph node embeddings represented as graph_node_embedding, and the conversational query which we denote as 𝖯suffixsubscript𝖯suffix\mathsf{P}_{\text{suffix}}sansserif_P start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT.

Appendix B Training Details

Table 4 list the hyper-parameters employed to train our model. Implementation details are discussed in Section 5. During the fine-tuning of the base language model, only the query, key, and value projection parameters are updated.

Parameter Value
Graph layers 2
Graph heads 2
Lora rank 128
Lora α𝛼\alphaitalic_α 32
Lora dropout 0.05
GAT Dropout 0.5
Optimizer Adam Kingma and Ba (2015)
Learning rate 5e-5
Batch size 1
Gradient accumulation 4
Table 4: Hyperparameter values used for our model.
Prompt: Mistral-7B zero shot and fine-tuned without graph embeddings [INST] You are a helpful assistant. Using the following facts: <evidence>Kid A, publication, 2 October 2000</evidence> <evidence>Rolling Stone, Editor, Noah Shachtman</evidence> <evidence>Rolling Stone, Catgories, Popular culture</evidence> <evidence>Publication Fact, Country UK, Accolade The 100 Best Albums of the 2000s, Year 2010, Rank 7</evidence> <evidence>Publication Rolling Stone, Country US, Accolade The 100 Best Albums of the decade, Year 2009, Rank 1</evidence> <evidence>Rolling Stone was founded in San Francisco in 1967 by Jann Wenner and Ralph J. Gleason.</evidence> Answer the following conversational query as a simple key fact without description: [/INST] Question: What is the release date of album Kid A? Answer: 2 October 2000 Question: Fact Rank? Answer: 7 Question: Ranking on Rolling Stone in 2009? Answer:
Figure 5: Example prompt for models which do not employ graph embeddings. Only a few relevant pieces of evidence are shown, for the sake of brevity.
Prompt: Mistral-7B fine-tuned with graph embeddings [INST] You are a helpful assistant. Using the following facts: [graph_node_embedding_1, graph_node_embedding_2, ... , graph_node_embedding_n] Answer the following conversational query as a simple key fact without description: [/INST] Question: What is the release date of album Kid A? Answer: 2 October 2000 Question: Fact Rank? Answer: 7 Question: Ranking on Rolling Stone in 2009? Answer:
Figure 6: Example prompt for graph-based models. We use 𝖯prefixsubscript𝖯prefix\mathsf{P}_{\text{prefix}}sansserif_P start_POSTSUBSCRIPT prefix end_POSTSUBSCRIPT and 𝖯suffixsubscript𝖯suffix\mathsf{P}_{\text{suffix}}sansserif_P start_POSTSUBSCRIPT suffix end_POSTSUBSCRIPT to denote the instruction before and after the graph_node_embeddings respectively. The umber of graph node embeddings is dynamic and varies based on evidence that has been retrieved.