Struct-X: Enhancing Large Language Models Reasoning with Structured Data

Xiaoyu Tan Haoyu Wang Xihe Qiu Yuan Cheng Yinghui Xu Wei Chu Yuan Qi
Abstract

Structured data, rich in logical and relational information, has the potential to enhance the reasoning abilities of large language models (LLMs). Still, its integration poses a challenge due to the risk of overwhelming LLMs with excessive tokens and irrelevant context information. To address this, we propose Struct-X, a novel framework that operates through five key phases: “read-model-fill-reflect-reason” efficiently enabling LLMs to utilize structured data. It begins by encoding structured data into a topological space using graph embeddings, followed by filling in missing entity information with knowledge retrieval modules, and filtering out irrelevant tokens via a self-supervised module. The final phase involves constructing a topological network with selected tokens to further reduce the total token length for more effective LLM inference. Additionally, Struct-X includes an Auxiliary Module trained to generate prompts, aiding LLMs in analyzing structured data. Extensive experiments on benchmarks, including the knowledge graph question-answer task and the long document reading comprehension task, show that Struct-X notably improves LLM reasoning, demonstrating the effectiveness of structured data augmentation in improving LLM inference with complex input context. The code has been open-sourced and can be found in Appendix I.

Struct-X: Enhancing Large Language Models Reasoning with Structured Data



1 Introduction

In recent years, significant advancements have been made in the field of large language models (LLMs), particularly in natural language understanding Fan et al. (2023). This progress has been largely driven by extensive pre-training on vast text corpora Gao et al. (2023), which has enhanced their generation capabilities. These advancements are often viewed as critical steps towards the development of artificial general intelligence (AGI) Pei et al. (2019). During the deployment of LLMs as general-purpose assistants for a variety of real-world applications, it becomes necessary for LLMs to process multimodal inputs. Among these inputs, structured data, like structured knowledge graphs (KGs), is particularly important Ryen et al. (2022). These graphs, with their rich repository of entity relationships and hierarchical knowledge, have the potential to significantly enhance the reasoning capabilities of LLMs, leading to more precise and reliable inferences. However, in real-world applications, the effective utilization of structured knowledge in LLMs presents a significant challenge Pan et al. (2024). A common approach is to flatten the structured information into a lengthy text sequence before inputting it into LLMs Li et al. (2023). However, this method often introduces an excessive amount of task-irrelevant context. Excess information can overwhelm the models, thereby impairing inference efficiency and accuracy Han et al. (2024). Additionally, it hinders the ability of LLMs to accurately comprehend and represent the complex knowledge embedded within structured data Zhou (2023).

To address this issue, various approaches have been explored. Some studies have focused on converting knowledge graph triples into textual statements Zhang (2023), while others have emphasized incorporating knowledge graph embeddings Chen (2023). Additionally, efforts are underway to embed knowledge graph entities and relations directly into the encoder layers of LLMs Jiang (2023). More previous work is summarized in Appendix B. However, these methods primarily concentrate on converting the structural data of knowledge graphs into different formats. They tend to overlook the need to reduce the information density of this structural data, which often includes task-irrelevant information. Moreover, these approaches face challenges in preserving the global topological structure of knowledge graphs, a critical aspect that warrants further attention.

In addition to the issues of redundant information and the lack of a global topological structure in knowledge graphs, another significant challenge is the high sparsity of these graphs Lazaridou et al. (2022), characterized by missing semantic connections between entities. This sparsity presents a challenge for leveraging structural data in LLMs Hadi et al. (2023). LLMs tend to prioritize explicit semantic connections presented in the context while overlooking implicit connections, which are crucial for enhancing inference performance. Although current research, such as Lv et al. (2022) and Chai (2023), has been directed towards automatic knowledge completion and data augmentation to boost overall performance, these approaches tend to overlook the aforementioned challenges of redundancy and topological structure representation in utilizing structural data.

To overcome the existing bottlenecks discussed above, we introduce Struct-X, a novel framework designed to utilize Structured data to enhance the interaction and compleX reasoning capabilities of LLMs. This framework is centered around a workflow of “read-model-fill-reflect-reason”. It employs the transformation of structured data into a topological space, achieved through the application of graph embeddings. This is followed by the augmentation of incomplete entity information utilizing knowledge retrieval modules. Subsequently, a self-retrieved generation module called Self-Reg is employed to eliminate irrelevant tokens. The final stage encompasses the development of a topological network incorporating the chosen tokens, which serves to diminish the overall token length, thereby enhancing the efficacy of LLM inference. Furthermore, an Auxiliary Module is also designed in Struct-X, which adjusts prompts based on the loss, guiding the LLM generation. Extensive evaluation of knowledge graph QA and reading comprehension benchmarks have proven Struct-X’s superior reasoning abilities. These tests confirm that augmenting LLMs with structured data can significantly improve their inference skills in complex context environments. We refer interested readers to Appendix A for more information about Struct-X’s interaction examples. The code of Struct-X has also been open-sourced and can be found in Appendix I. The main contributions of this paper include:

  1. 1.

    We propose a novelty framework Struct-X that implements a process of “read-model-fill-reflect-reason” on structured data, enabling LLMs to perform effective complex reasoning over structured data.

  2. 2.

    We design a knowledge learning and filtering process to dynamically fill in structured knowledge gaps, coupled with a self-retrieved generation module called Self-Reg to filter and verify the relevance of retrieved knowledge, retaining valuable token information to alleviate learning burdens on LLMs.

  3. 3.

    We construct specialized graph network encoders to fully learn the potential features of associated tokens and enable efficient cross-layer message passing in Transformers. We also devise an original Auxiliary Module for generating coherent prompts and improving answer responses.

2 Preliminaries

The task of text generation in LLMs involves creating a sequence of output y=[y1,,yT]𝑦subscript𝑦1subscript𝑦𝑇y=[y_{1},...,y_{T}]italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ], where T𝑇Titalic_T represents the total number of tokens Tang et al. (2023), based on a given input prompt x𝑥xitalic_x. This process is often modeled in an autoregressively manner, which estimates the likelihood of each token, where y<tsubscript𝑦absent𝑡y_{<t}italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT represents the tokens that come before the current sequence [y1,,yt1]subscript𝑦1subscript𝑦𝑡1[y_{1},...,y_{t-1}][ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] Zhang (2022). Enhancements to this process can be made by incorporating relevant information from external documents D𝐷Ditalic_D into the input, thereby refining the model’s predictions Hofstätter (2023).

Refer to caption
Figure 1: Overall architecture of the proposed Struct-X framework. It consists of modules for topological knowledge encoding, knowledge injection and retrieval, graph topology encoder, and Auxiliary Module.

Moreover, we can develop a novel decoding strategy that produces critique tokens 𝒞𝒞\mathcal{C}caligraphic_C alongside the main text output. These tokens are generated at each step and are designed to enable the LLMs to self-evaluate aspects such as relevance, factuality, and completeness of the generated content in Table 1.

p(y,𝒞|x)=t=1Tp(yt,𝒞t|x,y<t,𝒞<t),𝑝𝑦conditional𝒞𝑥superscriptsubscriptproduct𝑡1𝑇𝑝subscript𝑦𝑡conditionalsubscript𝒞𝑡𝑥subscript𝑦absent𝑡subscript𝒞absent𝑡p(y,\mathcal{C}|x)=\prod_{t=1}^{T}p(y_{t},\mathcal{C}_{t}|x,y_{<t},\mathcal{C}% _{<t}),italic_p ( italic_y , caligraphic_C | italic_x ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) , (1)

where the critique token 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on all preceding text and critiques. We define four types of critique tokens: IFReT - predicts if retrieval is needed, IFReL - assesses passage relevance, IFSuP - checks output is supported and IFUsE - decides whether it is useful.

These critique tokens enable better control of the decoding process through re-ranking or constraints Asai (2023). For instance, the probability of a desirable IFReL token can upweight certain outputs. As an example, the attention distillation critique token is computed between the input x𝑥xitalic_x and response ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This critiques the attention alignment between x𝑥xitalic_x and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By generating such reflective signals, IFReT can adapt its decoding strategy over time based on its critiques. The IFReT approach allows customization of model behavior through constraints on desired critique tokens. The detailed algorithm implementation process can be found in Appendix E.

p(IFReT=y|x)=fϕ(x)𝑝IFReTconditional𝑦𝑥subscript𝑓italic-ϕ𝑥p(\text{\leavevmode\resizebox{}{6.88889pt}{{\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\framebox{IFReT}}}}=y|x)=f_{\phi}(x)italic_p ( IFReT = italic_y | italic_x ) = italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ), which predicts whether passage retrieval is needed (y𝑦yitalic_y) given the input x𝑥xitalic_x using a scoring function fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT parameterized by ϕitalic-ϕ\phiitalic_ϕ. The relevance scoring between a passage p𝑝pitalic_p and the input is srel=gθ(x,p)IFReL(x,p)subscript𝑠𝑟𝑒𝑙subscript𝑔𝜃𝑥𝑝IFReL𝑥𝑝s_{rel}=g_{\theta}(x,p)\cdot\text{\leavevmode\resizebox{}{6.88889pt}{{\color[% rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\framebox{IFReL}}}% }(x,p)italic_s start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_p ) ⋅ IFReL ( italic_x , italic_p ), where gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT produces a relevance score modulated by the IFReL gate value. The factual consistency between a response y𝑦yitalic_y and passage p𝑝pitalic_p is evaluated by scon=hψ(y,p)σ(IFSuP(y,p))subscript𝑠𝑐𝑜𝑛direct-productsubscript𝜓𝑦𝑝𝜎IFSuP𝑦𝑝s_{con}=h_{\psi}(y,p)\odot\sigma(\text{\leavevmode\resizebox{}{6.88889pt}{{% \color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}% \framebox{IFSuP}}}}(y,p))italic_s start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_y , italic_p ) ⊙ italic_σ ( IFSuP ( italic_y , italic_p ) ), where direct-product\odot is element-wise production and σ𝜎\sigmaitalic_σ is the sigmoid activation function. The overall utility is decided using u=IFUsE(x,y)𝑢IFUsE𝑥𝑦u=\text{\leavevmode\resizebox{}{6.88889pt}{{\color[rgb]{.75,0,.25}\definecolor% [named]{pgfstrokecolor}{rgb}{.75,0,.25}\framebox{IFUsE}}}}(x,y)italic_u = IFUsE ( italic_x , italic_y ).

3 Methods

3.1 Topological Knowledge Injection

We first implement “read-model-fill” process and we start by processing input KGs using a graph attention encoder (GAE) that consists of L𝐿Litalic_L layers Xu et al. (2021). The initial node features, denoted as hv(0)superscriptsubscript𝑣0h_{v}^{(0)}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, are set up using information obtained from the KG completion module Fei et al. (2021). After processing through L𝐿Litalic_L layers, we obtain the final node embeddings, hv(L)superscriptsubscript𝑣𝐿h_{v}^{(L)}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, which effectively represent both the semantic and structural information of the KGs. These encoded graph embeddings, hv(L)superscriptsubscript𝑣𝐿h_{v}^{(L)}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, are then partially masked at a specific rate, pmasksubscript𝑝𝑚𝑎𝑠𝑘p_{mask}italic_p start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT, to assist in learning about missing knowledge. This masking process can be mathematically represented as h~v=M(hv(L))subscript~𝑣𝑀superscriptsubscript𝑣𝐿\tilde{h}_{v}=M(h_{v}^{(L)})over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_M ( italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ), where M()𝑀M(\cdot)italic_M ( ⋅ ) symbolizes the masking operation. The masked nodes, denoted as h~vsubscript~𝑣\tilde{h}_{v}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, are then fed into the knowledge retrieval module, R(h~v)𝑅subscript~𝑣R(\tilde{h}_{v})italic_R ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), which is explained in the following section. This module plays a crucial role in supplementing the missing information, thereby facilitating the generation of complete graph embeddings h¯vsubscript¯𝑣\bar{h}_{v}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT Reda et al. (2022).

Refer to caption
Figure 2: Knowledge injection and retrieval modules in Struct-X. The knowledge retrieval module fills in missing entity information in the graph embeddings.

To address the gaps in entity information within the structured knowledge graph, we have developed a knowledge learning module, denoted as F𝐹Fitalic_F. This module is designed to retrieve pertinent facts from the knowledge base to enhance the masked node embeddings, h~vsubscript~𝑣\tilde{h}_{v}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT Yasunaga et al. (2022). More specifically, for each masked node, we calculate a similarity score between its embedding h~vsubscript~𝑣\tilde{h}_{v}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and all tail entities t𝑡titalic_t that are part of the set E𝐸Eitalic_E. This is achieved using the scoring function fscoresubscript𝑓𝑠𝑐𝑜𝑟𝑒f_{score}italic_f start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT, which can be represented as:

s(v,t)=fscore(h~v,t).𝑠𝑣𝑡subscript𝑓𝑠𝑐𝑜𝑟𝑒subscript~𝑣𝑡s(v,t)=f_{score}(\tilde{h}_{v},t).italic_s ( italic_v , italic_t ) = italic_f start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) . (2)

This process enables us to efficiently fill in the missing information in the knowledge graph.

Typ Inputs Outputs Descriptions

IFReT

query,context𝑞𝑢𝑒𝑟𝑦𝑐𝑜𝑛𝑡𝑒𝑥𝑡query,contextitalic_q italic_u italic_e italic_r italic_y , italic_c italic_o italic_n italic_t italic_e italic_x italic_t {activate,wait}𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑤𝑎𝑖𝑡\{activate,wait\}{ italic_a italic_c italic_t italic_i italic_v italic_a italic_t italic_e , italic_w italic_a italic_i italic_t } Decides when extra facts can assist reasoning
\hdashline

IFReL

query,evidence𝑞𝑢𝑒𝑟𝑦𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒query,evidenceitalic_q italic_u italic_e italic_r italic_y , italic_e italic_v italic_i italic_d italic_e italic_n italic_c italic_e {high,low}𝑖𝑔𝑙𝑜𝑤\{high,low\}{ italic_h italic_i italic_g italic_h , italic_l italic_o italic_w } Whether evidence provides useful clues to solve query
\hdashline

IFSuP

query,evidence,response𝑞𝑢𝑒𝑟𝑦𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒query,evidence,responseitalic_q italic_u italic_e italic_r italic_y , italic_e italic_v italic_i italic_d italic_e italic_n italic_c italic_e , italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e {strong,medium,weak}𝑠𝑡𝑟𝑜𝑛𝑔𝑚𝑒𝑑𝑖𝑢𝑚𝑤𝑒𝑎𝑘\{strong,medium,weak\}{ italic_s italic_t italic_r italic_o italic_n italic_g , italic_m italic_e italic_d italic_i italic_u italic_m , italic_w italic_e italic_a italic_k } Alignment between statements in response and evidence
\hdashline

IFUsE

query,response𝑞𝑢𝑒𝑟𝑦𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒query,responseitalic_q italic_u italic_e italic_r italic_y , italic_r italic_e italic_s italic_p italic_o italic_n italic_s italic_e {5,4,3,2,1}54321\{5,4,3,2,1\}{ 5 , 4 , 3 , 2 , 1 } Usefulness score of response in answering query
Table 1: Self-supervised Auxiliary Module related parameters for selective knowledge retrieval and response correlation verification

The scoring function evaluates both feature and topological similarities within the graph in Figure 2. It selects the top K𝐾Kitalic_K entities t𝑡titalic_t based on the highest scores and retrieves the related facts (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) from the knowledge base. To incorporate these facts into the node embeddings, a relation-aware aggregation function faggsubscript𝑓𝑎𝑔𝑔f_{agg}italic_f start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT is used. This function accumulates relevant knowledge for each node, using a score threshold τ𝜏\tauitalic_τ to filter out irrelevant facts Yu et al. (2022). The aggregation function adeptly manages various relations in structured knowledge by considering information from retrieved triples in a relation-aware manner. Additionally, before being input into the Transformer encoder, one linear layer o𝑜oitalic_o concatenates and processes embeddings from all GAT layers.

h¯vsubscript¯𝑣\displaystyle\bar{h}_{v}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT =fagg({h~v}{(h,r,t)|s(v,t)>τ}),absentsubscript𝑓𝑎𝑔𝑔subscript~𝑣conditional-set𝑟𝑡𝑠𝑣𝑡𝜏\displaystyle=f_{agg}(\{\tilde{h}_{v}\}\cup\{(h,r,t)|s(v,t)>\tau\}),= italic_f start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT ( { over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } ∪ { ( italic_h , italic_r , italic_t ) | italic_s ( italic_v , italic_t ) > italic_τ } ) , (3)
evsubscript𝑒𝑣\displaystyle e_{v}italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT =o([hv(1),,hv(L),h¯v])absent𝑜superscriptsubscript𝑣1superscriptsubscript𝑣𝐿subscript¯𝑣\displaystyle=o([h_{v}^{(1)},...,h_{v}^{(L)},\bar{h}_{v}])= italic_o ( [ italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] )

The o𝑜oitalic_o merges inputs and reduces dimensionality. This process retains rich multi-scale structural and semantic features at various depths. The output evsubscript𝑒𝑣e_{v}italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is flattened and prepared into sequences to replace token embeddings for the Transformer encoder input, as suggested by Wang et al. (2021). The refined node embeddings h¯vsubscript¯𝑣\bar{h}_{v}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, enriched with retrieved entity information, supply additional knowledge for reasoning in the downstream LLMs.

3.2 Knowledge and Information Retrieval

Here we perform the “reflect” process. The module R(h~v)𝑅subscript~𝑣R(\tilde{h}_{v})italic_R ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) retrieves relevant knowledge absent in masked graph node inputs h~vsubscript~𝑣\tilde{h}_{v}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Related entities can be dynamically discovered by matching tail entities t𝑡titalic_t to each masked node using a similarity scoring function fscore(h~v,t)subscript𝑓𝑠𝑐𝑜𝑟𝑒subscript~𝑣𝑡f_{score}(\tilde{h}_{v},t)italic_f start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_t ) considering both feature and topological similarity Lewis (2020). Related facts (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) are recalled to fill gaps. After concatenating retrieved knowledge sequences for all nodes ordered by similarity scores, we employ a pruning algorithm leveraging multi-head self-attention distillation and thresholds to filter out lower-weighted tokens. The remaining dense sequence provides supplemental external knowledge to complete masked graph node inputs.

To filter and verify the relevance of retrieved knowledge, we design a self-retrieved generation module SelfRegψ(k)subscript𝑆𝑒𝑙𝑓𝑅𝑒𝑔𝜓𝑘\text{$SelfReg$}_{\psi}(k)italic_S italic_e italic_l italic_f italic_R italic_e italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_k ) parameterized by ψ𝜓\psiitalic_ψ that takes as input the retrieved knowledge sequences k𝑘kitalic_k and outputs a filtered subset k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG containing only the most valuable tokens Shuster (2021). Specifically, SelfRegψsubscript𝑆𝑒𝑙𝑓𝑅𝑒𝑔𝜓\text{$SelfReg$}_{\psi}italic_S italic_e italic_l italic_f italic_R italic_e italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT first encodes the knowledge sequence k=(x1,x2,,xN)𝑘subscript𝑥1subscript𝑥2subscript𝑥𝑁k=(x_{1},x_{2},\ldots,x_{N})italic_k = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) using a Transformer encoder to obtain representations hi=fenc(xi)subscript𝑖subscript𝑓𝑒𝑛𝑐subscript𝑥𝑖h_{i}=f_{\text{$enc$}}(x_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Next, we compute an importance score for each token si=σ(fscore(hi))subscript𝑠𝑖𝜎subscript𝑓scoresubscript𝑖s_{i}=\sigma(f_{\text{score}}(h_{i}))italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_f start_POSTSUBSCRIPT score end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where fscoresubscript𝑓𝑠𝑐𝑜𝑟𝑒f_{\text{$score$}}italic_f start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT is a scoring network and σ𝜎\sigmaitalic_σ is a sigmoid activation function. To train the scoring network in a self-supervised manner, we create corrupted knowledge sequences k~~𝑘\tilde{k}over~ start_ARG italic_k end_ARG by randomly masking or shuffling some tokens. A contrastive loss is implemented to assign higher scores sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to tokens from the original k𝑘kitalic_k versus corrupted k~~𝑘\tilde{k}over~ start_ARG italic_k end_ARG:

contrast=imax(0,sisi~+Δ),subscript𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡subscript𝑖0subscript𝑠𝑖~subscript𝑠𝑖Δ\mathcal{L}_{\text{$contrast$}}=\sum_{i}\max(0,s_{i}-\tilde{s_{i}}+\Delta),caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max ( 0 , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + roman_Δ ) , (4)

where ΔΔ\Deltaroman_Δ is a margin hyperparameter. This drives the model to identify the most valuable knowledge. Finally, we filter the sequence by discarding tokens scoring below a threshold of τ𝜏\tauitalic_τ to retain only the most relevant phrases, significantly reducing the learning burden when provided as supplements to the LLMs.

k^={xi|si>τ}.^𝑘conditional-setsubscript𝑥𝑖subscript𝑠𝑖𝜏\hat{k}=\{x_{i}|s_{i}>\tau\}.over^ start_ARG italic_k end_ARG = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_τ } . (5)

The filtered relevant knowledge k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG provides targeted assistance to improve reasoning without overwhelming the LLMs with extraneous and irrelevant information.

3.3 Graph Topology Encoder

Here we perform the “reason” phase. To capture semantic and structural interactions between entities within the KGs, we use a specialized graph encoder in Figure 3, denoted as Eθ(G)subscript𝐸𝜃𝐺E_{\theta}(G)italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_G ), which is parameterized by θ𝜃\thetaitalic_θ. This KG is represented as G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V is the set of node entities and E𝐸Eitalic_E is the set of relation edges Li (2022). For each entity node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in V𝑉Vitalic_V, we first derive its initial feature representation hvi(0)superscriptsubscriptsubscript𝑣𝑖0h_{v_{i}}^{(0)}italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, which is a vector in a high-dimensional space.

The graph encoder works through a series of L𝐿Litalic_L layers, each layer enhancing the node representations through message passing. This process can be described as:

hvi(l+1)=fθ({hvj(l):vj𝒩(vi)}),viV,formulae-sequencesuperscriptsubscriptsubscript𝑣𝑖𝑙1subscript𝑓𝜃conditional-setsuperscriptsubscriptsubscript𝑣𝑗𝑙subscript𝑣𝑗𝒩subscript𝑣𝑖for-allsubscript𝑣𝑖𝑉h_{v_{i}}^{(l+1)}=f_{\theta}\left(\left\{h_{v_{j}}^{(l)}:v_{j}\in\mathcal{N}(v% _{i})\right\}\right),\quad\forall v_{i}\in V,italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( { italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ) , ∀ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V , (6)

where 𝒩(vi)𝒩subscript𝑣𝑖\mathcal{N}(v_{i})caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) refers to the neighboring nodes of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is a function that aggregates information from these neighbors to update the node’s embedding.

To focus on the most relevant semantic connections, we use a graph self-attention layer within fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ). This layer calculates attention weights as follows aij=exp(qi,kj)vk𝒩(vi)exp(qi,kk)subscript𝑎𝑖𝑗subscript𝑞𝑖subscript𝑘𝑗subscriptsubscript𝑣𝑘𝒩subscript𝑣𝑖subscript𝑞𝑖subscript𝑘𝑘a_{ij}=\frac{\exp(\langle q_{i},k_{j}\rangle)}{\sum_{v_{k}\in\mathcal{N}(v_{i}% )}\exp(\langle q_{i},k_{k}\rangle)}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( ⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_exp ( ⟨ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) end_ARG, where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are derived from the embeddings of the nodes. This method allows the model to selectively emphasize the most informative signals from neighboring nodes Cui et al. (2020).

After processing through L𝐿Litalic_L layers, we obtain refined node embeddings zvi=hvi(L)subscript𝑧subscript𝑣𝑖superscriptsubscriptsubscript𝑣𝑖𝐿z_{v_{i}}=h_{v_{i}}^{(L)}italic_z start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT, which encapsulate both semantic and structural information of the graph. To make these embeddings more manageable for downstream tasks, we compress them through a trainable down-projection layer:

evi=𝐖dzvi,𝐖ddh×de,de<dz.formulae-sequencesubscript𝑒subscript𝑣𝑖subscript𝐖𝑑subscript𝑧subscript𝑣𝑖formulae-sequencesubscript𝐖𝑑superscriptsubscript𝑑subscript𝑑𝑒subscript𝑑𝑒subscript𝑑𝑧e_{v_{i}}=\mathbf{W}_{d}z_{v_{i}},\quad\mathbf{W}_{d}\in\mathbb{R}^{d_{h}% \times d_{e}},d_{e}<d_{z}.italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT < italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT . (7)

This step reduces the dimensionality of the embeddings to desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, which is smaller than the original dzsubscript𝑑𝑧d_{z}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. The resulting condensed embeddings evisubscript𝑒subscript𝑣𝑖e_{v_{i}}italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT still retain crucial token-level interactions but are more concise, making them better suited for training models for specific tasks. This approach ensures that while the size of the input sequence is significantly reduced, the essential semantic and structural features of the knowledge graph are preserved for subsequent reasoning.

3.4 Auxiliary Module

To further guide the LLMs in effectively reasoning over the structured input with a knowledge graph, we have developed an Auxiliary Module. This module is designed to create dynamic prompts that enhance the coherence of answers generated by LLMs. It functions by analyzing the LLM’s predicted answer, denoted as y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, along with the current loss, L𝐿Litalic_L. Based on these inputs, it generates a refined prompt, psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is then used for a new round of inference. We use the pre-trained Bert model (i.e., bert-base-NER) Devlin et al. (2018) to construct this Auxiliary Module, symbolized as G𝐺Gitalic_G and parameterized by θgsubscript𝜃𝑔\theta_{g}italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. This generator crafts the prompt text, taking into account the input values p=G(L,y^;θg)superscript𝑝𝐺𝐿^𝑦subscript𝜃𝑔p^{\prime}=G(L,\hat{y};\theta_{g})italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G ( italic_L , over^ start_ARG italic_y end_ARG ; italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ). The generator is trained jointly with the overall system using policy gradient methods to maximize the expected reward R𝑅Ritalic_R of producing coherent answers: J(θg)=𝔼pG[R(p)]𝐽subscript𝜃𝑔𝔼superscript𝑝similar-to𝐺delimited-[]𝑅superscript𝑝J(\theta_{g})=\mathbb{E}{p^{\prime}\sim G}[R(p^{\prime})]italic_J ( italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = blackboard_E italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_G [ italic_R ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ], θgJ(θg)=𝔼pG[θglogG(p|L,y^;θg)R(p)]subscript𝜃𝑔𝐽subscript𝜃𝑔𝔼superscript𝑝similar-to𝐺delimited-[]subscript𝜃𝑔𝐺conditionalsuperscript𝑝𝐿^𝑦subscript𝜃𝑔𝑅superscript𝑝\nabla{\theta_{g}}J(\theta_{g})=\mathbb{E}{p^{\prime}\sim G}[\nabla{\theta_{g}% }\log G(p^{\prime}|L,\hat{y};\theta_{g})R(p^{\prime})]∇ italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_J ( italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = blackboard_E italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_G [ ∇ italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT roman_log italic_G ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_L , over^ start_ARG italic_y end_ARG ; italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) italic_R ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ].

This reward function is designed to encourage the LLMs to generate responses that are not only fluent but also logically consistent, particularly when using the updated prompt. This feature enables the module to dynamically adjust prompts based on the current performance, thereby offering new approaches to improve the quality of answers. Throughout the training process, the module progressively learns to produce more effective prompts, leading to enhanced accuracy and coherence in the LLM’s reasoning. For more detailed experimental testing and analysis, please refer to Appendix G.

Refer to caption
Figure 3: Interaction between the graph topology encoder and LLM in Struct-X. The encoder refines node embeddings via cross-layer message passing. The condensed embeddings are provided as supplements.

4 Experiment

4.1 Datasets and Tasks

We assess the performance of our proposed Struct-X framework on four open-source benchmark datasets designed for knowledge graph reasoning and multi-hop reasoning abilities on graphs.

Task1:WebQSP contains 4,737 QA pairs where the questions require logical reasoning over a knowledge graph derived from Wikipedia to infer the correct answer. The knowledge graph consists of 5,719 entities and 2,150 relations.

Task2:MetaQA comprises a set of more complex compositional questions constructed from an underlying knowledge graph with a vocabulary of 300 entities and 100 relations. It has a total of 1,200 unique questions that test the multi-hop, logical, and comparative reasoning abilities of models.

Task3:Family Tree Age Consider a family tree G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where each individual visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in V𝑉Vitalic_V is associated with a description disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT specifying their age. The objective of this task is to identify the triplet comprising an individual, one of their grandparents, and a grand-uncle/grand-aunt by marriage that collectively has the highest cumulative age.

Task4:Travel Route Optimization Let G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) be a graph representing connected cities, where each city visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in V𝑉Vitalic_V has a description disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the travel toll or tax. The LLM must plan the route from a source to a destination city that minimizes the total toll paid.

For all datasets, we incorporate the encoded graph representations into Llama2, which has been pre-trained on BookCorpus and English Wikipedia. Appendix C is a case analysis of the experimental results on the datasets.

Model/Datasets WebQSP MetaQA Family Tree Travel Route
Acc. Prec. Rec. F1 Acc. Prec. Rec. F1 Acc. Prec. Rec. F1 Acc. Prec. Rec. F1
Embedding-based Model
TransE 67.32 68.41 65.47 66.91 74.63 74.44 75.01 74.72 73.45 71.83 66.92 63.84 62.27 60.45 65.28 62.56
DistMult 69.21 69.91 69.08 69.49 69.34 72.32 61.93 62.83 60.93 61.47 60.08 60.62 61.74 60.36 64.19 62.05
EmbedKGQA 63.26 72.38 74.33 73.35 70.17 70.38 69.57 69.97 65.62 64.31 68.94 65.92 64.95 63.01 68.64 65.37
ComplEx 67.36 68.21 66.27 65.64 72.86 71.64 71.18 69.53 59.83 58.45 62.74 60.27 57.45 55.92 60.93 57.84
RotatE 74.55 72.68 76.93 74.77 78.19 78.16 78.22 77.19 67.63 66.28 65.79 64.37 62.64 63.01 61.83 62.27
Open-source LLM
Llama27B 27.13 29.17 25.23 27.36 26.45 25.73 28.82 26.94 32.81 34.73 30.16 32.26 25.92 27.84 23.68 25.47
Alpaca7B 33.56 31.83 36.29 33.72 22.74 25.93 19.82 21.45 28.36 26.92 31.74 28.91 39.45 41.27 37.63 39.18
Llama213B 29.83 28.73 31.91 39.95 35.92 36.19 36.84 35.91 31.45 33.82 39.27 31.28 36.37 38.62 34.93 36.45
Alpaca13B 38.37 40.92 36.74 38.45 42.83 33.62 31.74 32.56 40.83 32.76 38.84 40.37 37.92 35.74 40.83 37.94
ChatGPT 41.27 49.84 44.92 41.72 38.74 37.56 40.83 38.94 35.92 33.84 38.62 35.83 34.83 36.73 42.15 44.71
LLM-based Fine-tuning
KG-LLaMA 42.45 43.28 40.39 41.37 46.28 45.94 47.36 46.47 49.74 48.36 42.64 40.27 48.56 50.28 45.74 47.83
KG-Alpaca 48.92 47.74 51.83 49.45 43.56 42.41 46.28 43.94 45.92 46.84 44.56 45.62 42.91 50.74 47.28 43.45
KG-BERT 56.28 55.94 57.36 56.47 70.92 69.74 63.82 61.45 63.27 61.19 68.29 64.23 74.32 75.61 74.45 72.37
PKGC 64.56 67.44 62.64 64.10 77.79 76.92 74.27 73.49 79.32 78.41 75.36 74.91 76.35 75.42 74.37 76.91
Vanilla IT 65.62 69.86 66.29 65.54 78.28 74.31 72.98 73.62 71.57 70.58 69.63 70.05 65.94 67.32 63.69 64.96
KoPA 72.48 72.82 71.64 71.52 75.82 74.69 73.95 71.58 73.56 72.18 72.08 71.53 73.13 75.05 71.12 71.24
Ours(Struct-X) 75.13 73.40 77.25 75.29 79.63 78.27 77.53 76.61 82.68 82.95 79.34 78.92 81.69 81.53 78.62 78.04
Table 2: Performance comparison across different datasets and Tasks
Methods WQSP
QA
1hop
QA
2hop
QA
3hop
KV-Mem 48.3 84.3 74.5 46.2
GraftNet 53.1 84.1 72.3 62.3
EmbedKGQA 63.3 85.3 74.1 79.8
NSM 69.6 84.6 89.1¯¯89.1\underline{89.1}under¯ start_ARG 89.1 end_ARG 88.7
UniKGQA 71.6¯¯71.6\underline{71.6}under¯ start_ARG 71.6 end_ARG 86.8 82.8 92.3
StructGPT 64.8 87.4¯¯87.4\underline{87.4}under¯ start_ARG 87.4 end_ARG 87.1 92.9¯¯92.9\underline{92.9}under¯ start_ARG 92.9 end_ARG
Ours 75.175.1\mathbf{75.1}bold_75.1 91.391.3\mathbf{91.3}bold_91.3 92.792.7\mathbf{92.7}bold_92.7 93.893.8\mathbf{93.8}bold_93.8
    w/o𝑤𝑜w/oitalic_w / italic_o SI (struct inference) 54.2 85.6 85.3 86.5
    w/o𝑤𝑜w/oitalic_w / italic_o KGP (KG perform better) 56.4 86.9 87.5 89.8
    w/o𝑤𝑜w/oitalic_w / italic_o model (graph study) 61.6 87.4 89.2 85.7
    w/o𝑤𝑜w/oitalic_w / italic_o encoder (location) 69.3 88.1 76.6 79.3
    w/o𝑤𝑜w/oitalic_w / italic_o self-reg (correction) 70.4 85.3 87.4 86.8
Table 3: Performance evaluation and comparison across different functional modules on reasoning tasks

4.2 Implementation Details

Baseline

  • Embedding-based Model: We compare against representative embedding models for knowledge graph reasoning including TransE Bordes et al. (2013), DistMult Yang et al. (2015), EmbedKGQA Saxena et al. (2020), ComplEx Trouillon et al. (2016), and RotatE Sun et al. (2019).

  • Open-source LLM: We evaluate reasoning capabilities of widely-used pre-trained language models accessible through open APIs, including Llama2 [7B & 13B] Touvron et al. (2023) and Alpaca [7B & 13B] Yao et al. (2023a) which are openly available LLMs up to 13 billion parameters.

  • LLM-based Fine-tuning: To assess the performance of LM fine-tuning approaches, we include as baselines KG-LlaMA and KG-Alpaca Yao et al. (2023b), KG-BERT Yao et al. (2019), PKGC Lv et al. (2022) and vanilla IT Zhang (2023) which incorporate techniques to enhance LMs using annotated KG datasets or self-supervision.

Our implementation is in PyTorch and we run experiments on NVIDIA A100 GPUs. More details of training can be found in Appendix D.

4.3 Main Results

The results presented in Table 2 indicate that Struct-X consistently outperforms existing baseline methods across various datasets. Specifically, in the WebQSP benchmark, Struct-X achieves an accuracy of 75.13%, which is 2.65% higher than the previously best-performing method, KoPA. Additionally, Struct-X shows modest improvements in precision and recall, with increases of 9.51% and 10.96%, respectively, compared to Vanilla IT. In the more challenging MetaQA dataset, Struct-X’s performance is notably better, surpassing the state-of-the-art accuracy scores by 1.84% and achieving a 1.68% higher precision. Furthermore, Struct-X demonstrates significant advancements in specialized tasks such as Family Tree and Travel Route, where it exceeds the top baseline results by 3.36% and 5.34% in accuracy, respectively.

Compared to embedding models such as TransE, DistMult, and EmbedKGQA, Struct-X also shows promising improvements in reasoning abilities by integrating both semantic and topological structures of knowledge graphs. For instance, against RotatE’s accuracy of 74.55% on the WebQSP dataset, Struct-X achieves higher performance with a 75.13% accuracy, an increase of 0.58%. The difference is slightly more pronounced on the MetaQA dataset, where Struct-X exceeds RotatE’s score of 78.19% by 1.44% in accuracy. In scenarios requiring complex reasoning inferences, Struct-X demonstrates enhanced capabilities, outperforming peak embedding model accuracy by a notable margin of 16.74% in Task4.

The results also show that Struct-X can enhance the capabilities of the Llama2, which itself achieves a 27.13% accuracy on the WebQSP benchmark. This enhancement is achieved through masking graph embeddings and using topology matching to retrieve relevant facts, thus addressing the gaps in factual knowledge that Llama2 requires. By overcoming these deficiencies in LLMs, Struct-X significantly improves performance, increasing accuracy by 47.4%. This indicates the effectiveness of structured augmentation, which is not present in the Llama2. Further, Struct-X filters out less important tokens using the Self-Reg module, ensuring focus on the most relevant information. In comparison to previous methods like KG-BERT fine-tuning, StructX offers essential enhancements, particularly in complex reasoning tasks, as evidenced by increases of up to 10.24% in accuracy and 5.61% in recall.

Based on the experimental results, the “reflect” process also plays a crucial role in enhancing reasoning capabilities. This process involves the IFReT (x)𝑥(x)( italic_x ), which selectively gathers evidence as needed, and the IFReL (x,p)𝑥𝑝(x,p)( italic_x , italic_p ), which filters less relevant passages using relevance scores from Eq.3 to improve context for LLMs. Additionally, the IFSuP (y,p)𝑦𝑝(y,p)( italic_y , italic_p ) and IFUsE (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) ensure passage-response consistency and assess overall utility, contributing to higher quality results. For further case studies and experiments on this topic, readers are directed to Appendix H.

4.4 Ablation Study

4.4.1 Q1: Different functional modules

Table 3 shows that each component of Struct-X plays a crucial role in enhancing various reasoning capabilities. For 1-hop single fact questions, while all versions of Struct-X are effective, the complete model excels with a 91.3% accuracy due to its ability to perform combinatorial reasoning using multi-head attention. This is key for interpreting semantic connections. In 2-hop and 3-hop multi-step reasoning, the absence of knowledge retrieval and injection modules results in a significant performance drop, with decreases of 7.4% and 7.3% respectively. However, the full Struct-X model, utilizing these modules, reaches 92.7% and 93.8% accuracy by effectively traversing distant nodes. The graph topology encoder also proves vital; its omission leads to a 5.8% decline in location-based reasoning, highlighting its importance in connecting nodes and facilitating spatial/hierarchical reasoning through message passing. Furthermore, the lower accuracy without the Auxiliary Module underlines its utility in guiding coherent inference across multiple steps.

Model Variant WebQSP MetaQA
Acc. (%) Rec. (%) F1 (%) Acc. (%) Rec. (%) F1 (%)
No Filtering 71.2±1.41plus-or-minus71.21.4171.2\pm 1.4171.2 ± 1.41 68.4±1.59plus-or-minus68.41.5968.4\pm 1.5968.4 ± 1.59 69.8±1.47plus-or-minus69.81.4769.8\pm 1.4769.8 ± 1.47 75.6±1.19plus-or-minus75.61.1975.6\pm 1.1975.6 ± 1.19 72.3±1.38plus-or-minus72.31.3872.3\pm 1.3872.3 ± 1.38 73.9±1.26plus-or-minus73.91.2673.9\pm 1.2673.9 ± 1.26
Random (20% cut) 72.8±1.28plus-or-minus72.81.2872.8\pm 1.2872.8 ± 1.28 70.1±1.44plus-or-minus70.11.4470.1\pm 1.4470.1 ± 1.44 71.4±1.33plus-or-minus71.41.3371.4\pm 1.3371.4 ± 1.33 77.2±1.07plus-or-minus77.21.0777.2\pm 1.0777.2 ± 1.07 74.2±1.24plus-or-minus74.21.2474.2\pm 1.2474.2 ± 1.24 75.6±1.13plus-or-minus75.61.1375.6\pm 1.1375.6 ± 1.13
Random (40% cut) 73.6±1.21plus-or-minus73.61.2173.6\pm 1.2173.6 ± 1.21 71.3±1.36plus-or-minus71.31.3671.3\pm 1.3671.3 ± 1.36 72.6±1.26plus-or-minus72.61.2672.6\pm 1.2672.6 ± 1.26 78.1±1.01plus-or-minus78.11.0178.1\pm 1.0178.1 ± 1.01 75.4±1.17plus-or-minus75.41.1775.4\pm 1.1775.4 ± 1.17 74.7±1.08plus-or-minus74.71.0874.7\pm 1.0874.7 ± 1.08
Random (60% cut) 74.1±1.16plus-or-minus74.11.1674.1\pm 1.1674.1 ± 1.16 72.8±1.30plus-or-minus72.81.3072.8\pm 1.3072.8 ± 1.30 73.3±1.21plus-or-minus73.31.2173.3\pm 1.2173.3 ± 1.21 79.0±0.95plus-or-minus79.00.9579.0\pm 0.9579.0 ± 0.95 76.8±1.11plus-or-minus76.81.1176.8\pm 1.1176.8 ± 1.11 75.9±1.03plus-or-minus75.91.0375.9\pm 1.0375.9 ± 1.03
Self-Reg (20% cut) 74.9±1.09plus-or-minus74.91.0974.9\pm 1.0974.9 ± 1.09 73.2±1.23plus-or-minus73.21.2373.2\pm 1.2373.2 ± 1.23 74.0±1.14plus-or-minus74.01.1474.0\pm 1.1474.0 ± 1.14 79.3±0.92plus-or-minus79.30.9279.3\pm 0.9279.3 ± 0.92 78.2±1.06plus-or-minus78.21.0678.2\pm 1.0678.2 ± 1.06 76.1±0.99plus-or-minus76.10.9976.1\pm 0.9976.1 ± 0.99
Self-Reg (40% cut) 75.1±1.07plus-or-minus75.11.07\textbf{75.1}\pm\textbf{1.07}75.1 ± 1.07 73.4±1.21plus-or-minus73.41.21\textbf{73.4}\pm\textbf{1.21}73.4 ± 1.21 75.3±1.12plus-or-minus75.31.12\textbf{75.3}\pm\textbf{1.12}75.3 ± 1.12 79.6±0.90plus-or-minus79.60.90\textbf{79.6}\pm\textbf{0.90}79.6 ± 0.90 78.3±1.04plus-or-minus78.31.04\textbf{78.3}\pm\textbf{1.04}78.3 ± 1.04 76.6±0.96plus-or-minus76.60.96\textbf{76.6}\pm\textbf{0.96}76.6 ± 0.96
Self-Reg (60% cut) 73.9±1.19plus-or-minus73.91.1973.9\pm 1.1973.9 ± 1.19 72.7±1.34plus-or-minus72.71.3472.7\pm 1.3472.7 ± 1.34 73.2±1.24plus-or-minus73.21.2473.2\pm 1.2473.2 ± 1.24 78.9±0.97plus-or-minus78.90.9778.9\pm 0.9778.9 ± 0.97 77.8±1.12plus-or-minus77.81.1277.8\pm 1.1277.8 ± 1.12 76.4±1.03plus-or-minus76.41.0376.4\pm 1.0376.4 ± 1.03
Table 4: Analysis of using different token filtering ratios in Self-Reg module

4.4.2 Q2:Filtering and reflection mechanism

Table 4 compares reasoning performance with the following variants: StructXNoFiltering𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑁𝑜𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔StructX_{NoFiltering}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_N italic_o italic_F italic_i italic_l italic_t italic_e italic_r italic_i italic_n italic_g end_POSTSUBSCRIPT: Directly injects all retrieved knowledge without filtering. StructXRandomFiltering𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑅𝑎𝑛𝑑𝑜𝑚𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔StructX_{RandomFiltering}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_R italic_a italic_n italic_d italic_o italic_m italic_F italic_i italic_l italic_t italic_e italic_r italic_i italic_n italic_g end_POSTSUBSCRIPT: Randomly removes of retrieved tokens. StructXRegFiltering𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑅𝑒𝑔𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔StructX_{RegFiltering}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_R italic_e italic_g italic_F italic_i italic_l italic_t italic_e italic_r italic_i italic_n italic_g end_POSTSUBSCRIPT: Uses the proposed Self-Reg module to score and filter tokens.

Across WebQSP and MetaQA datasets, incorporating filtering mechanisms leads to consistent gains over no filtering baselines. Randomly removing tokens brings minor improvements, showing that some knowledge reduction is beneficial. However, learned filtering with Self-Reg leads to more substantial gains. Comparing different Self-Reg cutting ratios, 40% filtering seems to achieve the optimal trade-off, maximizing accuracy and recall. More aggressive 60% cutting starts to degrade performance likely due to removing pertinent facts. On the other hand, light 20% filtering retains more distracting information. By balancing knowledge breadth and depth, 40% Self-Reg filtering enhances language model inference without overwhelming models. By scoring and removing extraneous tokens based on contextual representations, Self-Reg retains the essence to augment language models without diverting attention.

4.4.3 Q3: Learning by Auxiliary Module

The results in Table 5 demonstrate that incorporating the Auxiliary Module leads to significant performance gains over the base Struct-X model without this component. We observe absolute improvements of 3.9% in accuracy, 2.58% in precision, and 5.72% in recall after implementing the Auxiliary Module. This validates its efficacy in providing adaptive prompts that elicit more accurate and logically coherent reasoning from the LLM when inference is made over structured knowledge graphs. The gains over the previous best model, PKGC are also substantial, at 10.57% higher accuracy. Hence, the auxiliary module proves important for multi-hop reasoning and steering deductions in the right direction over complex topological structures. The consistent benefits confirm that modeling explicit prompt-answering mechanisms customized for structured reasoning tasks is an effective approach.

Model Accuracy (%) Precision (%) Recall (%)
KG-BERT 56.28±2.12plus-or-minus56.282.1256.28\pm 2.1256.28 ± 2.12 55.94±2.35plus-or-minus55.942.3555.94\pm 2.3555.94 ± 2.35 57.36±1.97plus-or-minus57.361.9757.36\pm 1.9757.36 ± 1.97
PKGC 64.56±1.84plus-or-minus64.561.8464.56\pm 1.8464.56 ± 1.84 67.44±1.62plus-or-minus67.441.6267.44\pm 1.6267.44 ± 1.62 62.64±2.14plus-or-minus62.642.1462.64\pm 2.1462.64 ± 2.14
Struct-X w/o Auxiliary Module 71.23±1.27plus-or-minus71.231.2771.23\pm 1.2771.23 ± 1.27 70.82±1.38plus-or-minus70.821.3870.82\pm 1.3870.82 ± 1.38 71.53±1.22plus-or-minus71.531.2271.53\pm 1.2271.53 ± 1.22
Struct-X 75.13±0.98plus-or-minus75.130.98\textbf{75.13}\pm\textbf{0.98}75.13 ± 0.98 73.40±1.12plus-or-minus73.401.12\textbf{73.40}\pm\textbf{1.12}73.40 ± 1.12 77.25±0.86plus-or-minus77.250.86\textbf{77.25}\pm\textbf{0.86}77.25 ± 0.86
Table 5: Performance from incorporating the Auxiliary Module for steering prompt

4.4.4 Q4:Knowledge injection variants

To validate the contributions of different components of our knowledge injection mechanism, we conduct an ablation study with the following variants: StructXNoInjection𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑁𝑜𝐼𝑛𝑗𝑒𝑐𝑡𝑖𝑜𝑛StructX_{NoInjection}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_N italic_o italic_I italic_n italic_j italic_e italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT: The base LLM (i.e., Llama2) without any graph representation injection. StructXEmbeddingsOnly𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠𝑂𝑛𝑙𝑦StructX_{EmbeddingsOnly}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g italic_s italic_O italic_n italic_l italic_y end_POSTSUBSCRIPT: Encoded graph embeddings are directly injected without any masking or knowledge retrieval. StructXMaskingOnly𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑀𝑎𝑠𝑘𝑖𝑛𝑔𝑂𝑛𝑙𝑦StructX_{MaskingOnly}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_M italic_a italic_s italic_k italic_i italic_n italic_g italic_O italic_n italic_l italic_y end_POSTSUBSCRIPT: Graph embeddings are masked but missing facts are not filled via retrieval. StructXRetrievalOnly𝑆𝑡𝑟𝑢𝑐𝑡subscript𝑋𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙𝑂𝑛𝑙𝑦StructX_{RetrievalOnly}italic_S italic_t italic_r italic_u italic_c italic_t italic_X start_POSTSUBSCRIPT italic_R italic_e italic_t italic_r italic_i italic_e italic_v italic_a italic_l italic_O italic_n italic_l italic_y end_POSTSUBSCRIPT: Masked embeddings are completed with the knowledge retrieval module but without graph encoding. We compare reasoning performance on WebQSP and MetaQA benchmarks against these reduced injection variants. The results in Table 6 demonstrate clear improvements from collectively incorporating all knowledge injection components compared to ablated variants. The full Struct-X model with topological encoding, masking, and retrieval achieves 1.68% and 1.31% higher accuracy over the best partial variant on WebQSP and MetaQA respectively. This confirms that each mechanism provides unique benefits - topological encoding better retain intricate connections, masking identifies missing facts, and retrieval fills knowledge gaps. The experiment proves that dynamic masking and retrieval to address inherent incompleteness in structured data are most impactful. Variants without these processes show worse performance as they fail to overcome language models’ factual deficiencies.

Model Variant WebQSP Accuracy (%) MetaQA Accuracy (%)
No Injection 63.45±1.72plus-or-minus63.451.7263.45\pm 1.7263.45 ± 1.72 71.23±1.43plus-or-minus71.231.4371.23\pm 1.4371.23 ± 1.43
Embeddings Only 68.92±1.37plus-or-minus68.921.3768.92\pm 1.3768.92 ± 1.37 74.56±1.21plus-or-minus74.561.2174.56\pm 1.2174.56 ± 1.21
Masking Only 71.23±1.19plus-or-minus71.231.1971.23\pm 1.1971.23 ± 1.19 76.92±1.08plus-or-minus76.921.0876.92\pm 1.0876.92 ± 1.08
Retrieval Only 73.45±1.04plus-or-minus73.451.0473.45\pm 1.0473.45 ± 1.04 78.32±0.92plus-or-minus78.320.9278.32\pm 0.9278.32 ± 0.92
Struct-X 75.13±0.98plus-or-minus75.130.98\textbf{75.13}\pm\textbf{0.98}75.13 ± 0.98 79.63±0.84plus-or-minus79.630.84\textbf{79.63}\pm\textbf{0.84}79.63 ± 0.84
Table 6: Advantages of dynamic factual injection and self-verified retrieval over individual knowledge supplementation variants

5 Conclusion

In this paper, we introduce Struct-X, a groundbreaking framework designed to enhance LLMs in complex reasoning tasks. Struct-X applies an efficient “read-model-fill-reflect-reason” methodology to structured data. It is adept at learning graph embeddings that are sensitive to geometric contexts, capturing the content of entities as well as their topological relationships. This enables Struct-X to effectively infer missing facts about entities by matching similar topological features. Furthermore, it enhances the LLMs by distributing multi-scale features, which bolsters the representation of underlying connections that are not explicitly apparent. Struct-X excels in tasks such as knowledge graph-based QA tasks and reading comprehension, especially in scenarios that require multi-hop logical reasoning.

6 Limitiation

The knowledge graph encoding may not fully capture complex relationships beyond structural topology, the auxiliary module’s prompting could be overly biased by the current loss landscape. Exploring more expressive graph representations and smarter prompting strategies could potentially address these limitations.

References

  • Asai (2023) Akari Asai. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
  • Banerjee et al. (2020) Chaity Banerjee, Tathagata Mukherjee, and Eduardo Pasiliao. 2020. Feature representations using the reflected rectified linear unit (rrelu) activation. Big Data Mining and Analytics, 3(2):102–120.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795.
  • Brody et al. (2021) Samuel Brody, Uri Alon, and Eran Yahav. 2021. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491.
  • Chai (2023) Ziwei Chai. 2023. Graphllm: Boosting graph reasoning ability of large language model. arXiv preprint arXiv:2310.05845.
  • Chen (2023) Zhikai Chen. 2023. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393.
  • Cui et al. (2020) G. Cui, J. Zhou, C. Yang, and Z. Liu. 2020. Adaptive graph encoder for attributed graph embedding. In SIGKDD, pages 976–985.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR.
  • Fan et al. (2023) A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533.
  • Fei et al. (2021) H. Fei, Y. Ren, Y. Zhang, D. Ji, and X. Liang. 2021. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in bioinformatics, 22(3):bbaa110.
  • Gao et al. (2023) L. Gao, Z. Dai, and et al. 2023. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508.
  • Hadi et al. (2023) M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, and S. Mirjalili. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints.
  • Han et al. (2024) J. Han, J. Kang Lu, Y. Xu, J. You, and B. Wu. 2024. Intelligent practices of large language models in digital government services. IEEE Access.
  • Harnoune (2021) Ayou Harnoune. 2021. Bert based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Computer Methods and Programs in Biomedicine Update, (1):100042.
  • Hofstätter (2023) Sebastian Hofstätter. 2023. Fid-light: Efficient and effective retrieval-augmented text generation. In ACM SIGIR.
  • Honovich (2022) Or Honovich. 2022. True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
  • Jiang (2023) Jinhao Jiang. 2023. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645.
  • Khan (2022) Salman Khan. 2022. Transformers in vision: A survey. ACM Computing Surveys (CSUR), 54(10s):1–41.
  • Lazaridou et al. (2022) A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115.
  • Lewis (2020) Patrick Lewis. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474.
  • Li (2021a) Ce Li. 2021a. Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems, 33(9):4800–4814.
  • Li et al. (2023) Lei Li, Yongfeng Zhang, and Li Chen. 2023. Prompt distillation for efficient llm-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management.
  • Li (2022) Yang Li. 2022. Dual encoder-based dynamic-channel graph convolutional network with edge enhancement for retinal vessel segmentation. IEEE Transactions on Medical Imaging, 41(8):1975–1989.
  • Li (2021b) Yawei Li. 2021b. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707.
  • Liu (2020) Weijie Liu. 2020. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34.
  • Liu (2022) Xiao Liu. 2022. Selfkg: Self-supervised entity alignment in knowledge graphs. In Proceedings of the ACM Web Conference 2022.
  • Lv et al. (2022) Xin Lv, Yankai Lin, Yixin Cao, and et al. 2022. Do pre-trained models benefit knowledge graph completion? a reliable evaluation and a reasonable approach. In ACL (Findings), pages 3570–3581. Association for Computational Linguistics.
  • Pan et al. (2024) S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.
  • Pan (2024) Shirui Pan. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.
  • Pei et al. (2019) Jian Pei, Lei Deng, Shuai Song, and Ming Zhao. 2019. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111.
  • Reda et al. (2022) Haftu Tasew Reda, Adnan Anwar, and Abdun Mahmood. 2022. Comprehensive survey and taxonomies of false data injection attacks in smart grids: attack models, targets, and impacts. Renewable and Sustainable Energy Reviews, 163:112423.
  • Roux (2022) Claude Roux. 2022. The sydney declaration–revisiting the essence of forensic science through its fundamental principles. Forensic Science International, 332:111182.
  • Ryen et al. (2022) Vincent Ryen, Ahmet Soylu, and Dumitru Roman. 2022. Building semantic knowledge graphs from (semi-) structured data: a review. Future Internet, 14(5):129.
  • Saxena et al. (2020) Apoorv Saxena, Aditay Tripathi, and Partha P. Talukdar. 2020. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In ACL 2020, Online, July 5-10, 2020, pages 4498–4507.
  • Shah (2020) Koonal K. Shah. 2020. An exploration of methods for obtaining 0= dead anchors for latent scale eq-5d-y values. The European Journal of Health Economics, 21:1091–1103.
  • Shuster (2021) Kurt Shuster. 2021. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  • Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR (Poster).
  • Tang et al. (2023) Ruize Tang, Xianpei Han, Xia Jiang, and Xiaohua Hu. 2023. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Pierre Albert, Amr Almahairi, Yannis Babaei, …, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In ICML, pages 2071–2080. JMLR.org.
  • Tu (2021) Ke Tu. 2021. Conditional graph attention networks for distilling and refining knowledge graphs in recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management.
  • Wang et al. (2021) Changgang Wang, Jun An, and Gang Mu. 2021. Power system network topology identification based on knowledge graph and graph neural network. Frontiers in Energy Research, 8:613331.
  • Wu (2020) Zonghan Wu. 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24.
  • Xu et al. (2021) H. Xu, W. Xia, Q. Gao, J. Han, and X. Gao. 2021. Graph embedding clustering: Graph attention auto-encoder with cluster-specificity distribution. Neural Networks, 142:221–230.
  • Xue (2020) Xiaoming Xue. 2020. Affine transformation-enhanced multifactorial optimization for heterogeneous problems. IEEE Transactions on Cybernetics, 52(7):6217–6231.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In ICLR (Poster).
  • Yang (2020) Liang Yang. 2020. Graph attention topic modeling network. In Proceedings of the Web Conference.
  • Yao et al. (2019) Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graph completion. CoRR, abs/1909.03193.
  • Yao et al. (2023a) Liang Yao, Jiazhen Peng, Chengsheng Mao, and Yuan Luo. 2023a. Exploring large language models for knowledge graph completion. CoRR, abs/2308.13916.
  • Yao et al. (2023b) Liang Yao, Jiazhen Peng, Chengsheng Mao, and Yuan Luo. 2023b. Exploring large language models for knowledge graph completion. CoRR, abs/2308.13916.
  • Yasunaga et al. (2022) M. Yasunaga, A. Bosselut, and H. et al. Ren. 2022. Deep bidirectional language-knowledge graph pretraining. In Advances in Neural Information Processing Systems, volume 35, pages 37309–37323.
  • Yu et al. (2022) D. Yu, C. Zhu, Y. Yang, and M. Zeng. 2022. Jaket: Joint pre-training of knowledge graph and language understanding. In AAAI, volume 36, pages 11630–11638.
  • Zhang (2023) Yichi Zhang. 2023. Making large language models perform better in knowledge graph completion. arXiv preprint arXiv:2310.06671.
  • Zhang (2022) Yizhe Zhang. 2022. Retgen: A joint framework for retrieval and grounded text generation modeling. In AAAI, volume 36.
  • Zhang et al. (2020) Zhiqing Zhang, Fuzhen Zhuang, Hengshu Zhu, Zhiping Shi, Hui Xiong, and Qing He. 2020. Relational graph neural network with hierarchical attention for knowledge graph completion. In AAAI, volume 34, pages 9612–9619.
  • Zhou (2023) Bin Zhou. 2023. Semantic-aware event link reasoning over industrial knowledge graph embedding time series data. International Journal of Production Research, 61(12):4117–4134.

Appendix A StructX Interaction Examples

Instructions Please indicate whether referring to external documents, improves the quality of the generated response. Please respond with either [Yes] or [No] and provide a brief explanation. Instruction: Identify the shortest path between two nodes in this knowledge graph. Need retrieval? [Yes] Explanation: Struct-X can ingest the graph structure and topology to reason about paths. But retrieving additional facts on edge distances or weights can supplement its understanding for more accurate optimization. Instruction: Determine which family tree node has the oldest relative based on date descriptions. Need retrieval? [No] Explanation: Struct-X encodes the hierarchical tree relations and date informations directly without needing external evidence. Retrieval may introduce unnecessary details. Instruction: Analyze the impacts of this new tax policy based on economic concepts. Need retrieval? [Yes] Explanation: While Struct-X has some linguistic capabilities, retrieving domain knowledge on economics and regulations will improve understanding of entities and contextual impacts for better analysis. Instruction: Summarize the key events in this 5-page history passage. Need retrieval? [No] Explanation: Struct-X is designed to ingest long document passages directly through encoders. No need for external info. Instruction: Compare the costs of different flight options based on stop, mileage and fare data. Need retrieval? [No] Explanation: Struct-X can encode and reason over structured data tables natively. External retrieval of similar data is unneeded.

Appendix B Related Work

Prior efforts have explored various techniques to enhance language models with structured knowledge. Pan et al. Pan (2024) directly converted knowledge graph triples into textual statements as inputs. Liu et al. Liu (2020) embedded knowledge graphs into hidden layers of BERT models. Harnoune et al.Harnoune (2021) utilized memory modules to provide facts for multi-hop reasoning. Wu et al. Wu (2020) designed graph recurrent networks to propagate information over knowledge graphs. Roux et al.Roux (2022) employed reinforcement learning for path-finding in knowledge-augmented environments. Tu et al. Tu (2021) used graph convolutional networks to distill subgraph knowledge. Liu et al. Liu (2022) pruned knowledge graph inputs via self-supervised learning objectives. Honovich et al.Honovich (2022) focused on improving factual consistency in conditional text generation. Shah et al. Shah (2020) iteratively expanded anchor sets for more comprehensive grounding. Xue et al. Xue (2020) optimized path selection on heterogeneous knowledge sources. While providing valuable techniques, existing methods lack consideration for efficiently filtering and condensing structured knowledge supplements tailored for language model integration. Our work aims to address these limitations through specialized contextual encoding, retrieval, and distillation components in the proposed Struct-X framework.

Appendix C Case Study

Family Relations John is Nikhil’s paternal grandfather. Nikhil’s aunt Meera is John’s daughter. John was born in 1950. Nikhil was born in 2000. Meera was born in 1975. Raj is Nikhil’s maternal grandfather. Raj has a brother named Shekhar. Raj was born in 1948. Shekhar was born in 1946. Question: What is the age difference between the oldest and youngest family members described? Response: Based on the family tree details provided: - The oldest member is Shekhar, Nikhil’s grand uncle, born in 1946. - The youngest member is Nikhil, born in 2000. - Their age difference is 1946 - 2000 = 54 years. So the age difference between the oldest and youngest family members is 54 years. Correct Answer: 54 years old.

In this family tree example, Struct-X first encodes the hierarchical relations between members like “John is Nikhil’s paternal grandfather" and “Meera is John’s daughter" as structured knowledge. The connections allow tracing ancestry and descendent paths. Age details like “John was born in 1950" are also injected. With encoded topology and dates, Struct-X can traverse the tree to identify the oldest member, Shekhar and the youngest, Nikhil, before subtracting their years to find the 54-year difference. This showcases interpreting familial links and performing numerical reasoning over injected facts. By attending over neighborhood nodes, Struct-X focuses on the most relevant semantic connections. The topology learning distinguishes close and distant relatives to assess ages. Retrieved date details fill knowledge gaps for calculation. Overall, this case validates Struct-X’s capabilities in encoding complex structure topology and performing multi-step inference by combining connection reasoning and data-driven deduction.

The example proves Struct-X can encode intricate hierarchical structures and use encoded topology to trace relationships and inject valuable factual knowledge. By learning contextual representations and connections in structured data, Struct-X successfully interprets semantic links between entities and integrates supplementary date details for numerical reasoning over multiple inference steps. This supports complex reasoning across topological dimensions.

Algorithm 1 Topology Learning and Training
1:  Input: knowledge graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), LLM Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Auxiliary Module Aϕsubscript𝐴italic-ϕA_{\phi}italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
2:  Encode G𝐺Gitalic_G into latent embeddings ZVsubscript𝑍𝑉Z_{V}italic_Z start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
3:  Mask node embeddings at rate pmasksubscript𝑝maskp_{\text{mask}}italic_p start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT as Z~Vsubscript~𝑍𝑉\tilde{Z}_{V}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
4:  Step 1: Topology Modeling & Filling
5:  for viZ~Vsubscript𝑣𝑖subscript~𝑍𝑉v_{i}\in\tilde{Z}_{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT do
6:     Retrieve related facts isubscript𝑖\mathcal{F}_{i}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT via similarity scoring
7:     Update z~isubscript~𝑧𝑖\tilde{z}_{i}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with isubscript𝑖\mathcal{F}_{i}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using faggsubscript𝑓𝑎𝑔𝑔f_{\text{$agg$}}italic_f start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT
8:  end for
9:  Step 2: Graph Topology Reasoning
10:  for l=1,,L𝑙1𝐿l=1,\ldots,Litalic_l = 1 , … , italic_L do
11:     Message passing layer to update ZVsubscript𝑍𝑉Z_{V}italic_Z start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
12:     Attention distillation over G𝐺Gitalic_G
13:  end for
14:  Step 3: LLM Integration & Training
15:  Flatten ZVsubscript𝑍𝑉Z_{V}italic_Z start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and pack into sequences
16:  Create auxiliary prompts pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT with Aϕsubscript𝐴italic-ϕA_{\phi}italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
17:  Jointly train Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on sequences using pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
18:  Update Aϕsubscript𝐴italic-ϕA_{\phi}italic_A start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT using policy gradient
Algorithm 2 Knowledge Filtering Module
1:  Input: x𝑥xitalic_x (Input text)
2:  Output: {y,n}𝑦𝑛\{y,n\}{ italic_y , italic_n } (Retrieve), {r,ir}𝑟𝑖𝑟\{r,ir\}{ italic_r , italic_i italic_r } (Relevant), {c,ic}𝑐𝑖𝑐\{c,ic\}{ italic_c , italic_i italic_c } (Coherent)
3:  Retrieve Module:
4:  Decide whether passage retrieval is needed based on x𝑥xitalic_x
5:   if retrieval is needed then
6:    Set y𝑦yitalic_y to Yes
7:    Set n𝑛nitalic_n to No
8:   else
9:    Set y𝑦yitalic_y to No
10:    Set n𝑛nitalic_n to No retrieval needed
11:   end if
12:  Relevant Module:
13:   Filter out irrelevant passages based on x𝑥xitalic_x and p𝑝pitalic_p (Retrieved passage)
14:   Set r𝑟ritalic_r to Relevant passages
15:   Set ir𝑖𝑟iritalic_i italic_r to Irrelevant passages
16:  Coherent Module:
17:   Verify coherence between generated response and input x𝑥xitalic_x and y𝑦yitalic_y
18:   Set c𝑐citalic_c to Coherent response
19:   Set ic𝑖𝑐icitalic_i italic_c to Incoherent response

Appendix D Experimental Parameter Settings

The Variable Description Details in Table 7 and hyperparameters in Table 9 provide concrete configuration details for Struct-X when evaluated on the four benchmark datasets. We can observe some key modeling choices - all models use a 4-layer graph encoder to learn topological representations, apply 30-40% node masking for knowledge gap simulation, and dedicate 256 dimensions to the Auxiliary Module for steering prompt/answer generation. Training hyperparameters are also shown, including batch sizes of 16-32, learning rates around 1e-4, and 10-20 training epochs. The number of tunable parameters indicates comparable model complexity across datasets.

Variables Description
G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) Knowledge graph
V𝑉Vitalic_V Node/entity set
E𝐸Eitalic_E Edge/relation set
(h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) Head, relation, tail
hv(l)superscriptsubscript𝑣𝑙h_{v}^{(l)}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT Node v𝑣vitalic_v feature at layer l𝑙litalic_l
N(v)𝑁𝑣N(v)italic_N ( italic_v ) Neighbor nodes
M(l)()superscript𝑀𝑙M^{(l)}(\cdot)italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) Aggregates neighbor info
σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) Activation function
s(v,t)𝑠𝑣𝑡s(v,t)italic_s ( italic_v , italic_t ) Similarity score
h~v~𝑣\tilde{h}vover~ start_ARG italic_h end_ARG italic_v Masked node embedding
fagg()𝑓𝑎𝑔𝑔f{\text{$agg$}}(\cdot)italic_f italic_a italic_g italic_g ( ⋅ ) Aggregates retrieved facts
fscore()subscript𝑓𝑠𝑐𝑜𝑟𝑒f_{\text{$score$}}(\cdot)italic_f start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT ( ⋅ ) Scores token relevance
contrast𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡\mathcal{L}{\text{$contrast$}}caligraphic_L italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t Contrastive loss
p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) Text generation distribution
srelsubscript𝑠𝑟𝑒𝑙s_{\text{$rel$}}italic_s start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT, sconsubscript𝑠𝑐𝑜𝑛s_{\text{$con$}}italic_s start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT Relevance and consistency scores
Table 7: Variables and description

Appendix E Self-Reg Module

IFReT Module This module decides if passage retrieval is needed using a scoring function:

IFReT(x)=fϕ(x)IFReT𝑥subscript𝑓italic-ϕ𝑥\leavevmode\resizebox{}{6.88889pt}{{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\framebox{IFReT}}}(x)=f_{\phi}(x)IFReT ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) (8)

Where x𝑥xitalic_x is the input text, and fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT outputs a binary decision on whether to activate retrieval given x𝑥xitalic_x, parameterized by ϕitalic-ϕ\phiitalic_ϕ. For example, if the input is x𝑥xitalic_x: "Tell me more about Van Gogh’s paintings", the module may predict IFReT(x)=1IFReT𝑥1\leavevmode\resizebox{}{6.88889pt}{{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\framebox{IFReT}}}(x)=1IFReT ( italic_x ) = 1, indicating that retrieval would be useful to supplement details about Van Gogh’s works.

IFReL Module This module scores the relevance of a retrieved passage p𝑝pitalic_p using:

srel=gθ(x,p)σ(IFReL(x,p))subscript𝑠𝑟𝑒𝑙subscript𝑔𝜃𝑥𝑝𝜎IFReL𝑥𝑝s_{rel}=g_{\theta}(x,p)\cdot\sigma(\leavevmode\resizebox{}{6.88889pt}{{\color[% rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\framebox{IFReL}}}% (x,p))italic_s start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_p ) ⋅ italic_σ ( IFReL ( italic_x , italic_p ) ) (9)

Where gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT produces a relevance score between input text x𝑥xitalic_x and passage p𝑝pitalic_p, modulated by the IFReL (x,p)𝑥𝑝(x,p)( italic_x , italic_p ) gate value passed through a sigmoid σ𝜎\sigmaitalic_σ. For instance, if a retrieved passage discusses Surrealism instead of Van Gogh, the model can set a lower IFReL (x,p)𝑥𝑝(x,p)( italic_x , italic_p ) score to downweight it.

IFSuP Module This evaluates the factual consistency between response y𝑦yitalic_y and passage p𝑝pitalic_p:

scon=hψ(y,p)σ(IFSuP(y,p))subscript𝑠𝑐𝑜𝑛direct-productsubscript𝜓𝑦𝑝𝜎IFSuP𝑦𝑝s_{con}=h_{\psi}(y,p)\odot\sigma(\leavevmode\resizebox{}{6.88889pt}{{\color[% rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}\framebox{% IFSuP}}}(y,p))italic_s start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_y , italic_p ) ⊙ italic_σ ( IFSuP ( italic_y , italic_p ) ) (10)

Where direct-product\odot is element-wise production, hψsubscript𝜓h_{\psi}italic_h start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT calculates consistency between y𝑦yitalic_y and p𝑝pitalic_p, controlled via IFSuP . This helps verify if details in y𝑦yitalic_y like dates or places align with the evidence in p𝑝pitalic_p.

IFUsE Module This directly outputs a usefulness score u𝑢uitalic_u between input x𝑥xitalic_x and response y𝑦yitalic_y:

u=IFUsE(x,y)𝑢IFUsE𝑥𝑦u=\leavevmode\resizebox{}{6.88889pt}{{\color[rgb]{.75,0,.25}\definecolor[named% ]{pgfstrokecolor}{rgb}{.75,0,.25}\framebox{IFUsE}}}(x,y)italic_u = IFUsE ( italic_x , italic_y ) (11)

For example, u𝑢uitalic_u may be lower if y𝑦yitalic_y fails to answer the query in x𝑥xitalic_x about Van Gogh’s paintings. The modules apply self-supervision for relevance, coherence, and consistency.

Refer to caption
Figure 4: Visualization of the performance of the SelfReg module

The IFReT (x𝑥xitalic_x) module for selective passage retrieval plays a key role in improving accuracy by retrieving evidence only when needed, avoiding unnecessary information. For instance, in closed-domain QA, Struct-X achieves higher recall by learning a tight retrieval threshold via IFReT (x𝑥xitalic_x), while open-ended generation benefits from more selective retrieval. Furthermore, the IFReL (x𝑥xitalic_x,p𝑝pitalic_p) module filters out lower-quality passages that are less relevant, as quantified by the relevance score srelsubscript𝑠𝑟𝑒𝑙s_{rel}italic_s start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT in Eq.3 and calibrated by IFReL gates. This enhances the contextual signals passed to the language model. The IFSuP (y𝑦yitalic_y,p𝑝pitalic_p) and IFUsE (x𝑥xitalic_x,y𝑦yitalic_y) critiques help further verify passage-response consistency and overall utility, ensuring higher quality outputs.

Figure 4 shows the performance of the model before and after applying different levels of knowledge filtering in question-answering comprehension tasks. The filtering ratio varies between 0 and 60%, and the improvement in fit between the response of the large language model and the real-world answer is used as the criterion for determining the effectiveness. Firstly, when there is no filtering (0%), the fitting degree R2 is around 0.7. This is the original level when injecting all knowledge. Subsequently, we observed that as the filtering ratio increased, the fitting degree R2 showed a trend of first increasing and then decreasing. When filtering out about 40% of low correlation knowledge, the model accuracy reaches a peak of around 0.82. This indicates that through algorithms such as Self Reg, the model has learned to recognize the most critical knowledge for the current question and answer. Overfiltering knowledge actually makes the model unable to learn comprehensively. However, continuing to increase the filtration ratio to 40-60% will result in a reversal and decline in the fit. The model has lost some useful knowledge, and the contextual information is insufficient for the model to make accurate inferences. Therefore, we validated and demonstrated that appropriate knowledge filtering can improve the effectiveness of question answering, but a balance needs to be found between denoising and preserving information. The Self Reg class module demonstrates a satisfactory fit, suggesting optimal model use at approximately 40% of the filtering points. The Retrieve module decides when passage retrieval is needed. The Relevant module filters out irrelevant passages. The Coherent module verifies whether the generated response is coherent with the input.

Refer to caption
(a) Q1 Task
Refer to caption
(b) Q2 Task
Refer to caption
(c) Q3 Task
Refer to caption
(d) Q4 Task
Figure 5: The results of four tasks in experiments section

Appendix F Preliminaries

We first introduce the primary knowledge of knowledge graphs, text generation in LLMs, and information retrieval.

F.1 Knowledge Graphs and Graph Networks

A knowledge graph (KG) is defined as 𝒢={(h,r,t)}𝒢𝑟𝑡\mathcal{G}=\{(h,r,t)\}caligraphic_G = { ( italic_h , italic_r , italic_t ) }, with “head” hhitalic_h and “tail” t𝑡titalic_t entities from \mathcal{E}caligraphic_E and relation type r𝑟ritalic_r from \mathcal{R}caligraphic_R. Each triplet represents unique knowledge and such knowledge representation in KG can enhance LLMs reasoning Liu (2020). Graph neural networks (GNNs) process graphs 𝒢=(V,E)𝒢𝑉𝐸\mathcal{G}=(V,E)caligraphic_G = ( italic_V , italic_E ) with nodes V𝑉Vitalic_V and edges E𝐸Eitalic_E, learning node representations by message passing, combining node features and graph topology Wu (2020). GNNs encode KGs’ topology and structure. Node v𝑣vitalic_v’s feature vector at layer l𝑙litalic_l is hv(l)superscriptsubscript𝑣𝑙h_{v}^{(l)}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, with neighboring nodes 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ) and edge feature evusubscript𝑒𝑣𝑢e_{vu}italic_e start_POSTSUBSCRIPT italic_v italic_u end_POSTSUBSCRIPT, where the function M(l)()superscript𝑀𝑙M^{(l)}(\cdot)italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) aggregates neighboring node information, and σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is an activation function:

hv(l+1)=σ(u𝒩(v)M(l)(hv(l),hu(l),evu)),superscriptsubscript𝑣𝑙1𝜎subscript𝑢𝒩𝑣superscript𝑀𝑙superscriptsubscript𝑣𝑙superscriptsubscript𝑢𝑙subscript𝑒𝑣𝑢h_{v}^{(l+1)}=\sigma\left(\sum_{u\in\mathcal{N}(v)}M^{(l)}(h_{v}^{(l)},h_{u}^{% (l)},e_{vu})\right),italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N ( italic_v ) end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_v italic_u end_POSTSUBSCRIPT ) ) , (12)

Graph attention networks (GAT), a subclass of graph networks, leverage self-attention mechanisms Brody et al. (2021), similar to the Transformer architecture, to enhance node representations Li (2021a). Each layer projects node features into queries Q𝑄Qitalic_Q, keys K𝐾Kitalic_K, and values V𝑉Vitalic_V, with attention coefficients calculated between connected nodes, following the Transformer model Yang (2020); Khan (2022). These coefficients are used to aggregate neighboring value vectors, updating the node feature representation hvsubscriptsuperscript𝑣h^{\prime}_{v}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT Li (2021b):

evu=LeakyReLU(𝒂[𝐖hv|𝐖hu]),subscript𝑒𝑣𝑢𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈superscript𝒂topdelimited-[]conditional𝐖subscript𝑣𝐖subscript𝑢e_{vu}=\textrm{$LeakyReLU$}\left(\boldsymbol{a}^{\top}\left[\mathbf{W}h_{v}|% \mathbf{W}h_{u}\right]\right),italic_e start_POSTSUBSCRIPT italic_v italic_u end_POSTSUBSCRIPT = italic_L italic_e italic_a italic_k italic_y italic_R italic_e italic_L italic_U ( bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_W italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | bold_W italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] ) , (13)
hv=σ(u𝒩(v)𝐕hu).superscript𝑣𝜎𝑢𝒩𝑣𝐕subscript𝑢h^{\prime}{v}=\sigma\left(\sum{u\in\mathcal{N}(v)}\mathbf{V}h_{u}\right).italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v = italic_σ ( ∑ italic_u ∈ caligraphic_N ( italic_v ) bold_V italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) . (14)

Through learning to focus on the most relevant semantic connections, our networks can refine node embeddings efficiently. The model constructs contextual node representations in KGs using graph attention layers Zhang et al. (2020), and subsequently integrates this structural knowledge into language models. This process improves the overall understanding and reasoning capabilities in tasks like semantic analysis and knowledge inference Banerjee et al. (2020).

F.2 Implementation Details

Training details We optimize model parameters using Adam optimizer with a learning rate of 1e-4, batch size of 32, and train for a maximum of 20 epochs. For testing, model accuracy is evaluated by the exact match of the predicted response with ground truth answers in the datasets. We report average accuracy over 5 runs with different random seeds and report the average value.

Model WebQSP MetaQA
Accuracy (%) Precision (%) Recall (%) Accuracy (%) Precision (%) Recall (%)
bert-base-NER111https://huggingface.co/dslim/bert-base-NER 75.13 73.40 77.25 79.63 78.27 77.53
bert-multilingual-sentiment222https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment 73.24 72.29 74.86 76.45 75.36 74.62
BERT-large-whole-word333https://huggingface.co/bert-large-cased-whole-word-masking 68.92 67.53 69.45 72.74 71.45 70.36
Table 8: Performance of of Auxiliary Module variants

Appendix G Performance of Core Components of Auxiliary Modules

The results in Table 8 clearly demonstrate Bert-base-NER’s superiority as the Struct-X Auxiliary Module, with over 6-7% performance gains in accuracy, precision, and recall compared to alternatives. In contrast, the whole-word masked BERT-large model gives even poorer results than no Auxiliary Module, while the multilingual sentiment BERT model remains insufficient.

The likely explanation lies in the Named Entity Recognition pre-training of Bert-base-NER, which equips the model with a finer-grained understanding of named entities and relational reasoning - highly valuable for multi-hop questions over knowledge graphs. By steering prompt/answering iterations towards logically consistent outputs, it provides vital signals previously lacking. Meanwhile, whole-word masking seems to hinder BERT-large from learning compositional word structures crucial for precisely interpreting relations. Although also a BERT model, the sentiment classification tuning causes multilingual BERT to underperform on topological tasks. The significant gaps quantified via controlled ablation experiments validate that selective BERT-tuning surpasses superior architectural variants when specifically matched to complex reasoning tasks involving entities and relations.

Hyperparameter WebQSP MetaQA Family Tree Travel Route
Graph Encoder Layers 4 4 4 4
Graph Encoder Dimensions 512 512 512 512
Encoder Heads 8 8 8 8
Node Masking Rate 0.4 0.4 0.3 0.3
Auxiliary Dimensions 512 512 256 256
Prefix 5 5 5 5
Generator Layers 2 2 4 4
MLM Probability 0.2 0.3 0.15 0.2
Tunable parameters 0.09330.09330.09330.0933 B 0.09330.09330.09330.0933 B 0.09330.09330.09330.0933 B 0.09330.09330.09330.0933 B
Batch Size 32 32 16 16
Dropout 0 0 0 0
Prefix 5 5 5 5
Batch Size 32 32 16 16
Learning Rate 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1e-4 1e-4
Training Epochs 15 20 15 10
Warmup epochs 1 1 1 1
Weight decay 0.01 0.01 0.01 0.01
Table 9: Hyperparameters of Struct-X for the four datasets

Appendix H Experimental Results Illustration

Figure 5 illustrates performance on four reasoning tasks under variants of the Struct-X framework with different components disabled. For multi-hop QA (Fig 5(a)), removing the knowledge retrieval module causes an evident drop of 8.2% in accuracy. This validates its role in traversing distant nodes by filling gaps. Omitting the encoding also degrades topological reasoning, seen in the 4.3% lower location QA (Fig 5(b)) accuracy without the graph encoder. For complex inferences (Fig 5(c)), the Auxiliary Module contributes to steering coherent deductions, reflected by 6.5% weaker decomposition scores when absent. Finally, inconsistent responses increase by 4.9% (Fig 5(d)) after disabling relevance filtering, confirming the module’s benefits for factual verification.

Across diverse reasoning dimensions, targeted ablation of key components consistently reveals measurable impacts validating their individual contributions. The knowledge retrieval module facilitates connecting facts between entities, the graph encoder grounds topological representations, the Auxiliary Module directs logical inferences, and relevance filtering retains consistency. As a novel framework, Struct-X combines complementary mechanisms for enhancing language models’ structured reasoning capacities. The ablation analyses quantify these modular effects through performance differentials on specialized probing tasks.

Appendix I Code

The code can be accessed at the following URL:https://github.com/AnonymousBoy123/anonymous_structx.

 

Input Tell me more about the painter Van Gogh

Output [ IFReT =Yes]

Reference [35]

[ IFReL = Relevant] Van Gogh was a famous Dutch painter during the post-impressionist period. [ IFSuP = Partially Supported] He was known for paintings like the Starry Night and Sunflowers.

[ IFReT = Continue] However, he struggled with mental illness and remained poor and virtually unknown throughout his life. [ IFSeP = Fully Supported] Most of his best works were produced in the last two years before his death. [ IFUsE =4]

Reference [35] Vincent Willem van Gogh was a Dutch post-impressionist painter who is among the most famous and influential figures in the history of Western art. In a decade, he created about 2,100 artworks, including around 860 oil paintings, most of which date from the last two years of his life. They include landscapes, still lifes, portraits and self-portraits, and are characterised by bold colours and dramatic, impulsive and expressive brushwork that contributed to the foundations of modern art. He was little known during his lifetime, and many of his masterpieces were not appreciated until the 20th century. By the late 1920s, he had become one of the most celebrated artists in history. Nonetheless, mental illness plagued him throughout his life, and after he shot himself in the chest with a revolver at age 37, he succumbed to his injuries two days later.

Correctness of IFReL and IFSuP IFSuP is incorrect in claiming full support about his lack of fame during life, when the reference clearly states he only gained appreciation after death. So IFSuP should be partially supported. IFReL is appropriately marked as relevant overall.