Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval

Shengjie Ma [email protected] Gaoling School of Artificial Intelligence, Renmin University of ChinaBeijingChina Chengjin Xu [email protected] IDEA Research, International Digital Economy AcademyShenzhenGuangdongChina Xuhui Jiang [email protected] Institute of Computing Technology, Chinese Academy of SciencesBeijingChina Muzhi Li [email protected] Dept. of Computer Science & Engineering, The Chinese University of Hong Kong Sha TinHongkongChina Huaren Qu [email protected] The Hong Kong University of Science and TechnologyGuangzhouGuangdongChina  and  Jian Guo [email protected] IDEA Research, International Digital Economy AcademyShenzhenGuangdongChina
(2023)
Abstract.

Retrieval-augmented generation (RAG) has significantly advanced large language models (LLMs) by enabling dynamic information retrieval to mitigate knowledge gaps and hallucinations in generated content. However, these systems often falter with complex reasoning and consistency across diverse queries. In this work, we present Think-on-Graph 2.0 (ToG2.02.0{2.0}2.0), an enhanced RAG framework that aligns questions with the knowledge graph and uses it as a navigational tool, which deepens and refines the RAG paradigm for information collection and integration. The KG-guided navigation fosters deep and long-range associations to uphold logical consistency and optimize the scope of retrieval for precision and interoperability. In conjunction, factual consistency can be better ensured through semantic similarity guided by precise directives. ToG2.02.0{2.0}2.0 not only improves the accuracy and reliability of LLMs’ responses but also demonstrates the potential of hybrid structured knowledge systems to significantly advance LLM reasoning, aligning it closer to human-like performance. We conducted extensive experiments on four public datasets to demonstrate the advantages of our method compared to the baseline.

XXX
journalyear: 2023booktitle: price: 15.00doi: 10.1145/3539618.3591766isbn: 978-1-4503-9408-6/23/07ccs: Information systems Users and interactive retrieval

1. Introduction

Refer to caption
Figure 1. Comparison of traditional RAG (a), KG-based generation (b) and Graph-guided RAG (c). The example illustrates the limitations of pure semantic retrieval and pure knowledge graph-augmented frameworks in complex QA tasks, and the advantages of our proposed KG+RAG framework. a: The dataset size is too large in the semantic retrieval-based RAG paradigm, resulting in low information density. Also, traditional retrieval systems struggle to capture deep connections between facts, thus failing to focus on key points in the question. b: In the KG path inference-based LLM augmentation paradigm, the information provided by triples lacks both depth and details. Even information may be missing, due to the incompleteness of the KG. c: The proposed ToG2.0 combines the advantages of both approaches. The KG helps to understand deep connections between different facts and precisely narrows down the search scope, while entity context-based retrieval supplements the information missing in the knowledge graph’s triple path reasoning.

Retrieval augmented generation (RAG) has emerged as a promising solution to address the knowledge deficiencies and hallucination issues of large language models (LLMs). Retrieval-augmented generation (RAG) systems significantly enhance the capabilities of large language models (LLMs) by dynamically retrieving relevant information from external sources. This method allows LLMs to transcend their inherent limitations of static knowledge, enabling them to address various applications with high diversity and complexity(Zhao et al., 2024). Despite this potential, and researchers’ attempts to incorporate various complex additional processes into RAG (such as knowledge preprocessing, fine-grained retrieval, and generating thought chains), LLMs still struggle with building human-like insights to complex tasks, which involve a motivated, continuous effort to understand connections (which can be among objects, numbers, concepts, and events) in order to anticipate their trajectories and act effectively.

Most recent RAG implementations rely heavily on vector retrieval of text(Ding et al., 2024). In this context, vector embeddings are numerical representations of words, phrases, or entire documents that capture their meanings in a semantic space. RAG systems identify potentially relevant text chunks or documents from knowledge sources by comparing the similarities between vector embeddings. While vector embeddings are effective for capturing surface-level semantic similarities, they are inefficient for all tasks. Indiscriminate retrieval of k text fragments with a query significantly increases the input length for the model. Additionally, they have several limitations in understanding the long-range association between various types of knowledge: 1. Shallow Correlation Capture: Simple vector-based matching might miss conceptual correlations, such as between the Global Financial Crisis and the 2008 Recession. 2. Difficulty in Aggregating Diverse Facts: Relying solely on vector embeddings struggles to seamlessly connect related but not directly similar knowledge, like linking ”Harry Potter” and ”Fantastic Beasts” as works by J.K. Rowling. 3. Inability to Handle Complex Logic: Simple vector-based retrieval is not suitable for multi-step reasoning or tracking logical links between different information fragments unless all these fragments are pre-divided and encoded, which becomes highly inefficient for many potential reasoning types. As shown in Figure 1(a), the naive RAG, where generic retrievers search through large-scale corpora, the recall often remains at the level of superficial semantic similarity. This is because the similarity modeled by the retrievers is aligned with human understanding, not with LLMs. Moreover, training retrievers specifically for certain datasets and tasks is impractical and inconvenient.

To address these challenges, researchers like Lee et al. (2020) have enhanced retrieval units from words and phrases to document paragraphs, aiming for finer-grained relevant information at the cost of increased retrieval complexity. ITER-RETGEN(Shao et al., 2023) follows an iterative strategy, merging retrieval and generation in a loop, alternating between ”retrieval-augmented generation” and ”generation-augmented retrieval”. Trivedi et al. (2023) combined RAG with the Chain of Thought (CoT)(Wei et al., 2022a) method, alternating between CoT-guided retrieval and retrieval-supported CoT processes, significantly improving GPT-3’s performance on various Q&A tasks. Despite these optimizations, traditional query-to-document or query-to-paragraph retrieval still struggles to accurately focus on relations between key points within complex questions, resulting in low information density and ineffective long-range knowledge association. In addition, the coarse-grained retrieval in multi-step iterations can potentially introduce more noise and even harmful disturbance, further limiting the accuracy and reliability of RAG.

Some researchers have introduced structured external knowledge graphs (KGs) into RAG, such as ToG(Sun et al., 2024), which searches for valid information based on the triple relationships in Wikipedia KG. KGs are sophisticated frameworks that encapsulate the essence of data interconnectivity, not only cataloging information but also elucidating the context and multi-level interrelations among entities. The structured nature of KGs enhances the transparency and explainability of the systems. While powerful for structuring high-level concepts and relationships, KGs inherently possess limitations in comprehensiveness and detail, as shown in Figure 1(b). Their highly generalized nature, which facilitates broad overviews of connected data, often precludes them from providing the fine-grained details necessary for nuanced understanding and analysis. This lack of detailed information can be a significant hurdle when precision and specificity are required, highlighting a fundamental cooperation between generalization and granularity. The challenges also remain in effectively identifying and mitigating noise, ambiguity and incompleteness in KGs(Tian et al., 2022).

Refer to caption
Figure 2. The detailed structure of ToG2.0.

The synergy between knowledge graphs and unstructured documents for RAG is becoming increasingly crucial. Therefore, we propose Think-on-Graph 2.0 (ToG2.0), an advanced RAG paradigm with graph-guided knowledge retrieval for deep and interpretable reasoning. ToG2.0 effectively integrates unstructured knowledge from documents with structured insights from knowledge graphs (KGs), serving as a roadmap to enhance complex problem-solving. By aligning questions with the knowledge graph and using it as a navigational tool, this approach deepens and refines the RAG paradigm for information collection and integration, which not only ensures semantic similarity at the level of factual consistency but also fosters long-range associations to uphold logical consistency. The proposed paradigm makes LLM perform closer to human when reasoning and problem-solving: examining current clues and associating potential entities based on their existing knowledge framework, continuously digging into the topic until finding the answer.

Figure 1(c) shows a simple case of Tog2.0, which draws from the ToG approach in multi-hop searches within knowledge graphs, starting from key entities identified in the query and exploring outward based on entities and relationships with a prompt-driven inferential process. It combines the logical chain extensions based on triples with unstructured contextual knowledge of relevant entities, improving the methods for ranking and selecting relevant entities and relations, thus more effectively integrating and utilizing heterogeneous external knowledge. Specifically, Tog2.0 uses entities encountered during exploration to limit the scale of the corpus for retrieval, enhancing efficiency. It also ranks and selects entities based on both the query, current triple chains, and references retrieved from the current entity’s context, which reduces entity ambiguity and ensures accurate exploration direction of the next step. In practical conduction, balancing reasoning speed and answer quality is crucial. For complex problems requiring high accuracy, deeper retrieval may be necessary. Most advanced RAG systems enhance generated results at the cost of more intermediate processes and more frequent LLM calls. Tog2.0 incorporates various strategies to balance reasoning speed and answer quality: Firstly, Topic Pruning: excluding general entities like ”country,” ”gender,” and ”film” from the query to select entities that best serve as starting points for reasoning, reducing the total number of exploration links. Secondly, Relation Pruning Optimization: instead of calling LLM for every entity in ToG, Tog2.0 allows LLM to select relations for multiple entities simultaneously in one time, reducing the number and time of LLM calls. Finally, DPR-based Entity Ranking: utilizing dense passage retrieval for entity ranking instead of LLM calls in ToG, with better stability, accuracy, and runtime efficiency than LLM. Through these innovations, Tog2.0 aims to align the performance and reliability of RAG systems with humans.

2. Related Works

2.1. Retrieval Augmented Generation with Knowledge Graph

RAG aims to offer real-time knowledge updates and effective utilization of external knowledge sources with high interpretability. An important factor is the granularity of the retrieved data. Coarse-grained retrieval units theoretically can provide more relevant information for the problem, but they may also contain redundant content, which could distract the retriever and language models in downstream tasks. On the other hand, fine-grained retrieval unit granularity increases the burden of retrieval and does not guarantee semantic integrity and meeting the required knowledge semantic integrity and meeting the required knowledge(Gao et al., 2024). This low information density and low utility are due to the inherent limitations of semantic retrieval.

KGs offer dynamic, explicit, and structured representations of knowledge. This structured knowledge representation is particularly beneficial for LLMs because it introduces a level of interpretability and precision in the knowledge that LLMs can access. Early approaches (Sun et al. (2020), Peters et al. (2019), Huang et al. (2024), Liu et al. (2020)) focused on embedding knowledge from KGs directly into the neural networks of LLMs. This embedding could occur either during the pretraining phase or the fine-tuning process. The intent was to infuse the models with rich, structured knowledge right from the foundational stages of model training. Despite the initial promise, integrating KGs directly into LLMs introduced challenges. As noted by Hu et al. (2023), this integration leads to a reduction in the natural explainability of the models. Additionally, it makes updating the knowledge base more complex and less efficient, as any change in the KG requires retraining or significant adjustments to the model. More recent studies(Jiang et al., 2024) have shifted towards using KGs to augment LLMs externally rather than embedding the knowledge directly into the models. Those approaches involve translating relevant structured knowledge from KGs into textual prompts that are then fed to LLMs. The process typically follows a fixed pipeline where extra information from KGs is retrieved to enhance the LLM prompts. The integration of Knowledge Graphs (KGs) with Large Language Models (LLMs) offers numerous advantages but also meets several distinct challenges and limitations, such as incompleteness and ambiguity as discussed in the last section.

In this work, we aim to integrate both KG and unstructured data with LLM, leveraging the strengths of both to mitigate their respective weaknesses.

3. Methodology

The proposed method Explore & Examine on Graph first utilizes the LLM to evaluate the query and initializes proper reasoning starting points. Following this Tog2.0 can activate the internal knowledge and reading comprehension abilities of the LLM to efficiently identify multi-granularity local clues that support reasoning, which progressively assembles the supporting information chain and finally completes the global chain of thought from the question to the answer. In the following section, we will explain each step in detail.

3.1. Tog2.0 Initializtion

Selecting appropriate starting points for specific queries can facilitate much more streamlined reasoning. For example: ”Among the founders of Tencent company, who has been a member of the National People’s Congress?”. In this case, a broad or poorly chosen point such as ”Member of the National People’s Congress” could lead to pitfalls of sifting through large amounts of irrelevant data and cause time-consuming and less focused exploration. An effective starting point would be to focus on the entity ’Founders of Tencent’. This principle is essential in reasoning tasks, especially in open-domain question answering, where the question is highly varied. Therefore, given a question q𝑞qitalic_q, Tog2.0 first performs Named Entity Recognition (NER) and Topic Prune (TP), which prompts the LLM to evaluate q𝑞qitalic_q and appearing entities, selecting topic entities Etopic0={e1,e2,,eN}superscriptsubscript𝐸𝑡𝑜𝑝𝑖𝑐0subscript𝑒1subscript𝑒2subscript𝑒𝑁E_{topic}^{0}=\{e_{1},e_{2},\ldots,e_{N}\}italic_E start_POSTSUBSCRIPT italic_t italic_o italic_p italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } that are appropriate to serve as the starting point for the question.

In complex reasoning, the implicit correlation between the question and effective intermediate clue sentences often goes unrecognized by both sparse retrieval models and dense pre-trained retrieval models. To solve this limitation, we prompt the LLM to formulate clue queries qj0superscriptsubscript𝑞𝑗0q_{j}^{0}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT based on the current context for every topic entity ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which orients the next-step direction of exploring the relations and contexts. Using the question about Tencent mentioned above as the example again, based on the entity ”National People’s Congress?”, the LLM may generate a clue-query that suggests gathering information about their political roles or affiliations.

3.2. Reasoning with Graph-driven Knowledge Retrieval

Next, we will introduce how Tog2.0 iteratively utilizes structured and unstructured knowledge for reasoning. Formally, in the (i+1)thsubscript𝑖1𝑡{(i+1)}_{th}( italic_i + 1 ) start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT iteration, given the original question q𝑞qitalic_q, the clue queries from the ithsubscript𝑖𝑡{i}_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT iteration Qci={q1i,q2i,,qNi}superscriptsubscript𝑄𝑐𝑖superscriptsubscript𝑞1𝑖superscriptsubscript𝑞2𝑖superscriptsubscript𝑞𝑁𝑖Q_{c}^{i}=\{q_{1}^{i},q_{2}^{i},\ldots,q_{N}^{i}\}italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, the topic entities Etopici={e1,e2,,eN}superscriptsubscript𝐸𝑡𝑜𝑝𝑖𝑐𝑖subscript𝑒1subscript𝑒2subscript𝑒𝑁E_{topic}^{i}=\{e_{1},e_{2},\ldots,e_{N}\}italic_E start_POSTSUBSCRIPT italic_t italic_o italic_p italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and their preceding triple paths 𝐏i={P1i,P2i,,PNi}superscript𝐏𝑖superscriptsubscript𝑃1𝑖superscriptsubscript𝑃2𝑖superscriptsubscript𝑃𝑁𝑖\mathbf{P}^{i}=\{P_{1}^{i},P_{2}^{i},\ldots,P_{N}^{i}\}bold_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, Pji={pj0,pj1,,pji}superscriptsubscript𝑃𝑗𝑖superscriptsubscript𝑝𝑗0superscriptsubscript𝑝𝑗1superscriptsubscript𝑝𝑗𝑖P_{j}^{i}=\{p_{j}^{0},p_{j}^{1},\ldots,p_{j}^{i}\}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, each iteration includes three steps: relation prune (RP), entity prune (EP), and examine and reasoning (ER). Note that i=0𝑖0i=0italic_i = 0 indicates the initialization phase and the P0superscript𝑃0P^{0}italic_P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is empty.

Our Model LLM WebQSP HotpotQA QALD-10-en FEVER
Tog2.0 GPT-3.5-turbo 81.13 40.91 54.05 58.54
Baseline
Vanilla GPT-3.5-turbo 74.55 28.89 42.04 52.10
CoT GPT-3.5-turbo 59.93 34.40 42.94 57.80
CoK GPT-3.5-turbo - 35.40 - 63.50
ToG GPT-3.5-turbo 76.20 26.30 50.20 52.70
Table 1. Performance of our method versus various baselines. The evaluation metric for FEVER is Accuracy, while the metric for WebQSP, HotpotQA, and QALD-10-en is Exact Match (EM).
Model LLM WebQSP HotpotQA QALD-10-en FEVER
Vanilla Llama-2-13b 53.25 16.23 36.04 42.10
Vanilla GPT-3.5-turbo 74.55 28.89 42.04 52.10
Tog2.0 (w/o TP, RC, clue_query) GPT-3.5-turbo 78.70 39.29 51.05 56.30
Tog2.0 (w/o TP, RC, clue_query) Llama-2-13b 76.22 29.15 48.64 49.17
Tog2.0 (w/o TP, clue_query) GPT-3.5-turbo 76.43 38.64 49.85 56.04
Tog2.0 (w/o TP) GPT-3.5-turbo 77.62 39.61 52.85 56.46
Tog2.0 GPT-3.5-turbo 81.13 40.91 54.05 58.54
Table 2. Ablation study: the influence of each module in Tog2.0 on the final performance. The evaluation metric for FEVER is Accuracy, while WebQSP & HotpotQA & QALD-10-en is Exact Match (EM). clue-query: query re-formulate. TP: topic prune. RC: relation prune combination.

Relation Prune (RP): Based on q𝑞qitalic_q and Qcisuperscriptsubscript𝑄𝑐𝑖Q_{c}^{i}italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we prompt the LLM to select the relations that are most likely to find entities containing helpful context information for solving q𝑞qitalic_q and that match the description of Qcisuperscriptsubscript𝑄𝑐𝑖Q_{c}^{i}italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Unlike selecting relations for a single topic entity at a time, we provide GPT-3.5 with all topic entities within a single prompt. This approach not only reduces the number of API calls, thereby accelerating inference time, but also enables the LLM to simultaneously consider the interconnections between multiple reasoning paths, allowing it to make selections from a more global perspective. The selected relations for topic entity ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are denoted as Rj={rj1,rj2,,rjW}subscript𝑅𝑗subscript𝑟𝑗1subscript𝑟𝑗2subscript𝑟𝑗𝑊R_{j}=\{r_{j1},r_{j2},\ldots,r_{jW}\}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_j italic_W end_POSTSUBSCRIPT }, where W𝑊Witalic_W denotes the hyper-parameter width.

Entity Prune (EP): Given a topic entity ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and one of the selected relation rjksubscript𝑟𝑗𝑘r_{j}kitalic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_k, Tog2.0 will identify all interconnected candidate entity nodes {ejkl}subscript𝑒𝑗𝑘𝑙\{e_{jkl}\}{ italic_e start_POSTSUBSCRIPT italic_j italic_k italic_l end_POSTSUBSCRIPT } within the Wiki Knowledge Graph (KG) and get their associated Wikipedia page documents Djklsubscript𝐷𝑗𝑘𝑙D_{jkl}italic_D start_POSTSUBSCRIPT italic_j italic_k italic_l end_POSTSUBSCRIPT through locally deployed service. The document context of each candidate entity is initially segmented into appropriately sized chunks{tjklm}subscript𝑡𝑗𝑘𝑙𝑚\{t_{jklm}\}{ italic_t start_POSTSUBSCRIPT italic_j italic_k italic_l italic_m end_POSTSUBSCRIPT }. Subsequently, a two-stage search Fretrsubscript𝐹𝑟𝑒𝑡𝑟F_{retr}italic_F start_POSTSUBSCRIPT italic_r italic_e italic_t italic_r end_POSTSUBSCRIPT is employed, utilizing pre-trained language models for all candidate entities’ chunks. Formally, sjklm=Fretr([q,qji,pjkli],tjklm)subscript𝑠𝑗𝑘𝑙𝑚subscript𝐹𝑟𝑒𝑡𝑟𝑞superscriptsubscript𝑞𝑗𝑖superscriptsubscript𝑝𝑗𝑘𝑙𝑖subscript𝑡𝑗𝑘𝑙𝑚s_{jklm}=F_{retr}([q,q_{j}^{i},p_{jkl}^{i}],t_{jklm})italic_s start_POSTSUBSCRIPT italic_j italic_k italic_l italic_m end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_r italic_e italic_t italic_r end_POSTSUBSCRIPT ( [ italic_q , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_j italic_k italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] , italic_t start_POSTSUBSCRIPT italic_j italic_k italic_l italic_m end_POSTSUBSCRIPT ) denotes the relevance score of the mthsubscript𝑚𝑡m_{th}italic_m start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT paragraph of the lthsubscript𝑙𝑡l_{th}italic_l start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT candidate, where pjklisuperscriptsubscript𝑝𝑗𝑘𝑙𝑖p_{jkl}^{i}italic_p start_POSTSUBSCRIPT italic_j italic_k italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the triples from which the current candidate entity is derived. Then, the ranking score of a candidate entity ejklsubscript𝑒𝑗𝑘𝑙e_{jkl}italic_e start_POSTSUBSCRIPT italic_j italic_k italic_l end_POSTSUBSCRIPT is calculated as the exponentially decayed weighted sum of scores of its chunks that rank in top-K𝐾Kitalic_K, and the weight for the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ranked chunk is calculated as w=eαi𝑤superscript𝑒𝛼𝑖w=e^{-\alpha\cdot i}italic_w = italic_e start_POSTSUPERSCRIPT - italic_α ⋅ italic_i end_POSTSUPERSCRIPT, where K𝐾Kitalic_K and α𝛼\alphaitalic_α are hyper-parameters. Finally, top-W candidate entities are selected as the new topic entities Etopici+1superscriptsubscript𝐸𝑡𝑜𝑝𝑖𝑐𝑖1E_{topic}^{i+1}italic_E start_POSTSUBSCRIPT italic_t italic_o italic_p italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT for the next iteration, meanwhile the corresponding preceding triple paths 𝐏i+1superscript𝐏𝑖1\mathbf{P}^{i+1}bold_P start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT will be updated.

Examine and reasoning (ER): Following RP and EP, we give LLM carefully aggregated references, including q𝑞qitalic_q, Qcisuperscriptsubscript𝑄𝑐𝑖Q_{c}^{i}italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, 𝐏i+1superscript𝐏𝑖1\mathbf{P}^{i+1}bold_P start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT and the top L𝐿Litalic_L (LK𝐿𝐾L\leq Kitalic_L ≤ italic_K) chunks. Then the LLM is prompted to examine the logical coherence and the completeness of factual evidence. If the LLM believes it can answer the question, the iteration ends. If not, based on the question and the collected contextual clues, a new clue-query needs to be generated for the next round.

4. Experiments

4.1. Datasets and Metrics

We evaluated our method on two multi-hop KBQA datasets WebQSP(Yih et al., 2016) and QALD10-en(Usbeck et al., 2023), a multi-hop complex document QA dataset HotpotQA(Yang et al., 2018), and a fact verification dataset FEVER(Thorne et al., 2018). The evaluation metric for FEVER is Accuracy, while the metric for WebQSP & HotpotQA & QALD-10-en is Exact Match (EM).

4.2. Baselines

We compare ToG2.0 with both widely used baselines and state-of-the-art methods to provide a more comprehensive overview: 1) Standard prompting (Vanilla Prompt) directly answers the question. 2) Chain-of-thought (CoT)(Wei et al., 2022b) generates several intermediate rationales before the final answer to improve the complex reasoning capability of LLMs. 3) Chain-of-Knowledge (CoK)(Li et al., 2024) a heterogeneous source augmented large language model framework. 4) Think-on-Graph (ToG)(Sun et al., 2024) a KG method that searches useful triples for reasoning.

4.3. Implementation Details

In this study, considering the experimental costs and for ease of comparison with other baselines, we conducted experiments on two LLMs: GPT-3.5-turbo and Llama-2-13b-chat. We used the OpenAI API to access GPT-3.5-turbo, while Llama-2-13B-chat was deployed on 8 A100-40G GPUs without quantization. Consistent with the ToG settings, we set the temperature parameter to 0.4 during exploration and 0 during reasoning. The maximum token length for generation was capped at 256. For context retrieval, we utilized the pre-trained BGE-embedding model without any fine-tuning. We choose Wikidata as the knowledge source for all experiments. During the TP, RC, relation pruning, and reasoning stages, we employed a 2-shot demonstration for all prompts.

4.4. Main Results

As shown in Table1, we analyze the performance of our proposed method, Tog2.0, in comparison with state-of-the-art baselines, including the Vanilla RAG strategy, the Chain-of-Thought (CoT), and the current SOTA baseline (CoK). The evaluation metrics include Exact Match (EM) for WebQSP, HotpotQA, and QALD-10-en, and Accuracy for FEVER.

We note that Tog2.0 outperforms other baselines on WebQSP, HotpotQA and QALD-10-en. Notably, on HotpotQA, it significantly surpasses the SOTA baseline CoK by 5.51%. Compared to the original ToG, Tog2.0 achieved a substantial improvement on HotpotQA (14.6%) and also demonstrated notable enhancements on other datasets (4.93% on WebQSP, 3.85% on QALD-10-en and 5.84% on FEVER). This demonstrates the advantages of our ”KG+context” framework in addressing complex problems. Although the performance on the fact-checking dataset FEVER is slightly inferior to CoK, this may be due to CoK utilizing more knowledge sources and an additional LLM self-verification mechanism. To save computational costs and reduce inference latency, we ultimately decided not to use a self-verification mechanism, which could be applied based on application requirements in the future.

4.5. Ablation Study

To evaluate the contribution of each component in Tog2.0, we conducted comprehensive ablation experiments across all datasets.

Compared to the performance on the other three datasets, on WebQSP, the effectiveness of Topic Prune (TP) is more pronounced, possibly due to the higher relative proportion of general entities in WebQSP questions, which tends to introduce more unnecessary noise.

Although Relation Prune (RC) may slightly decrease the performance due to the increased difficulty for the LLM in understanding multiple tasks within a single prompt, the benefit is a significant reduction in the number of inferences and latency. Assuming a reasoning depth of N𝑁Nitalic_N and a width of W𝑊Witalic_W, the complexity can theoretically be reduced from O(WN)𝑂superscript𝑊𝑁O(W^{N})italic_O ( italic_W start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) to O(N)𝑂𝑁O(N)italic_O ( italic_N ).

Additionally, clue-query also brought relatively consistent improvements across each dataset, indicating that adaptive query optimization can help the LLM better understand the tasks. We also tested the vanilla RAG process and the basic Tog2.0 process on Llama-2-13B. It can be seen that on a less capable LLM, Tog2.0 can bring a greater performance improvement. This suggests that Tog2.0 might be more adaptable. Weaker LLMs often encounter bottlenecks when handling complex tasks, while Tog2.0 utilizes knowledge graphs as clues to optimize the reasoning path and reduce task complexity. It then uses entity context to further guide the model to focus on relevant information, thereby improving task understanding and response accuracy. In contrast, GPT-3.5, due to its higher inherent capabilities, may not exhibit as significant a performance improvement because it is already close to its performance ceiling.

References

  • (1)
  • Ding et al. (2024) Yujuan Ding, Wenqi Fan, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meets llms: Towards retrieval-augmented large language models. arXiv preprint arXiv:2405.06211 (2024).
  • Gao et al. (2024) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997
  • Hu et al. (2023) Linmei Hu, Zeyi Liu, Ziwang Zhao, Lei Hou, Liqiang Nie, and Juanzi Li. 2023. A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering (2023).
  • Huang et al. (2024) Rikui Huang, Wei Wei, Xiaoye Qu, Wenfeng Xie, Xianling Mao, and Dangyang Chen. 2024. Joint Multi-Facts Reasoning Network for Complex Temporal Question Answering Over Knowledge Graph. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 10331–10335.
  • Jiang et al. (2024) Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. 2024. Efficient Knowledge Infusion via KG-LLM Alignment. arXiv:2406.03746 [cs.CL] https://arxiv.org/abs/2406.03746
  • Lee et al. (2020) Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. 2020. Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 912–919. https://doi.org/10.18653/v1/2020.acl-main.85
  • Li et al. (2024) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2024. Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources. In International Conference on Learning Representations. https://openreview.net/forum?id=cPgh4gWZlz
  • Liu et al. (2020) Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841–1851.
  • Peters et al. (2019) Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
  • Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv:2305.15294 [cs.CL] https://arxiv.org/abs/2305.15294
  • Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697 [cs.CL] https://arxiv.org/abs/2307.07697
  • Sun et al. (2020) Yawei Sun, Lingling Zhang, Gong Cheng, and Yuzhong Qu. 2020. SPARQA: Skeleton-based Semantic Parsing for Complex Questions over Knowledge Bases. CoRR abs/2003.13956 (2020). arXiv:2003.13956 https://arxiv.org/abs/2003.13956
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355 (2018).
  • Tian et al. (2022) Ling Tian, Xue Zhou, Yan-Ping Wu, Wang-Tao Zhou, Jin-Hao Zhang, and Tian-Shu Zhang. 2022. Knowledge graph and knowledge reasoning: A systematic review. Journal of Electronic Science and Technology 20, 2 (2022), 100159.
  • Trivedi et al. (2023) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. arXiv:2212.10509 [cs.CL] https://arxiv.org/abs/2212.10509
  • Usbeck et al. (2023) Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, et al. 2023. QALD-10–The 10th challenge on question answering over linked data. Semantic Web Preprint (2023), 1–15.
  • Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022a. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903 https://arxiv.org/abs/2201.11903
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
  • Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206.
  • Zhao et al. (2024) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473 (2024).