Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval

Shengjie Ma [email protected] Gaoling School of Artificial Intelligence, Renmin University of ChinaBeijingChina , Chengjin Xu [email protected] IDEA Research, International Digital Economy AcademyShenzhenGuangdongChina , Xuhui Jiang [email protected] Institute of Computing Technology, Chinese Academy of SciencesBeijingChina , Muzhi Li [email protected] Dept. of Computer Science & Engineering, The Chinese University of Hong Kong Sha TinHongkongChina , Huaren Qu [email protected] The Hong Kong University of Science and TechnologyGuangzhouGuangdongChina and Jian Guo [email protected] IDEA Research, International Digital Economy AcademyShenzhenGuangdongChina

(2023)

Abstract.

Retrieval-augmented generation (RAG) has significantly advanced large language models (LLMs) by enabling dynamic information retrieval to mitigate knowledge gaps and hallucinations in generated content. However, these systems often falter with complex reasoning and consistency across diverse queries. In this work, we present Think-on-Graph 2.0 (ToG ${2.0}$ ), an enhanced RAG framework that aligns questions with the knowledge graph and uses it as a navigational tool, which deepens and refines the RAG paradigm for information collection and integration. The KG-guided navigation fosters deep and long-range associations to uphold logical consistency and optimize the scope of retrieval for precision and interoperability. In conjunction, factual consistency can be better ensured through semantic similarity guided by precise directives. ToG ${2.0}$ not only improves the accuracy and reliability of LLMs’ responses but also demonstrates the potential of hybrid structured knowledge systems to significantly advance LLM reasoning, aligning it closer to human-like performance. We conducted extensive experiments on four public datasets to demonstrate the advantages of our method compared to the baseline.

XXX

^†^†journalyear: 2023^†^†booktitle: ^†^†price: 15.00^†^†doi: 10.1145/3539618.3591766^†^†isbn: 978-1-4503-9408-6/23/07^†^†ccs: Information systems Users and interactive retrieval

1. Introduction

Refer to caption — Figure 1. Comparison of traditional RAG (a), KG-based generation (b) and Graph-guided RAG (c). The example illustrates the limitations of pure semantic retrieval and pure knowledge graph-augmented frameworks in complex QA tasks, and the advantages of our proposed KG+RAG framework. a: The dataset size is too large in the semantic retrieval-based RAG paradigm, resulting in low information density. Also, traditional retrieval systems struggle to capture deep connections between facts, thus failing to focus on key points in the question. b: In the KG path inference-based LLM augmentation paradigm, the information provided by triples lacks both depth and details. Even information may be missing, due to the incompleteness of the KG. c: The proposed ToG_2.0 combines the advantages of both approaches. The KG helps to understand deep connections between different facts and precisely narrows down the search scope, while entity context-based retrieval supplements the information missing in the knowledge graph’s triple path reasoning.

Retrieval augmented generation (RAG) has emerged as a promising solution to address the knowledge deficiencies and hallucination issues of large language models (LLMs). Retrieval-augmented generation (RAG) systems significantly enhance the capabilities of large language models (LLMs) by dynamically retrieving relevant information from external sources. This method allows LLMs to transcend their inherent limitations of static knowledge, enabling them to address various applications with high diversity and complexity(Zhao et al., 2024). Despite this potential, and researchers’ attempts to incorporate various complex additional processes into RAG (such as knowledge preprocessing, fine-grained retrieval, and generating thought chains), LLMs still struggle with building human-like insights to complex tasks, which involve a motivated, continuous effort to understand connections (which can be among objects, numbers, concepts, and events) in order to anticipate their trajectories and act effectively.

Most recent RAG implementations rely heavily on vector retrieval of text(Ding et al., 2024). In this context, vector embeddings are numerical representations of words, phrases, or entire documents that capture their meanings in a semantic space. RAG systems identify potentially relevant text chunks or documents from knowledge sources by comparing the similarities between vector embeddings. While vector embeddings are effective for capturing surface-level semantic similarities, they are inefficient for all tasks. Indiscriminate retrieval of k text fragments with a query significantly increases the input length for the model. Additionally, they have several limitations in understanding the long-range association between various types of knowledge: 1. Shallow Correlation Capture: Simple vector-based matching might miss conceptual correlations, such as between the Global Financial Crisis and the 2008 Recession. 2. Difficulty in Aggregating Diverse Facts: Relying solely on vector embeddings struggles to seamlessly connect related but not directly similar knowledge, like linking ”Harry Potter” and ”Fantastic Beasts” as works by J.K. Rowling. 3. Inability to Handle Complex Logic: Simple vector-based retrieval is not suitable for multi-step reasoning or tracking logical links between different information fragments unless all these fragments are pre-divided and encoded, which becomes highly inefficient for many potential reasoning types. As shown in Figure 1(a), the naive RAG, where generic retrievers search through large-scale corpora, the recall often remains at the level of superficial semantic similarity. This is because the similarity modeled by the retrievers is aligned with human understanding, not with LLMs. Moreover, training retrievers specifically for certain datasets and tasks is impractical and inconvenient.

To address these challenges, researchers like Lee et al. (2020) have enhanced retrieval units from words and phrases to document paragraphs, aiming for finer-grained relevant information at the cost of increased retrieval complexity. ITER-RETGEN(Shao et al., 2023) follows an iterative strategy, merging retrieval and generation in a loop, alternating between ”retrieval-augmented generation” and ”generation-augmented retrieval”. Trivedi et al. (2023) combined RAG with the Chain of Thought (CoT)(Wei et al., 2022a) method, alternating between CoT-guided retrieval and retrieval-supported CoT processes, significantly improving GPT-3’s performance on various Q&A tasks. Despite these optimizations, traditional query-to-document or query-to-paragraph retrieval still struggles to accurately focus on relations between key points within complex questions, resulting in low information density and ineffective long-range knowledge association. In addition, the coarse-grained retrieval in multi-step iterations can potentially introduce more noise and even harmful disturbance, further limiting the accuracy and reliability of RAG.

Some researchers have introduced structured external knowledge graphs (KGs) into RAG, such as ToG(Sun et al., 2024), which searches for valid information based on the triple relationships in Wikipedia KG. KGs are sophisticated frameworks that encapsulate the essence of data interconnectivity, not only cataloging information but also elucidating the context and multi-level interrelations among entities. The structured nature of KGs enhances the transparency and explainability of the systems. While powerful for structuring high-level concepts and relationships, KGs inherently possess limitations in comprehensiveness and detail, as shown in Figure 1(b). Their highly generalized nature, which facilitates broad overviews of connected data, often precludes them from providing the fine-grained details necessary for nuanced understanding and analysis. This lack of detailed information can be a significant hurdle when precision and specificity are required, highlighting a fundamental cooperation between generalization and granularity. The challenges also remain in effectively identifying and mitigating noise, ambiguity and incompleteness in KGs(Tian et al., 2022).

The synergy between knowledge graphs and unstructured documents for RAG is becoming increasingly crucial. Therefore, we propose Think-on-Graph 2.0 (ToG_2.0), an advanced RAG paradigm with graph-guided knowledge retrieval for deep and interpretable reasoning. ToG_2.0 effectively integrates unstructured knowledge from documents with structured insights from knowledge graphs (KGs), serving as a roadmap to enhance complex problem-solving. By aligning questions with the knowledge graph and using it as a navigational tool, this approach deepens and refines the RAG paradigm for information collection and integration, which not only ensures semantic similarity at the level of factual consistency but also fosters long-range associations to uphold logical consistency. The proposed paradigm makes LLM perform closer to human when reasoning and problem-solving: examining current clues and associating potential entities based on their existing knowledge framework, continuously digging into the topic until finding the answer.

Figure 1(c) shows a simple case of Tog_2.0, which draws from the ToG approach in multi-hop searches within knowledge graphs, starting from key entities identified in the query and exploring outward based on entities and relationships with a prompt-driven inferential process. It combines the logical chain extensions based on triples with unstructured contextual knowledge of relevant entities, improving the methods for ranking and selecting relevant entities and relations, thus more effectively integrating and utilizing heterogeneous external knowledge. Specifically, Tog_2.0 uses entities encountered during exploration to limit the scale of the corpus for retrieval, enhancing efficiency. It also ranks and selects entities based on both the query, current triple chains, and references retrieved from the current entity’s context, which reduces entity ambiguity and ensures accurate exploration direction of the next step. In practical conduction, balancing reasoning speed and answer quality is crucial. For complex problems requiring high accuracy, deeper retrieval may be necessary. Most advanced RAG systems enhance generated results at the cost of more intermediate processes and more frequent LLM calls. Tog_2.0 incorporates various strategies to balance reasoning speed and answer quality: Firstly, Topic Pruning: excluding general entities like ”country,” ”gender,” and ”film” from the query to select entities that best serve as starting points for reasoning, reducing the total number of exploration links. Secondly, Relation Pruning Optimization: instead of calling LLM for every entity in ToG, Tog_2.0 allows LLM to select relations for multiple entities simultaneously in one time, reducing the number and time of LLM calls. Finally, DPR-based Entity Ranking: utilizing dense passage retrieval for entity ranking instead of LLM calls in ToG, with better stability, accuracy, and runtime efficiency than LLM. Through these innovations, Tog_2.0 aims to align the performance and reliability of RAG systems with humans.

2. Related Works

2.1. Retrieval Augmented Generation with Knowledge Graph

RAG aims to offer real-time knowledge updates and effective utilization of external knowledge sources with high interpretability. An important factor is the granularity of the retrieved data. Coarse-grained retrieval units theoretically can provide more relevant information for the problem, but they may also contain redundant content, which could distract the retriever and language models in downstream tasks. On the other hand, fine-grained retrieval unit granularity increases the burden of retrieval and does not guarantee semantic integrity and meeting the required knowledge semantic integrity and meeting the required knowledge(Gao et al., 2024). This low information density and low utility are due to the inherent limitations of semantic retrieval.

KGs offer dynamic, explicit, and structured representations of knowledge. This structured knowledge representation is particularly beneficial for LLMs because it introduces a level of interpretability and precision in the knowledge that LLMs can access. Early approaches (Sun et al. (2020), Peters et al. (2019), Huang et al. (2024), Liu et al. (2020)) focused on embedding knowledge from KGs directly into the neural networks of LLMs. This embedding could occur either during the pretraining phase or the fine-tuning process. The intent was to infuse the models with rich, structured knowledge right from the foundational stages of model training. Despite the initial promise, integrating KGs directly into LLMs introduced challenges. As noted by Hu et al. (2023), this integration leads to a reduction in the natural explainability of the models. Additionally, it makes updating the knowledge base more complex and less efficient, as any change in the KG requires retraining or significant adjustments to the model. More recent studies(Jiang et al., 2024) have shifted towards using KGs to augment LLMs externally rather than embedding the knowledge directly into the models. Those approaches involve translating relevant structured knowledge from KGs into textual prompts that are then fed to LLMs. The process typically follows a fixed pipeline where extra information from KGs is retrieved to enhance the LLM prompts. The integration of Knowledge Graphs (KGs) with Large Language Models (LLMs) offers numerous advantages but also meets several distinct challenges and limitations, such as incompleteness and ambiguity as discussed in the last section.

In this work, we aim to integrate both KG and unstructured data with LLM, leveraging the strengths of both to mitigate their respective weaknesses.

3. Methodology

The proposed method Explore & Examine on Graph first utilizes the LLM to evaluate the query and initializes proper reasoning starting points. Following this Tog_2.0 can activate the internal knowledge and reading comprehension abilities of the LLM to efficiently identify multi-granularity local clues that support reasoning, which progressively assembles the supporting information chain and finally completes the global chain of thought from the question to the answer. In the following section, we will explain each step in detail.

3.1. Tog_2.0 Initializtion

Selecting appropriate starting points for specific queries can facilitate much more streamlined reasoning. For example: ”Among the founders of Tencent company, who has been a member of the National People’s Congress?”. In this case, a broad or poorly chosen point such as ”Member of the National People’s Congress” could lead to pitfalls of sifting through large amounts of irrelevant data and cause time-consuming and less focused exploration. An effective starting point would be to focus on the entity ’Founders of Tencent’. This principle is essential in reasoning tasks, especially in open-domain question answering, where the question is highly varied. Therefore, given a question $q$ , Tog_2.0 first performs Named Entity Recognition (NER) and Topic Prune (TP), which prompts the LLM to evaluate $q$ and appearing entities, selecting topic entities $E_{topic}^{0}=\{e_{1},e_{2},\ldots,e_{N}\}$ that are appropriate to serve as the starting point for the question.

In complex reasoning, the implicit correlation between the question and effective intermediate clue sentences often goes unrecognized by both sparse retrieval models and dense pre-trained retrieval models. To solve this limitation, we prompt the LLM to formulate clue queries $q_{j}^{0}$ based on the current context for every topic entity $e_{j}$ , which orients the next-step direction of exploring the relations and contexts. Using the question about Tencent mentioned above as the example again, based on the entity ”National People’s Congress?”, the LLM may generate a clue-query that suggests gathering information about their political roles or affiliations.

3.2. Reasoning with Graph-driven Knowledge Retrieval

Next, we will introduce how Tog_2.0 iteratively utilizes structured and unstructured knowledge for reasoning. Formally, in the ${(i+1)}_{th}$ iteration, given the original question $q$ , the clue queries from the ${i}_{th}$ iteration $Q_{c}^{i}=\{q_{1}^{i},q_{2}^{i},\ldots,q_{N}^{i}\}$ , the topic entities $E_{topic}^{i}=\{e_{1},e_{2},\ldots,e_{N}\}$ and their preceding triple paths $\mathbf{P}^{i}=\{P_{1}^{i},P_{2}^{i},\ldots,P_{N}^{i}\}$ , $P_{j}^{i}=\{p_{j}^{0},p_{j}^{1},\ldots,p_{j}^{i}\}$ , each iteration includes three steps: relation prune (RP), entity prune (EP), and examine and reasoning (ER). Note that $i=0$ indicates the initialization phase and the $P^{0}$ is empty.

Our Model	LLM	WebQSP	HotpotQA	QALD-10-en	FEVER
Tog_2.0	GPT-3.5-turbo	81.13	40.91	54.05	58.54
Baseline
Vanilla	GPT-3.5-turbo	74.55	28.89	42.04	52.10
CoT	GPT-3.5-turbo	59.93	34.40	42.94	57.80
CoK	GPT-3.5-turbo	-	35.40	-	63.50
ToG	GPT-3.5-turbo	76.20	26.30	50.20	52.70

Table 1. Performance of our method versus various baselines. The evaluation metric for FEVER is Accuracy, while the metric for WebQSP, HotpotQA, and QALD-10-en is Exact Match (EM).

Model	LLM	WebQSP	HotpotQA	QALD-10-en	FEVER
Vanilla	Llama-2-13b	53.25	16.23	36.04	42.10
Vanilla	GPT-3.5-turbo	74.55	28.89	42.04	52.10
Tog_2.0 (w/o TP, RC, clue_query)	GPT-3.5-turbo	78.70	39.29	51.05	56.30
Tog_2.0 (w/o TP, RC, clue_query)	Llama-2-13b	76.22	29.15	48.64	49.17
Tog_2.0 (w/o TP, clue_query)	GPT-3.5-turbo	76.43	38.64	49.85	56.04
Tog_2.0 (w/o TP)	GPT-3.5-turbo	77.62	39.61	52.85	56.46
Tog_2.0	GPT-3.5-turbo	81.13	40.91	54.05	58.54

Table 2. Ablation study: the influence of each module in Tog_2.0 on the final performance. The evaluation metric for FEVER is Accuracy, while WebQSP & HotpotQA & QALD-10-en is Exact Match (EM). clue-query: query re-formulate. TP: topic prune. RC: relation prune combination.

Relation Prune (RP): Based on $q$ and $Q_{c}^{i}$ , we prompt the LLM to select the relations that are most likely to find entities containing helpful context information for solving $q$ and that match the description of $Q_{c}^{i}$ . Unlike selecting relations for a single topic entity at a time, we provide GPT-3.5 with all topic entities within a single prompt. This approach not only reduces the number of API calls, thereby accelerating inference time, but also enables the LLM to simultaneously consider the interconnections between multiple reasoning paths, allowing it to make selections from a more global perspective. The selected relations for topic entity $e_{j}$ are denoted as $R_{j}=\{r_{j1},r_{j2},\ldots,r_{jW}\}$ , where $W$ denotes the hyper-parameter width.

Entity Prune (EP): Given a topic entity $e_{j}$ and one of the selected relation $r_{j}k$ , Tog_2.0 will identify all interconnected candidate entity nodes $\{e_{jkl}\}$ within the Wiki Knowledge Graph (KG) and get their associated Wikipedia page documents $D_{jkl}$ through locally deployed service. The document context of each candidate entity is initially segmented into appropriately sized chunks $\{t_{jklm}\}$ . Subsequently, a two-stage search $F_{retr}$ is employed, utilizing pre-trained language models for all candidate entities’ chunks. Formally, $s_{jklm}=F_{retr}([q,q_{j}^{i},p_{jkl}^{i}],t_{jklm})$ denotes the relevance score of the $m_{th}$ paragraph of the $l_{th}$ candidate, where $p_{jkl}^{i}$ is the triples from which the current candidate entity is derived. Then, the ranking score of a candidate entity $e_{jkl}$ is calculated as the exponentially decayed weighted sum of scores of its chunks that rank in top- $K$ , and the weight for the $i_{th}$ ranked chunk is calculated as $w=e^{-\alpha\cdot i}$ , where $K$ and $\alpha$ are hyper-parameters. Finally, top-W candidate entities are selected as the new topic entities $E_{topic}^{i+1}$ for the next iteration, meanwhile the corresponding preceding triple paths $\mathbf{P}^{i+1}$ will be updated.

Examine and reasoning (ER): Following RP and EP, we give LLM carefully aggregated references, including $q$ , $Q_{c}^{i}$ , $\mathbf{P}^{i+1}$ and the top $L$ ( $L\leq K$ ) chunks. Then the LLM is prompted to examine the logical coherence and the completeness of factual evidence. If the LLM believes it can answer the question, the iteration ends. If not, based on the question and the collected contextual clues, a new clue-query needs to be generated for the next round.

4. Experiments

4.1. Datasets and Metrics

We evaluated our method on two multi-hop KBQA datasets WebQSP(Yih et al., 2016) and QALD10-en(Usbeck et al., 2023), a multi-hop complex document QA dataset HotpotQA(Yang et al., 2018), and a fact verification dataset FEVER(Thorne et al., 2018). The evaluation metric for FEVER is Accuracy, while the metric for WebQSP & HotpotQA & QALD-10-en is Exact Match (EM).

4.2. Baselines

We compare ToG_2.0 with both widely used baselines and state-of-the-art methods to provide a more comprehensive overview: 1) Standard prompting (Vanilla Prompt) directly answers the question. 2) Chain-of-thought (CoT)(Wei et al., 2022b) generates several intermediate rationales before the final answer to improve the complex reasoning capability of LLMs. 3) Chain-of-Knowledge (CoK)(Li et al., 2024) a heterogeneous source augmented large language model framework. 4) Think-on-Graph (ToG)(Sun et al., 2024) a KG method that searches useful triples for reasoning.

4.3. Implementation Details

In this study, considering the experimental costs and for ease of comparison with other baselines, we conducted experiments on two LLMs: GPT-3.5-turbo and Llama-2-13b-chat. We used the OpenAI API to access GPT-3.5-turbo, while Llama-2-13B-chat was deployed on 8 A100-40G GPUs without quantization. Consistent with the ToG settings, we set the temperature parameter to 0.4 during exploration and 0 during reasoning. The maximum token length for generation was capped at 256. For context retrieval, we utilized the pre-trained BGE-embedding model without any fine-tuning. We choose Wikidata as the knowledge source for all experiments. During the TP, RC, relation pruning, and reasoning stages, we employed a 2-shot demonstration for all prompts.

4.4. Main Results

As shown in Table1, we analyze the performance of our proposed method, Tog_2.0, in comparison with state-of-the-art baselines, including the Vanilla RAG strategy, the Chain-of-Thought (CoT), and the current SOTA baseline (CoK). The evaluation metrics include Exact Match (EM) for WebQSP, HotpotQA, and QALD-10-en, and Accuracy for FEVER.

We note that Tog_2.0 outperforms other baselines on WebQSP, HotpotQA and QALD-10-en. Notably, on HotpotQA, it significantly surpasses the SOTA baseline CoK by 5.51%. Compared to the original ToG, Tog_2.0 achieved a substantial improvement on HotpotQA (14.6%) and also demonstrated notable enhancements on other datasets (4.93% on WebQSP, 3.85% on QALD-10-en and 5.84% on FEVER). This demonstrates the advantages of our ”KG+context” framework in addressing complex problems. Although the performance on the fact-checking dataset FEVER is slightly inferior to CoK, this may be due to CoK utilizing more knowledge sources and an additional LLM self-verification mechanism. To save computational costs and reduce inference latency, we ultimately decided not to use a self-verification mechanism, which could be applied based on application requirements in the future.

4.5. Ablation Study

To evaluate the contribution of each component in Tog_2.0, we conducted comprehensive ablation experiments across all datasets.

Compared to the performance on the other three datasets, on WebQSP, the effectiveness of Topic Prune (TP) is more pronounced, possibly due to the higher relative proportion of general entities in WebQSP questions, which tends to introduce more unnecessary noise.

Although Relation Prune (RC) may slightly decrease the performance due to the increased difficulty for the LLM in understanding multiple tasks within a single prompt, the benefit is a significant reduction in the number of inferences and latency. Assuming a reasoning depth of $N$ and a width of $W$ , the complexity can theoretically be reduced from $O(W^{N})$ to $O(N)$ .

Additionally, clue-query also brought relatively consistent improvements across each dataset, indicating that adaptive query optimization can help the LLM better understand the tasks. We also tested the vanilla RAG process and the basic Tog_2.0 process on Llama-2-13B. It can be seen that on a less capable LLM, Tog_2.0 can bring a greater performance improvement. This suggests that Tog_2.0 might be more adaptable. Weaker LLMs often encounter bottlenecks when handling complex tasks, while Tog_2.0 utilizes knowledge graphs as clues to optimize the reasoning path and reduce task complexity. It then uses entity context to further guide the model to focus on relevant information, thereby improving task understanding and response accuracy. In contrast, GPT-3.5, due to its higher inherent capabilities, may not exhibit as significant a performance improvement because it is already close to its performance ceiling.

References

(1)
Ding et al. (2024) Yujuan Ding, Wenqi Fan, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meets llms: Towards retrieval-augmented large language models. arXiv preprint arXiv:2405.06211 (2024).
Gao et al. (2024) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997
Hu et al. (2023) Linmei Hu, Zeyi Liu, Ziwang Zhao, Lei Hou, Liqiang Nie, and Juanzi Li. 2023. A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering (2023).
Huang et al. (2024) Rikui Huang, Wei Wei, Xiaoye Qu, Wenfeng Xie, Xianling Mao, and Dangyang Chen. 2024. Joint Multi-Facts Reasoning Network for Complex Temporal Question Answering Over Knowledge Graph. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 10331–10335.
Jiang et al. (2024) Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. 2024. Efficient Knowledge Infusion via KG-LLM Alignment. arXiv:2406.03746 [cs.CL] https://arxiv.org/abs/2406.03746
Lee et al. (2020) Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. 2020. Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 912–919. https://doi.org/10.18653/v1/2020.acl-main.85
Li et al. (2024) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2024. Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources. In International Conference on Learning Representations. https://openreview.net/forum?id=cPgh4gWZlz
Liu et al. (2020) Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841–1851.
Peters et al. (2019) Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv:2305.15294 [cs.CL] https://arxiv.org/abs/2305.15294
Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697 [cs.CL] https://arxiv.org/abs/2307.07697
Sun et al. (2020) Yawei Sun, Lingling Zhang, Gong Cheng, and Yuzhong Qu. 2020. SPARQA: Skeleton-based Semantic Parsing for Complex Questions over Knowledge Bases. CoRR abs/2003.13956 (2020). arXiv:2003.13956 https://arxiv.org/abs/2003.13956
Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355 (2018).
Tian et al. (2022) Ling Tian, Xue Zhou, Yan-Ping Wu, Wang-Tao Zhou, Jin-Hao Zhang, and Tian-Shu Zhang. 2022. Knowledge graph and knowledge reasoning: A systematic review. Journal of Electronic Science and Technology 20, 2 (2022), 100159.
Trivedi et al. (2023) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. arXiv:2212.10509 [cs.CL] https://arxiv.org/abs/2212.10509
Usbeck et al. (2023) Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, et al. 2023. QALD-10–The 10th challenge on question answering over linked data. Semantic Web Preprint (2023), 1–15.
Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022a. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903 https://arxiv.org/abs/2201.11903
Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206.
Zhao et al. (2024) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473 (2024).