[Uncaptioned image] Spinach: SPARQL-Based Information Navigation
for Challenging Real-World Questions

Shicheng Liu  Sina J. Semnani  Harold Triedman§  Jialiang Xu
Isaac Dan Zhao  Monica S. Lam
Stanford University  § Wikimedia Foundation
{shicheng, sinaj, lam}@cs.stanford.edu
[email protected]
Abstract

Recent work integrating Large Language Models (LLMs) has led to significant improvements in the Knowledge Base Question Answering (KBQA) task. However, we posit that existing KBQA datasets that either have simple questions, use synthetically generated logical forms, or are based on small knowledge base (KB) schemas, do not capture the true complexity of KBQA tasks.

To address this, we introduce the Spinach dataset, an expert-annotated KBQA dataset collected from forum discussions on Wikidata’s “Request a Query” forum with 320 decontextualized question-SPARQL pairs. Much more complex than existing datasets, Spinach calls for strong KBQA systems that do not rely on training data to learn the KB schema, but can dynamically explore large and often incomplete schemas and reason about them.

Along with the dataset, we introduce the Spinach agent, a new KBQA approach that mimics how a human expert would write SPARQLs for such challenging questions. Experiments on existing datasets show Spinach’s capability in KBQA, achieving a new state of the art on the QALD-7, QALD-9 Plus and QALD-10 datasets by 30.1%, 27.0%, and 10.0% in F1, respectively, and coming within 1.6% of the fine-tuned LLaMA SOTA model on WikiWebQuestions. On our new Spinach dataset, Spinach agent outperforms all baselines, including the best GPT-4-based KBQA agent, by 38.1% in F1.111Code and data available at https://github.com/stanford-oval/spinach

[Uncaptioned image]

Spinach: SPARQL-Based Information Navigation
for Challenging Real-World Questions


Shicheng Liu  Sina J. Semnani  Harold Triedman§  Jialiang Xu Isaac Dan Zhao  Monica S. Lam Stanford University  § Wikimedia Foundation {shicheng, sinaj, lam}@cs.stanford.edu [email protected]


**footnotetext: Equal contribution

1 Introduction

Refer to caption
Figure 1(a) An example forum discussion
Question: Who are the doctoral advisors of Leonhard Euler,
and their advisors, and so on? In addition, who are his doctoral student,
grand-student, great-grand-student, and great-great-grand-students?
Each tuple in the result should contain both the student and the doctoral advisor.
SPARQL: SELECT ?doctor ?doctorMaster WHERE {
       { ?root (wdt:P184*) ?doctor. }
        UNION
        { ?root (wdt:P185/(wdt:P185?)/(wdt:P185?)/(wdt:P185?)) ?doctor. }
        ?doctor wdt:P184 ?doctorMaster.
        VALUES ?root {
             wd:Q7604
        }
}
Figure 1(b) The corresponding decontextualized example included in the validation set of the Spinach dataset. The SPARQL is based on the “Refined SPARQL” proposed by the user TomT0m. Projection fields asking for labels and the optional image clause are removed in accordance with our methodology in Section 3.1.

Wikidata (Vrandečić and Krötzsch, 2014), one of the largest publicly available knowledge bases, currently contains 15 billion facts and is estimated to grow at a rate of 1 billion triples per year (MediaWiki, 2024b). It is of significant value to many scientific communities, including Mathematics (Scharpf et al., 2021), Biology (Mitraka et al., 2015; Pfundner et al., 2015), Education (Sigalov and Nachmias, 2023), Linguistics (Turki et al., 2017; lu Yu and Qiao, 2017), and the Social Sciences (Kapsalis, 2019; Zhao, 2022), among many others (Farda-Sarbas and Müller-Birn, 2019; Turki et al., 2023). Effective access to Wikidata data can be challenging. To address this, the Wikidata Request a Query forum222https://www.wikidata.org/wiki/Wikidata:Request_a_query has been created so that users can ask questions, propose solutions, and participate in follow-up conversations. 1Figure 1(a) shows one such conversation, in which a user wants to obtain the academic genealogy of Leonhard Euler.

Although numerous datasets have been proposed for Knowledge Base Question Answering (KBQA) task, they either contain only simple questions (Yih et al., 2016; Bordes et al., 2015; Usbeck et al., 2017; Perevalov et al., 2022; Usbeck et al., 2023; Xu et al., 2023; Korablinov and Braslavski, 2020; Rybin et al., 2021) or synthetically generated complex logical forms (Bao et al., 2016; Gu et al., 2021; Talmor and Berant, 2018; Keysers et al., 2020; Cao et al., 2022a; Dubey et al., 2019). Datasets with synthetically generated logical forms often overestimate the performance of KBQA systems, and performing well on them does not equate the ability to handle real-world queries Oren et al. (2021); Campagna et al. (2022). Thus, the community needs a high-quality dataset with organic, real-world queries that capture the true complexity of KBQA tasks.

In this paper, we propose using the Wikidata Request a Query forum to build and evaluate next-generation KBQA systems. We introduce the [Uncaptioned image] Spinach dataset, a new, expert-annotated KBQA dataset featuring decontextualized question-SPARQL pairs derived from complex discussions on real-life forums. 1Figure 1(b) presents an example obtained from the discussion in 1Figure 1(a).

As we show, current KBQA approaches (Xu et al., 2023; Sun et al., 2024a) cannot handle the complexity of these real-life queries. The state-of-the-art (SOTA) approach, ToG Sun et al. (2024a), which integrates LLMs with KG reasoning, scores only 1.8 EM and 7.2 F1 on this data set. Therefore, we propose a new LLM-augmented KBQA approach, [Uncaptioned image] Spinach: SPARQL-Based Information Navigation for Challenging Real-World Questions, which is designed with the primary goal of mimicking how an expert would tackle the task. We show that the Spinach agent establishes a new state of the art on popular datasets such as QALD-7, QALD-9, and QALD-10 and comes within 1.6% of the fine-tuned SOTA on WikiWebQuestions. Although Spinach outperforms strong previous work on the Spinach dataset, it indicates that further research is necessary to fully address real-life KBQA.

2 Related Work

2.1 KBQA benchmarks

avg. clauses avg. projs avg. rels avg. subjs avg. preds avg. objs avg. lits
Natural questions w/ annotated logical forms
WikiWebQuestion (Xu et al., 2023) 2.63 1 1.53 1.25 1.52 1.53 0.04
QALD-9 Plus (Perevalov et al., 2022) 3.14 1 1.77 1.26 1.70 1.78 0.05
QALD-10 (Usbeck et al., 2023) 2.38 1 1.27 1.19 1.17 1.32 0.05
RuBQ (Korablinov and Braslavski, 2020) 2.17 1 1.12 1.03 1.11 1.07 0.01
SimpleQuestionsWikidata (Diefenbach et al., 2017b) 2.00 1 1.00 1.00 1.00 1.00 0.00
Synthetic logical forms w/ synthetic or paraphrased questions
CWQ (Talmor and Berant, 2018) 5.19 1 2.80 1.87 2.62 3.38 0.11
GrailQA (Gu et al., 2021) 7.10 1 3.02 1.97 2.43 3.90 0.08
KQA Pro (Cao et al., 2022a) 6.34 1 5.01 2.77 3.94 2.43 2.37
MCWQ (Cui et al., 2022) 6.34 1 5.09 2.67 3.53 3.37 0.00
LC-QuAD-2 Dubey et al. (2019) 3.65 1 2.07 1.51 2.05 2.07 0.22
Natural logical form w/ annotated questions
[Uncaptioned image] Spinach (Ours) 8.89 2.50 4.03 1.76 3.55 4.53 0.46
Table 1: Quantitative measurements on different KBQA datasets. For datasets originally based on Freebase, we calculate their corresponding datasets in Wikidata, if available. Refer to Appendix A for definitions of each metric.

Over the past decade, numerous Knowledge Base Question Answering (KBQA) benchmarks with logical forms have been introduced and can be classified into two categories:

Datasets with natural questions originally collected through search engines or crowd-sourcing. This includes the popular dataset WebQuestionSP (Yih et al., 2016), along with the QALD datasets Usbeck et al. (2017, 2018, 2023); Perevalov et al. (2022), RuBQ (Korablinov and Braslavski, 2020), and SimpleQuestions (Bordes et al., 2015), among others;

Datasets with synthetically generated logical forms and questions, where some datasets paraphrase synthetically generated questions via crowdsourcing. This category includes ComplexWebQuestions (Talmor and Berant, 2018), GrailQA (Gu et al., 2021), KQA Pro (Cao et al., 2022a), CFQ (Keysers et al., 2020), and LC-QuAD-2 Dubey et al. (2019), among others.

Some of the aforementioned datasets were originally based on Freebase (Bollacker et al., 2008) or DBPedia Lehmann et al. (2015). Recently, however, the KBQA community has been shifting toward using Wikidata as the underlying knowledge base for KBQA datasets, due to its larger size and continuous updates from community contributions.333For more on why Wikidata is better suited for KBQA benchmarks, refer to Section 1.1 of Usbeck et al. (2023). Several datasets have been converted to use Wikidata: WebQuestionSP has been converted to WikiWebQuestions (Xu et al., 2023); SimpleQuestions has a Wikidata version (Diefenbach et al., 2017b); and CFQ has been converted to MCWQ (Cui et al., 2022).

In Table 1, we present quantitative statistics on these datasets. Existing datasets with natural questions typically involve relatively simple queries (e.g., 2.63 clauses per query for WikiWebQuestions). In contrast, synthetically generated datasets generally contain more complex queries by design. However, due to the limited natural language variety between training and evaluation data, models can achieve artificially high accuracy. For instance, a simple semantic parser based on the BART Lewis et al. (2020) model can achieve an accuracy of over 90% on KQA-Pro even without an entity linking module (Cao et al., 2022a). Furthermore, systems that excel at synthetic datasets do not necessarily perform well in natural, complex tasks (Oren et al., 2021; Campagna et al., 2022). This indicates a need for a high-quality dataset with natural, complex queries and logical forms in the KBQA community.

2.2 KBQA approaches

Current KBQA systems can be classified mainly into three categories: (1) subgraph retrieval approaches (Sun et al., 2018, 2019; Sen et al., 2021; Verga et al., 2021; Mavromatis and Karypis, 2022), (2) semantic parsing systems (Yih et al., 2015, 2016; Luo et al., 2018; Lan and Jiang, 2020; Das et al., 2021; Ye et al., 2022; Cao et al., 2022b; Gu and Su, 2022; Xu et al., 2023), and more recently, (3) LLM-based graph exploration agents (Sun et al., 2024a; Xiong et al., 2024). Some works also use a combination of these techniques (Yu et al., 2023; Luo et al., 2024a, b).

Due to the sheer size of Wikidata, none of the previous subgraph retrieval approaches have tackled the problem of embedding the entire graph, instead operating only on a small fixed subset of Wikidata. This limitation renders them inapplicable in real-world scenarios where access to the entire graph is needed. Among semantic parsing systems, Xu et al. (2023) fine-tune LLaMA Touvron et al. (2023) with a modified SPARQL syntax and achieve state-of-the-art results on 2 KBQA benchmarks. Among LLM-based approaches, Sun et al. (2024a) instruct an LLM to dynamically explore the graph to fetch answers, achieving SOTA on 6 KBQA datasets. In Section 5, we benchmark the performance of these two systems on our new Spinach dataset.

3 The Spinach Dataset

The archives of the Wikidata Request a Query from July 2016 to the present date are available444https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive. These conversations are real and organic, reflecting the kind of queries that practitioners are interested in. Out of the discussions up to May 2024, there are 2780 discussions with a valid SPARQL query. We only keep those conversations whose last-mentiomed query returns non-empty results under 10 MB. Of the 2171 discussions left, we randomly sample 920 conversations for consideration. These conversations span many domains, and typically do not specify the exact natural language corresponding to the SPARQLs. Thus, maunal processing is required to convert them to a KBQA dataset. The Wikidata experts among the authors of this paper manually inspected these conversations to produce a dataset with 155 examples in the validation and 165 examples in the test set, as described below. Addition details on this process can be found in Appendix B.1.

3.1 Choosing and Trimming the Queries

Our goal in building this data set is to have examples focus more on end-users simply accessing the Wikidata content, as opposed to performing obscure optimizations or formatting. As such, we manually remove the following SPARQL clauses:

Wikimedia presentation queries: We filter out clauses for analyzing or improving Wikimedia itself, such as asking which articles exist in one language in Wikipedia and not the other.

Overly complicated SPARQL. We exclude certain conversations where users ask for help debugging their complex SPARQL queries.

Queries obscured by optimizations: Because Wikidata restricts query runtime to 60 seconds (MediaWiki, 2024a), users may include clauses to optimize the performance, which are often not directly relevant to answering the question. We remove these clauses.

Formatting clauses: We remove clauses that only format the results.

Appendix B.2 and B.3 show specific examples of these modifications and exclusions.

3.2 Annotating Natural Questions

For each conversation, the experts annotate a self-contained, decontextualized natural language question that accurately captures the meaning of the user-written SPARQL. We follow these steps:

Disambiguate entities and properties: Entities and properties are important parts of a query. For ambiguous entities, the verbalization includes additional information to resolve the ambiguity. As for the properties, the verbalization should disintinguish between similar properties. For example, instead of asking “where a movie takes place”, we distinguish between the “narrative location” and the “filming location”; instead of asking “where a person comes from”, we distinguish between the “courtry of citizenship” vs. “country of birth”.

Natural verbalizations: Whenever possible, to the extent that no ambiguities are introduced, the question should refrain from directly using entity and property names, instead using a more natural way to express the meaning. For instance, instead of asking “what is the point of time of the goal?”, a more natural question with the same level of accuracy like “when does the goal take place?” should be used.

Accurately capturing OPTIONAL clauses and projections: In SPARQL, OPTIONAL clauses are used to include additional data that may or may not exist for queried items without excluding results lacking this optional data.555https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#OPTIONAL The verbalization should specify which fields are optional with clauses such as “if available”. Similarly, the verbalization should accurately capture what is requested in the projections, using phrases like: “For each result, return their name and location” if necessary.

As illustrated in Table 1, the Spinach dataset is the first dataset with both natural and complex logical forms, which represents the real-world KBQA needs of Wikidata users.

4 The Spinach Agent

The Spinach agent is an LLM-augmented knowledge graph exploration agent (Section  2.2), where an LLM is instructed to explore the knowledge graph, until an appropriate answer is found or another stop condition is met. However, unlike prior work, we design Spinach with the primary goal of mimicking a human expert writing a SPARQL query. An expert starts by writing simple queries and looking up Wikidata entity or property pages when needed, all to understand the structure of the knowledge graph and what connections exist. This is especially important for Wikidata due to its anomalous structure Shenoy et al. (2022). An expert then might add new SPARQL clauses to build towards the final SPARQL, checking her work along the way by executing intermediate queries and eyeballing the results. Potentially, she might go back to the drawing board and double-check her assumptions if a query fails.

Following this design principle, Spinach improves upon several aspects of previous agent-based KBQA systems:

First, Spinach uses the full expressiveness of SPARQL for exploration. This contrasts to Sun et al. (2024a); Xiong et al. (2024); Sun et al. (2024b), which explore the knowledge graph one edge at a time. As a result, Spinach weaves the exploration and writing of SPARQL together. It is instructed to try many SPARQLs and fail early. It learns about the structure of the knowledge graph from the results of its queries (or lack thereof), or realizes its incorrect assumptions, and decides to revisit them. This is a key difference from Sun et al. (2024b, a), where the answer is generated only when exploration is done.

Second, during its exploration, Spinach does not keep track of a subgraph as its state. Instead, we define the state to be the full history of actions and their results so far. Limiting the state to a subgraph Sun et al. (2024b, a) means that the agent would categorically not support questions whose answer contains a large number of entities (e.g. “Which actors graduated from Harvard?”) or computation (e.g. “What is the tallest mountain?”).

Third, while most KBQA agents start their exploration from entities they detect from the question, Spinach does not assume access to entities. This, for example, gives the agent the flexibility to start by searching for relevant relations if that is a more suitable way to explore. For example, when faced with the question “What are the items listed on the National Heritage List for England?”, it can find the designated property “National Heritage List for England number” (P1216) for this purpose instead of trying to use the entity “National Heritage List for England” (Q6973052), which makes the process much simpler.

Concretely, Spinach agent runs for N𝑁Nitalic_N time steps. At time step n𝑛nitalic_n, it takes the current state as input and outputs a thought Yao et al. (2023) tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and one of the possible actions ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from the set of all possible actions. The action is then executed outside of the agent, and the results are returned to the agent as observation onsubscript𝑜𝑛o_{n}italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In the rest of this section, we go into the details of the Spinach agent. Figure 2 shows an example of how Spinach answers a question.

4.1 State

The state at time step n𝑛nitalic_n is 𝒮n={t1,a1,o1,,tn1,an1,on1}subscript𝒮𝑛subscript𝑡1subscript𝑎1subscript𝑜1subscript𝑡𝑛1subscript𝑎𝑛1subscript𝑜𝑛1\mathcal{S}_{n}=\{t_{1},a_{1},o_{1},...,t_{n-1},a_{n-1},o_{n-1}\}caligraphic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT }, that is, the list of all thoughts, actions, and observations up to and including the previous time step. 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to \emptyset.

Complex KBQA tasks require high reasoning ability. One promising approach to improve the reasoning capabilities of LLMs is ReAct prompting Yao et al. (2023), in which the LLM is instructed to output its reasoning trace in natural language first, before selecting an action. We require the agent to output a thought at the beginning of each time step.

At time step n𝑛nitalic_n, the Spinach agent takes ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, one of the following five actions, and receives the result of the execution as observation onsubscript𝑜𝑛o_{n}italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. These actions are selected for their similarity to what a human expert can do.

search_wikidata(string) is equivalent to using the text search box at wikidata.org. This action searches Wikidata for items (entities or properties) that match a given string using the Wikidata API666https://www.wikidata.org/w/api.php’s wbsearchentities. This API finds matches using human-readable labels and aliases. The main use for this action is to find relevant QIDs and PIDs.

get_wikidata_entry(QID) is equivalent to visiting the Wikidata page for an entity, e.g. https://www.wikidata.org/wiki/Q219563. This action retrieves all outgoing edges, i.e. linked entities, properties, and qualifiers of a specified Wikidata entity using its QID.

get_property_examples(PID) provides a few examples demonstrating the use of the specified property in Wikidata. The main use of this action is to understand how a property is used in Wikidata, if it is not clear from its label alone.

execute_sparql(SPARQL) is equivalent to using the Wikidata Query Service at https://query.wikidata.org/. Runs a SPARQL query on Wikidata and returns its results.

stop() marks the last executed SPARQL query as the final answer and ends the process.

Importantly, the results of the execution of each action are put in a human-readable format to make it easier for the LLM to process. To limit the amount of information that the agent has to process, we limit the output of search results to at most 8 entities and 4 properties, and limit large results of SPARQL queries to the first and last 5 rows. The outputs of get_wikidata_entry() are pruned using a separate few-shot prompt (Prompt  5) to only keep entries that are relevant to the question.

Refer to caption
Figure 2: The sequence of 13 actions that the Spinach agent takes to answer a sample question from the Spinach validation set. Here, the agent goes through several distinct phases, only with the high-level instruction in Section 4.2. Note that every step includes a thought, action and observation, but some are omitted here for brevity. Full version available in Listing 1.

4.2 Policy

The policy of Spinach is implemented using a zero-shot prompt that only includes high-level instructions such as “start by constructing very simple queries and gradually build towards the complete query” and “confirm all your assumptions about the structure of Wikidata before proceeding” (Prompt 4). The decision of selecting the action at each time step is left to the LLM. This decision is made after experimenting with more restrictive policies, but we empirically observe that accounting for all various edge cases the agent might confuse the agent and hurt the performance.

The agent loop constructs the state and uses this policy prompt to ask LLM to output a thought and an action. It then executes the action and adds the observation to the state. The agent loop continues until either the stop() action is selected or it reaches the predefined maximum number of 30 actions.

In practice, the LLM can occasionally fall into a cycle of repeating the same action and argument over and over again, or deviate from its instructions and call stop() even though the last SPARQL execution did not return any results. To remedy this, the agent loop checks for these two situations and resets the state to before the first time the repeated action was performed, or the beginning, respectively, and continues from there. Note that the policy LLM is run using sampling with high temperature of 1.0 to encourage exploration, so rerunning from a previous state does not repeat the exact same sequence of actions.

In the example in Figure 2, Spinach goes through the following four phases, just by following the high-level instruction we provided: it (1) looks up relevant PIDs and QIDs, (2) attempts to write the SPARQL in one go, which fails due to a hallucinated QID that the agent did not previously check, (3) starts the debugging process by executing simpler SPARQL queries, and double-checking its assumptions. It realizes its mistake after looking at the result of a get_wikidata_entry, and finally (4) moves on to write the correct SPARQL query.

5 Experiments

5.1 Evaluation Metrics

Prior works mostly use two metrics: Exact Match (EM) and F1 (Yih et al., 2016; Usbeck et al., 2023). As shown in Table 1, the Spinach dataset contains 2.50 projections on average for each query, whereas prior datasets all only contain only one field in projection. This means that naively applying EM and F1 is not possible; therefore, we propose a generalization of these metrics.

Concretely, during evaluation, for each data point x𝑥xitalic_x, a gold result y𝑦yitalic_y and a predicted result ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are given, where:

y,y{[y1,y2]}{True,False}𝑦superscript𝑦subscript𝑦1subscript𝑦2TrueFalsey,y^{\prime}\in\{[y_{1},y_{2}\cdots]\}\cup\{\text{True},\text{False}\}italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ ] } ∪ { True , False }

each yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being either an entity (e.g. ‘Q123’) or a literal (e.g. ‘123’). For each data point, EM is calculated to be 1 if y=y𝑦superscript𝑦y=y^{\prime}italic_y = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 00 otherwise. If both results are lists, F1 is calculated as the harmonic mean of precision and recall :

F1(y,y)=2tp2tp+fp+fnsubscript𝐹1𝑦superscript𝑦2tp2tpfpfnF_{1}(y,y^{\prime})=\frac{2\textit{tp}}{2\textit{tp}+\textit{fp}+\textit{fn}}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 2 tp end_ARG start_ARG 2 tp + fp + fn end_ARG (1)

where tp denotes the number of true positives (|yy|𝑦superscript𝑦|y\cap y^{\prime}|| italic_y ∩ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |), fp denotes the number of false positives (|yy|superscript𝑦𝑦|y^{\prime}-y|| italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_y |), and fn denotes the number of false negatives (|yy|𝑦superscript𝑦|y-y^{\prime}|| italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |). If one of y,y𝑦superscript𝑦y,y^{\prime}italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a boolean, then F1(y,y)subscript𝐹1𝑦superscript𝑦F_{1}(y,y^{\prime})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is set to 1111 if y=y𝑦superscript𝑦y=y^{\prime}italic_y = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 00 otherwise (Usbeck et al., 2023).

In prior datasets, each y1,y2,subscript𝑦1subscript𝑦2y_{1},y_{2},\cdotsitalic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ is only an entity or literal, but this is not necessarily the case in real-life SPARQLs. For instance, in 1Figure 1(b), the user is interested in a tuple of the doctoral advisee and advisor for each result, where each yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a list of entities and literals. In other words, the results are in the general case a matrix, not a list. Every row in the matrix is one result, and each column corresponds to one projection in the query.

From the user’s perspective, a good system could opt to return additional information when it helps clarify or enrich the results. For instance, for the query “what is the county with most people in South Dakota?”, a LLM-based system can choose to return the top county along with its population, even though the gold SPARQL might only include the county in projection. Penalizing the system for the additional projection columns is not meaningful in this case. To reflect this, we introduce a row-major generalization of EM and F1 to handle matrix-wise comparisons in real-life KBQA tasks, where each row is handled similarly to the traditional F1 score, and extra columns are not penalized.

Concretely, given two matrices of the form:

𝐲=(y11y1myn1ynm)𝐲=(y11y1myn1ynm)𝐲matrixsubscript𝑦11subscript𝑦1𝑚subscript𝑦𝑛1subscript𝑦𝑛𝑚superscript𝐲matrixsubscriptsuperscript𝑦11subscriptsuperscript𝑦1superscript𝑚subscriptsuperscript𝑦superscript𝑛1subscriptsuperscript𝑦superscript𝑛superscript𝑚\mathbf{y}=\begin{pmatrix}y_{11}&\cdots&y_{1m}\\ \vdots&\ddots&\vdots\\ y_{n1}&\cdots&y_{nm}\\ \end{pmatrix}~{}\mathbf{y^{\prime}}=\begin{pmatrix}y^{\prime}_{11}&\cdots&y^{% \prime}_{1m^{\prime}}\\ \vdots&\ddots&\vdots\\ y^{\prime}_{n^{\prime}1}&\cdots&y^{\prime}_{n^{\prime}m^{\prime}}\\ \end{pmatrix}bold_y = ( start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUBSCRIPT 1 italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG )

we calculate the recall between a gold row yi=(yi1,yi,m)subscript𝑦𝑖subscript𝑦𝑖1subscript𝑦𝑖𝑚y_{i}=(y_{i1},\cdots y_{i,m})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , ⋯ italic_y start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT ) and predicted row yi=(yi1,yi,m)subscriptsuperscript𝑦superscript𝑖subscript𝑦superscript𝑖1subscript𝑦superscript𝑖superscript𝑚y^{\prime}_{i^{\prime}}=(y_{i^{\prime}1},\cdots y_{i^{\prime},m^{\prime}})italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 1 end_POSTSUBSCRIPT , ⋯ italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) as

recall(yi,yi)=|yiyi||yi|recallsubscript𝑦𝑖subscriptsuperscript𝑦superscript𝑖subscript𝑦𝑖subscriptsuperscript𝑦𝑖subscript𝑦𝑖\text{recall}(y_{i},y^{\prime}_{i^{\prime}})=\frac{|y_{i}\cap y^{\prime}_{i}|}% {|y_{i}|}recall ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG

An assignment algorithm (SciPy, 2024) is then run between the two matrices to find the row-by-row matching with the highest cumulative recall score (without matching rows with 00 recall). Each matched row (yi,yi)subscript𝑦𝑖subscriptsuperscript𝑦superscript𝑖(y_{i},y^{\prime}_{i^{\prime}})( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) adds recall(yi,yi)recallsubscript𝑦𝑖subscriptsuperscript𝑦superscript𝑖\text{recall}(y_{i},y^{\prime}_{i^{\prime}})recall ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) to the true positives and 1recall(yi,yi)1recallsubscript𝑦𝑖subscriptsuperscript𝑦superscript𝑖1-\text{recall}(y_{i},y^{\prime}_{i^{\prime}})1 - recall ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) to the false negatives. Each unmatched gold row counts as 1111 false negative, and each unmatched prediction row counts as 1111 false positive. The final F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is then calculated according to Equation 1, and EM is defined to be 1111 if the row-major F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score is 1111 and 00 otherwise. Note that with this definition, EM and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are exactly equal to EM and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the traditional sense when there is only one projection.

5.2 Spinach Agent on Prior Datasets

QALD-7 (Task 4) QALD-9 Plus (en) QALD-10 (en) WikiWebQuestions
Test Test Full Set Subset in ToG Dev Test
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
STAGG Yih et al. (2016) - 19.0 - - - - - - - - - -
GGNN Sorokin and Gurevych (2018) - 21.3 - - - - - - - - - -
LingTeQA To and Reformat (2020) - 34.0 - - - - - - - - - -
Baramiia et al. (2022) - - - - - 42.8 - - - - - -
Shivashankar et al. (2022) - - - - - 49.1 - - - - - -
QAnswer (Diefenbach et al., 2017a) - 40.0 - 44.6 - 57.8 - - - - - -
SPARQL-QA (Borroto et al., 2022) - - - - - 59.5 - - - - - -
Liu et al. (2024) - - - - 56.5 - - - - - - -
0-shot ToG (GPT-4) (Sun et al., 2024a) - - - - - - 54.7 - - - - -
Fine-tuned WikiSP (Xu et al., 2023) 38.0 43.6 - - - - - - 75.6 76.9 65.5 71.9
0-shot Spinach agent (GPT-4o) (Ours) 62.2 74.6 58.3 71.6 63.1 69.5 64.7 72.4 61.2 72.3 59.9 70.3
Table 2: Performance of the Spinach agent and prior works on 4 prior datasets. Sun et al. (2024a) only evaluated on the subset of non-boolean questions of QALD-10, which we denote as “Subset in ToG”. Spinach achieves the new SOTA on QALD-7, QALD-9 Plus, and QALD-10. On WikiWebQuestions, it comes within 1.6% F1 to the SOTA WikiSP fine-tuned on the dataset.

We evaluate our approach on four previous Wikidata datasets. We experiment with three QALD challenges that have annotated Wikidata SPARQLs: QALD-7 (task 4 for Wikidata) (Usbeck et al., 2017), QALD-9 Plus Perevalov et al. (2022), and the English subset of QALD-10 (Usbeck et al., 2023). In addition, we also evaluate with WikiWebQuestions (Xu et al., 2023), the Wikidata version of the popular WebQuestionSP dataset.

As shown in Table 2, the Spinach agent achieves new SOTA on the three QALD datasets and comes within 1.6% F1 of the fine-tuned SOTA (WikiSP) on WikiWebQuestion. In particular, our 0-shot agent outperforms WikiSP by 24.2% EM and 31.0% F1 on Qald-7 (Task 4), showing its impressive cross-dataset generalization capability compared to a fine-tuned model. Our approach also outperforms the GPT-4-based ToG Sun et al. (2024a) by 10.0% EM on the same subset of QALD-10 used for evaluation in its paper.

5.3 Prior Work on the Spinach Dataset

Dev Test
EM F1 EM F1
Direct GPT-4o Question Answering 0.0 3.9 0.0 4.0
GPT-4o Generating SPARQL 1.3 5.4 0.6 3.9
Fine-tuned WikiSP (Xu et al., 2023) 1.3 3.5 1.2 7.1
0-shot ToG (GPT-4)  (Sun et al., 2024a) 3.9 9.8 1.8 7.2
0-shot Spinach agent (GPT-4o) (Ours) 21.4 46.4 16.4 45.3
Table 3: Evaluation of the Spinach agent and prior works on the Spinach dataset.

We benchmark the performance of four baseline systems on the Spinach dataset. These baselines are: (1) directly asking GPT-4o to answer the question without writing SPARQL, (2) directly asking GPT-4o to write a SPARQL query, (3) WikiSP Xu et al. (2023) which is a fine-tuned LLaMA (7B-parameter) model and (4) the GPT-4-based ToG agent (Sun et al., 2024a).

WikiSP expects a Named Entity Disambiguation (NED) module to predict the relevant entities given a question. The original NED module was based on a fine-tuned version of ReFinED (Ayoola et al., 2022). To make the task easier for WikiSP, we directly use the gold entities that appear in the annotated SPARQL as inputs. Similarly, the ToG (Sun et al., 2024a) pipeline uses LLM to extract “topic entities” from the input question as the starting point of its knowledge graph exploration. We observed that using the gold entities as topic entities, the model was unable to return any exploration paths for the first 80 questions in the validation set and simply defaulted to GPT-4. To make the task even easier for ToG, we sample up to 20 entities from the results of the gold SPARQL queries as the topic entities. Additional details on our setup and comparison of baseline systems can be found in Appendix C.

As shown in Table 3, the Spinach agent drastically outperforms all baselines. The median number of actions Spinach takes is 8. In particular, we highlight that the low accuracy of the GPT-4o QA system shows that this dataset mainly consists of long-tail knowledge unfamiliar to LLMs. For reference, GPT-4 was able to achieve 90.5% EM on WebQuestionSP (Tan et al., 2023). Although Spinach was able to achieve the SOTA on this dataset, the relatively lower performance compared to those achieved on previous dataset indicates that there is a lot of room for improving KBQA systems in future work.

5.4 Error Analysis

We randomly sample 20 cases on the Spinach dataset where the Spinach agent achieves less than 0.05 F1 and conduct an error analysis. We observe that:

  • Property-related problems: 40% of errors are due to problems with properties. This includes cases when the Spinach agent fails to fetch the correct property or incorrectly uses a property (e.g. using it as a wdt: relation as opposed to the correct ps: or pq: qualifier).

  • Complicated SPARQL: 30% of errors are due to failure of the Spinach agent to write complex SPARQL to fetch results, e.g., a complicated filter for finding people born in a specific month.777An interesting observation is that in 2 cases, GPT-4o ended the reasoning with “I will fetch all related records and proceed to filter the results in Python” after observing its previous SPARQL returned no results, suggesting there may be benefits to leveraging programming languages that are more familar to LLMs.

  • Not enough exploration: 15% of errors are due to insufficient exploration performed by the LLM after reaching the maximum allowed number of actions. Here, the LLM has determined the infeasibility of its prior approaches and is actively exploring new methods.

  • Inaccurate semantic parsing: 10% of errors are due to the LLM injecting a clause that should not exist. For instance, when asked “what items were published in …”, the LLM assumes that the results have to be an instance of the domain entity “book”, but there are many other types of results returned by the gold query (e.g. “printed matter” and “legal act”).

  • Formatting issues: 5% of errors are due to format errors. This one case is due to LLM returning the date in full (“June 23 2021”) even though the question only asked for the year (“2021”).

6 Conclusion

We propose the expert-annotated Spinach dataset drawn from real-world queries and introduce the Spinach agent that mimics how a human expert writes SPARQL queries. Experiments on prior works show that the Spinach agent achieves the new SOTA on 3 QALD datasets and comes within 1.6% F1 to the fine-tuned SOTA model on WikiWebQuestions. While achieving SOTA result on the new Spinach dataset, its relative low performance suggests ample opportunities for enhancing KBQA systems in future research.

Ethical Considerations

We release the Spinach dataset in this paper. This dataset is built using publicly available data in the Wikidata Request Query forum, which is licensed under the Creative Commons CC0 License. We don’t expect any harm being produced from the usage of this dataset. The generation and annotation of the dataset were done by the authors of this paper.

No GPU-based training was carried out in our experiments, as we mainly used the OpenAI API to call the GPT-4o model. To evaluate the LLaMA baseline from Xu et al. (2023), we used a Linux server with one NVIDIA A100 GPU for <1 hour.

To facilitate further research, we release our code and data for the Spinach agent and the Spinach dataset. The Spinach agent is released under Apache License, version 2.0. The Spinach dataset, derived from the Wikidata Request a Query forum, is released under the CC BY-SA 4.0 license, the same license that covers the forum.

Limitations

Since Spinach agent makes multiple LLM calls for each question, its latency and cost are higher compared to simpler systems. Other multi-stage KBQA pipelines with LLMs like Sun et al. (2024a) suffer from the same limitations. This seems to be the price for a more accurate KBQA system. However, recently, model distillation has been shown to be effective in improving model efficiency Semnani et al. (2023). We leave the exploration of this direction for future work.

As observed in recent works (Kamalloo et al., 2023; Zhang et al., 2024), automatic metrics such as F1 and EM often fail to accurately measure the capabilities of LLM-based systems. By verbalizing the SPARQL queries as accurately as possible, we aim to minimize the difference between automatic metrics and manual inspection. We thus expect these automatic metrics to still play a pivotal role in the evaluation of future work on the Spinach dataset.

Annotated datasets can also suffer from annotation errors and mistakes (Zang et al., 2020). We address this limitation by having experts annotate the dataset and independently cross-validating each other’s annotations, as discussed in Appendix B.1.

Acknowledgements

This work is supported in part by the Verdant Foundation, Microsoft Azure AI credit, KDDI, and the Stanford Human-Centered Artificial Intelligence (HAI) Institute.

References

Appendix A Definition of metrics used in Table 1

For each query, we define the number of clauses as the number of atomic nodes of a SPARQL abstract syntax tree (AST), where an atomic node is defined as one of: a projection clause (i.e., SELECT ... WHERE), a single subject-relation-object clause, a group by clause, a having by clause, a filter clause, a sorting clause, a MINUS clause, and a join clause. The number of projections is defined as the number of fields in the final SELECT ... WHERE clause. The number of relations is defined as the number of subject-relation-object clauses. The number of subjects is defined to be the number of uniquely occuring subjects (either variables or entities) in each subject-relation-object clause. The number of predicates is defined to be the unique number of properties (i.e. a PID of the form P...). The number of objects is defined to be the number of uniquely occuring objects (either variables or entities) in each subject-relation-object clause, where we also count the number of y occuring in each VALUES clause of the form VALUES ?x {y}. The number of literals is defined to be the unique number of strings (e.g. ‘‘wikidata’’) or numerical numbers (e.g. 123.45).

Similar to prior work (Gu et al., 2021), we use hand-crafted rules (including regular expressions) to retrieve these metrics.

Appendix B Additional details on dataset collection

B.1 Additional Details on Annotation

Expert annotations: The [Uncaptioned image] Spinach dataset is annotated by 3 authors of this paper who are experts with extensive knowledge in SPARQLs and Wikidata. The 3 experts first engaged in a long discussion with specific examples to standardize the annotation procedure. 2 experts first annotated the dev and test sets, with the 3rd expert double-checking and validating the annotations.

LLM suggestions: To facilitate the annotation process, we used GPT-4o to generate preliminary annotation suggestions to the experts. For each example, the experts are shown with (1) the original SPARQL query and with properties & entities substituted with labels, (2) link to the specific forum discussion, (3) two LLM-suggested verbalizations (one more verbose and one more natural), and (4) an LLM suggestion of whether to modify, include, or exclude the query with its reasonings. For each data example, the experts executed numerous intermediate SPARQL queries on the Wikidata site to verify the query’s validity and finalize the annotated SPARQL.

Automatic exclusion of mwapi in queries: Some SPARQL queries on the forum make use of mwapi888https://en.wikibooks.org/wiki/SPARQL/SERVICE_-_mwapi. The usage of these APIs are mostly for optimization or are otherwise very specific to the structure of Wikimedia sites. We note that sometimes, removing these APIs would lead to small changes in the results (most likely due to the outdated discrepancy between the API outputs and Wikidata-direct outputs). To standardize the dataset and avoid issues during evaluation, we use regex to delete all occurrences of mwapis in the SPARQLs before executing them.

B.2 Examples of queries modified

The following are examples of how we modify target SPARQLs.

Wikimedia presentation queries: This dicsussion, titled “Two-letter genera, on Wikispecies” in July of 2020, invovles the following snippets of retrieving information from species.wikimedia.org for certain taxa:

?wikispecies schema:about ?item .
?wikispecies schema:isPartOf <https://species.wikimedia.org/> .

This, however, is very specific to the structure of different Wikimedia sites. Removing these two clauses does not modify the core parts of the meaning. We thus remove these two clauses and the corresponding target ?wikispecies in the projection. The resulting SPARQL is then included in the Spinach validation set.

Queries obscured by optimizations: This discussion, titled “Persons from a certain time period and country” in Septempter of 2020, contains a query that makes use of the following optimization:

int:Prior hint:rangeSafe true .

which is used to speed up the succeeding filter. Removing this optimization in this case does not signifincatly increase SPARQL executing time. We thus remove this clause and include the example in the Spinach validation set.

Formatting clauses: This discussion, titled “Query all taxons which are trees”, results in a SPARQL of the following structure:

SELECT
?taxon ?sample (GROUP_CONCAT(DISTINCT str(?commonname); separator = "//") as ?commonnames)
WHERE
{
...
}
GROUP BY ?taxon ?sample

Using a group by, it is trying to concatenate all commonnames into one single string, seperated by "//". However, it is very difficult to accurately capture this in natural language, and a system could return the results in different orders, raising issues for evaluation. We thus exclude this concatenation and instead change it to a counting operation:

SELECT
?taxon ?sample (COUNT(?commonname) as ?commonnamecount)
WHERE
{
...
}
GROUP BY ?taxon ?sample

The resulting SPARQL is then included in the Spinach validation set.

B.3 Examples of queries excluded

The following are examples of when, after modifications, no meaningful part of the SPARQL remains. The discussions are thus excluded from our dataset.

Wikimedia presentation queries: The discussion, titled “?item =wd:QXXX” in April of 2020, results in a SPARQL that is only trying to fetch wikipedia pages that are about the item “Gambling, Gods And LSD”. The usage of schema:about property is very specific to Wikimedia and could raise confusion on the meaning of “about”. It is thus excluded. The discussion, titled “Bengali Wikipedia articles with no Wikidata statement ” in March of 2018, results in a SPARQL trying to fetch Wikidata items with no statements that are the topic of discussion of Bengali Wikipedia articles. The usage of wikibase:statements, schema:about, schema:isPartOf, and wikibase:sitelinks properties are all very specific to the structure of Wikimedia sites and can create confusions for KBQA systems. It is thus excluded.

Questions on complex SPARQL code: This discussion, titled “Scatterplot query” in March of 2021, is an example where the original requester comes in with a SPARQL that is complicated whose meaning is already difficult to acurately express in English. The final SPARQL from the conversation (hyperlinked in “ a bit more baroque, with axes”) only adds to the complexity.

Queries obscured by optimizations: This discussion, titled “Section “P31wdt:P279⁢ wd:Q16917 in wdt:P131wdt:P131⁢ wd:Q25 (query optimization)”, invovles only the refactoring of a query into using two sub-queries which are then joined together to avoid timing outs. The two queries are semantically equivalent. Due to reproducibility challenges from this refactoring, this discussion is excluded.

Formatting clauses: The discussion, titled “preferred format for id” in Feburary of 2022, only contains discussions on differnt ways for string processing in SPARQL and is thus excluded. The discussion, titled “Custom link formatting in WDQS results” in June of 2018, only contains results in a SPARQL that focuses on converting string formats after minimal use of Wikidata (fetching the sitelinks of one item) and is thus excluded.

Appendix C System and Evaluation Details

Unless otherwise specified, LLMs are used with greedy decoding, i.e. with temperature=0𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒0temperature=0italic_t italic_e italic_m italic_p italic_e italic_r italic_a italic_t italic_u italic_r italic_e = 0, with the exception of Spinach agent’s policy prompt, which is run with temperature=1𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒1temperature=1italic_t italic_e italic_m italic_p italic_e italic_r italic_a italic_t italic_u italic_r italic_e = 1 and nucleus sampling Holtzman et al. (2019) with p=0.9𝑝0.9p=0.9italic_p = 0.9

The ToG agent by default builds a local version of Wikidata using the simple-wikidata-db library999https://github.com/neelguha/simple-wikidata-db. However, building this index from scratch is extremely computationally expensive. Sun et al. (2024b) report that they need to deploy the Wikidata dump across five AWS EC2 instances, each consisting of a 768GB machine with 48 cores. Fahl et al. (2022) reported various other methods of hosting Wikidata locally. We attempted using Qlever as the SPARQL Engine but failed to do so on the latest Wikidata dump. For the evaluation of ToG, we re-implemented the same logic using dynamic Wikidata API calls.

We adopted the same hyperparameters for ToG as used in the original paper. For the experiment in Section 5.3, we used D=3 and T=3 as the hyperparameters for graph exploration.

The GPT-4 and ToG systems return results in strings of the entities instead of entity IDs; we convert the gold SPARQL output to only contain their English labels and calculate EM and F1 metrics.

Appendix D Prompts used in the Spinach system

# instruction
Your task is to write a Wikidata SPARQL query to answer the given question. Follow a step-by-step process:
1. Start by constructing very simple fragments of the SPARQL query.
2. Execute each fragment to verify its correctness. Adjust as needed based on your the observations.
3. Confirm all your assumptions about the structure of Wikidata before proceeding.
4. Gradually build the complete SPARQL query by adding one piece at a time.
5. Do NOT repeat the same action, as the results will be the same.
6. The question is guaranteed to have an answer in Wikidata, so continue until you find it.
7. If the user is asking a True/False question with only one answer, use ASK WHERE to fetch a True/False answer at the very end.
8. In the final SPARQL projections, do not only ask for labels. Ask for the actual entities whenever needed (e.g. instead of doing SELECT xLabel‘, do SELECT x‘).
9. If the final result was contained in last rounds get_wikidata_entry and you are ready to stop, use execute_sparql and generate a SPARQL to retrieve that results.
Form exactly one "Thought" and perform exactly one "Action", then wait for the "Observation".
Possible actions are:
- get_wikidata_entry(QID): Retrieves all outgoing edges (linked entities, properties, and qualifiers) of a specified Wikidata entity using its QID.
- search_wikidata(string): Searches Wikidata for entities or properties matching the given string.
- get_property_examples(PID): Provides a few examples demonstrating the use of the specified property (PID) in Wikidata.
- execute_sparql(SPARQL): Runs a SPARQL query on Wikidata and returns a truncated result set for brevity.
- stop(): Marks the last executed SPARQL query as the final answer and ends the process.
# input
Question: {{ question }}
{% if action_history %}
{% for i in range(0, action_history|length) %}
{{ action_history[i] }}
{% endfor %}
{% endif %}
Output one "Thought" and one "Action":
Table 4: The policy prompt of the Spinach agent.
# instruction
# instruction
At each turn, you are given a Wikidata entry and a question.
You want to write a SPARQL query that answers the question.
As the first step, remove the parts of the Wikidata entry that could not be potentially helpful when writing the SPARQL.
The output should be a json object containing part of the Wikidata entry.
# few-shot example 1, input
Wikidata entry for OneRepublic (Q1438730, OneRepublic is an American pop rock band formed in Colorado Springs, Colorado, in 2002):
{
"instance of (P31)": "musical group (Q215380)",
...
"social media followers (P8687)": {
"3134158": {
"Qualifiers": [
{
"point in time (P585)": "4 February 2023"
}
]
},
"3276596": {
"Qualifiers": [
{
"point in time (P585)": "6 January 2021"
}
]
},
"3178896": {
"Qualifiers": [
{
"point in time (P585)": "2 March 2022"
}
]
},
"3720919": {
"Qualifiers": [
{
"point in time (P585)": "10 May 2018"
}
]
}
}
}
Question: "What is the title of the second single on OneRepublics third album Native?"
# few-shot example 1, output
{
"instance of (P31)": "musical group (Q215380)",
"record label (P264)": "Columbia Records (Q183387)",
"discography (P358)": "OneRepublic discography (Q935670)",
}
# few-shot example 2, input
Wikidata entry for Barack Obama Sr. (Q649593, Barack Obama Sr. was an economist and the father of Barack Obama jr.):
{
"place of birth (P19)": {
"Nyang’oma Kogelo (Q1027168)": {
"Qualifiers": [
{
"country (P17)": "Kenya Colony"
}
]
}
},
...
}
Question: "where did barack obama sr. attend school?"
# few-shot example 2, output
{
"place of birth (P19)": {
"Nyang’oma Kogelo (Q1027168)": {
"Qualifiers": [
{
"country (P17)": "Kenya Colony"
}
]
}
},
"place of death (P20)": "Nairobi (Q3870)",
...
}
# input
Wikidata entry for "{{ entity_and_description }}":
{{ outgoing_edges }}
Question: "{{ question }}"
Table 5: The 2-shot prompt used to prune the output of get_wikidata_entry() action. Here we omit parts of the few-shot example inputs for space.

Appendix E Full Example Trace of the Spinach Agent

Listing  1 contains a full sequence of thoughts, actions and observations of the Spinach agent for the given input question.

"""
Question: SteplanI hav
"""
Thought: First, I need to identify the properties and entities related to people affiliated with the University of Washington School of Music, their educational background, and the musical instruments they play. This will help in formulating the final SPARQL query.
>> search_wikidata("University of Washington School of Music")
Observation: University of Washington School of Music (Q98035717): school of music
Thought: Now that I have identified the entity for the University of Washington School of Music, I need to find the properties that can help me identify people affiliated with it and the properties related to their education and musical instruments.
>> get_wikidata_entry("Q98035717")
Observation:
{
instance of (P31): {
faculty (Q180958): {
Qualifiers: [
{
of (P642): University of Washington (Q219563)
}
]
},
music school (Q1021290): {}
},
field of work (P101): [
higher education (Q136822),
postgraduate education (Q141212),
musicology (Q164204),
music (Q638),
research (Q42240),
ethnomusicology (Q208365)
],
located in the administrative territorial entity (P131): Seattle (Q5083),
location (P276): Music Building (Q98690890),
has subsidiary (P355): University of Washington Laboratory for Music Cognition, Culture & Learning (Q101157561),
part of (P361): University of Washington College of Arts and Sciences (Q7896566),
has part(s) (P527): University of Washington Laboratory for Music Cognition, Culture & Learning (Q101157561),
parent organization (P749): University of Washington College of Arts and Sciences (Q7896566)
}
Thought: To find people affiliated with the University of Washington School of Music, I need to look for properties that define such an affiliation. Additionally, I need to identify properties related to their education at the University of Washington and the musical instruments they play. I will start by searching for the property that links people to their affiliations.
>> search_wikidata("affiliation")
Observation:
- affiliation (Q46135267)
connection, formally or informally, with an organization or intellectual movement
- filiation (Q5861944)
system for recognizing the lineage or kinship of a person in anthropology
- affiliation (Q2620373)
archaic term for the legal establishment of paternity
- affiliated entity (Q69467106)
inverse property label item for P1416
- Affiliation (Q93774359)
scholarly article
- Affiliations Committee (Q46815761)
Wikimedia community-run committee
- Wikimedia Affiliations Committee wiki (Q33120941)
Wikimedia private wiki
- Affiliation and participation (Q107433952)
masters thesis of Hannah Marie Mayfield
- affiliation (P1416)
organization that a person or organization is affiliated with (not necessarily member of or employed by)
Data Type: wikibase-item
- affiliation string (P6424)
qualifier to provide the published string form of affiliation attached to an author
Data Type: string
- Olympedia affiliations ID (P9447)
identifier for the affiliation of athletes in Olympedia athlete database
Data Type: external-id (Represents an identifier used in an external system.)
Thought: The property ‘affiliation (P1416)’ seems to be the most relevant for linking people to the University of Washington School of Music. Next, I need to identify the property for their education at the University of Washington and the property that links people to the musical instruments they play.
>> search_wikidata("educated at")
Observation:
- educated at (P69)
educational institution attended by subject
Data Type: wikibase-item
Thought: I have identified the property ‘educated at (P69)’ to link people to the University of Washington. Next, I need to find the property that links people to the musical instruments they play.
>> search_wikidata("musical instrument")
Observation:
- musical instrument (Q34379)
device created or adapted to make musical sounds
- heraldic musical instrument (Q56877088)
category of heraldic charges
- Musical instrument (Q102413357)
Oil of canvas by Alla Grigoryan
- Musical Instrument (Q102413736)
painting by Musical instrument
- Musical instrument (Q86150836)
sculpture in the church Brother Klaus in the city of Bern, Switzerland
- Musical Instrument (Q80002888)
musical instrument by unknown artist (1929.322)
- Musical instrument (Q123046258)
US patent 11482197
- musical instrument maker (Q1955150)
manufacturer of musical instruments
- instrument (P1303)
musical instrument that a person plays or teaches or used in a music occupation
Data Type: wikibase-item
- MIMO instrument ID (P3763)
identifier for a musical instrument, on the instrument keyword thesaurus from Musical Instruments Museums Online
Data Type: external-id (Represents an identifier used in an external system.)
Thought: I have identified the relevant properties: ‘affiliation (P1416)’ to link people to the University of Washington School of Music, ‘educated at (P69)’ to link people to the University of Washington, and ‘musical instrument (P1303)’ to identify the instruments they play. I will now construct a SPARQL query that combines these properties to get the required information.
>> execute_sparql("""
SELECT ?instrument (COUNT(*) AS ?count) WHERE {
?person wdt:P69 wd:Q1063349;
wdt:P1416 wd:Q98035717;
wdt:P1303 ?instrument.
}
GROUP BY ?instrument
"""
)
Observation:
Query returned empty result.
Thought: It appears that the query returned an empty result. This could be due to several reasons: either there are no people matching all criteria, or the properties might not be used exactly as I assumed. To ensure accuracy, I will individually verify if there are people affiliated with the University of Washington School of Music who have also been educated at the University of Washington.
>> execute_sparql(
"""
SELECT ?person WHERE {
?person wdt:P69 wd:Q1063349;
wdt:P1416 wd:Q98035717.
}
LIMIT 10
"""
)
Observation: Query returned empty result.
Thought: The absence of results could indicate that the properties or the specific entities used in the query might not be correct or commonly utilized in this context. I will check if there are people who have been educated at the University of Washington (Q1063349) and then see if I can refine the affiliation aspect.
>> execute_sparql(
"""
SELECT ?person WHERE { ?person wdt:P69 wd:Q1063349. }
LIMIT 10
"""
)
Observation: Query returned empty result.
Thought: Since my previous queries returned empty results, I need to verify if there are any individuals in Wikidata who have been educated at the University of Washington (Q1063349). This will help me confirm if I am using the correct properties and entities.
>> get_wikidata_entry("Q1063349")
Observation:
{
country (P17): Russia (Q159),
instance of (P31): hamlet (Q5084),
located in the administrative territorial entity (P131): Tavtimanovsky selsoviet (Q4449505),
population (P1082): {
11: {
Qualifiers: [
{
point in time (P585): 1 January 2010,
determination method (P459): census (Q39825)
},
{
determination method (P459): census (Q39825),
point in time (P585): 1 January 2002
}
]
},
9: {
Qualifiers: [
{
point in time (P585): 1 January 2009,
determination method (P459): census (Q39825)
}
]
}
}
}
Thought: The query results indicate that there may be an issue with the QID I used for the University of Washington (Q1063349). It seems to correspond to a location in Russia rather than the university. I should search for the correct QID for the University of Washington.
>> search_wikidata("University of Washington")
Observation:
- University of Washington (Q219563)
public research university in Seattle, Washington, United States
- University of Washington Collection (NAID 931) (Q59502962)
collection in the National Archives and Records Administrations holdings
- University of Washington Department of Chemistry (Q97958839)
academic department at the University of Washington
- University of Washington Department of Architecture (Q98844905)
academic department at the University of Washington
- University of Washington Department of Civil and Environmental Engineering (Q97990078)
academic department at the University of Washington
- University of Washington Department of Chemical Engineering (Q99196105)
academic department at the University of Washington
- University of Washington Paul G. Allen School of Computer Science & Engineering (Q98186807)
school of the University of Washington
- University of Washington School of Music (Q98035717)
school of music
Thought: The correct QID for the University of Washington is Q219563. I should now update my SPARQL query to use this correct QID for identifying people who have been educated at the University of Washington and who are affiliated with the University of Washington School of Music.
>> execute_sparql(
"""
SELECT ?person WHERE {
?person wdt:P69 wd:Q219563;
wdt:P1416 wd:Q98035717.
}
LIMIT 10
"""
)
Observation:
| Person |
|--------------|
| Q107692821 |
| Q107705263 |
| Q107706107 |
| Q107707997 |
| Q107708199 |
| Q107708456 |
| Q107720948 |
| Q107773759 |
| Q107782800 |
| Q107789859 |
Thought: Now that I have identified individuals who have been educated at the University of Washington (Q219563) and are affiliated with the University of Washington School of Music (Q98035717), I will update my SPARQL query to find the musical instruments played by these individuals and count the occurrences of each instrument.
>> execute_sparql(
"""
SELECT ?instrument ?instrumentLabel (COUNT(?student) AS ?count) WHERE {
?student wdt:P1303 ?instrument;
wdt:P1416 wd:Q98035717;
wdt:P69 wd:Q219563.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?instrument ?instrumentLabel
"""
)
Observation:
| instrument | instrumentLabel | count |
|--------------|-----------------|-------|
| Q5994 | piano | 99 |
| Q1467960 | mbira | 2 |
| Q8350 | trombone | 11 |
| Q8338 | trumpet | 8 |
| Q17172850 | voice | 32 |
| ... | ... | ... |
| Q302497 | mandolin | 1 |
| Q187851 | recorder | 1 |
| Q185041 | cor anglais | 1 |
| Q83509 | piccolo | 1 |
Thought: The query successfully returned the musical instruments played by individuals who are affiliated with the University of Washington School of Music and have been educated at the University of Washington, along with the counts of each instrument. I will mark this query as complete.
>> stop()
Listing 1: Full version of the example in Figure 2