Spinach: SPARQL-Based Information Navigation
for Challenging Real-World Questions

Shicheng Liu^†^∗ Sina J. Semnani^†^∗ Harold Triedman^§ Jialiang Xu^†
Isaac Dan Zhao^† Monica S. Lam^†
^† Stanford University ^§ Wikimedia Foundation
{shicheng, sinaj, lam}@cs.stanford.edu
[email protected]

Abstract

Recent work integrating Large Language Models (LLMs) has led to significant improvements in the Knowledge Base Question Answering (KBQA) task. However, we posit that existing KBQA datasets that either have simple questions, use synthetically generated logical forms, or are based on small knowledge base (KB) schemas, do not capture the true complexity of KBQA tasks.

To address this, we introduce the Spinach dataset, an expert-annotated KBQA dataset collected from forum discussions on Wikidata’s “Request a Query” forum with 320 decontextualized question-SPARQL pairs. Much more complex than existing datasets, Spinach calls for strong KBQA systems that do not rely on training data to learn the KB schema, but can dynamically explore large and often incomplete schemas and reason about them.

Along with the dataset, we introduce the Spinach agent, a new KBQA approach that mimics how a human expert would write SPARQLs for such challenging questions. Experiments on existing datasets show Spinach’s capability in KBQA, achieving a new state of the art on the QALD-7, QALD-9 Plus and QALD-10 datasets by 30.1%, 27.0%, and 10.0% in F1, respectively, and coming within 1.6% of the fine-tuned LLaMA SOTA model on WikiWebQuestions. On our new Spinach dataset, Spinach agent outperforms all baselines, including the best GPT-4-based KBQA agent, by 38.1% in F1.¹¹1Code and data available at https://github.com/stanford-oval/spinach

Spinach: SPARQL-Based Information Navigation
for Challenging Real-World Questions

Shicheng Liu^†^∗ Sina J. Semnani^†^∗ Harold Triedman^§ Jialiang Xu^† Isaac Dan Zhao^† Monica S. Lam^† ^† Stanford University ^§ Wikimedia Foundation {shicheng, sinaj, lam}@cs.stanford.edu [email protected]

^*^*footnotetext: Equal contribution

1 Introduction

Refer to caption — Figure 1(a) An example forum discussion

Wikidata (Vrandečić and Krötzsch, 2014), one of the largest publicly available knowledge bases, currently contains 15 billion facts and is estimated to grow at a rate of 1 billion triples per year (MediaWiki, 2024b). It is of significant value to many scientific communities, including Mathematics (Scharpf et al., 2021), Biology (Mitraka et al., 2015; Pfundner et al., 2015), Education (Sigalov and Nachmias, 2023), Linguistics (Turki et al., 2017; lu Yu and Qiao, 2017), and the Social Sciences (Kapsalis, 2019; Zhao, 2022), among many others (Farda-Sarbas and Müller-Birn, 2019; Turki et al., 2023). Effective access to Wikidata data can be challenging. To address this, the Wikidata Request a Query forum²²2https://www.wikidata.org/wiki/Wikidata:Request_a_query has been created so that users can ask questions, propose solutions, and participate in follow-up conversations. 1Figure 1(a) shows one such conversation, in which a user wants to obtain the academic genealogy of Leonhard Euler.

Although numerous datasets have been proposed for Knowledge Base Question Answering (KBQA) task, they either contain only simple questions (Yih et al., 2016; Bordes et al., 2015; Usbeck et al., 2017; Perevalov et al., 2022; Usbeck et al., 2023; Xu et al., 2023; Korablinov and Braslavski, 2020; Rybin et al., 2021) or synthetically generated complex logical forms (Bao et al., 2016; Gu et al., 2021; Talmor and Berant, 2018; Keysers et al., 2020; Cao et al., 2022a; Dubey et al., 2019). Datasets with synthetically generated logical forms often overestimate the performance of KBQA systems, and performing well on them does not equate the ability to handle real-world queries Oren et al. (2021); Campagna et al. (2022). Thus, the community needs a high-quality dataset with organic, real-world queries that capture the true complexity of KBQA tasks.

In this paper, we propose using the Wikidata Request a Query forum to build and evaluate next-generation KBQA systems. We introduce the Spinach dataset, a new, expert-annotated KBQA dataset featuring decontextualized question-SPARQL pairs derived from complex discussions on real-life forums. 1Figure 1(b) presents an example obtained from the discussion in 1Figure 1(a).

As we show, current KBQA approaches (Xu et al., 2023; Sun et al., 2024a) cannot handle the complexity of these real-life queries. The state-of-the-art (SOTA) approach, ToG Sun et al. (2024a), which integrates LLMs with KG reasoning, scores only 1.8 EM and 7.2 F1 on this data set. Therefore, we propose a new LLM-augmented KBQA approach, Spinach: SPARQL-Based Information Navigation for Challenging Real-World Questions, which is designed with the primary goal of mimicking how an expert would tackle the task. We show that the Spinach agent establishes a new state of the art on popular datasets such as QALD-7, QALD-9, and QALD-10 and comes within 1.6% of the fine-tuned SOTA on WikiWebQuestions. Although Spinach outperforms strong previous work on the Spinach dataset, it indicates that further research is necessary to fully address real-life KBQA.

2 Related Work

2.1 KBQA benchmarks

Natural questions w/ annotated logical forms
	avg. clauses	avg. projs	avg. rels	avg. subjs	avg. preds	avg. objs	avg. lits
WikiWebQuestion (Xu et al., 2023)	2.63	1	1.53	1.25	1.52	1.53	0.04
QALD-9 Plus (Perevalov et al., 2022)	3.14	1	1.77	1.26	1.70	1.78	0.05
QALD-10 (Usbeck et al., 2023)	2.38	1	1.27	1.19	1.17	1.32	0.05
RuBQ (Korablinov and Braslavski, 2020)	2.17	1	1.12	1.03	1.11	1.07	0.01
SimpleQuestionsWikidata (Diefenbach et al., 2017b)	2.00	1	1.00	1.00	1.00	1.00	0.00
Synthetic logical forms w/ synthetic or paraphrased questions
CWQ (Talmor and Berant, 2018)	5.19	1	2.80	1.87	2.62	3.38	0.11
GrailQA (Gu et al., 2021)	7.10	1	3.02	1.97	2.43	3.90	0.08
KQA Pro (Cao et al., 2022a)	6.34	1	5.01	2.77	3.94	2.43	2.37
MCWQ (Cui et al., 2022)	6.34	1	5.09	2.67	3.53	3.37	0.00
LC-QuAD-2 Dubey et al. (2019)	3.65	1	2.07	1.51	2.05	2.07	0.22
Natural logical form w/ annotated questions
Spinach (Ours)	8.89	2.50	4.03	1.76	3.55	4.53	0.46

Table 1: Quantitative measurements on different KBQA datasets. For datasets originally based on Freebase, we calculate their corresponding datasets in Wikidata, if available. Refer to Appendix A for definitions of each metric.

Over the past decade, numerous Knowledge Base Question Answering (KBQA) benchmarks with logical forms have been introduced and can be classified into two categories:

Datasets with natural questions originally collected through search engines or crowd-sourcing. This includes the popular dataset WebQuestionSP (Yih et al., 2016), along with the QALD datasets Usbeck et al. (2017, 2018, 2023); Perevalov et al. (2022), RuBQ (Korablinov and Braslavski, 2020), and SimpleQuestions (Bordes et al., 2015), among others;

Datasets with synthetically generated logical forms and questions, where some datasets paraphrase synthetically generated questions via crowdsourcing. This category includes ComplexWebQuestions (Talmor and Berant, 2018), GrailQA (Gu et al., 2021), KQA Pro (Cao et al., 2022a), CFQ (Keysers et al., 2020), and LC-QuAD-2 Dubey et al. (2019), among others.

Some of the aforementioned datasets were originally based on Freebase (Bollacker et al., 2008) or DBPedia Lehmann et al. (2015). Recently, however, the KBQA community has been shifting toward using Wikidata as the underlying knowledge base for KBQA datasets, due to its larger size and continuous updates from community contributions.³³3For more on why Wikidata is better suited for KBQA benchmarks, refer to Section 1.1 of Usbeck et al. (2023). Several datasets have been converted to use Wikidata: WebQuestionSP has been converted to WikiWebQuestions (Xu et al., 2023); SimpleQuestions has a Wikidata version (Diefenbach et al., 2017b); and CFQ has been converted to MCWQ (Cui et al., 2022).

In Table 1, we present quantitative statistics on these datasets. Existing datasets with natural questions typically involve relatively simple queries (e.g., 2.63 clauses per query for WikiWebQuestions). In contrast, synthetically generated datasets generally contain more complex queries by design. However, due to the limited natural language variety between training and evaluation data, models can achieve artificially high accuracy. For instance, a simple semantic parser based on the BART Lewis et al. (2020) model can achieve an accuracy of over 90% on KQA-Pro even without an entity linking module (Cao et al., 2022a). Furthermore, systems that excel at synthetic datasets do not necessarily perform well in natural, complex tasks (Oren et al., 2021; Campagna et al., 2022). This indicates a need for a high-quality dataset with natural, complex queries and logical forms in the KBQA community.

2.2 KBQA approaches

Current KBQA systems can be classified mainly into three categories: (1) subgraph retrieval approaches (Sun et al., 2018, 2019; Sen et al., 2021; Verga et al., 2021; Mavromatis and Karypis, 2022), (2) semantic parsing systems (Yih et al., 2015, 2016; Luo et al., 2018; Lan and Jiang, 2020; Das et al., 2021; Ye et al., 2022; Cao et al., 2022b; Gu and Su, 2022; Xu et al., 2023), and more recently, (3) LLM-based graph exploration agents (Sun et al., 2024a; Xiong et al., 2024). Some works also use a combination of these techniques (Yu et al., 2023; Luo et al., 2024a, b).

Due to the sheer size of Wikidata, none of the previous subgraph retrieval approaches have tackled the problem of embedding the entire graph, instead operating only on a small fixed subset of Wikidata. This limitation renders them inapplicable in real-world scenarios where access to the entire graph is needed. Among semantic parsing systems, Xu et al. (2023) fine-tune LLaMA Touvron et al. (2023) with a modified SPARQL syntax and achieve state-of-the-art results on 2 KBQA benchmarks. Among LLM-based approaches, Sun et al. (2024a) instruct an LLM to dynamically explore the graph to fetch answers, achieving SOTA on 6 KBQA datasets. In Section 5, we benchmark the performance of these two systems on our new Spinach dataset.

3 The Spinach Dataset

The archives of the Wikidata Request a Query from July 2016 to the present date are available⁴⁴4https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive. These conversations are real and organic, reflecting the kind of queries that practitioners are interested in. Out of the discussions up to May 2024, there are 2780 discussions with a valid SPARQL query. We only keep those conversations whose last-mentiomed query returns non-empty results under 10 MB. Of the 2171 discussions left, we randomly sample 920 conversations for consideration. These conversations span many domains, and typically do not specify the exact natural language corresponding to the SPARQLs. Thus, maunal processing is required to convert them to a KBQA dataset. The Wikidata experts among the authors of this paper manually inspected these conversations to produce a dataset with 155 examples in the validation and 165 examples in the test set, as described below. Addition details on this process can be found in Appendix B.1.

3.1 Choosing and Trimming the Queries

Our goal in building this data set is to have examples focus more on end-users simply accessing the Wikidata content, as opposed to performing obscure optimizations or formatting. As such, we manually remove the following SPARQL clauses:

Wikimedia presentation queries: We filter out clauses for analyzing or improving Wikimedia itself, such as asking which articles exist in one language in Wikipedia and not the other.

Overly complicated SPARQL. We exclude certain conversations where users ask for help debugging their complex SPARQL queries.

Queries obscured by optimizations: Because Wikidata restricts query runtime to 60 seconds (MediaWiki, 2024a), users may include clauses to optimize the performance, which are often not directly relevant to answering the question. We remove these clauses.

Formatting clauses: We remove clauses that only format the results.

Appendix B.2 and B.3 show specific examples of these modifications and exclusions.

3.2 Annotating Natural Questions

For each conversation, the experts annotate a self-contained, decontextualized natural language question that accurately captures the meaning of the user-written SPARQL. We follow these steps:

Disambiguate entities and properties: Entities and properties are important parts of a query. For ambiguous entities, the verbalization includes additional information to resolve the ambiguity. As for the properties, the verbalization should disintinguish between similar properties. For example, instead of asking “where a movie takes place”, we distinguish between the “narrative location” and the “filming location”; instead of asking “where a person comes from”, we distinguish between the “courtry of citizenship” vs. “country of birth”.

Natural verbalizations: Whenever possible, to the extent that no ambiguities are introduced, the question should refrain from directly using entity and property names, instead using a more natural way to express the meaning. For instance, instead of asking “what is the point of time of the goal?”, a more natural question with the same level of accuracy like “when does the goal take place?” should be used.

Accurately capturing OPTIONAL clauses and projections: In SPARQL, OPTIONAL clauses are used to include additional data that may or may not exist for queried items without excluding results lacking this optional data.⁵⁵5https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#OPTIONAL The verbalization should specify which fields are optional with clauses such as “if available”. Similarly, the verbalization should accurately capture what is requested in the projections, using phrases like: “For each result, return their name and location” if necessary.

As illustrated in Table 1, the Spinach dataset is the first dataset with both natural and complex logical forms, which represents the real-world KBQA needs of Wikidata users.

4 The Spinach Agent

The Spinach agent is an LLM-augmented knowledge graph exploration agent (Section 2.2), where an LLM is instructed to explore the knowledge graph, until an appropriate answer is found or another stop condition is met. However, unlike prior work, we design Spinach with the primary goal of mimicking a human expert writing a SPARQL query. An expert starts by writing simple queries and looking up Wikidata entity or property pages when needed, all to understand the structure of the knowledge graph and what connections exist. This is especially important for Wikidata due to its anomalous structure Shenoy et al. (2022). An expert then might add new SPARQL clauses to build towards the final SPARQL, checking her work along the way by executing intermediate queries and eyeballing the results. Potentially, she might go back to the drawing board and double-check her assumptions if a query fails.

Following this design principle, Spinach improves upon several aspects of previous agent-based KBQA systems:

First, Spinach uses the full expressiveness of SPARQL for exploration. This contrasts to Sun et al. (2024a); Xiong et al. (2024); Sun et al. (2024b), which explore the knowledge graph one edge at a time. As a result, Spinach weaves the exploration and writing of SPARQL together. It is instructed to try many SPARQLs and fail early. It learns about the structure of the knowledge graph from the results of its queries (or lack thereof), or realizes its incorrect assumptions, and decides to revisit them. This is a key difference from Sun et al. (2024b, a), where the answer is generated only when exploration is done.

Second, during its exploration, Spinach does not keep track of a subgraph as its state. Instead, we define the state to be the full history of actions and their results so far. Limiting the state to a subgraph Sun et al. (2024b, a) means that the agent would categorically not support questions whose answer contains a large number of entities (e.g. “Which actors graduated from Harvard?”) or computation (e.g. “What is the tallest mountain?”).

Third, while most KBQA agents start their exploration from entities they detect from the question, Spinach does not assume access to entities. This, for example, gives the agent the flexibility to start by searching for relevant relations if that is a more suitable way to explore. For example, when faced with the question “What are the items listed on the National Heritage List for England?”, it can find the designated property “National Heritage List for England number” (P1216) for this purpose instead of trying to use the entity “National Heritage List for England” (Q6973052), which makes the process much simpler.

Concretely, Spinach agent runs for $N$ time steps. At time step $n$ , it takes the current state as input and outputs a thought Yao et al. (2023) $t_{n}$ and one of the possible actions $a_{n}$ from the set of all possible actions. The action is then executed outside of the agent, and the results are returned to the agent as observation $o_{n}$ . In the rest of this section, we go into the details of the Spinach agent. Figure 2 shows an example of how Spinach answers a question.

4.1 State

The state at time step $n$ is $\mathcal{S}_{n}=\{t_{1},a_{1},o_{1},...,t_{n-1},a_{n-1},o_{n-1}\}$ , that is, the list of all thoughts, actions, and observations up to and including the previous time step. $\mathcal{S}_{1}$ is set to $\emptyset$ .

Complex KBQA tasks require high reasoning ability. One promising approach to improve the reasoning capabilities of LLMs is ReAct prompting Yao et al. (2023), in which the LLM is instructed to output its reasoning trace in natural language first, before selecting an action. We require the agent to output a thought at the beginning of each time step.

At time step $n$ , the Spinach agent takes $a_{n}$ , one of the following five actions, and receives the result of the execution as observation $o_{n}$ . These actions are selected for their similarity to what a human expert can do.

search_wikidata(string) is equivalent to using the text search box at wikidata.org. This action searches Wikidata for items (entities or properties) that match a given string using the Wikidata API⁶⁶6https://www.wikidata.org/w/api.php’s wbsearchentities. This API finds matches using human-readable labels and aliases. The main use for this action is to find relevant QIDs and PIDs.

get_wikidata_entry(QID) is equivalent to visiting the Wikidata page for an entity, e.g. https://www.wikidata.org/wiki/Q219563. This action retrieves all outgoing edges, i.e. linked entities, properties, and qualifiers of a specified Wikidata entity using its QID.

get_property_examples(PID) provides a few examples demonstrating the use of the specified property in Wikidata. The main use of this action is to understand how a property is used in Wikidata, if it is not clear from its label alone.

execute_sparql(SPARQL) is equivalent to using the Wikidata Query Service at https://query.wikidata.org/. Runs a SPARQL query on Wikidata and returns its results.

stop() marks the last executed SPARQL query as the final answer and ends the process.

Importantly, the results of the execution of each action are put in a human-readable format to make it easier for the LLM to process. To limit the amount of information that the agent has to process, we limit the output of search results to at most 8 entities and 4 properties, and limit large results of SPARQL queries to the first and last 5 rows. The outputs of get_wikidata_entry() are pruned using a separate few-shot prompt (Prompt 5) to only keep entries that are relevant to the question.

4.2 Policy

The policy of Spinach is implemented using a zero-shot prompt that only includes high-level instructions such as “start by constructing very simple queries and gradually build towards the complete query” and “confirm all your assumptions about the structure of Wikidata before proceeding” (Prompt 4). The decision of selecting the action at each time step is left to the LLM. This decision is made after experimenting with more restrictive policies, but we empirically observe that accounting for all various edge cases the agent might confuse the agent and hurt the performance.

The agent loop constructs the state and uses this policy prompt to ask LLM to output a thought and an action. It then executes the action and adds the observation to the state. The agent loop continues until either the stop() action is selected or it reaches the predefined maximum number of 30 actions.

In practice, the LLM can occasionally fall into a cycle of repeating the same action and argument over and over again, or deviate from its instructions and call stop() even though the last SPARQL execution did not return any results. To remedy this, the agent loop checks for these two situations and resets the state to before the first time the repeated action was performed, or the beginning, respectively, and continues from there. Note that the policy LLM is run using sampling with high temperature of 1.0 to encourage exploration, so rerunning from a previous state does not repeat the exact same sequence of actions.

In the example in Figure 2, Spinach goes through the following four phases, just by following the high-level instruction we provided: it (1) looks up relevant PIDs and QIDs, (2) attempts to write the SPARQL in one go, which fails due to a hallucinated QID that the agent did not previously check, (3) starts the debugging process by executing simpler SPARQL queries, and double-checking its assumptions. It realizes its mistake after looking at the result of a get_wikidata_entry, and finally (4) moves on to write the correct SPARQL query.

5 Experiments

5.1 Evaluation Metrics

Prior works mostly use two metrics: Exact Match (EM) and F1 (Yih et al., 2016; Usbeck et al., 2023). As shown in Table 1, the Spinach dataset contains 2.50 projections on average for each query, whereas prior datasets all only contain only one field in projection. This means that naively applying EM and F1 is not possible; therefore, we propose a generalization of these metrics.

Concretely, during evaluation, for each data point $x$ , a gold result $y$ and a predicted result $y^{\prime}$ are given, where:

y,y^{\prime}\in\{[y_{1},y_{2}\cdots]\}\cup\{\text{True},\text{False}\}

each $y_{i}$ being either an entity (e.g. ‘Q123’) or a literal (e.g. ‘123’). For each data point, EM is calculated to be 1 if $y=y^{\prime}$ and $0$ otherwise. If both results are lists, F1 is calculated as the harmonic mean of precision and recall :

F_{1}(y,y^{\prime})=\frac{2\textit{tp}}{2\textit{tp}+\textit{fp}+\textit{fn}}

(1)

where tp denotes the number of true positives ( $|y\cap y^{\prime}|$ ), fp denotes the number of false positives ( $|y^{\prime}-y|$ ), and fn denotes the number of false negatives ( $|y-y^{\prime}|$ ). If one of $y,y^{\prime}$ is a boolean, then $F_{1}(y,y^{\prime})$ is set to $1$ if $y=y^{\prime}$ and $0$ otherwise (Usbeck et al., 2023).

In prior datasets, each $y_{1},y_{2},\cdots$ is only an entity or literal, but this is not necessarily the case in real-life SPARQLs. For instance, in 1Figure 1(b), the user is interested in a tuple of the doctoral advisee and advisor for each result, where each $y_{i}$ is a list of entities and literals. In other words, the results are in the general case a matrix, not a list. Every row in the matrix is one result, and each column corresponds to one projection in the query.

From the user’s perspective, a good system could opt to return additional information when it helps clarify or enrich the results. For instance, for the query “what is the county with most people in South Dakota?”, a LLM-based system can choose to return the top county along with its population, even though the gold SPARQL might only include the county in projection. Penalizing the system for the additional projection columns is not meaningful in this case. To reflect this, we introduce a row-major generalization of EM and F1 to handle matrix-wise comparisons in real-life KBQA tasks, where each row is handled similarly to the traditional F1 score, and extra columns are not penalized.

Concretely, given two matrices of the form:

\mathbf{y}=\begin{pmatrix}y_{11}&\cdots&y_{1m}\\ \vdots&\ddots&\vdots\\ y_{n1}&\cdots&y_{nm}\\ \end{pmatrix}~{}\mathbf{y^{\prime}}=\begin{pmatrix}y^{\prime}_{11}&\cdots&y^{% \prime}_{1m^{\prime}}\\ \vdots&\ddots&\vdots\\ y^{\prime}_{n^{\prime}1}&\cdots&y^{\prime}_{n^{\prime}m^{\prime}}\\ \end{pmatrix}

we calculate the recall between a gold row $y_{i}=(y_{i1},\cdots y_{i,m})$ and predicted row $y^{\prime}_{i^{\prime}}=(y_{i^{\prime}1},\cdots y_{i^{\prime},m^{\prime}})$ as

\text{recall}(y_{i},y^{\prime}_{i^{\prime}})=\frac{|y_{i}\cap y^{\prime}_{i}|}% {|y_{i}|}

An assignment algorithm (SciPy, 2024) is then run between the two matrices to find the row-by-row matching with the highest cumulative recall score (without matching rows with $0$ recall). Each matched row $(y_{i},y^{\prime}_{i^{\prime}})$ adds $\text{recall}(y_{i},y^{\prime}_{i^{\prime}})$ to the true positives and $1-\text{recall}(y_{i},y^{\prime}_{i^{\prime}})$ to the false negatives. Each unmatched gold row counts as $1$ false negative, and each unmatched prediction row counts as $1$ false positive. The final $F_{1}$ is then calculated according to Equation 1, and EM is defined to be $1$ if the row-major $F_{1}$ score is $1$ and $0$ otherwise. Note that with this definition, EM and $F_{1}$ are exactly equal to EM and $F_{1}$ in the traditional sense when there is only one projection.

5.2 Spinach Agent on Prior Datasets

	QALD-7 (Task 4)		QALD-9 Plus (en)		QALD-10 (en)				WikiWebQuestions
	Test		Test		Full Set		Subset in ToG		Dev		Test
	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1	EM	F1
STAGG Yih et al. (2016)	-	19.0	-	-	-	-	-	-	-	-	-	-
GGNN Sorokin and Gurevych (2018)	-	21.3	-	-	-	-	-	-	-	-	-	-
LingTeQA To and Reformat (2020)	-	34.0	-	-	-	-	-	-	-	-	-	-
Baramiia et al. (2022)	-	-	-	-	-	42.8	-	-	-	-	-	-
Shivashankar et al. (2022)	-	-	-	-	-	49.1	-	-	-	-	-	-
QAnswer (Diefenbach et al., 2017a)	-	40.0	-	44.6	-	57.8	-	-	-	-	-	-
SPARQL-QA (Borroto et al., 2022)	-	-	-	-	-	59.5	-	-	-	-	-	-
Liu et al. (2024)	-	-	-	-	56.5	-	-	-	-	-	-	-
0-shot ToG (GPT-4) (Sun et al., 2024a)	-	-	-	-	-	-	54.7	-	-	-	-	-
Fine-tuned WikiSP (Xu et al., 2023)	38.0	43.6	-	-	-	-	-	-	75.6	76.9	65.5	71.9
0-shot Spinach agent (GPT-4o) (Ours)	62.2	74.6	58.3	71.6	63.1	69.5	64.7	72.4	61.2	72.3	59.9	70.3

Table 2: Performance of the Spinach agent and prior works on 4 prior datasets. Sun et al. (2024a) only evaluated on the subset of non-boolean questions of QALD-10, which we denote as “Subset in ToG”. Spinach achieves the new SOTA on QALD-7, QALD-9 Plus, and QALD-10. On WikiWebQuestions, it comes within 1.6% F1 to the SOTA WikiSP fine-tuned on the dataset.

We evaluate our approach on four previous Wikidata datasets. We experiment with three QALD challenges that have annotated Wikidata SPARQLs: QALD-7 (task 4 for Wikidata) (Usbeck et al., 2017), QALD-9 Plus Perevalov et al. (2022), and the English subset of QALD-10 (Usbeck et al., 2023). In addition, we also evaluate with WikiWebQuestions (Xu et al., 2023), the Wikidata version of the popular WebQuestionSP dataset.

As shown in Table 2, the Spinach agent achieves new SOTA on the three QALD datasets and comes within 1.6% F1 of the fine-tuned SOTA (WikiSP) on WikiWebQuestion. In particular, our 0-shot agent outperforms WikiSP by 24.2% EM and 31.0% F1 on Qald-7 (Task 4), showing its impressive cross-dataset generalization capability compared to a fine-tuned model. Our approach also outperforms the GPT-4-based ToG Sun et al. (2024a) by 10.0% EM on the same subset of QALD-10 used for evaluation in its paper.

5.3 Prior Work on the Spinach Dataset

	Dev		Test
	EM	F1	EM	F1
Direct GPT-4o Question Answering	0.0	3.9	0.0	4.0
GPT-4o Generating SPARQL	1.3	5.4	0.6	3.9
Fine-tuned WikiSP (Xu et al., 2023)	1.3	3.5	1.2	7.1
0-shot ToG (GPT-4) (Sun et al., 2024a)	3.9	9.8	1.8	7.2
0-shot Spinach agent (GPT-4o) (Ours)	21.4	46.4	16.4	45.3

Table 3: Evaluation of the Spinach agent and prior works on the Spinach dataset.

We benchmark the performance of four baseline systems on the Spinach dataset. These baselines are: (1) directly asking GPT-4o to answer the question without writing SPARQL, (2) directly asking GPT-4o to write a SPARQL query, (3) WikiSP Xu et al. (2023) which is a fine-tuned LLaMA (7B-parameter) model and (4) the GPT-4-based ToG agent (Sun et al., 2024a).

WikiSP expects a Named Entity Disambiguation (NED) module to predict the relevant entities given a question. The original NED module was based on a fine-tuned version of ReFinED (Ayoola et al., 2022). To make the task easier for WikiSP, we directly use the gold entities that appear in the annotated SPARQL as inputs. Similarly, the ToG (Sun et al., 2024a) pipeline uses LLM to extract “topic entities” from the input question as the starting point of its knowledge graph exploration. We observed that using the gold entities as topic entities, the model was unable to return any exploration paths for the first 80 questions in the validation set and simply defaulted to GPT-4. To make the task even easier for ToG, we sample up to 20 entities from the results of the gold SPARQL queries as the topic entities. Additional details on our setup and comparison of baseline systems can be found in Appendix C.

As shown in Table 3, the Spinach agent drastically outperforms all baselines. The median number of actions Spinach takes is 8. In particular, we highlight that the low accuracy of the GPT-4o QA system shows that this dataset mainly consists of long-tail knowledge unfamiliar to LLMs. For reference, GPT-4 was able to achieve 90.5% EM on WebQuestionSP (Tan et al., 2023). Although Spinach was able to achieve the SOTA on this dataset, the relatively lower performance compared to those achieved on previous dataset indicates that there is a lot of room for improving KBQA systems in future work.

5.4 Error Analysis

We randomly sample 20 cases on the Spinach dataset where the Spinach agent achieves less than 0.05 F1 and conduct an error analysis. We observe that:

•

Property-related problems: 40% of errors are due to problems with properties. This includes cases when the Spinach agent fails to fetch the correct property or incorrectly uses a property (e.g. using it as a wdt: relation as opposed to the correct ps: or pq: qualifier).
•

Complicated SPARQL: 30% of errors are due to failure of the Spinach agent to write complex SPARQL to fetch results, e.g., a complicated filter for finding people born in a specific month.⁷⁷7An interesting observation is that in 2 cases, GPT-4o ended the reasoning with “I will fetch all related records and proceed to filter the results in Python” after observing its previous SPARQL returned no results, suggesting there may be benefits to leveraging programming languages that are more familar to LLMs.
•

Not enough exploration: 15% of errors are due to insufficient exploration performed by the LLM after reaching the maximum allowed number of actions. Here, the LLM has determined the infeasibility of its prior approaches and is actively exploring new methods.
•

Inaccurate semantic parsing: 10% of errors are due to the LLM injecting a clause that should not exist. For instance, when asked “what items were published in …”, the LLM assumes that the results have to be an instance of the domain entity “book”, but there are many other types of results returned by the gold query (e.g. “printed matter” and “legal act”).
•

Formatting issues: 5% of errors are due to format errors. This one case is due to LLM returning the date in full (“June 23 2021”) even though the question only asked for the year (“2021”).

6 Conclusion

We propose the expert-annotated Spinach dataset drawn from real-world queries and introduce the Spinach agent that mimics how a human expert writes SPARQL queries. Experiments on prior works show that the Spinach agent achieves the new SOTA on 3 QALD datasets and comes within 1.6% F1 to the fine-tuned SOTA model on WikiWebQuestions. While achieving SOTA result on the new Spinach dataset, its relative low performance suggests ample opportunities for enhancing KBQA systems in future research.

Ethical Considerations

We release the Spinach dataset in this paper. This dataset is built using publicly available data in the Wikidata Request Query forum, which is licensed under the Creative Commons CC0 License. We don’t expect any harm being produced from the usage of this dataset. The generation and annotation of the dataset were done by the authors of this paper.

No GPU-based training was carried out in our experiments, as we mainly used the OpenAI API to call the GPT-4o model. To evaluate the LLaMA baseline from Xu et al. (2023), we used a Linux server with one NVIDIA A100 GPU for <1 hour.

To facilitate further research, we release our code and data for the Spinach agent and the Spinach dataset. The Spinach agent is released under Apache License, version 2.0. The Spinach dataset, derived from the Wikidata Request a Query forum, is released under the CC BY-SA 4.0 license, the same license that covers the forum.

Limitations

Since Spinach agent makes multiple LLM calls for each question, its latency and cost are higher compared to simpler systems. Other multi-stage KBQA pipelines with LLMs like Sun et al. (2024a) suffer from the same limitations. This seems to be the price for a more accurate KBQA system. However, recently, model distillation has been shown to be effective in improving model efficiency Semnani et al. (2023). We leave the exploration of this direction for future work.

As observed in recent works (Kamalloo et al., 2023; Zhang et al., 2024), automatic metrics such as F1 and EM often fail to accurately measure the capabilities of LLM-based systems. By verbalizing the SPARQL queries as accurately as possible, we aim to minimize the difference between automatic metrics and manual inspection. We thus expect these automatic metrics to still play a pivotal role in the evaluation of future work on the Spinach dataset.

Annotated datasets can also suffer from annotation errors and mistakes (Zang et al., 2020). We address this limitation by having experts annotate the dataset and independently cross-validating each other’s annotations, as discussed in Appendix B.1.

Acknowledgements

This work is supported in part by the Verdant Foundation, Microsoft Azure AI credit, KDDI, and the Stanford Human-Centered Artificial Intelligence (HAI) Institute.

References

Ayoola et al. (2022) Tom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos Christodoulopoulos, and Andrea Pierleoni. 2022. ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 209–220, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
Bao et al. (2016) Junwei Bao, Nan Duan, Zhao Yan, Ming Zhou, and Tiejun Zhao. 2016. Constraint-based question answering with knowledge graph. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2503–2514, Osaka, Japan. The COLING 2016 Organizing Committee.
Baramiia et al. (2022) Nikita Baramiia, Alina Rogulina, Sergey Petrakov, Valerii Kornilov, and Anton Razzhigaev. 2022. Ranking approach to monolingual question answering over knowledge graphs. In NLIWoD@ ESWC, pages 32–37.
Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 1247–1250, New York, NY, USA. Association for Computing Machinery.
Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. Preprint, arXiv:1506.02075.
Borroto et al. (2022) Manuel Borroto, Francesco Ricca, Bernardo Cuteri, and Vito Barbara. 2022. Sparql-qa enters the qald challenge. In Proceedings of the 7th Natural Language Interfaces for the Web of Data (NLIWoD) co-located with the 19th European Semantic Web Conference, Hersonissos, Greece, volume 3196, pages 25–31.
Campagna et al. (2022) Giovanni Campagna, Sina Semnani, Ryan Kearns, Lucas Jun Koba Sato, Silei Xu, and Monica Lam. 2022. A few-shot semantic parser for Wizard-of-Oz dialogues with the precise ThingTalk representation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4021–4034, Dublin, Ireland. Association for Computational Linguistics.
Cao et al. (2022a) Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. 2022a. KQA pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6101–6119, Dublin, Ireland. Association for Computational Linguistics.
Cao et al. (2022b) Shulin Cao, Jiaxin Shi, Zijun Yao, Xin Lv, Jifan Yu, Lei Hou, Juanzi Li, Zhiyuan Liu, and Jinghui Xiao. 2022b. Program transfer for answering complex questions over knowledge bases. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8128–8140, Dublin, Ireland. Association for Computational Linguistics.
Cui et al. (2022) Ruixiang Cui, Rahul Aralikatte, Heather Lent, and Daniel Hershcovich. 2022. Compositional generalization in multilingual semantic parsing over Wikidata. Transactions of the Association for Computational Linguistics, 10:937–955.
Das et al. (2021) Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based reasoning for natural language queries over knowledge bases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9594–9611, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Diefenbach et al. (2017a) Dennis Diefenbach, Kamal Singh, and Pierre Maret. 2017a. Wdaqua-core0: A question answering component for the research community. In Semantic Web Challenges: 4th SemWebEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28-June 1, 2017, Revised Selected Papers, pages 84–89. Springer.
Diefenbach et al. (2017b) Dennis Diefenbach, Thomas Pellissier Tanon, Kamal Deep Singh, and Pierre Maret. 2017b. Question answering benchmarks for wikidata. In Proceedings of the ISWC 2017 Posters & Demonstrations and Industry Tracks co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 23rd - to - 25th, 2017.
Dubey et al. (2019) Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In The Semantic Web – ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II, page 69–78, Berlin, Heidelberg. Springer-Verlag.
Fahl et al. (2022) Wolfgang Fahl, Tim Holzheim, Andrea Westerinen, Christoph Lange, and Stefan Decker. 2022. Getting and hosting your own copy of wikidata.
Farda-Sarbas and Müller-Birn (2019) Mariam Farda-Sarbas and Claudia Müller-Birn. 2019. Wikidata from a research perspective – a systematic mapping study of wikidata. Preprint, arXiv:1908.11153.
Gu et al. (2021) Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, WWW ’21, page 3477–3488, New York, NY, USA. Association for Computing Machinery.
Gu and Su (2022) Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1718–1731, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
Kamalloo et al. (2023) Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
Kapsalis (2019) Effie Kapsalis. 2019. Wikidata: Recruiting the crowd to power access to digital archives. Journal of Radio & Audio Media, 26(1):134–142.
Keysers et al. (2020) Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
Korablinov and Braslavski (2020) Vladislav Korablinov and Pavel Braslavski. 2020. Rubq: A russian dataset for question answering over wikidata. In The Semantic Web – ISWC 2020, pages 97–110, Cham. Springer International Publishing.
Lan and Jiang (2020) Yunshi Lan and Jing Jiang. 2020. Query graph generation for answering multi-hop complex questions from knowledge bases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 969–974, Online. Association for Computational Linguistics.
Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Liu et al. (2024) Jiaxiang Liu, Tong Zhou, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Enhancing large language models with pseudo- and multisource- knowledge graphs for open-ended question answering. Preprint, arXiv:2402.09911.
lu Yu and Qiao (2017) Xue lu Yu and Lin Qiao. 2017. Meronymy relation extraction based on 3-motif in wikidata. DEStech Transactions on Computer Science and Engineering.
Luo et al. (2024a) Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, and Wei Lin. 2024a. ChatKBQA: A generate-then-retrieve framework for knowledge base question answering with fine-tuned large language models.
Luo et al. (2018) Kangqi Luo, Fengli Lin, Xusheng Luo, and Kenny Zhu. 2018. Knowledge base question answering via encoding of complex query graphs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2185–2194, Brussels, Belgium. Association for Computational Linguistics.
Luo et al. (2024b) Linhao Luo, Yuan-Fang Li, Reza Haf, and Shirui Pan. 2024b. Reasoning on graphs: Faithful and interpretable large language model reasoning. In The Twelfth International Conference on Learning Representations.
Mavromatis and Karypis (2022) Costas Mavromatis and George Karypis. 2022. ReaRev: Adaptive reasoning for question answering over knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2447–2458, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
MediaWiki (2024a) MediaWiki. 2024a. Wikidata query service/user manual — mediawiki,. [Online; accessed 13-June-2024].
MediaWiki (2024b) MediaWiki. 2024b. Wikidata: Sparql query service/wdqs graph split. Accessed: 2024-06-12.
Mitraka et al. (2015) Elvira Mitraka, Andra Waagmeester, Sebastian Burgstaller-Muehlbacher, Lynn M. Schriml, Andrew I. Su, and Benjamin M. Good. 2015. Wikidata: A platform for data integration and dissemination for the life sciences and beyond. bioRxiv.
Oren et al. (2021) Inbar Oren, Jonathan Herzig, and Jonathan Berant. 2021. Finding needles in a haystack: Sampling structurally-diverse training sets from synthetic data for compositional generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10793–10809, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Perevalov et al. (2022) Aleksandr Perevalov, Dennis Diefenbach, Ricardo Usbeck, and Andreas Both. 2022. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In 2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 229–234.
Pfundner et al. (2015) Alexander Pfundner, Tobias Schönberg, John Horn, Richard D Boyce, and Matthias Samwald. 2015. Utilizing the wikidata system to improve the quality of medical content in wikipedia in diverse languages: a pilot study. Journal of Medical Internet Research, 17(5):e110.
Rybin et al. (2021) Ivan Rybin, Vladislav Korablinov, Pavel Efimov, and Pavel Braslavski. 2021. Rubq 2.0: An innovated russian question answering dataset. In The Semantic Web, pages 532–547, Cham. Springer International Publishing.
Scharpf et al. (2021) Philipp Scharpf, Moritz Schubotz, and Bela Gipp. 2021. Mathematics in wikidata. In International Semantic Web Conference (ISWC) 2021. Virtual event.
SciPy (2024) SciPy. 2024. scipy.optimize.linear_sum_assignment. Accessed: 2024-07-11.
Semnani et al. (2023) Sina Semnani, Violet Yao, Heidi Zhang, and Monica Lam. 2023. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore. Association for Computational Linguistics.
Sen et al. (2021) Priyanka Sen, Armin Oliya, and Amir Saffari. 2021. Expanding end-to-end question answering on differentiable knowledge graphs with intersection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8805–8812, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Shenoy et al. (2022) Kartik Shenoy, Filip Ilievski, Daniel Garijo, Daniel Schwabe, and Pedro Szekely. 2022. A study of the quality of wikidata. Journal of Web Semantics, 72:100679.
Shivashankar et al. (2022) Kanchan Shivashankar, Khaoula Benmaarouf, and Nadine Steinmetz. 2022. From graph to graph: Amr to sparql. In Proceedings of the 7th Natural Language Interfaces for the Web of Data (NLIWoD) co-located with the 19th European Semantic Web Conference (ESWC 2022), Hersonissos, Greece, 29th May.
Sigalov and Nachmias (2023) Shani Evenstein Sigalov and Rafi Nachmias. 2023. Investigating the potential of the semantic web for education: Exploring wikidata as a learning platform. Education and Information Technologies, pages 1 – 50.
Sorokin and Gurevych (2018) Daniil Sorokin and Iryna Gurevych. 2018. Modeling semantics with gated graph neural networks for knowledge base question answering. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3306–3317, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William Cohen. 2019. PullNet: Open domain question answering with iterative retrieval on knowledge bases and text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2380–2390, Hong Kong, China. Association for Computational Linguistics.
Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium. Association for Computational Linguistics.
Sun et al. (2024a) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024a. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In The Twelfth International Conference on Learning Representations.
Sun et al. (2024b) Lei Sun, Zhengwei Tao, Youdi Li, and Hiroshi Arakawa. 2024b. Oda: Observation-driven agent for integrating llms and knowledge graphs. Preprint, arXiv:2404.07677.
Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, New Orleans, Louisiana. Association for Computational Linguistics.
Tan et al. (2023) Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen, and Guilin Qi. 2023. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. Preprint, arXiv:2303.07992.
To and Reformat (2020) Nhuan D. To and Marek Reformat. 2020. Question-answering system with linguistic terms over rdf knowledge graphs. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 4236–4243.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
Turki et al. (2023) Houcemeddine Turki, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Lane Rasberry, and Daniel Mietchen. 2023. [novel] ten years of wikidata: A bibliometric study. In The 4th Wikidata Workshop.
Turki et al. (2017) Houcemeddine Turki, Denny Vrandecic, Helmi Hamdi, and Imed Adel. 2017. Using wikidata as a multi-lingual multi-dialectal dictionary for arabic dialects. In 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pages 437–442.
Usbeck et al. (2018) Ricardo Usbeck, Ria Hari Gusmita, Axel-Cyrille Ngonga Ngomo, and Muhammad Saleem. 2018. 9th challenge on question answering over linked data (qald-9) (invited paper). In Semdeep/NLIWoD@ISWC.
Usbeck et al. (2017) Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian Haarmann, Anastasia Krithara, Michael Röder, and Giulio Napolitano. 2017. 7th open challenge on question answering over linked data (qald-7). In Semantic web evaluation challenge, pages 59–69. Springer.
Usbeck et al. (2023) Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2023. Qald-10 – the 10th challenge on question answering over linked data. Semantic Web, semantic-web/sw233471.
Verga et al. (2021) Pat Verga, Haitian Sun, Livio Baldini Soares, and William Cohen. 2021. Adaptable and interpretable neural MemoryOver symbolic knowledge. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3678–3691, Online. Association for Computational Linguistics.
Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85.
Xiong et al. (2024) Guanming Xiong, Junwei Bao, and Wen Zhao. 2024. Interactive-kbqa: Multi-turn interactions for knowledge base question answering with large language models. Preprint, arXiv:2402.15131.
Xu et al. (2023) Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina Semnani, and Monica Lam. 2023. Fine-tuned LLMs know more, hallucinate less with few-shot sequence-to-sequence semantic parsing over Wikidata. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5778–5791, Singapore. Association for Computational Linguistics.
Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
Ye et al. (2022) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. 2022. RNG-KBQA: Generation augmented iterative ranking for knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6032–6043, Dublin, Ireland. Association for Computational Linguistics.
Yih et al. (2015) Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing, China. Association for Computational Linguistics.
Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, Berlin, Germany. Association for Computational Linguistics.
Yu et al. (2023) Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Yiqun Hu, William Yang Wang, Zhiguo Wang, and Bing Xiang. 2023. DecAF: Joint decoding of answers and logical forms for question answering over knowledge bases. In The Eleventh International Conference on Learning Representations.
Zang et al. (2020) Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. MultiWOZ 2.2 : A dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 109–117, Online. Association for Computational Linguistics.
Zhang et al. (2024) Heidi C. Zhang, Sina J. Semnani, Farhad Ghassemi, Jialiang Xu, Shicheng Liu, and Monica S. Lam. 2024. Spaghetti: Open-domain question answering from heterogeneous data sources with retrieval and semantic parsing. Preprint, arXiv:2406.00562.
Zhao (2022) Fudie Zhao. 2022. A systematic review of Wikidata in Digital Humanities projects. Digital Scholarship in the Humanities, 38(2):852–874.

Appendix A Definition of metrics used in Table 1

For each query, we define the number of clauses as the number of atomic nodes of a SPARQL abstract syntax tree (AST), where an atomic node is defined as one of: a projection clause (i.e., SELECT ... WHERE), a single subject-relation-object clause, a group by clause, a having by clause, a filter clause, a sorting clause, a MINUS clause, and a join clause. The number of projections is defined as the number of fields in the final SELECT ... WHERE clause. The number of relations is defined as the number of subject-relation-object clauses. The number of subjects is defined to be the number of uniquely occuring subjects (either variables or entities) in each subject-relation-object clause. The number of predicates is defined to be the unique number of properties (i.e. a PID of the form P...). The number of objects is defined to be the number of uniquely occuring objects (either variables or entities) in each subject-relation-object clause, where we also count the number of y occuring in each VALUES clause of the form VALUES ?x {y}. The number of literals is defined to be the unique number of strings (e.g. ‘‘wikidata’’) or numerical numbers (e.g. 123.45).

Similar to prior work (Gu et al., 2021), we use hand-crafted rules (including regular expressions) to retrieve these metrics.

Appendix B Additional details on dataset collection

B.1 Additional Details on Annotation

Expert annotations: The Spinach dataset is annotated by 3 authors of this paper who are experts with extensive knowledge in SPARQLs and Wikidata. The 3 experts first engaged in a long discussion with specific examples to standardize the annotation procedure. 2 experts first annotated the dev and test sets, with the 3rd expert double-checking and validating the annotations.

LLM suggestions: To facilitate the annotation process, we used GPT-4o to generate preliminary annotation suggestions to the experts. For each example, the experts are shown with (1) the original SPARQL query and with properties & entities substituted with labels, (2) link to the specific forum discussion, (3) two LLM-suggested verbalizations (one more verbose and one more natural), and (4) an LLM suggestion of whether to modify, include, or exclude the query with its reasonings. For each data example, the experts executed numerous intermediate SPARQL queries on the Wikidata site to verify the query’s validity and finalize the annotated SPARQL.

Automatic exclusion of mwapi in queries: Some SPARQL queries on the forum make use of mwapi⁸⁸8https://en.wikibooks.org/wiki/SPARQL/SERVICE_-_mwapi. The usage of these APIs are mostly for optimization or are otherwise very specific to the structure of Wikimedia sites. We note that sometimes, removing these APIs would lead to small changes in the results (most likely due to the outdated discrepancy between the API outputs and Wikidata-direct outputs). To standardize the dataset and avoid issues during evaluation, we use regex to delete all occurrences of mwapis in the SPARQLs before executing them.

B.2 Examples of queries modified

The following are examples of how we modify target SPARQLs.

Wikimedia presentation queries: This dicsussion, titled “Two-letter genera, on Wikispecies” in July of 2020, invovles the following snippets of retrieving information from species.wikimedia.org for certain taxa:

⬇

?wikispecies schema:about ?item .

?wikispecies schema:isPartOf <https://species.wikimedia.org/> .

This, however, is very specific to the structure of different Wikimedia sites. Removing these two clauses does not modify the core parts of the meaning. We thus remove these two clauses and the corresponding target ?wikispecies in the projection. The resulting SPARQL is then included in the Spinach validation set.

Queries obscured by optimizations: This discussion, titled “Persons from a certain time period and country” in Septempter of 2020, contains a query that makes use of the following optimization:

⬇

int:Prior hint:rangeSafe true .

which is used to speed up the succeeding filter. Removing this optimization in this case does not signifincatly increase SPARQL executing time. We thus remove this clause and include the example in the Spinach validation set.

Formatting clauses: This discussion, titled “Query all taxons which are trees”, results in a SPARQL of the following structure:

⬇

SELECT

?taxon ?sample (GROUP_CONCAT(DISTINCT str(?commonname); separator = "//") as ?commonnames)

WHERE

{

...

}

GROUP BY ?taxon ?sample

Using a group by, it is trying to concatenate all commonnames into one single string, seperated by "//". However, it is very difficult to accurately capture this in natural language, and a system could return the results in different orders, raising issues for evaluation. We thus exclude this concatenation and instead change it to a counting operation:

⬇

SELECT

?taxon ?sample (COUNT(?commonname) as ?commonnamecount)

WHERE

{

...

}

GROUP BY ?taxon ?sample

The resulting SPARQL is then included in the Spinach validation set.

B.3 Examples of queries excluded

The following are examples of when, after modifications, no meaningful part of the SPARQL remains. The discussions are thus excluded from our dataset.

Wikimedia presentation queries: The discussion, titled “?item =wd:QXXX” in April of 2020, results in a SPARQL that is only trying to fetch wikipedia pages that are about the item “Gambling, Gods And LSD”. The usage of schema:about property is very specific to Wikimedia and could raise confusion on the meaning of “about”. It is thus excluded. The discussion, titled “Bengali Wikipedia articles with no Wikidata statement ” in March of 2018, results in a SPARQL trying to fetch Wikidata items with no statements that are the topic of discussion of Bengali Wikipedia articles. The usage of wikibase:statements, schema:about, schema:isPartOf, and wikibase:sitelinks properties are all very specific to the structure of Wikimedia sites and can create confusions for KBQA systems. It is thus excluded.

Questions on complex SPARQL code: This discussion, titled “Scatterplot query” in March of 2021, is an example where the original requester comes in with a SPARQL that is complicated whose meaning is already difficult to acurately express in English. The final SPARQL from the conversation (hyperlinked in “ a bit more baroque, with axes”) only adds to the complexity.

Queries obscured by optimizations: This discussion, titled “Section “P31wdt:P279⁢ wd:Q16917 in wdt:P131wdt:P131⁢ wd:Q25 (query optimization)”, invovles only the refactoring of a query into using two sub-queries which are then joined together to avoid timing outs. The two queries are semantically equivalent. Due to reproducibility challenges from this refactoring, this discussion is excluded.

Formatting clauses: The discussion, titled “preferred format for id” in Feburary of 2022, only contains discussions on differnt ways for string processing in SPARQL and is thus excluded. The discussion, titled “Custom link formatting in WDQS results” in June of 2018, only contains results in a SPARQL that focuses on converting string formats after minimal use of Wikidata (fetching the sitelinks of one item) and is thus excluded.

Appendix C System and Evaluation Details

Unless otherwise specified, LLMs are used with greedy decoding, i.e. with $temperature=0$ , with the exception of Spinach agent’s policy prompt, which is run with $temperature=1$ and nucleus sampling Holtzman et al. (2019) with $p=0.9$

The ToG agent by default builds a local version of Wikidata using the simple-wikidata-db library⁹⁹9https://github.com/neelguha/simple-wikidata-db. However, building this index from scratch is extremely computationally expensive. Sun et al. (2024b) report that they need to deploy the Wikidata dump across five AWS EC2 instances, each consisting of a 768GB machine with 48 cores. Fahl et al. (2022) reported various other methods of hosting Wikidata locally. We attempted using Qlever as the SPARQL Engine but failed to do so on the latest Wikidata dump. For the evaluation of ToG, we re-implemented the same logic using dynamic Wikidata API calls.

We adopted the same hyperparameters for ToG as used in the original paper. For the experiment in Section 5.3, we used D=3 and T=3 as the hyperparameters for graph exploration.

The GPT-4 and ToG systems return results in strings of the entities instead of entity IDs; we convert the gold SPARQL output to only contain their English labels and calculate EM and F1 metrics.

Appendix D Prompts used in the Spinach system

⬇

# instruction

Your task is to write a Wikidata SPARQL query to answer the given question. Follow a step-by-step process:

1. Start by constructing very simple fragments of the SPARQL query.

2. Execute each fragment to verify its correctness. Adjust as needed based on your the observations.

3. Confirm all your assumptions about the structure of Wikidata before proceeding.

4. Gradually build the complete SPARQL query by adding one piece at a time.

5. Do NOT repeat the same action, as the results will be the same.

6. The question is guaranteed to have an answer in Wikidata, so continue until you find it.

7. If the user is asking a True/False question with only one answer, use ASK WHERE to fetch a True/False answer at the very end.

8. In the final SPARQL projections, do not only ask for labels. Ask for the actual entities whenever needed (e.g. instead of doing ‘SELECT xLabel‘, do ‘SELECT x‘).

9. If the final result was contained in last round’s ‘get_wikidata_entry‘ and you are ready to stop, use ‘execute_sparql‘ and generate a SPARQL to retrieve that results.

Form exactly one "Thought" and perform exactly one "Action", then wait for the "Observation".

Possible actions are:

- get_wikidata_entry(QID): Retrieves all outgoing edges (linked entities, properties, and qualifiers) of a specified Wikidata entity using its QID.

- search_wikidata(string): Searches Wikidata for entities or properties matching the given string.

- get_property_examples(PID): Provides a few examples demonstrating the use of the specified property (PID) in Wikidata.

- execute_sparql(SPARQL): Runs a SPARQL query on Wikidata and returns a truncated result set for brevity.

- stop(): Marks the last executed SPARQL query as the final answer and ends the process.

# input

Question: {{ question }}

{% if action_history %}

{% for i in range(0, action_history|length) %}

{% endfor %}

{% endif %}

Output one "Thought" and one "Action":

Table 4: The policy prompt of the Spinach agent.

⬇

# instruction

At each turn, you are given a Wikidata entry and a question.

You want to write a SPARQL query that answers the question.

As the first step, remove the parts of the Wikidata entry that could not be potentially helpful when writing the SPARQL.

The output should be a json object containing part of the Wikidata entry.

# few-shot example 1, input

Wikidata entry for OneRepublic (Q1438730, ’OneRepublic’ is an American pop rock band formed in Colorado Springs, Colorado, in 2002):

{

"instance of (P31)": "musical group (Q215380)",

...

"social media followers (P8687)": {

"3134158": {

"Qualifiers": [

{

"point in time (P585)": "4 February 2023"

}

]

"3276596": {

"Qualifiers": [

{

"point in time (P585)": "6 January 2021"

}

]

"3178896": {

"Qualifiers": [

{

"point in time (P585)": "2 March 2022"

}

]

"3720919": {

"Qualifiers": [

{

"point in time (P585)": "10 May 2018"

}

]

}

Question: "What is the title of the second single on OneRepublic’s third album Native?"

# few-shot example 1, output

{

"instance of (P31)": "musical group (Q215380)",

"record label (P264)": "Columbia Records (Q183387)",

"discography (P358)": "OneRepublic discography (Q935670)",

}

# few-shot example 2, input

Wikidata entry for Barack Obama Sr. (Q649593, Barack Obama Sr. was an economist and the father of Barack Obama jr.):

{

"place of birth (P19)": {

"Nyang’oma Kogelo (Q1027168)": {

"Qualifiers": [

{

"country (P17)": "Kenya Colony"

}

]

}

...

}

Question: "where did barack obama sr. attend school?"

# few-shot example 2, output

{

"place of birth (P19)": {

"Nyang’oma Kogelo (Q1027168)": {

"Qualifiers": [

{

"country (P17)": "Kenya Colony"

}

]

}

"place of death (P20)": "Nairobi (Q3870)",

...

}

# input

Wikidata entry for "{{ entity_and_description }}":

Question: "{{ question }}"

Table 5: The 2-shot prompt used to prune the output of get_wikidata_entry() action. Here we omit parts of the few-shot example inputs for space.

Appendix E Full Example Trace of the Spinach Agent

Listing 1 contains a full sequence of thoughts, actions and observations of the Spinach agent for the given input question.

⬇

"""

Question: SteplanI hav

"""

Thought: First, I need to identify the properties and entities related to people affiliated with the University of Washington School of Music, their educational background, and the musical instruments they play. This will help in formulating the final SPARQL query.

>> search_wikidata("University of Washington School of Music")

Observation: University of Washington School of Music (Q98035717): school of music

Thought: Now that I have identified the entity for the University of Washington School of Music, I need to find the properties that can help me identify people affiliated with it and the properties related to their education and musical instruments.

>> get_wikidata_entry("Q98035717")

Observation:

{

instance of (P31): {

faculty (Q180958): {

Qualifiers: [

{

of (P642): University of Washington (Q219563)

}

]

music school (Q1021290): {}

field of work (P101): [

higher education (Q136822),

postgraduate education (Q141212),

musicology (Q164204),

music (Q638),

research (Q42240),

ethnomusicology (Q208365)

located in the administrative territorial entity (P131): Seattle (Q5083),

location (P276): Music Building (Q98690890),

has subsidiary (P355): University of Washington Laboratory for Music Cognition, Culture & Learning (Q101157561),

part of (P361): University of Washington College of Arts and Sciences (Q7896566),

has part(s) (P527): University of Washington Laboratory for Music Cognition, Culture & Learning (Q101157561),

parent organization (P749): University of Washington College of Arts and Sciences (Q7896566)

}

Thought: To find people affiliated with the University of Washington School of Music, I need to look for properties that define such an affiliation. Additionally, I need to identify properties related to their education at the University of Washington and the musical instruments they play. I will start by searching for the property that links people to their affiliations.

>> search_wikidata("affiliation")

Observation:

- affiliation (Q46135267)

connection, formally or informally, with an organization or intellectual movement

- filiation (Q5861944)

system for recognizing the lineage or kinship of a person in anthropology

- affiliation (Q2620373)

archaic term for the legal establishment of paternity

- affiliated entity (Q69467106)

inverse property label item for P1416

- Affiliation (Q93774359)

scholarly article

- Affiliations Committee (Q46815761)

Wikimedia community-run committee

- Wikimedia Affiliations Committee wiki (Q33120941)

Wikimedia private wiki

- Affiliation and participation (Q107433952)

master’s thesis of Hannah Marie Mayfield

- affiliation (P1416)

organization that a person or organization is affiliated with (not necessarily member of or employed by)

Data Type: wikibase-item

- affiliation string (P6424)

qualifier to provide the published string form of affiliation attached to an author

Data Type: string

- Olympedia affiliations ID (P9447)

identifier for the affiliation of athletes in Olympedia athlete database

Data Type: external-id (Represents an identifier used in an external system.)

Thought: The property ‘affiliation (P1416)’ seems to be the most relevant for linking people to the University of Washington School of Music. Next, I need to identify the property for their education at the University of Washington and the property that links people to the musical instruments they play.

>> search_wikidata("educated at")

Observation:

- educated at (P69)

educational institution attended by subject

Data Type: wikibase-item

Thought: I have identified the property ‘educated at (P69)’ to link people to the University of Washington. Next, I need to find the property that links people to the musical instruments they play.

>> search_wikidata("musical instrument")

Observation:

- musical instrument (Q34379)

device created or adapted to make musical sounds

- heraldic musical instrument (Q56877088)

category of heraldic charges

- Musical instrument (Q102413357)

Oil of canvas by Alla Grigoryan

- Musical Instrument (Q102413736)

painting by Musical instrument

- Musical instrument (Q86150836)

sculpture in the church Brother Klaus in the city of Bern, Switzerland

- Musical Instrument (Q80002888)

musical instrument by unknown artist (1929.322)

- Musical instrument (Q123046258)

US patent 11482197

- musical instrument maker (Q1955150)

manufacturer of musical instruments

- instrument (P1303)

musical instrument that a person plays or teaches or used in a music occupation

Data Type: wikibase-item

- MIMO instrument ID (P3763)

identifier for a musical instrument, on the instrument keyword thesaurus from Musical Instruments Museums Online

Data Type: external-id (Represents an identifier used in an external system.)

Thought: I have identified the relevant properties: ‘affiliation (P1416)’ to link people to the University of Washington School of Music, ‘educated at (P69)’ to link people to the University of Washington, and ‘musical instrument (P1303)’ to identify the instruments they play. I will now construct a SPARQL query that combines these properties to get the required information.

>> execute_sparql("""

SELECT ?instrument (COUNT(*) AS ?count) WHERE {

?person wdt:P69 wd:Q1063349;

wdt:P1416 wd:Q98035717;

wdt:P1303 ?instrument.

}

GROUP BY ?instrument

"""

)

Observation:

Query returned empty result.

Thought: It appears that the query returned an empty result. This could be due to several reasons: either there are no people matching all criteria, or the properties might not be used exactly as I assumed. To ensure accuracy, I will individually verify if there are people affiliated with the University of Washington School of Music who have also been educated at the University of Washington.

>> execute_sparql(

"""

SELECT ?person WHERE {

?person wdt:P69 wd:Q1063349;

wdt:P1416 wd:Q98035717.

}

LIMIT 10

"""

)

Observation: Query returned empty result.

Thought: The absence of results could indicate that the properties or the specific entities used in the query might not be correct or commonly utilized in this context. I will check if there are people who have been educated at the University of Washington (Q1063349) and then see if I can refine the affiliation aspect.

>> execute_sparql(

"""

SELECT ?person WHERE { ?person wdt:P69 wd:Q1063349. }

LIMIT 10

"""

)

Observation: Query returned empty result.

Thought: Since my previous queries returned empty results, I need to verify if there are any individuals in Wikidata who have been educated at the University of Washington (Q1063349). This will help me confirm if I am using the correct properties and entities.

>> get_wikidata_entry("Q1063349")

Observation:

{

country (P17): Russia (Q159),

instance of (P31): hamlet (Q5084),

located in the administrative territorial entity (P131): Tavtimanovsky selsoviet (Q4449505),

population (P1082): {

11: {

Qualifiers: [

{

point in time (P585): 1 January 2010,

determination method (P459): census (Q39825)

{

determination method (P459): census (Q39825),

point in time (P585): 1 January 2002

}

]

9: {

Qualifiers: [

{

point in time (P585): 1 January 2009,

determination method (P459): census (Q39825)

}

]

}

Thought: The query results indicate that there may be an issue with the QID I used for the University of Washington (Q1063349). It seems to correspond to a location in Russia rather than the university. I should search for the correct QID for the University of Washington.

>> search_wikidata("University of Washington")

Observation:

- University of Washington (Q219563)

public research university in Seattle, Washington, United States

- University of Washington Collection (NAID 931) (Q59502962)

collection in the National Archives and Records Administration’s holdings

- University of Washington Department of Chemistry (Q97958839)

academic department at the University of Washington

- University of Washington Department of Architecture (Q98844905)

academic department at the University of Washington

- University of Washington Department of Civil and Environmental Engineering (Q97990078)

academic department at the University of Washington

- University of Washington Department of Chemical Engineering (Q99196105)

academic department at the University of Washington

- University of Washington Paul G. Allen School of Computer Science & Engineering (Q98186807)

school of the University of Washington

- University of Washington School of Music (Q98035717)

school of music

Thought: The correct QID for the University of Washington is Q219563. I should now update my SPARQL query to use this correct QID for identifying people who have been educated at the University of Washington and who are affiliated with the University of Washington School of Music.

>> execute_sparql(

"""

SELECT ?person WHERE {

?person wdt:P69 wd:Q219563;

wdt:P1416 wd:Q98035717.

}

LIMIT 10

"""

)

Observation:

| Person |

|--------------|

| Q107692821 |

| Q107705263 |

| Q107706107 |

| Q107707997 |

| Q107708199 |

| Q107708456 |

| Q107720948 |

| Q107773759 |

| Q107782800 |

| Q107789859 |

Thought: Now that I have identified individuals who have been educated at the University of Washington (Q219563) and are affiliated with the University of Washington School of Music (Q98035717), I will update my SPARQL query to find the musical instruments played by these individuals and count the occurrences of each instrument.

>> execute_sparql(

"""

SELECT ?instrument ?instrumentLabel (COUNT(?student) AS ?count) WHERE {

?student wdt:P1303 ?instrument;

wdt:P1416 wd:Q98035717;

wdt:P69 wd:Q219563.

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

}

GROUP BY ?instrument ?instrumentLabel

"""

)

Observation:

| instrument | instrumentLabel | count |

|--------------|-----------------|-------|

| Q5994 | piano | 99 |

| Q1467960 | mbira | 2 |

| Q8350 | trombone | 11 |

| Q8338 | trumpet | 8 |

| Q17172850 | voice | 32 |

| ... | ... | ... |

| Q302497 | mandolin | 1 |

| Q187851 | recorder | 1 |

| Q185041 | cor anglais | 1 |

| Q83509 | piccolo | 1 |

Thought: The query successfully returned the musical instruments played by individuals who are affiliated with the University of Washington School of Music and have been educated at the University of Washington, along with the counts of each instrument. I will mark this query as complete.

>> stop()

Listing 1: Full version of the example in Figure 2