Crafting the Path: Robust Query Rewriting for Information Retrieval

Ingeol Baek, Jimin Lee, Joonho Yang, Hwanhee Lee^$\dagger$
Department of Artificial Intelligence, Chung-Ang University, Seoul, Korea
[email protected], {ljm1690, plm3332, hwanheelee}@cau.ac.kr

Abstract

Query rewriting aims to generate a new query that can complement the original query to improve the information retrieval system. Recent studies on query rewriting, such as query2doc (Q2D), query2expand (Q2E) and querey2cot (Q2C), rely on the internal knowledge of Large Language Models (LLMs) to generate a relevant passage to add information to the query. Nevertheless, the efficacy of these methodologies may markedly decline in instances where the requisite knowledge is not encapsulated within the model’s intrinsic parameters. In this paper, we propose a novel structured query rewriting method called Crafting The Path tailored for retrieval systems. Crafting The Path involves a three-step process that crafts query-related information necessary for finding the passages to be searched in each step. Specifically, the Crafting The Path begins with Query Concept Comprehension, proceeds to Query Type Identification, and finally conducts Expected Answer Extraction. Experimental results show that our method outperforms previous rewriting methods, especially in less familiar domains for LLMs. We demonstrate that our method is less dependent on the internal parameter knowledge of the model and generates queries with fewer factual inaccuracies. Furthermore, we observe that Crafting The Path has less latency compared to the baselines.

Ingeol Baek, Jimin Lee, Joonho Yang, Hwanhee Lee^$\dagger$ Department of Artificial Intelligence, Chung-Ang University, Seoul, Korea [email protected], {ljm1690, plm3332, hwanheelee}@cau.ac.kr

^†^†footnotetext: ^$\dagger$Corresponding author.

1 Introduction

In an open-domain QA system, document retrievers are utilized to retrieve the necessary information to answer the given query. Query rewriting reformulates original queries to help the retrieval system find relevant passages. Recent works on query rewriting focus on Large Language Models (LLMs) Brown et al. (2020); OpenAI (2022, 2023); Touvron et al. (2023) to generate additional information. Specifically, these studies aim to generate a relevant passage for a given query by leveraging the pre-trained knowledge of LLMs. Utilizing these new queries generated from LLMs has shown a significant increase in the performance of retrieval systems. Recently, various LLM based rewriting methods such as query2doc (Q2D) (Wang et al., 2023c), query2expand (Q2E), and querey2cot (Q2C) (Jagerman et al., 2023) have been introduced. Q2D generates a pseudo-document based on the original query, which is then used as input for the retriever. Similarly, Q2C employs a Chain-of-Thought (Wei et al., 2022) approach, and Q2E generates semantically equivalent queries. These approaches leverage the rewriting of the original query into a form similar to passages in the corpus. These techniques result in superior performance improvements compared to utilizing the base query alone.

The fundamental reason for utilizing retrieval systems in open-domain QA is to use external knowledge when QA systems do not have knowledge to generate correct answer (Lewis et al., 2020). However, relying heavily on inherent knowledge of LLMs often leads to the generation of irrelevant information and causes numerous factual errors in the reformulated queries. As in the right examples of Figure 1, Q2C asserts that coffee originated in ancient Egypt or Yemen, whereas its actual origin is Ethiopia. Meanwhile, Q2D discusses a legendary story about "goats" without providing any information related to the origin of the word "coffee". These types of misinformation and unrelated contents can lead to significant performance degradation because of the included incorrect information in the reformulated queries.

In this paper, we propose a novel query rewriting method, Crafting The Path, which is a fine-grained query reformulation technique through the structured reasoning process. Instead of simply generating the passage similar to the candidate documents, Crafting The Path focuses on identifying what information needs to be found to solve the given query. The Crafting The Path method comprises three steps. The first step, Query Concept Comprehension, provides fundamental background knowledge. Offering basic factual information reduces the likelihood of including incorrect information and helps the retrieval system clearly understand the main topic. The second step, Query Type Identification, specifies the required information to filter out irrelevant information. Finally, through Expected Answer Extraction process, the retriever model identifies the essential information it needs to find, facilitating the extraction of accurate passages. This structured, step-by-step process minimizes unnecessary inferences by the model, thereby reducing the possibility of factual errors.

Crafting The Path outperforms all of the baseline methods and exhibits a reduced degree of factual errors demonstrated by the 10% higher FActScore (Min et al., 2023) compared to baselines. Additionally, our approach demonstrates enhanced performance in the absence of prior knowledge of the model’s internal parameters. This is evidenced by experiments in closed-book QA settings, where rewriting models fail to provide correct answers, resulting in a 3.57% increase in performance. Furthermore, our method shows 7.3% less latency compared baselines.

2 Related Work

Information Retrieval

Information retrieval is the process of obtaining relevant information from the database based on given queries. The main two methods for information retrieval are sparse retrieval and dense retrieval. A prominent example of the sparse retrieval method is BM25 (Robertson and Jones, 1976; Robertson et al., 1995), which serves as a ranking function to evaluate the relevance between a given query and documents. In contrast, dense retrieval method (Xiong et al., 2021; Qu et al., 2021) involves fetching passages that exhibit high similarity to the query using the document embeddings. This approach typically utilizes pre-trained language models such as BERT (Devlin et al., 2018) for the encoder, and some methods fine-tune these encoders. (Karpukhin et al., 2020; Wang et al., 2023b). In this work, we improve the performance of both sparse and dense retrieval methods by focusing on query rewriting method.

Refer to caption — Figure 1: Overview of our proposed query rewriting method Crafting The Path, along with the rewritten query examples of query2doc (Q2D) and query2cot (Q2C) methodologies.

LLM Based Query Rewriting

Query rewriting refers to the task that modifies the original query to improve the search results for the information retrieval systems. Recent studies on query rewriting mainly use the large language models to create relevant information for the given query. query2doc (Q2D) (Wang et al., 2023c) operates by generating a pseudo-document based on the original query, which is then used as input for retriever. Similarly, query2cot (Q2C) employs a Chain of Thought (CoT) (Wei et al., 2022) approach, and query2expand (Q2E) (Jagerman et al., 2023) generates semantically equivalent query. These approaches leverage the rewriting of the original query into a form similar to passages in the corpus. This technique leads to significant performance improvement compared to using the base query alone. Furthermore, Rewrite-Retrieve-Read(Ma et al., 2023) introduces a methodology that enhances rewriting performance by incorporating reinforcement learning. Another study, ITER-REGEN(Shao et al., 2023), improves query quality by feeding the query and retrieved documents into a language model for rewriting, followed by a repeated retrieval process. Rephrase and Respond (Deng et al., 2023) argues that for effective rewriting, queries should be rephrased in a manner that is easier for LLMs to understand. For the conversational serach, (Yoon et al., 2024) proposes a method that generates a variety of queries and uses the rank of retrieved passages to train the LLMs on only the optimal queries. This process is further refined using a DPO (Rafailov et al., 2023) approach to create optimal queries.

Our work has similarities with recent efforts like Q2D and Q2E, particularly in using linguistic techniques to expand queries. However, unlike previous approaches, we focus on generating only the essential information necessary for information retrieval and aim to improve performance by reducing hallucinations in the reformulated query.

3 Crafting The Path

Our proposed rewriting method, Crafting The Path, is composed of the following three steps.

3.1 Query Rewriting via Crafting The Path

3.1.1 Step 1: Query Concept Comprehension

We begin with Query Concept Comprehension step, which generates additional information that serves as the contextual background for the existing query. This step enriches the direct information about the question object within the query. For example, in Figure 1, when asking about the origin of "coffee," this step provides a detailed description of "coffee". Query Concept Comprehension step aids in identifying the information to be searched for in the next step.

3.1.2 Step 2: Query Type Identification

Based on the specific information obtained through the original query and Step 1, we proceed to Query Type Identification step. In this step, we generate the type of necessary information to retrieve relevant passages. Specifically, we create categories for the query that help filter out irrelevant information. To retrieve information to answer the origin of coffee as in Figure 1, we can think that one must search for the historical context and etymology of coffee. Inspired by this point, Query Type Identification aims to find the type of necessary information that the ground truth passage might include as in Figure 1.

3.1.3 Step 3: Expected Answer Extraction

The final step involves extracting expected answers for the query based on the information generated from the previous step. As in Figure 1, the details regarding the origin of coffee being Ethiopia and its etymology enable the retriever model to identify the required information, facilitating the extraction of accurate passage.

Input Prompt for Crafting The Path
Instruction: By following the requirements, write 3 steps related to the Query and answer in the same format as the example.
Requirements:
1. In step1, generate the contextual background from the existing query is extracted.
2. In step2, generate what information is needed to solve the question.
3. In step3, generate expected answer based on query, step1, and step2.
4. If you think there is no more suitable answer, end up with ’None’.
Query 1: what is the number one formula one car?
Step 1: Formula One (F1) is the highest class of international automobile racing competition held by the FIA.
Step 2: To know the best car, you have to look at the race records.
Step 3: Red Bull Racing’s RB20 is the best car.
(4-shot examples) …
Query 5:

Table 1: Prompt used for Crafting The Path.

These three distinct steps offer a form of query rewriting that enhances the retrieval of more accurate information and minimizes the inclusion of incorrect information. We implement all of these steps in Crafting The Path with a single LLM call using the prompt in Table 1. Specifically, we provide the role of each step with the examples in the prompts. Additionally, to avoid producing inaccurate information, we instruct the model to generate “None” when it lacks certain knowledge, providing clear guidance for the retriever system’s input.

3.2 Passage Retriever

Constructing Inputs of Retriever To construct the final query $q^{+}$ , we expand the original query $q$ three times and concatenate $q$ with rewritten query $QR$ in sparse retrieval as shown in Eq. 1. In the case of dense retrieval, a [SEP] token is inserted between the query and the QR to differentiate them.

$\text{Sparse: }q^{+}=\text{concat(}\{q\}\times\text{3},\ QR\text{)}$ .

(1)

$\text{Dense: }q^{+}=\text{concat(}q,\text{[SEP]},\ QR\text{)}$ .

(2)

Training Dense Retriever To train a dense retriever, we utilize the Binary Passage Retrieval (BPR) loss (Yamada et al., 2021) as follows to reduce the memory usage:

$\mathcal{L}_{\text{cand}}=\sum_{j=1}^{n}\max(0,-(\langle\mathbf{\tilde{h}}_{q_% {i}^{\phantom{0}}},\mathbf{\tilde{h}}_{p_{i}^{+}}\rangle+\langle\mathbf{\tilde% {h}}_{q_{i}^{\phantom{0}}},\mathbf{\tilde{h}}_{p_{i,j}^{-}}\rangle)+\alpha)$ .

(3)

\leavevmode\resizebox{418.08165pt}{}{ $\mathcal{L}_{\text{rerank}}=-\log\frac{\exp(\langle\mathbf{e}_{q_{i}^{% \phantom{0}}},\,\mathbf{\tilde{h}}_{p_{i}^{+}}\rangle)}{\exp(\langle\mathbf{e}% _{q^{\phantom{0}}_{i}},\,\mathbf{\tilde{h}}_{p_{i}^{+}}\rangle)+\sum_{j=1}^{n}% {\exp(\langle\mathbf{e}_{q^{\phantom{0}}_{i}},\,\mathbf{\tilde{h}}_{p^{-}_{i,j% }}\rangle)}}$ },

(4)

where $\mathcal{D}=\{\langle q_{i},p^{+}_{i},p^{-}_{i,1},\cdots,p^{-}_{i,n}\rangle\}_% {i=1}^{m}$ denote a set where $m$ represents training instances, $p^{+}_{i}$ denotes a positive passage, and $p^{-}_{i,j}$ denotes a negative passage. We compute embedding $e\in\mathbb{R}^{d}$ using an encoder, each $\tilde{h}_{q}$ and $\tilde{h}_{p}$ represent the hash code for a query and a passage, respectively. $\mathcal{L}_{cand}$ is to identify positive passages based on ranking loss, and $\alpha$ is the margin that is enforced between the positive and negative scores. $\mathcal{L}_{rerank}$ is used to minimize the negative log-likelihood for a positive passage. Finally, we employ the BPR loss as follows:

\mathcal{L}_{bpr}=\mathcal{L}_{cand}+\mathcal{L}_{rerank}.

(5)

4 Experiments

4.1 Experimental Setup

Datasets

We use the MS-MARCO passage dataset (Campos et al., 2016) for training retriever. Additionally, To demonstrate the robustness of our model on unseen data, we utilize nine retrieval datasets from BEIR (Thakur et al., 2021) for our main experiment. In our experiment, we utilize the nDCG@10 metric to evaluate the quality of the top 10 search results based on their relevance and order.

Baselines

To analyze query rewriting methods based on LLMs, we use three baselines: query2doc (Q2D), query2cot (Q2C), and query2expand (Q2E). All rewriting methods use 4-shot prompts. For Q2D, Q2E and Q2C, we reference the prompt from Wang et al. (2023c), and Jagerman et al. (2023). In our experiments, we employ Mistral-7b (Jiang et al., 2023) and Phi-2 (Microsoft, 2023) as query rewriting models including Crafting The Path. We conduct experiments with all methods on the MS-MARCO and 9 BEIR datasets.

Implementation Details

For reliable experiments, we train five different retriever models using a different seed for each method. To evaluate the query rewriting methods using dense retrieval models, we use a total of 20 models. For each dataset, we construct 5 (fine-tuned retrieval models) $\times$ 4 (rewriting methods) = 20 embedding vectors, and compute the mean and variance. In experiments with the query rewriting method using Phi-2, we obtain results using models trained with the new queries written by Mistral-7b. We employ all-mpnet-base-v2¹¹1https://huggingface.co/sentence-transformers/all-mpnet-base-v2 as our pre-trained dense retrieval model.

		scifact	trec-covid	nfcorpus	quora	scidocs	hotpotqa	dbpedia	fiqa	fever	Avg
	Ours	58.41 ${}^{\text{(±0.10)}}$	64.59 ${}^{\text{(±0.86)}}$	32.28 ${}^{\text{(±0.03)}}$	75.21 ${}^{\text{(±2.27)}}$	18.39 ${}^{\text{(±0.10)}}$	50.66 ${}^{\text{(±0.36)}}$	41.78 ${}^{\text{(±0.18)}}$	40.74 ${}^{\text{(±0.09)}}$	63.21 ${}^{\text{(±0.26)}}$	49.47
Mistral-7b	Q2D	59.40 ${}^{\text{(±0.07)}}$	61.92 ${}^{\text{(±1.07)}}$	32.03 ${}^{\text{(±0.07)}}$	74.79 ${}^{\text{(±0.41)}}$	18.40 ${}^{\text{(±0.03)}}$	52.05 ${}^{\text{(±0.59)}}$	41.46 ${}^{\text{(±0.48)}}$	40.37 ${}^{\text{(±0.41)}}$	64.26 ${}^{\text{(±0.60)}}$	49.41
Dense ${}^{\text{FT}}$	Q2E	57.06 ${}^{\text{(±2.11)}}$	55.89 ${}^{\text{(±2.38)}}$	31.98 ${}^{\text{(±0.23)}}$	71.98 ${}^{\text{(±0.98)}}$	18.29 ${}^{\text{(±0.01)}}$	48.57 ${}^{\text{(±0.50)}}$	39.13 ${}^{\text{(±0.23)}}$	40.90 ${}^{\text{(±0.71)}}$	60.85 ${}^{\text{(±1.20)}}$	47.19
	Q2C	59.37 ${}^{\text{(±1.83)}}$	63.14 ${}^{\text{(±6.22)}}$	32.15 ${}^{\text{(±0.16)}}$	74.33 ${}^{\text{(±2.43)}}$	18.31 ${}^{\text{(±0.05)}}$	52.27 ${}^{\text{(±1.18)}}$	41.75 ${}^{\text{(±0.14)}}$	40.60 ${}^{\text{(±0.31)}}$	62.60 ${}^{\text{(±0.48)}}$	49.39
	Ours	58.40 ${}^{\text{(±0.09)}}$	63.24 ${}^{\text{(±1.70)}}$	32.60 ${}^{\text{(±0.02)}}$	75.44 ${}^{\text{(±1.96)}}$	17.99 ${}^{\text{(±0.06)}}$	49.84 ${}^{\text{(±0.23)}}$	40.32 ${}^{\text{(±0.20)}}$	40.31 ${}^{\text{(±0.09)}}$	62.12 ${}^{\text{(±0.39)}}$	48.92
Phi-2	Q2D	58.23 ${}^{\text{(±0.29)}}$	59.50 ${}^{\text{(±0.68)}}$	31.52 ${}^{\text{(±0.13)}}$	75.55 ${}^{\text{(±0.43)}}$	18.05 ${}^{\text{(±0.01)}}$	49.28 ${}^{\text{(±0.80)}}$	39.68 ${}^{\text{(±0.71)}}$	39.81 ${}^{\text{(±0.04)}}$	62.73 ${}^{\text{(±1.01)}}$	48.26
Dense ${}^{\text{FT}}$	Q2E	57.07 ${}^{\text{(±3.23)}}$	56.58 ${}^{\text{(±2.82)}}$	31.47 ${}^{\text{(±0.16)}}$	72.46 ${}^{\text{(±1.45)}}$	18.26 ${}^{\text{(±0.03)}}$	46.94 ${}^{\text{(±0.43)}}$	38.44 ${}^{\text{(±0.08)}}$	40.61 ${}^{\text{(±0.53)}}$	59.84 ${}^{\text{(±1.26)}}$	46.85
	Q2C	57.84 ${}^{\text{(±1.36)}}$	62.23 ${}^{\text{(±4.59)}}$	32.20 ${}^{\text{(±0.17)}}$	73.56 ${}^{\text{(±4.99)}}$	18.00 ${}^{\text{(±0.02)}}$	49.39 ${}^{\text{(±0.89)}}$	40.20 ${}^{\text{(±0.16)}}$	39.61 ${}^{\text{(±0.45)}}$	60.95 ${}^{\text{(±0.27)}}$	48.22
	Ours	70.78	74.79	35.51	75.82	16.10	57.41	46.68	29.12	63.66	52.21
Mistral-7b	Q2D	71.14	67.73	35.01	70.37	15.68	58.82	41.36	28.67	67.72	50.72
Sparse (BM25)	Q2E	68.61	66.29	35.14	76.66	16.13	54.58	42.69	28.15	52.98	49.03
	Q2C	71.63	74.45	35.29	74.86	16.10	58.71	43.16	30.91	64.36	52.16
	Ours	68.48	69.71	34.38	74.47	15.47	55.20	42.93	28.91	59.20	49.86
Phi-2	Q2D	67.15	59.42	32.83	71.45	14.95	52.61	37.50	25.10	57.67	46.52
Sparse (BM25)	Q2E	68.71	66.94	33.51	77.35	16.21	53.35	39.18	26.72	52.53	48.28
	Q2C	69.14	69.57	35.21	75.33	15.70	55.39	39.64	28.37	59.57	49.77

Table 2: Experimental results on BEIR dataset. Highest performance is highlighted in bold, and the second highest is underlined.

4.2 Results

Main Results

As shown in Table 2, Crafting The Path outperforms all approaches, including Q2D, Q2C, and Q2E, for the average score across 9 BEIR datasets. Our approach consistently demonstrates superior performance compared to existing methods. Crafting The Path provides a robust application across various domains and rewriting model sizes in both sparse and dense retrievers. Moreover, we observe significant performance improvements on the trec-covid (Wang et al., 2020; Voorhees et al., 2020) and nfcorpus (Boteva et al., 2016) datasets for both retriever types compared to previous methods. Especially, trec-covid dataset requires searching for the latest information on queries about COVID-19. Crafting The Path proves to be more effective in finding such recent information. However, previous methods outperform our approach on datasets like FEVER (Thorne et al., 2018). The FEVER dataset is based on Wikipedia, which is frequently used for LLMs pre-training data. Since the training data for the Mistral-7b and Phi-2 models are not disclosed, and verifying the presence of internal parameter knowledge directly remains a challenge, it is difficult to confirm the internal knowledge of the models (Wang et al., 2023a). However, we conduct measurements of the impact of internal model parameters through experiments in Table 5. As results in Table 2, we observe the significant performance improvements in these BEIR datasets from Crafting The Path rewriting methods through the use of a structured rewriting method.

Ablation Study

We conduct an ablation study on Crafting The Path. We measure the performance on the MS-MARCO dataset by excluding each step as shown in Table 3. We experiment with omitting Step 3 and also conduct experiments excluding both Steps 2 and 3. Performance declines with the removal of each step, demonstrating that each step is essential for performance improvement.

	MS-MARCO Passage dev
	nDCG	MRR	Recall@1K
Crafting The Path	45.42	33.11	97.05
w/o step3	44.99	32.91	96.67
w/o step2, 3	44.59	32.42	96.17

Table 3: Ablation study on Crafting The Path method.

BEIR	FActScore
BEIR	Crafting the Path	Q2D	Q2C
Mistral-7b	0.718	0.506	0.711
Phi-2	0.765	0.460	0.675

Table 4: Average FActScore (Min et al., 2023) on each method.

4.3 Analysis

Evaluating Factuality of Queries

To determine the impact of factual errors occurring during query rewriting on retrieval performance, we use FActScore to measure the accuracy of Mistral-7b and Phi-2 across three rewriting methods. Unlike Min et al. (2023), which uses atomic facts, we simply divide the content by sentence. For each separated sentence, we use a gold label passage as evidence to output as True or False. If the factuality evaluation results for three sentences of a rewritten query are True, True, False, we assign a score of 2/3. We calculate the average for the measured queries. In Table 4, Crafting The Path exhibits the smallest factual error, demonstrating the best performance with both the Mistral-7b and Phi-2 models.

hotpotQA	Correct Answer			Incorrect Answer
Mistral-7b	Ours	Q2D	Q2C	Ours	Q2D	Q2C
nDCG@10	77.80	76.00	79.27	60.05	55.57	58.77
MRR	90.04	87.73	90.90	78.76	71.71	75.35
Phi-2	Ours	Q2D	Q2C	Ours	Q2D	Q2C
nDCG@10	74.72	71.03	75.43	59.42	46.93	57.64
MRR	87.95	86.17	87.75	79.16	64.63	75.59

Table 5: The impact of reliance on rewriting model internal knowledge.

Measuring the Reliance on Internal Knowledge

To evaluate the reliance of the internal model parameter knowledge in query reformulation of each LLM, we divide the dataset into problems where each LLM can generate the correct answer and those where it cannot in a closed-book setting. Based on this division, we apply three rewriting methods and measure scores as shown in Table 5. Both the Mistral-7b and Phi-2 models demonstrate superior performance in the Incorrect Answer cases when using our rewriting method, compared to previous rewriting approaches. This demonstrates that our method achieves more effective information retrieval when the model needs to search unknown information, aligning with the original purpose of using RAG (Lewis et al., 2020). In this experiment, we use the Contriever (Izacard et al., 2022) model for the dense retriever.

Answer Modification

In Table 6, rewriting the HotpotQA dev dataset (Yang et al., 2018), comprising 5,447 entries using Mistral-7b, results in new queries that include 1,746, 1,623, and 1,752 answers for Crafting The Path, Q2D, and Q2C, respectively. To evaluate the impact of our model’s reliance on internal knowledge for generating answers, we conducted three experiments. First, in the New Query setting, we directly utilize the new queries generated by the LLM. Second, in the Replace Answer to [MASK] setting, if the new query contained an answer, we mask the answer portion. Finally, in the Delete New Queries with Answer experiment, we exclude any new queries from the evaluation if at least one of the three rewriting methods generated a new query containing an answer. The evaluation dataset reduces to 3,212 in Delete New Queries with Answer experiment. Our approach achieves the best performance across all three experiments. Notably, the difference in nDCG@10 scores between our method and Q2C was 0.45 in the first experiment and 1.75 in the third experiment. This score gap suggests that our approach is relatively less affected by the presence or absence of answers in the queries compared to existing methods.

Mistral-7b	nDCG@10	MRR
	hotpotqa
New Query
Crafting The Path	65.49	82.21
query2doc (Q2D)	61.83	76.62
query2cot (Q2C)	65.04	80.11
Replace Answer to [MASK]
Crafting The Path	64.09	81.45
query2doc (Q2D)	61.08	76.36
query2cot (Q2C)	63.92	79.59
Delete New Queries with Answer
Crafting The Path	58.29	77.46
query2doc (Q2D)	53.39	69.91
query2cot (Q2C)	56.54	73.68

Table 6: The results of an answer modification experiment.

	LLM call	Index search
Crafting The Path	5648.9ms	261.44ms
query2doc (Q2D)	8167.5ms	335.65ms
query2cot (Q2C)	6094.2ms	281.98ms

Table 7: Latency analysis of each method.

Latency

Since all three methods use the same dense retriever architecture, the search time is equal. Hence, we measure the time of LLM call latency to compare the speed of each method. We retrieve results from the MS-MARCO passage dev dataset for 1,000 entries and average the outcomes over 100 repetitions. We measure the latency incurred when the model generates a new query and BM25 searches for relevant passages. Our method generates less redundant information, resulting in lower latency compared to Q2D. Also, our method shows comparable latency compared to Q2C due to the similar length of generated text.

5 Conclusion

We present a Crafting the Path, an approach that involves a structured three-step process, focusing not merely on generating additional information for the query but primarily on generating what information to find from the query. Our method less relies on internal model parameters and demonstrates robust performance across various models. Additionally, the method generates fewer factual errors and delivers improved out-of-domain performance with lower latency than previous methods.

Acknowledgement

This research was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) through the Korea government (MSIT) under Grant No. 2021-0-01341 (Artificial Intelligence Graduate School Program (Chung-Ang University)).

References

Boteva et al. (2016) Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. In Proceedings of the European Conference on Information Retrieval (ECIR). Springer.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Campos et al. (2016) Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
Deng et al. (2023) Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. 2023. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Izacard et al. (2022) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
Jagerman et al. (2023) Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bendersky. 2023. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Ma et al. (2023) Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting in retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, Singapore. Association for Computational Linguistics.
Microsoft (2023) Microsoft. 2023. Microsoft research blog. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
OpenAI (2022) OpenAI. 2022. Chatgpt blog post. https://openai.com/blog/chatgpt.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Qu et al. (2021) Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, Online. Association for Computational Linguistics.
Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
Robertson and Jones (1976) Stephen E Robertson and K Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129–146.
Robertson et al. (1995) Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp, 109:109.
Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274, Singapore. Association for Computational Linguistics.
Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Voorhees et al. (2020) E. Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, W. Hersh, Kyle Lo, Kirk Roberts, I. Soboroff, and Lucy Lu Wang. 2020. Trec-covid: Constructing a pandemic information retrieval test collection. ArXiv, abs/2005.04474.
Wang et al. (2023a) Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. 2023a. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
Wang et al. (2023b) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023b. SimLM: Pre-training with representation bottleneck for dense passage retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2244–2258, Toronto, Canada. Association for Computational Linguistics.
Wang et al. (2023c) Liang Wang, Nan Yang, and Furu Wei. 2023c. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678.
Wang et al. (2020) Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, K. Funk, Rodney Michael Kinney, Ziyang Liu, W. Merrill, P. Mooney, D. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, B. Stilson, A. Wade, K. Wang, Christopher Wilhelm, Boya Xie, D. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. Cord-19: The covid-19 open research dataset. ArXiv.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Xiong et al. (2021) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Yamada et al. (2021) Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. 2021. Efficient passage retrieval with hashing for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 979–986, Online. Association for Computational Linguistics.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yoon et al. (2024) Chanwoong Yoon, Gangwoo Kim, Byeongguk Jeon, Sungdong Kim, Yohan Jo, and Jaewoo Kang. 2024. Ask optimal questions: Aligning large language models with retriever’s preference in conversational search. arXiv preprint arXiv:2402.11827.

Appendix A Appendix

A.1 Hyperparameters

Name	Value
Learning rate	2e-5
PLM	all-mpnet-base-v2
Batch	128
Epoch	3
Learning rate decay	linear
Warmup steps	1000
Binary loss margin	2.0
Similarity function	dot score
Query length	128
Passage length	128

Table 8: Hyperparameters used to train dense retrieval model.

A.2 Prompts

prompts

Instruction: Based on the example below, write 3 steps related to the Query and answer in the same format as the example.

Requirements:

1. In step1, sub-information from the existing query is extracted.

2. In step2, please generate what information is needed to solve the question.

3. In step3, an answer is generated based on Query, step1, and step2.

4. If you don’t have certain information, generate ’None’.

5. Please prioritize your most confident predictions.

Example:

Query: where is the Danube?

step1: The Danube is Europe’s second-longest river, flowing through Central and Eastern Europe, from Germany to the Black Sea.

step2: To locate the Danube precisely, geographical knowledge or a map of Europe highlighting rivers is necessary.

step3: The Danube flows through 10 countries: Germany, Austria, Slovakia, Hungary, Croatia, Serbia, Bulgaria, Romania, Moldova, and Ukraine, before emptying into the Black Sea.

Query: what is the number one formula one car?

step1: Formula One (F1) is the highest class of international automobile racing competition held by the FIA.

step2: To know the best car, you have to look at the race records.

step3: Red Bull Racing’s RB20 is the best car.

Query: which movie did Michael Winder write?

step1: Michael Winder is a screenwriter involved in the film industry, potentially credited with writing one or more movies.

step2: To identify the movie(s) Michael Winder wrote, access to a film database or filmography reference is needed.

step3: Michael Winder wrote the movie "In Time" (2011).

Query: who’s the director of Predators?

step1: "Predators" is a film, and like all films, it has a director responsible for overseeing the creative aspects of the production.

step2: To identify the director of "Predators," one needs access to movie databases, film credits, or industry knowledge about this specific film.

step3: Nimród Antal is the director of "Predators" (2010).

Query:

Table 9: The full prompt used for Crafting The Path method.

prompts

Instruction:

You are good at writing Passage. You are asked to write a passage that answers the given query. Do not ask the user for further clarification.

Requirements:

1. Please write it in a similar format to the example

2. Please prioritize your most confident predictions.

Example:

Query: what state is this zip code 85282

Passage: Welcome to TEMPE, AZ 85282. 85282 is a rural zip code in Tempe, Arizona. The population

is primarily white, and mostly single. At $200,200 the average home value here is a bit higher than

average for the Phoenix-Mesa-Scottsdale metro area, so this probably isn’t the place to look for housing

bargains.5282 Zip code is located in the Mountain time zone at 33 degrees latitude (Fun Fact: this is the

same latitude as Damascus, Syria!) and -112 degrees longitude.

Query: why is gibbs model of reflection good

Passage: In this reflection, I am going to use Gibbs (1988) Reflective Cycle. This model is a recognised

framework for my reflection. Gibbs (1988) consists of six stages to complete one cycle which is able

to improve my nursing practice continuously and learning from the experience for better practice in the

future.n conclusion of my reflective assignment, I mention the model that I chose, Gibbs (1988) Reflective

Cycle as my framework of my reflective. I state the reasons why I am choosing the model as well as some

discussion on the important of doing reflection in nursing practice.

Query: what does a thousand pardons means

Passage: Oh, that’s all right, that’s all right, give us a rest; never mind about the direction, hang the

direction - I beg pardon, I beg a thousand pardons, I am not well to-day; pay no attention when I soliloquize,

it is an old habit, an old, bad habit, and hard to get rid of when one’s digestion is all disordered with eating

food that was raised forever and ever before he was born; good land! a man can’t keep his functions

regular on spring chickens thirteen hundred years old.

Query: what is a macro warning

Passage: Macro virus warning appears when no macros exist in the file in Word. When you open

a Microsoft Word 2002 document or template, you may receive the following macro virus warning,

even though the document or template does not contain macros: C:\<path>\<file name>contains macros.

Macros may contain viruses.

Query:

Table 10: The full prompt used for query2doc (Q2D) method.

prompts

Instruction:

Based on the example below, write keywords. Do not ask the user for further clarification

Requirements:

1. Please write it in a similar format to the example

2. Please prioritize your most confident predictions.

Example:

Query: how to include bullets in excel

Keywords: insert bullet points in excel

Query: positive predictive value formula

Keywords: calculating positive predictive value

Query: house for sale bridgewater ma

Keywords: homes for sale in bridgewater

Query: r text command

Keywords: text processing in r

Query:

Table 11: The full prompt used for query2expand (Q2E) method.

prompts

Instruction:

Answer the following query. Give the rationale before answering:

Requirements:

1. Please write it in a similar format to the example

2. Please prioritize your most confident predictions.

3. Let’s think step by step.

Query: what does folic acid do

Answer: Folic acid aids in DNA synthesis, cell division, and red blood cell formation. It’s vital for fetal

development during pregnancy, preventing neural tube defects, and supporting general health.

Query: what is calomel powder used for?

Answer: Calomel powder, historically used in medicine, served as a purgative, diuretic, and syphilis treatment.

Its usage declined due to the toxic effects of mercury, leading to safer alternatives. Today, it’s largely obsolete in medical practice.

Query: what county is dewitt michigan in?

Answer: DeWitt, Michigan, is located in Clinton County. This geographic classification helps in understanding

local governance, services, and regional affiliations, essential for residents and researchers.

Query: the importance of minerals in diet

Answer: Minerals are crucial for bodily functions, including bone health, fluid balance, and muscle function.

They support metabolic processes and the nervous system, highlighting their essential role in maintaining overall health and preventing deficiencies.

Query:

Table 12: The full prompt used for query2cot (Q2C) method.