Fine-Tuning and Prompt Optimization:
Two Great Steps that Work Better Together

Dilara Soylu Christopher Potts Omar Khattab
Stanford University

Abstract

Natural Language Processing (NLP) systems are increasingly taking the form of multi-stage pipelines involving multiple distinct language models (LMs) and prompting strategies. Here we address the question of how to fine-tune such systems to improve their performance. We cast this as a problem of optimizing the underlying LM weights and the prompting strategies together, and consider a challenging but highly realistic scenario in which we have no gold labels for any intermediate stages in the pipeline. To address this challenge, we evaluate approximate optimization strategies in which we bootstrap training labels for all pipeline stages and use these to optimize the pipeline’s prompts and fine-tune its weights alternatingly. In experiments with multi-hop QA, mathematical reasoning, and feature-based classification, we find that simple approaches for optimizing the prompts and weights together outperform directly optimizing weights alone and prompts alone by up to 65% and 5%, respectively, on average across LMs and tasks. We will release our new optimizers in DSPy at http://dspy.ai.

1 Introduction

While the capabilities of language models (LMs) continue to grow, recent work has shown the potential of building more powerful Natural Language Processing (NLP) systems by composing multiple skills of LMs into pipelines. Examples of this include systems for retrieval-augmented generation Guu et al. (2020); Lewis et al. (2020), multi-hop reasoning Qi et al. (2021); Khattab et al. (2021), information extraction Pourreza and Rafiei (2023); D’Oosterlinck et al. (2024), and other sophisticated pipelines Dohan et al. (2022); Khattab et al. (2022); Beurer-Kellner et al. (2023); Schlag et al. (2023).

Such LM Programs offer much more control for designing NLP systems, as they break down problems into modular, more manageable sub-tasks that can be assigned to LMs. If we could teach these LMs to accurately conduct their easier sub-tasks and to communicate effectively within multi-stage pipelines, this could greatly expand the scope of reliable NLP systems we can build.

To this end, Khattab et al. (2023) recently introduced the DSPy framework for defining and automatically optimizing LM Programs. In it, a program is defined as a function $\Phi$ that composes a set of stages, which we will refer to as language modules $M=\langle M_{1},\ldots,M_{|M|}\rangle$ , into a pipeline. Each module $M_{i}$ specifies a fuzzy natural-language transformation (e.g., generating a summary of a supplied document) that needs to be learned. To do so, each module learns a particular prompt (template) $\pi$ to make a call to a particular LM with weights $\theta$ . The optimization problem is then defined as maximizing the expected performance (per a downstream metric $\mu$ ) of the program $\Phi$ over a set of inputs by updating each module’s $\pi$ and $\theta$ .

Existing work Khattab et al. (2023); Opsahl-Ong et al. (2024) has studied optimizing the discrete string prompt of each module and has considered simple approaches for fine-tuning each module’s LM weights. In this empirical study, we investigate updating each module’s prompt and LM weights together to maximize a downstream metric on the final output of the program. Doing this is challenging as $\Phi$ is not generally differentiable and its modules $M_{i}$ generally lack labeled outputs and exhibit sophisticated dependencies. Moreover, in realistic settings, the training set is usually very small and only a small number of LM calls are possible for training and inference.

To address this challenge, we propose to alternate between optimizing prompts and fine-tuning LM weights and evaluate approximate optimization strategies in which we bootstrap training labels for all pipeline modules. In experiments with multi-hop QA (HotPotQA), mathematical reasoning (GSM8K), and feature-based classification (Iris), we show that these tandem strategies are highly effective across three different LMs, leading to 5–78% gains for HotPotQA, 2.5–10% gains for GSM8K, and -5.9–136% gains for Iris against prompts only and weights only strategies, averaged across mistral-7b-instruct-v0.2, llama-2-7b-chat, and llama-3-8b-instruct.

2 Problem Statement

We are given an LM program $\Phi$ , which operates like a blackbox function $\Phi:\mathcal{X}\to\mathcal{Y}$ , in which $\mathcal{X}$ and $\mathcal{Y}$ are typically in natural language (e.g., questions and their program generated answers, respectively). For example, we may have a program $\Phi$ for answering complex questions with short factoid answers. In the course of its execution, $\Phi$ makes one or more calls to each of $|M|\geq 1$ language modules, $M=\langle M_{1},\ldots,M_{|M|}\rangle$ .

For example, the program may implement a multi-hop, retrieval-augmented pipeline for question answering. This common pipeline Qi et al. (2021); Khattab et al. (2021); Press et al. (2023); Khattab et al. (2022) breaks down the input into sub-questions that are used to iteratively find relevant passages (e.g., from a corpus like Wikipedia) until the question can be faithfully answered. In general terms, each module $M_{i}:\mathcal{X}_{i}\to\mathcal{Y}_{i}$ is a declarative LM invocation that defines, in inherently fuzzy natural-language terms, an input $\mathcal{X}_{i}$ domain (like a user-supplied question and a set of retrieved passages) and an output $\mathcal{Y}_{i}$ co-domain (like a search query to find additional relevant passages).

We seek to implement each language module as some specific, well-tuned strategy for invoking an underlying language model $\mathbf{LM}$ . Concretely, we assume that a module $M_{i}$ will be fully implemented by specifying (1) the string prompt $\pi_{i}$ in which the module inputs $\mathcal{X}_{i}$ are plugged in to decode the module outputs $\mathcal{Y}_{i}$ and (2) the floating-point weights $\theta_{i}$ assigned to the parameters of $\mathbf{LM}$ in the course of this module. We refer to the version of $\Phi$ in which the prompts and LM weights are assigned explicitly to $\Pi$ and $\Theta$ , respectively, as $\Phi_{\langle\Theta,\Pi\rangle}$ .

Given nothing but a small training set $X=\{(x_{1},m_{1}),\ldots,(x_{|X|},m_{|X|)})\}$ of inputs $x_{i}\in\mathcal{X}$ and optional metadata like output labels or other hints $m_{i}\in\mathcal{M}$ that can be used for determining the correctness of a given program run, and a metric $\mu:\mathcal{Y}\times\mathcal{M}\to\mathbb{R}$ , our goal is to optimize $\Phi$ , that is, configure its modules’ prompts and LM weights to maximize the following objective:

\operatorname*{arg\,max}_{\Theta,\Pi}\,\frac{1}{|X|}\sum_{(x,m)\in X}\mu(\Phi_% {\langle\Theta,\Pi\rangle}(x),m)

Researchers tuning LM pipelines are in effect seeking to achieve this objective. It is also a very large subspace of the optimization problem in the DSPy framework¹¹1http://dspy.ai for LM programs. Unfortunately, this problem is intractable: we don’t have gradients or intermediate output labels to optimize each module, so we seek approximate strategies for such optimization.

3 Alternating Prompt and Weight Optimization Steps for LM Programs

We now introduce the BetterTogether algorithm, which simply alternates prompt and weight optimization steps for LM programs. We hypothesize that, when an LM is used to teach itself how to tackle the task defined by an LM program, optimizing prompts and fine-tuning LM weights are both essential to achieve the highest quality. In particular, we expect that (1) prompt optimization before fine-tuning can lead to more successful datapoints for fine-tuning and (2) prompt optimization after fine-tuning can make adjustments to the behavior of the LM program that lead to higher quality. Considering that fine-tuning is often perceived as a more powerful tool, this can be surprising, especially when both forms of optimization are ultimately applied over the same set of training inputs $X$ .

Algorithm 1 BetterTogether: Optimizing LM programs by alternating prompt and weight optimization steps, instantiated in Algorithm 2

Program

\Phi_{\langle\Theta,\Pi\rangle}=\Phi_{\Theta}\odot\Phi_{\Pi}

with module weights

\Theta=[\theta_{1},\ldots,\theta_{|\Phi|}]

and module prompts

\Pi=[\pi_{1},\ldots,\pi_{|\Phi|}]

Training Set

X

and Metric

\mu

1:function BetterTogether(

\Phi_{\langle\Theta,\Pi\rangle}

X

\mu

)

\Pi^{\prime}\leftarrow\textsc{OptimizePrompts($\Phi_{\langle\Theta,\Pi\rangle}% $, $X$, $\mu$)}

\Theta^{\prime}\leftarrow\textsc{FinetuneWeights($\Phi_{\langle\Theta,\Pi^{% \prime}\rangle}$, $X$, $\mu$)}

\Pi^{\prime\prime}\leftarrow\textsc{OptimizePrompts($\Phi_{\langle\Theta^{% \prime},\Pi\rangle}$, $X$, $\mu$)}

5: return

\Phi_{\langle\Theta^{\prime},\Pi^{\prime\prime}\rangle}

6:end function

Accordingly, the general optimization framework for our algorithm is defined in Algorithm 1. Given a program $\Phi$ , the algorithm begins by optimizing $\Phi$ ’s prompts, then fine-tuning its set of LM weights, and finally optimizing its prompts again. In principle, each of these steps could be treated as optional. This will define the different possible combinations that we will seek to evaluate in Section 4. Specifically, we are interested in the quality of (1) the vanilla program $\Phi$ with simple user-supplied instructions as the prompts and no fine-tuning of $\mathbf{LM}$ , (2) optimizing the prompts only, (3) optimizing the weights only, (4) optimizing the prompts twice, i.e. using the prompt-optimized $\Phi$ as a starting point for a second round of prompt optimization, (5) optimizing the weights twice, (6) optimizing the prompts then the weights, (7) vice versa, and (8) optimizing the prompts, weights, then prompts. Overall, we expect the final three to consistently outperform the first five.

For our algorithm in Algorithm 1 to be complete, we need to instantiate Lines 1–3 with specific approaches for prompt optimization and LM fine-tuning. For this, we choose the Bootstrap- $*$ family of algorithms from Khattab et al. (2023), which work by executing an initial version of the program on input examples $(x_{i},m_{i})\in X$ and recording the inputs/outputs observed at each module when the final output is “correct”, i.e., $\mu(\Phi(x_{i}),m_{i})\geq\lambda$ for some threshold $\lambda$ (e.g., $1.0$ for binary accuracy). This is important to note: in line with our formulation, our prompt and weight optimization regimes are not simply training on hand-labeled data but on self-generated program traces.

Algorithm 2 Instantiating Algorithm 1’s prompt & weight optimizers with bootstrapping algorithms

Training Set

X

and Metric

\mu

1:function BootstrapFewShotRS(

\Phi_{\langle\Theta,\Pi\rangle}

X

\mu

)

T,V\leftarrow\textsc{SplitIntoTrainAndValidation}(X)

\tau\leftarrow\textsc{BootstrapTraces($\Phi_{\langle\Theta,\Pi\rangle}$, $T$)}

\tau\leftarrow\textsc{FilterTraces($\tau$, $\mu$)}

5: Initialize attempts list

\mathcal{A}\leftarrow\{\}

6: for

\tau^{\prime}\in\textsc{SampleFewShotSubsets}(\tau)

\Pi^{\prime}\leftarrow\textsc{ConstructFewShotPrompts}(\tau^{\prime})

\sigma\leftarrow\frac{1}{|V|}\sum_{\langle x_{i},m_{i}\rangle\in V}\mu(\Phi_{% \langle\Theta,\Pi^{\prime}\rangle}(x_{i}),m_{i})

9: Extend

\mathcal{A}

with

(\sigma,\Pi^{\prime})

10: end for

11: return

\Pi_{\max}

\mathcal{A}

’s highest-scoring prompts sequence

12:end function

13:

14:function BootstrapFinetune(

\Phi_{\langle\Theta,\Pi\rangle}

X

\mu

)

15:

\tau\leftarrow\textsc{BootstrapTraces($\Phi_{\langle\Theta,\Pi\rangle}$, $X$)}

16:

\tau\leftarrow\textsc{FilterTraces($\tau$, $\mu$)}

17:

\Theta^{\prime}\leftarrow\textsc{TrainLM($\tau$)}

18: return

\Theta^{\prime}

19:end function

20:

21:Set OptimizePrompts as BootstrapFewShotRS

22:Set FinetuneWeights as BootstrapFinetune

Algorithm 2 shows the instantiations for Lines 1–3 of our Algorithm 1. For prompt optimization, we use BootstrapFewshotRS (BFRS) of DSPy, which self-generates potential few-shot examples of every module and applies a form of random search (RS) to select the specific generated few-shot examples that are used for prompting. Overall, BFRS first divides $X$ into a training split $T$ and a validation split $V$ (Line 2). It then executes the provided $\Phi_{\langle\Theta,\Pi\rangle}$ on the training inputs, collecting input–output pairs for every module in $\Phi$ for each $x_{i}\in T$ . This is called a trace $\tau$ , and we keep only the traces assigned high scores by $\mu$ (Line 4). Given all of these traces, BFRS samples multiple different subsets of a few traces $\tau^{\prime}$ (Line 6), each of them containing a potential few-shot example for each module in $\Phi$ , and ultimately selects the subset that, when used to construct few-shot prompts (Line 7) achieves the highest score (Line 8). This simple search strategy is known to consistently leads to large quality improvements in prompting LM programs Khattab et al. (2023); Opsahl-Ong et al. (2024), often outperforming manually or automatically optimizing prompt instructions or writing examples by hand.

For fine-tuning, we extend BootstrapFinetune (BFT) of DSPy, which self-generates a large number examples for every module and combines them into one dataset to finetune the LM weights with an implicit multi-task objective, where the sub-tasks are the modules’ roles. Existing work has only considered BFT in a very narrow setting for LM programs: on HotPotQA, Khattab et al. (2023) train a T5-Large model using traces from a few-shot Llama2-13b program, without considering getting an LM to teach itself via BFT nor considering a role for BFRS in the fine-tuned program. In this work, we focus on allowing models to teach themselves and self-improve. We propose for the first time combining the strategies of BFRS and BFT via alternation to get the same LM to teach itself far better than either prompt or weight optimization in isolation. (We could also test similar ideas in scenarios where a larger models does the bootstrapping for a smaller LM. This may lead to even higher results but is outside our scope.)

4 Experimental Evaluation

Strategy	mistral-7b-instruct-v0.2			llama-2-7b-chat			llama-3-8b-instruct
Strategy	HotPotQA	GSM8K	Iris	HotPotQA	GSM8K	Iris	HotPotQA	GSM8K	Iris
Vanilla Zero-shot	17.2	40.3	20.0	13.2	24.0	00.0	31.6	72.7	34.0
Prompt Optimization ( $\Pi$ )	33.8	46.4	52.0	33.3	26.0	56.0	46.9	77.9	78.7
Weight Optimization ( $\Theta$ )	22.9	40.7	28.7	12.2	24.0	-	34.8	75.1	31.3
$\Pi\rightarrow\Pi$	33.8	47.7	64.0	32.6	24.7	64.0	46.5	77.6	77.3
$\Theta\rightarrow\Theta$	24.0	42.8	31.3	13.0	24.1	-	34.4	44.1	30.7
$\Pi\rightarrow\Theta$	36.3	47.3	24.7	32.7	27.3	29.3	42.8	77.6	34.7
$\Theta\rightarrow\Pi$	33.0	48.3	65.3	34.2	26.6	-	43.6	78.9	83.3
$\Pi\rightarrow\Theta\rightarrow\Pi$	37.6	46.8	57.3	34.8	26.3	49.3	46.7	77.0	79.3

Table 1: Main Results. Percentage accuracies of strategies consisting of prompt optimization (

\Pi

), weight optimization (

\Theta

), and their permutations on HotPotQA, GSM8K, and Iris evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are average performance of three runs on held-out test sets using different random seeds. Settings that include weight optimization as the first step rely on the data-points bootstrapped using the “Vanilla Zero-shot” setting. Since there weren’t any data-points that were answered correctly by llama-2-7b-chat on the Iris dataset using the ‘Vanilla Zero-shot” setting, there weren’t any bootstrapped examples for weight optimization either. Settings that weren’t possible to run due to this are marked with “–”.

We now seek to evaluate our hypothesis on the importance of optimizing both prompts and LM weights of LM programs. We conduct our evaluation across three datasets that span different tasks (and thus LM programs) each. In particular, we use HotPotQA Yang et al. (2018) for multi-hop reasoning, GSM8K Cobbe et al. (2021) for arithmetic reasoning, and Iris Fisher (1988) for classification. Unless otherwise specified, we use $1000$ training set and $500$ development set examples for each dataset. We conduct our main experiments using the same model for prompt optimization, bootstrapping training traces, and fine-tuning. We experiment with three models: mistral-7b-instruct-v0.2 Jiang et al. (2023), llama-2-7b-chat Touvron et al. (2023), llama-3-8b-instruct MetaAI (2024).

We implement all of our programs and optimizers as extensions to the DSPy framework. All evaluation results are the average of three random seeds, which are used to shuffle our training sets before optimization. Full text for programs is shared in Appendix A. Appendices B and C report the license information for all LMs and datasets used as well as our implementation details (e.g., hyperparameters and software), respectively.

Multi-hop Reasoning

HotPotQA (in the “fullwiki” setting) is a question answering task in which systems must find two Wikipedia pages via search and use them to answer a factoid question. Therefore it can be implemented as a program that has three LM modules: the first two for generating search queries (i.e., hops) and the last one for generating an answer. Each module uses Chain-of-Thought (CoT; Wei et al. 2022) to generate its outputs, producing a reasoning string before the search query or the answer. Search queries are passed to a frozen ColBERTv2 Santhanam et al. (2022) retriever. Accuracy is measured using the exact match score of the answer with the ground truth answer for the given question, after normalizing case, stripping surrounding whitespace characters, and removing punctuation. We use a held-out set of $1500$ examples from the official development set to report our final results, since the official test set is not public.

Arithmetic Reasoning

GSM8K is a popular benchmark consisting of grade school math problems. We implement it as an LM program with a single module using CoT prompting, where the LM generates a reasoning string followed by an answer. We report our final results on the entire held-out test set of GSM8K, with $1319$ examples.

Classification

Iris is a classic classification task in machine learning, where the task is to classify species of Iris flowers. We use a single-module CoT DSPy program for Iris with the goal of assessing whether it being a feature-based classification task gives a large advantage to methods based entirely on gradient descent (fine-tuning). This tests the extrapolation of our hypothesis to a very different setting from the other two tasks. We report our results on a test set of $50$ examples due to the size of the Iris dataset.

5 Results & Discussion

Table 1 reports how each of the strategies described in Section 3 perform on the held-out test sets of our datasets. Reported values are averaged across three runs with unique random seeds. Appendix D separately reports the results from each run.

In 7 out of the 9 dataset and LM pairs, we observe that the best-performing strategies are always strategies that utilize prompt ( $\Pi$ ) and weight ( $\Theta$ ) optimization steps together, although there is no clear winner among the three methods that optimize both. Overall, optimizing prompts is essential on all the tasks, but optimizing prompts and weights together leads to strong gains over the best setting that only optimizes one of the two.

In summary, we have proposed to alternate between prompt optimization and fine-tuning LM weights. In experiments with multi-hop QA (HotPotQA), mathematical reasoning (GSM8K), and feature-based classification (Iris), we show that our strategies are highly effective for getting an LM to teach itself to perform an LM program via bootstrapping, leading to 5–78% gains for HotPotQA, 2.5–10% gains for GSM8K, and -5.9–136% gains for Iris.

6 Limitations

While this short paper presents strong evidence from nine case studies in total, spanning three tasks (and their corresponding LM programs) and three LMs, it is possible that other tasks, programs, or LMs will change the pattern in unforeseen ways. In particular, we have only experimented with weight optimization in the form of LoRA fine-tuning of pre-trained models. It is in principle possible that some other fine-tuning strategy would be so powerful and cost-effective as to remove the need for prompt optimization.

In addition, though we expect our findings to inform many researchers and practitioners interested in optimizing LM programs, and encourage them to explore optimizing prompts and fine-tuning LM weights together, we do not yet understand why both are important. The role of prompt optimization and the role of fine-tuning multi-stage LM programs are both new, and the relative lack of deep understanding of these roles in the emerging literature could pose risks in unanticipated interactions between these components, compared with standard gradient descent for neural networks, which has been studied for decades.

Acknowledgments

D.S. is supported by Ravi Family Graduate Fellowship. This work was partially supported by IBM as a founding member of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), and by the HAI Hoffman–Yee Grant “Dendritic Computation for Knowledge Systems”.

References

Beurer-Kellner et al. (2023) Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2023. Prompting is programming: A query language for large language models. Proc. ACM Program. Lang., 7(PLDI).
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
Dohan et al. (2022) David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton. 2022. Language model cascades. Preprint, arXiv:2207.10342.
D’Oosterlinck et al. (2024) Karel D’Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, and Christopher Potts. 2024. In-context learning for extreme multi-label classification. Preprint, arXiv:2401.12178.
Fisher (1988) Ronald A. Fisher. 1988. Iris. UCI Machine Learning Repository.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. Preprint, arXiv:2106.09685.
HuggingFace (2023) HuggingFace. 2023. Text generation inference.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. In Thirty-Fifth Conference on Neural Information Processing Systems.
Khattab et al. (2022) Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
Khattab et al. (2023) Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling declarative language model calls into self-improving pipelines. Preprint, arXiv:2310.03714.
Lewis et al. (2020) Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Merkel (2014) Dirk Merkel. 2014. Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal, 2014(239):2.
MetaAI (2024) MetaAI. 2024. Meta llama 3.
Opsahl-Ong et al. (2024) Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing instructions and demonstrations for multi-stage language model programs. Preprint, arXiv:2406.11695.
Pourreza and Rafiei (2023) Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. Preprint, arXiv:2304.11015.
Press et al. (2023) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore. Association for Computational Linguistics.
Qi et al. (2021) Peng Qi, Haejun Lee, Tg Sido, and Christopher Manning. 2021. Answering open-domain questions of varying reasoning steps from text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3599–3614, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Santhanam et al. (2022) Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
Schlag et al. (2023) Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen tau Yih, Jason Weston, Jürgen Schmidhuber, and Xian Li. 2023. Large language model programs. Preprint, arXiv:2305.05364.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.

Appendices

Appendix A Programs

The DSPy programs for HotPotQA, GSM8K, and Iris are shared in Snippets 1, 2, 3, respectively.

⬇

1class HotPotQAProgram(dspy.Module):

2 def __init__(self, passages_per_hop=3):

3 super().__init__()

5 self.retrieve = dspy.Retrieve(k=passages_per_hop)

6 self.generate_query = [dspy.ChainOfThought("context, question -> search_query") for _ in range(2)]

7 self.generate_answer = dspy.ChainOfThought("context, question -> answer")

9 def forward(self, question):

10 context = []

12 for hop in range(2):

13 search_query = self.generate_query[hop](context=context, question=question).search_query

14 passages = self.retrieve(search_query).passages

15 context = dsp.utils.deduplicate(context + passages)

17 return self.generate_answer(context=context, question=question).copy(context=context)

Snippet 1: DSPy program for HotPotQA.

⬇

1class CoTProgram(dspy.Module):

2 def __init__(self):

3 super().__init__()

4 self.prog = dspy.ChainOfThought("question -> answer")

6 def forward(self, question):

7 return self.prog(question=question)

Snippet 2: DSPy program for GSM8K.

⬇

1class IrisSignature(dspy.Signature):

2 "Given the petal and sepal dimensions in cm, predict the iris species."

4 petal_length = dspy.InputField()

5 petal_width = dspy.InputField()

6 sepal_length = dspy.InputField()

7 sepal_width = dspy.InputField()

8 answer = dspy.OutputField(desc=’setosa, versicolour, or virginica’)

11class IrisProgram(dspy.Module):

12 def __init__(self):

13 self.pred = dspy.ChainOfThought(IrisSignature)

15 def forward(self, petal_length, petal_width, sepal_length, sepal_width):

16 return self.pred(petal_length=petal_length, petal_width=petal_width, sepal_length=sepal_length, sepal_width=sepal_width)

Snippet 3: DSPy program for Iris, provided to us by the DSPy team.

Appendix B Asset Information

We share the associated licenses for the models and datasets we used below. For models, we list the specific HuggingFace model id we used to retrieve the respective weights.

1.

mistralai/Mistral-7b-Instruct-v0.2: Apache License 2.0
2.

meta-llama/Llama-2-7b-chat-hf: Meta Llama 2 Community License at https://ai.meta.com/llama/license/
3.

meta-llama/Meta-Llama-3-8B-Instruct: Meta Llama 3 Community License at https://llama.meta.com/llama3/license/
4.

HotPotQA: Apache License 2.0
5.

GSM8K: MIT License
6.

Iris: Creative Commons Attribution 4.0 International (CC BY 4.0)

All the LMs used in this work are intended for use in English.

Appendix C Implementation Details

In this section, we share the implementation details as it pertains to sizes of the splits, LM sampling, fine-tuning, and compute requirements. We also share the details for how we compute the gains reported throughout the paper.

Split Sizes

For optimizing prompt templates with BootstrapFewshotRandomSearch (BFRS), we sub-sample $100$ examples from the training set for BFRS training set and $250$ examples for its validation set. We allow BFRS to use up to $3$ boostrapped as well as $3$ labeled in-context-examples to search over $6$ candidate few-shot prompts.

The original Iris dataset has a total of $150$ examples across all the splits. We re-split all the data-points into train, development, and test sets, each with $50$ examples. We use this test set to report our final numbers. From the training split, we use a $15$ / $35$ sub-split for internal prompt-optimization training and validation, respectively.

Sampling

For sampling, we host our models in Docker Merkel (2014) instances through HuggingFace’s text-generation-inference HuggingFace (2023) toolkit. We keep the sampling parameters the same across all experiments, using TopK sampling with a temperature of $0.1$ , and top_k of $0.97$ , until the model either generates a stopping string or a total of $1024$ tokens (including the tokens in the prompt, if supplied).

Fine-tuning

For fine-tuning, we use Low Rank Adaptation (LoRA) Hu et al. (2021) to train the query and key self-attention layers of our models, using a LoRA rank of $32$ , alpha of $64$ , with no dropout. We fine-tune all of our models for $5$ epochs using bfloat16 precision, with a learning rate of $1\mathrm{e}{-5}$ and an effective batch size of $8$ . We use gradient accumulation steps larger than 1 in order to effectively use a large batch size, without having to fit all the batch in memory at once.

Compute Requirement

We use A100 GPUs to run our experiments. The total time it takes to run the experiments varies based on the strategy, LM and dataset. Total approximate GPU hours to produce Table 1 was $\approx$ 75 hours.

Appendix D Extended Results

The results shared in 1 are the average of three runs. Tables 2, 3, and 4 show the breakdown of the individual runs for HotPotQA GSM8K and Iris respectively.

Strategy	mistral-7b-instruct-v0.2				llama-2-7b-chat				llama-3-8b-instruct
Strategy	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg
Vanilla Zero-shot	17.2	17.2	17.2	17.2	13.2	13.2	13.2	13.2	31.6	31.6	31.6	31.6
Prompt Optimization ( $\Pi$ )	32.7	34.7	34.0	33.8	33.3	33.3	33.4	33.3	45.7	47.4	47.5	46.9
Weight Optimization ( $\Theta$ )	22.0	23.1	23.5	22.9	12.4	11.8	12.3	12.2	34.9	35.3	34.3	34.8
$\Pi\rightarrow\Pi$	31.7	36.0	33.7	33.8	31.7	33.1	33.1	32.6	47.3	45.4	46.7	46.5
$\Theta\rightarrow\Theta$	24.1	23.9	23.9	24.0	12.4	13.5	13.3	13.0	35.1	34.1	34.1	34.4
$\Pi\rightarrow\Theta$	34.9	39.1	34.9	36.3	32.8	32.3	33.1	32.7	40.6	42.1	45.7	42.8
$\Theta\rightarrow\Pi$	29.3	33.8	35.8	33.0	36.0	33.4	33.1	34.2	44.5	40.9	45.3	43.6
$\Pi\rightarrow\Theta\rightarrow\Pi$	34.9	40.7	37.2	37.6	34.7	34.5	35.3	34.8	46.5	47.1	46.4	46.7

Table 2: Results of HotPotQA Runs. Percentage accuracies of strategies consisting of prompt optimization (

\Pi

), weight optimization (

\Theta

) and their permutations for HotPotQA evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average.

Strategy	mistral-7b-instruct-v0.2				llama-2-7b-chat				llama-3-8b-instruct
Strategy	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg
Vanilla Zero-shot	40.3	40.3	40.3	40.3	24.0	24.0	24.0	24.0	72.7	72.7	72.7	72.7
Prompt Optimization ( $\Pi$ )	45.0	47.2	47.1	46.4	27.3	25.1	25.5	26.0	76.9	77.9	78.9	77.9
Weight Optimization ( $\Theta$ )	40.8	40.0	41.2	40.7	23.7	24.2	24.0	24.0	75.7	74.8	74.8	75.1
$\Pi\rightarrow\Pi$	46.3	47.2	49.6	47.7	28.4	24.0	21.8	24.7	76.5	80.1	76.1	77.6
$\Theta\rightarrow\Theta$	42.9	41.8	43.8	42.8	24.0	24.3	24.0	24.1	52.2	36.6	43.4	44.0
$\Pi\rightarrow\Theta$	46.4	47.3	48.2	47.3	27.8	28.1	25.9	27.3	77.6	75.4	79.8	77.6
$\Theta\rightarrow\Pi$	50.1	46.0	48.8	48.3	26.8	26.1	27.0	26.6	78.5	79.8	78.4	78.9
$\Pi\rightarrow\Theta\rightarrow\Pi$	44.9	48.5	47.1	46.8	27.1	25.9	25.9	26.3	77.6	75.4	77.8	77.0

Table 3: Results of GSM8K Runs. Percentage accuracies of strategies consisting of prompt optimization (

\Pi

), weight optimization (

\Theta

) and their permutations for GSM8K evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average.

Strategy	mistral-7b-instruct-v0.2				llama-2-7b-chat				llama-3-8b-instruct
Strategy	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg	Run 1	Run 2	Run 3	Avg
Vanilla Zero-shot	20.0	20.0	20.0	20.0	00.0	00.0	00.0	00.0	34.0	34.0	34.0	34.0
Prompt Optimization ( $\Pi$ )	50.0	56.0	50.0	52.0	42.0	56.0	70.0	56.0	82.0	64.0	90.0	78.7
Weight Optimization ( $\Theta$ )	26.0	28.0	32.0	28.7	-	-	-	-	32.0	30.0	32.0	31.3
$\Pi\rightarrow\Pi$	74.0	54.0	64.0	64.0	62.0	74.0	56.0	64.0	86.0	72.0	74.0	77.3
$\Theta\rightarrow\Theta$	28.0	32.0	34.0	31.3	-	-	-	-	32.0	30.0	30.0	30.7
$\Pi\rightarrow\Theta$	22.0	26.0	26.0	24.7	30.0	28.0	30.0	29.3	36.0	32.0	36.0	34.7
$\Theta\rightarrow\Pi$	60.0	68.0	68.0	65.3	-	-	-	-	76.0	80.0	94.0	83.3
$\Pi\rightarrow\Theta\rightarrow\Pi$	40.0	54.0	78.0	57.3	62.0	32.0	54.0	49.3	92.0	86.0	60.0	79.3

Table 4: Results of Iris Runs. Percentage accuracies of strategies consisting of prompt optimization (

\Pi

), weight optimization (

\Theta

) and their permutations for Iris evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average. Settings that include weight optimization as the first step rely on the data-points bootstrapped using the “Vanilla Zero-shot” setting. Since there weren’t any data-points that were answered correctly by llama-2-7b-chat using the ‘Vanilla Zero-shot” setting, there weren’t any bootstrapped examples to for weight optimization either. Settings that weren’t possible to run due to this are marked with “–”.

Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

Abstract

1 Introduction

2 Problem Statement

3 Alternating Prompt and Weight Optimization Steps for LM Programs

4 Experimental Evaluation

Multi-hop Reasoning

Arithmetic Reasoning

Classification

5 Results & Discussion

6 Limitations

Acknowledgments

References

Appendices

Appendix A Programs

Appendix B Asset Information

Appendix C Implementation Details

Split Sizes

Sampling

Fine-tuning

Compute Requirement

Appendix D Extended Results

Fine-Tuning and Prompt Optimization:
Two Great Steps that Work Better Together