Fine-Tuning and Prompt Optimization:
Two Great Steps that Work Better Together

Dilara Soylu     Christopher Potts     Omar Khattab
Stanford University
Abstract

Natural Language Processing (NLP) systems are increasingly taking the form of multi-stage pipelines involving multiple distinct language models (LMs) and prompting strategies. Here we address the question of how to fine-tune such systems to improve their performance. We cast this as a problem of optimizing the underlying LM weights and the prompting strategies together, and consider a challenging but highly realistic scenario in which we have no gold labels for any intermediate stages in the pipeline. To address this challenge, we evaluate approximate optimization strategies in which we bootstrap training labels for all pipeline stages and use these to optimize the pipeline’s prompts and fine-tune its weights alternatingly. In experiments with multi-hop QA, mathematical reasoning, and feature-based classification, we find that simple approaches for optimizing the prompts and weights together outperform directly optimizing weights alone and prompts alone by up to 65% and 5%, respectively, on average across LMs and tasks. We will release our new optimizers in DSPy at http://dspy.ai.

Fine-Tuning and Prompt Optimization:
Two Great Steps that Work Better Together



1 Introduction

While the capabilities of language models (LMs) continue to grow, recent work has shown the potential of building more powerful Natural Language Processing (NLP) systems by composing multiple skills of LMs into pipelines. Examples of this include systems for retrieval-augmented generation Guu et al. (2020); Lewis et al. (2020), multi-hop reasoning Qi et al. (2021); Khattab et al. (2021), information extraction Pourreza and Rafiei (2023); D’Oosterlinck et al. (2024), and other sophisticated pipelines Dohan et al. (2022); Khattab et al. (2022); Beurer-Kellner et al. (2023); Schlag et al. (2023).

Such LM Programs offer much more control for designing NLP systems, as they break down problems into modular, more manageable sub-tasks that can be assigned to LMs. If we could teach these LMs to accurately conduct their easier sub-tasks and to communicate effectively within multi-stage pipelines, this could greatly expand the scope of reliable NLP systems we can build.

To this end, Khattab et al. (2023) recently introduced the DSPy framework for defining and automatically optimizing LM Programs. In it, a program is defined as a function ΦΦ\Phiroman_Φ that composes a set of stages, which we will refer to as language modules M=M1,,M|M|𝑀subscript𝑀1subscript𝑀𝑀M=\langle M_{1},\ldots,M_{|M|}\rangleitalic_M = ⟨ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT | italic_M | end_POSTSUBSCRIPT ⟩, into a pipeline. Each module Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT specifies a fuzzy natural-language transformation (e.g., generating a summary of a supplied document) that needs to be learned. To do so, each module learns a particular prompt (template) π𝜋\piitalic_π to make a call to a particular LM with weights θ𝜃\thetaitalic_θ. The optimization problem is then defined as maximizing the expected performance (per a downstream metric μ𝜇\muitalic_μ) of the program ΦΦ\Phiroman_Φ over a set of inputs by updating each module’s π𝜋\piitalic_π and θ𝜃\thetaitalic_θ.

Existing work Khattab et al. (2023); Opsahl-Ong et al. (2024) has studied optimizing the discrete string prompt of each module and has considered simple approaches for fine-tuning each module’s LM weights. In this empirical study, we investigate updating each module’s prompt and LM weights together to maximize a downstream metric on the final output of the program. Doing this is challenging as ΦΦ\Phiroman_Φ is not generally differentiable and its modules Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generally lack labeled outputs and exhibit sophisticated dependencies. Moreover, in realistic settings, the training set is usually very small and only a small number of LM calls are possible for training and inference.

To address this challenge, we propose to alternate between optimizing prompts and fine-tuning LM weights and evaluate approximate optimization strategies in which we bootstrap training labels for all pipeline modules. In experiments with multi-hop QA (HotPotQA), mathematical reasoning (GSM8K), and feature-based classification (Iris), we show that these tandem strategies are highly effective across three different LMs, leading to 5–78% gains for HotPotQA, 2.5–10% gains for GSM8K, and -5.9–136% gains for Iris against prompts only and weights only strategies, averaged across mistral-7b-instruct-v0.2, llama-2-7b-chat, and llama-3-8b-instruct.

2 Problem Statement

We are given an LM program ΦΦ\Phiroman_Φ, which operates like a blackbox function Φ:𝒳𝒴:Φ𝒳𝒴\Phi:\mathcal{X}\to\mathcal{Y}roman_Φ : caligraphic_X → caligraphic_Y, in which 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y are typically in natural language (e.g., questions and their program generated answers, respectively). For example, we may have a program ΦΦ\Phiroman_Φ for answering complex questions with short factoid answers. In the course of its execution, ΦΦ\Phiroman_Φ makes one or more calls to each of |M|1𝑀1|M|\geq 1| italic_M | ≥ 1 language modules, M=M1,,M|M|𝑀subscript𝑀1subscript𝑀𝑀M=\langle M_{1},\ldots,M_{|M|}\rangleitalic_M = ⟨ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT | italic_M | end_POSTSUBSCRIPT ⟩.

For example, the program may implement a multi-hop, retrieval-augmented pipeline for question answering. This common pipeline Qi et al. (2021); Khattab et al. (2021); Press et al. (2023); Khattab et al. (2022) breaks down the input into sub-questions that are used to iteratively find relevant passages (e.g., from a corpus like Wikipedia) until the question can be faithfully answered. In general terms, each module Mi:𝒳i𝒴i:subscript𝑀𝑖subscript𝒳𝑖subscript𝒴𝑖M_{i}:\mathcal{X}_{i}\to\mathcal{Y}_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a declarative LM invocation that defines, in inherently fuzzy natural-language terms, an input 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT domain (like a user-supplied question and a set of retrieved passages) and an output 𝒴isubscript𝒴𝑖\mathcal{Y}_{i}caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT co-domain (like a search query to find additional relevant passages).

We seek to implement each language module as some specific, well-tuned strategy for invoking an underlying language model 𝐋𝐌𝐋𝐌\mathbf{LM}bold_LM. Concretely, we assume that a module Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be fully implemented by specifying (1) the string prompt πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in which the module inputs 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are plugged in to decode the module outputs 𝒴isubscript𝒴𝑖\mathcal{Y}_{i}caligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and (2) the floating-point weights θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT assigned to the parameters of 𝐋𝐌𝐋𝐌\mathbf{LM}bold_LM in the course of this module. We refer to the version of ΦΦ\Phiroman_Φ in which the prompts and LM weights are assigned explicitly to ΠΠ\Piroman_Π and ΘΘ\Thetaroman_Θ, respectively, as ΦΘ,ΠsubscriptΦΘΠ\Phi_{\langle\Theta,\Pi\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT.

Given nothing but a small training set X={(x1,m1),,(x|X|,m|X|))}X=\{(x_{1},m_{1}),\ldots,(x_{|X|},m_{|X|)})\}italic_X = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT | italic_X | end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT | italic_X | ) end_POSTSUBSCRIPT ) } of inputs xi𝒳subscript𝑥𝑖𝒳x_{i}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X and optional metadata like output labels or other hints misubscript𝑚𝑖m_{i}\in\mathcal{M}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_M that can be used for determining the correctness of a given program run, and a metric μ:𝒴×:𝜇𝒴\mu:\mathcal{Y}\times\mathcal{M}\to\mathbb{R}italic_μ : caligraphic_Y × caligraphic_M → blackboard_R, our goal is to optimize ΦΦ\Phiroman_Φ, that is, configure its modules’ prompts and LM weights to maximize the following objective:

argmaxΘ,Π1|X|(x,m)Xμ(ΦΘ,Π(x),m)subscriptargmaxΘΠ1𝑋subscript𝑥𝑚𝑋𝜇subscriptΦΘΠ𝑥𝑚\operatorname*{arg\,max}_{\Theta,\Pi}\,\frac{1}{|X|}\sum_{(x,m)\in X}\mu(\Phi_% {\langle\Theta,\Pi\rangle}(x),m)start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT roman_Θ , roman_Π end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_X | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_m ) ∈ italic_X end_POSTSUBSCRIPT italic_μ ( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT ( italic_x ) , italic_m )

Researchers tuning LM pipelines are in effect seeking to achieve this objective. It is also a very large subspace of the optimization problem in the DSPy framework111http://dspy.ai for LM programs. Unfortunately, this problem is intractable: we don’t have gradients or intermediate output labels to optimize each module, so we seek approximate strategies for such optimization.

3 Alternating Prompt and Weight Optimization Steps for LM Programs

We now introduce the BetterTogether algorithm, which simply alternates prompt and weight optimization steps for LM programs. We hypothesize that, when an LM is used to teach itself how to tackle the task defined by an LM program, optimizing prompts and fine-tuning LM weights are both essential to achieve the highest quality. In particular, we expect that (1) prompt optimization before fine-tuning can lead to more successful datapoints for fine-tuning and (2) prompt optimization after fine-tuning can make adjustments to the behavior of the LM program that lead to higher quality. Considering that fine-tuning is often perceived as a more powerful tool, this can be surprising, especially when both forms of optimization are ultimately applied over the same set of training inputs X𝑋Xitalic_X.

Algorithm 1 BetterTogether: Optimizing LM programs by alternating prompt and weight optimization steps, instantiated in Algorithm 2
Program ΦΘ,Π=ΦΘΦΠsubscriptΦΘΠdirect-productsubscriptΦΘsubscriptΦΠ\Phi_{\langle\Theta,\Pi\rangle}=\Phi_{\Theta}\odot\Phi_{\Pi}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ⊙ roman_Φ start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT,
        with module weights Θ=[θ1,,θ|Φ|]Θsubscript𝜃1subscript𝜃Φ\Theta=[\theta_{1},\ldots,\theta_{|\Phi|}]roman_Θ = [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT | roman_Φ | end_POSTSUBSCRIPT ]
        and module prompts Π=[π1,,π|Φ|]Πsubscript𝜋1subscript𝜋Φ\Pi=[\pi_{1},\ldots,\pi_{|\Phi|}]roman_Π = [ italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT | roman_Φ | end_POSTSUBSCRIPT ]
Training Set X𝑋Xitalic_X and Metric μ𝜇\muitalic_μ
1:function BetterTogether(ΦΘ,ΠsubscriptΦΘΠ\Phi_{\langle\Theta,\Pi\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT, X𝑋Xitalic_X, μ𝜇\muitalic_μ)
2:     ΠOptimizePrompts(ΦΘ,ΠXμ)superscriptΠOptimizePrompts(ΦΘ,ΠXμ)\Pi^{\prime}\leftarrow\textsc{OptimizePrompts($\Phi_{\langle\Theta,\Pi\rangle}% $, $X$, $\mu$)}roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← OptimizePrompts( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT , italic_X , italic_μ )
3:     ΘFinetuneWeights(ΦΘ,ΠXμ)superscriptΘFinetuneWeights(ΦΘ,ΠXμ)\Theta^{\prime}\leftarrow\textsc{FinetuneWeights($\Phi_{\langle\Theta,\Pi^{% \prime}\rangle}$, $X$, $\mu$)}roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← FinetuneWeights( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_POSTSUBSCRIPT , italic_X , italic_μ )
4:     Π′′OptimizePrompts(ΦΘ,ΠXμ)superscriptΠ′′OptimizePrompts(ΦΘ,ΠXμ)\Pi^{\prime\prime}\leftarrow\textsc{OptimizePrompts($\Phi_{\langle\Theta^{% \prime},\Pi\rangle}$, $X$, $\mu$)}roman_Π start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ← OptimizePrompts( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Π ⟩ end_POSTSUBSCRIPT , italic_X , italic_μ )
5:     return ΦΘ,Π′′subscriptΦsuperscriptΘsuperscriptΠ′′\Phi_{\langle\Theta^{\prime},\Pi^{\prime\prime}\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Π start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ⟩ end_POSTSUBSCRIPT
6:end function

Accordingly, the general optimization framework for our algorithm is defined in Algorithm 1. Given a program ΦΦ\Phiroman_Φ, the algorithm begins by optimizing ΦΦ\Phiroman_Φ’s prompts, then fine-tuning its set of LM weights, and finally optimizing its prompts again. In principle, each of these steps could be treated as optional. This will define the different possible combinations that we will seek to evaluate in Section 4. Specifically, we are interested in the quality of (1) the vanilla program ΦΦ\Phiroman_Φ with simple user-supplied instructions as the prompts and no fine-tuning of 𝐋𝐌𝐋𝐌\mathbf{LM}bold_LM, (2) optimizing the prompts only, (3) optimizing the weights only, (4) optimizing the prompts twice, i.e. using the prompt-optimized ΦΦ\Phiroman_Φ as a starting point for a second round of prompt optimization, (5) optimizing the weights twice, (6) optimizing the prompts then the weights, (7) vice versa, and (8) optimizing the prompts, weights, then prompts. Overall, we expect the final three to consistently outperform the first five.

For our algorithm in Algorithm 1 to be complete, we need to instantiate Lines 1–3 with specific approaches for prompt optimization and LM fine-tuning. For this, we choose the Bootstrap-* family of algorithms from Khattab et al. (2023), which work by executing an initial version of the program on input examples (xi,mi)Xsubscript𝑥𝑖subscript𝑚𝑖𝑋(x_{i},m_{i})\in X( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X and recording the inputs/outputs observed at each module when the final output is “correct”, i.e., μ(Φ(xi),mi)λ𝜇Φsubscript𝑥𝑖subscript𝑚𝑖𝜆\mu(\Phi(x_{i}),m_{i})\geq\lambdaitalic_μ ( roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_λ for some threshold λ𝜆\lambdaitalic_λ (e.g., 1.01.01.01.0 for binary accuracy). This is important to note: in line with our formulation, our prompt and weight optimization regimes are not simply training on hand-labeled data but on self-generated program traces.

Algorithm 2 Instantiating Algorithm 1’s prompt & weight optimizers with bootstrapping algorithms
Training Set X𝑋Xitalic_X and Metric μ𝜇\muitalic_μ
1:function BootstrapFewShotRS(ΦΘ,ΠsubscriptΦΘΠ\Phi_{\langle\Theta,\Pi\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT, X𝑋Xitalic_X, μ𝜇\muitalic_μ)
2:     T,VSplitIntoTrainAndValidation(X)𝑇𝑉SplitIntoTrainAndValidation𝑋T,V\leftarrow\textsc{SplitIntoTrainAndValidation}(X)italic_T , italic_V ← SplitIntoTrainAndValidation ( italic_X )
3:     τBootstrapTraces(ΦΘ,ΠT)𝜏BootstrapTraces(ΦΘ,ΠT)\tau\leftarrow\textsc{BootstrapTraces($\Phi_{\langle\Theta,\Pi\rangle}$, $T$)}italic_τ ← BootstrapTraces( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT , italic_T )
4:     τFilterTraces(τμ)𝜏FilterTraces(τμ)\tau\leftarrow\textsc{FilterTraces($\tau$, $\mu$)}italic_τ ← FilterTraces( italic_τ , italic_μ )
5:     Initialize attempts list 𝒜{}𝒜\mathcal{A}\leftarrow\{\}caligraphic_A ← { }
6:     for τSampleFewShotSubsets(τ)superscript𝜏SampleFewShotSubsets𝜏\tau^{\prime}\in\textsc{SampleFewShotSubsets}(\tau)italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ SampleFewShotSubsets ( italic_τ ) do
7:         ΠConstructFewShotPrompts(τ)superscriptΠConstructFewShotPromptssuperscript𝜏\Pi^{\prime}\leftarrow\textsc{ConstructFewShotPrompts}(\tau^{\prime})roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ConstructFewShotPrompts ( italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
8:         σ1|V|xi,miVμ(ΦΘ,Π(xi),mi)𝜎1𝑉subscriptsubscript𝑥𝑖subscript𝑚𝑖𝑉𝜇subscriptΦΘsuperscriptΠsubscript𝑥𝑖subscript𝑚𝑖\sigma\leftarrow\frac{1}{|V|}\sum_{\langle x_{i},m_{i}\rangle\in V}\mu(\Phi_{% \langle\Theta,\Pi^{\prime}\rangle}(x_{i}),m_{i})italic_σ ← divide start_ARG 1 end_ARG start_ARG | italic_V | end_ARG ∑ start_POSTSUBSCRIPT ⟨ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ∈ italic_V end_POSTSUBSCRIPT italic_μ ( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
9:         Extend 𝒜𝒜\mathcal{A}caligraphic_A with (σ,Π)𝜎superscriptΠ(\sigma,\Pi^{\prime})( italic_σ , roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
10:     end for
11:     return ΠmaxsubscriptΠ\Pi_{\max}roman_Π start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, 𝒜𝒜\mathcal{A}caligraphic_A’s highest-scoring prompts sequence
12:end function
13:
14:function BootstrapFinetune(ΦΘ,ΠsubscriptΦΘΠ\Phi_{\langle\Theta,\Pi\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT, X𝑋Xitalic_X, μ𝜇\muitalic_μ)
15:     τBootstrapTraces(ΦΘ,ΠX)𝜏BootstrapTraces(ΦΘ,ΠX)\tau\leftarrow\textsc{BootstrapTraces($\Phi_{\langle\Theta,\Pi\rangle}$, $X$)}italic_τ ← BootstrapTraces( roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT , italic_X )
16:     τFilterTraces(τμ)𝜏FilterTraces(τμ)\tau\leftarrow\textsc{FilterTraces($\tau$, $\mu$)}italic_τ ← FilterTraces( italic_τ , italic_μ )
17:     ΘTrainLM(τ)superscriptΘTrainLM(τ)\Theta^{\prime}\leftarrow\textsc{TrainLM($\tau$)}roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← TrainLM( italic_τ )
18:     return ΘsuperscriptΘ\Theta^{\prime}roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
19:end function
20:
21:Set OptimizePrompts as BootstrapFewShotRS
22:Set FinetuneWeights as BootstrapFinetune

Algorithm 2 shows the instantiations for Lines 1–3 of our Algorithm 1. For prompt optimization, we use BootstrapFewshotRS (BFRS) of DSPy, which self-generates potential few-shot examples of every module and applies a form of random search (RS) to select the specific generated few-shot examples that are used for prompting. Overall, BFRS first divides X𝑋Xitalic_X into a training split T𝑇Titalic_T and a validation split V𝑉Vitalic_V (Line 2). It then executes the provided ΦΘ,ΠsubscriptΦΘΠ\Phi_{\langle\Theta,\Pi\rangle}roman_Φ start_POSTSUBSCRIPT ⟨ roman_Θ , roman_Π ⟩ end_POSTSUBSCRIPT on the training inputs, collecting input–output pairs for every module in ΦΦ\Phiroman_Φ for each xiTsubscript𝑥𝑖𝑇x_{i}\in Titalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T. This is called a trace τ𝜏\tauitalic_τ, and we keep only the traces assigned high scores by μ𝜇\muitalic_μ (Line 4). Given all of these traces, BFRS samples multiple different subsets of a few traces τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (Line 6), each of them containing a potential few-shot example for each module in ΦΦ\Phiroman_Φ, and ultimately selects the subset that, when used to construct few-shot prompts (Line 7) achieves the highest score (Line 8). This simple search strategy is known to consistently leads to large quality improvements in prompting LM programs Khattab et al. (2023); Opsahl-Ong et al. (2024), often outperforming manually or automatically optimizing prompt instructions or writing examples by hand.

For fine-tuning, we extend BootstrapFinetune (BFT) of DSPy, which self-generates a large number examples for every module and combines them into one dataset to finetune the LM weights with an implicit multi-task objective, where the sub-tasks are the modules’ roles. Existing work has only considered BFT in a very narrow setting for LM programs: on HotPotQA, Khattab et al. (2023) train a T5-Large model using traces from a few-shot Llama2-13b program, without considering getting an LM to teach itself via BFT nor considering a role for BFRS in the fine-tuned program. In this work, we focus on allowing models to teach themselves and self-improve. We propose for the first time combining the strategies of BFRS and BFT via alternation to get the same LM to teach itself far better than either prompt or weight optimization in isolation. (We could also test similar ideas in scenarios where a larger models does the bootstrapping for a smaller LM. This may lead to even higher results but is outside our scope.)

4 Experimental Evaluation

Strategy mistral-7b-instruct-v0.2 llama-2-7b-chat llama-3-8b-instruct
HotPotQA GSM8K Iris HotPotQA GSM8K Iris HotPotQA GSM8K Iris
Vanilla Zero-shot 17.2 40.3 20.0 13.2 24.0 00.0 31.6 72.7 34.0
Prompt Optimization (ΠΠ\Piroman_Π) 33.8 46.4 52.0 33.3 26.0 56.0 46.9 77.9 78.7
Weight Optimization (ΘΘ\Thetaroman_Θ) 22.9 40.7 28.7 12.2 24.0 - 34.8 75.1 31.3
ΠΠΠΠ\Pi\rightarrow\Piroman_Π → roman_Π 33.8 47.7 64.0 32.6 24.7 64.0 46.5 77.6 77.3
ΘΘΘΘ\Theta\rightarrow\Thetaroman_Θ → roman_Θ 24.0 42.8 31.3 13.0 24.1 - 34.4 44.1 30.7
ΠΘΠΘ\Pi\rightarrow\Thetaroman_Π → roman_Θ 36.3 47.3 24.7 32.7 27.3 29.3 42.8 77.6 34.7
ΘΠΘΠ\Theta\rightarrow\Piroman_Θ → roman_Π 33.0 48.3 65.3 34.2 26.6 - 43.6 78.9 83.3
ΠΘΠΠΘΠ\Pi\rightarrow\Theta\rightarrow\Piroman_Π → roman_Θ → roman_Π 37.6 46.8 57.3 34.8 26.3 49.3 46.7 77.0 79.3
Table 1: Main Results. Percentage accuracies of strategies consisting of prompt optimization (ΠΠ\Piroman_Π), weight optimization (ΘΘ\Thetaroman_Θ), and their permutations on HotPotQA, GSM8K, and Iris evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are average performance of three runs on held-out test sets using different random seeds. Settings that include weight optimization as the first step rely on the data-points bootstrapped using the “Vanilla Zero-shot” setting. Since there weren’t any data-points that were answered correctly by llama-2-7b-chat on the Iris dataset using the ‘Vanilla Zero-shot” setting, there weren’t any bootstrapped examples for weight optimization either. Settings that weren’t possible to run due to this are marked with “–”.

We now seek to evaluate our hypothesis on the importance of optimizing both prompts and LM weights of LM programs. We conduct our evaluation across three datasets that span different tasks (and thus LM programs) each. In particular, we use HotPotQA Yang et al. (2018) for multi-hop reasoning, GSM8K Cobbe et al. (2021) for arithmetic reasoning, and Iris Fisher (1988) for classification. Unless otherwise specified, we use 1000100010001000 training set and 500500500500 development set examples for each dataset. We conduct our main experiments using the same model for prompt optimization, bootstrapping training traces, and fine-tuning. We experiment with three models: mistral-7b-instruct-v0.2 Jiang et al. (2023), llama-2-7b-chat Touvron et al. (2023), llama-3-8b-instruct MetaAI (2024).

We implement all of our programs and optimizers as extensions to the DSPy framework. All evaluation results are the average of three random seeds, which are used to shuffle our training sets before optimization. Full text for programs is shared in Appendix A. Appendices B and C report the license information for all LMs and datasets used as well as our implementation details (e.g., hyperparameters and software), respectively.

Multi-hop Reasoning

HotPotQA (in the “fullwiki” setting) is a question answering task in which systems must find two Wikipedia pages via search and use them to answer a factoid question. Therefore it can be implemented as a program that has three LM modules: the first two for generating search queries (i.e., hops) and the last one for generating an answer. Each module uses Chain-of-Thought (CoT; Wei et al. 2022) to generate its outputs, producing a reasoning string before the search query or the answer. Search queries are passed to a frozen ColBERTv2 Santhanam et al. (2022) retriever. Accuracy is measured using the exact match score of the answer with the ground truth answer for the given question, after normalizing case, stripping surrounding whitespace characters, and removing punctuation. We use a held-out set of 1500150015001500 examples from the official development set to report our final results, since the official test set is not public.

Arithmetic Reasoning

GSM8K is a popular benchmark consisting of grade school math problems. We implement it as an LM program with a single module using CoT prompting, where the LM generates a reasoning string followed by an answer. We report our final results on the entire held-out test set of GSM8K, with 1319131913191319 examples.

Classification

Iris is a classic classification task in machine learning, where the task is to classify species of Iris flowers. We use a single-module CoT DSPy program for Iris  with the goal of assessing whether it being a feature-based classification task gives a large advantage to methods based entirely on gradient descent (fine-tuning). This tests the extrapolation of our hypothesis to a very different setting from the other two tasks. We report our results on a test set of 50505050 examples due to the size of the Iris dataset.

5 Results & Discussion

Table 1 reports how each of the strategies described in Section 3 perform on the held-out test sets of our datasets. Reported values are averaged across three runs with unique random seeds. Appendix D separately reports the results from each run.

In 7 out of the 9 dataset and LM pairs, we observe that the best-performing strategies are always strategies that utilize prompt (ΠΠ\Piroman_Π) and weight (ΘΘ\Thetaroman_Θ) optimization steps together, although there is no clear winner among the three methods that optimize both. Overall, optimizing prompts is essential on all the tasks, but optimizing prompts and weights together leads to strong gains over the best setting that only optimizes one of the two.

In summary, we have proposed to alternate between prompt optimization and fine-tuning LM weights. In experiments with multi-hop QA (HotPotQA), mathematical reasoning (GSM8K), and feature-based classification (Iris), we show that our strategies are highly effective for getting an LM to teach itself to perform an LM program via bootstrapping, leading to 5–78% gains for HotPotQA, 2.5–10% gains for GSM8K, and -5.9–136% gains for Iris.

6 Limitations

While this short paper presents strong evidence from nine case studies in total, spanning three tasks (and their corresponding LM programs) and three LMs, it is possible that other tasks, programs, or LMs will change the pattern in unforeseen ways. In particular, we have only experimented with weight optimization in the form of LoRA fine-tuning of pre-trained models. It is in principle possible that some other fine-tuning strategy would be so powerful and cost-effective as to remove the need for prompt optimization.

In addition, though we expect our findings to inform many researchers and practitioners interested in optimizing LM programs, and encourage them to explore optimizing prompts and fine-tuning LM weights together, we do not yet understand why both are important. The role of prompt optimization and the role of fine-tuning multi-stage LM programs are both new, and the relative lack of deep understanding of these roles in the emerging literature could pose risks in unanticipated interactions between these components, compared with standard gradient descent for neural networks, which has been studied for decades.

Acknowledgments

D.S. is supported by Ravi Family Graduate Fellowship. This work was partially supported by IBM as a founding member of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), and by the HAI Hoffman–Yee Grant “Dendritic Computation for Knowledge Systems”.

References

  • Beurer-Kellner et al. (2023) Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2023. Prompting is programming: A query language for large language models. Proc. ACM Program. Lang., 7(PLDI).
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
  • Dohan et al. (2022) David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton. 2022. Language model cascades. Preprint, arXiv:2207.10342.
  • D’Oosterlinck et al. (2024) Karel D’Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, and Christopher Potts. 2024. In-context learning for extreme multi-label classification. Preprint, arXiv:2401.12178.
  • Fisher (1988) Ronald A. Fisher. 1988. Iris. UCI Machine Learning Repository.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. Preprint, arXiv:2106.09685.
  • HuggingFace (2023) HuggingFace. 2023. Text generation inference.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
  • Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. In Thirty-Fifth Conference on Neural Information Processing Systems.
  • Khattab et al. (2022) Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  • Khattab et al. (2023) Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling declarative language model calls into self-improving pipelines. Preprint, arXiv:2310.03714.
  • Lewis et al. (2020) Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Merkel (2014) Dirk Merkel. 2014. Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal, 2014(239):2.
  • MetaAI (2024) MetaAI. 2024. Meta llama 3.
  • Opsahl-Ong et al. (2024) Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing instructions and demonstrations for multi-stage language model programs. Preprint, arXiv:2406.11695.
  • Pourreza and Rafiei (2023) Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. Preprint, arXiv:2304.11015.
  • Press et al. (2023) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore. Association for Computational Linguistics.
  • Qi et al. (2021) Peng Qi, Haejun Lee, Tg Sido, and Christopher Manning. 2021. Answering open-domain questions of varying reasoning steps from text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3599–3614, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Santhanam et al. (2022) Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  • Schlag et al. (2023) Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen tau Yih, Jason Weston, Jürgen Schmidhuber, and Xian Li. 2023. Large language model programs. Preprint, arXiv:2305.05364.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.

Appendices

Appendix A Programs

The DSPy programs for HotPotQA, GSM8K, and Iris are shared in Snippets 1, 2, 3, respectively.

1class HotPotQAProgram(dspy.Module):
2 def __init__(self, passages_per_hop=3):
3 super().__init__()
4
5 self.retrieve = dspy.Retrieve(k=passages_per_hop)
6 self.generate_query = [dspy.ChainOfThought("context, question -> search_query") for _ in range(2)]
7 self.generate_answer = dspy.ChainOfThought("context, question -> answer")
8
9 def forward(self, question):
10 context = []
11
12 for hop in range(2):
13 search_query = self.generate_query[hop](context=context, question=question).search_query
14 passages = self.retrieve(search_query).passages
15 context = dsp.utils.deduplicate(context + passages)
16
17 return self.generate_answer(context=context, question=question).copy(context=context)
Snippet 1: DSPy program for HotPotQA.
1class CoTProgram(dspy.Module):
2 def __init__(self):
3 super().__init__()
4 self.prog = dspy.ChainOfThought("question -> answer")
5
6 def forward(self, question):
7 return self.prog(question=question)
Snippet 2: DSPy program for GSM8K.
1class IrisSignature(dspy.Signature):
2 "Given the petal and sepal dimensions in cm, predict the iris species."
3
4 petal_length = dspy.InputField()
5 petal_width = dspy.InputField()
6 sepal_length = dspy.InputField()
7 sepal_width = dspy.InputField()
8 answer = dspy.OutputField(desc=’setosa, versicolour, or virginica’)
9
10
11class IrisProgram(dspy.Module):
12 def __init__(self):
13 self.pred = dspy.ChainOfThought(IrisSignature)
14
15 def forward(self, petal_length, petal_width, sepal_length, sepal_width):
16 return self.pred(petal_length=petal_length, petal_width=petal_width, sepal_length=sepal_length, sepal_width=sepal_width)
Snippet 3: DSPy program for Iris, provided to us by the DSPy team.

Appendix B Asset Information

We share the associated licenses for the models and datasets we used below. For models, we list the specific HuggingFace model id we used to retrieve the respective weights.

  1. 1.

    mistralai/Mistral-7b-Instruct-v0.2: Apache License 2.0

  2. 2.

    meta-llama/Llama-2-7b-chat-hf: Meta Llama 2 Community License at https://ai.meta.com/llama/license/

  3. 3.

    meta-llama/Meta-Llama-3-8B-Instruct: Meta Llama 3 Community License at https://llama.meta.com/llama3/license/

  4. 4.

    HotPotQA: Apache License 2.0

  5. 5.

    GSM8K: MIT License

  6. 6.

    Iris: Creative Commons Attribution 4.0 International (CC BY 4.0)

All the LMs used in this work are intended for use in English.

Appendix C Implementation Details

In this section, we share the implementation details as it pertains to sizes of the splits, LM sampling, fine-tuning, and compute requirements. We also share the details for how we compute the gains reported throughout the paper.

Split Sizes

For optimizing prompt templates with BootstrapFewshotRandomSearch (BFRS), we sub-sample 100100100100 examples from the training set for BFRS training set and 250250250250 examples for its validation set. We allow BFRS to use up to 3333 boostrapped as well as 3333 labeled in-context-examples to search over 6666 candidate few-shot prompts.

The original Iris dataset has a total of 150150150150 examples across all the splits. We re-split all the data-points into train, development, and test sets, each with 50505050 examples. We use this test set to report our final numbers. From the training split, we use a 15151515 / 35353535 sub-split for internal prompt-optimization training and validation, respectively.

Sampling

For sampling, we host our models in Docker Merkel (2014) instances through HuggingFace’s text-generation-inference HuggingFace (2023) toolkit. We keep the sampling parameters the same across all experiments, using TopK sampling with a temperature of 0.10.10.10.1, and top_k of 0.970.970.970.97, until the model either generates a stopping string or a total of 1024102410241024 tokens (including the tokens in the prompt, if supplied).

Fine-tuning

For fine-tuning, we use Low Rank Adaptation (LoRA) Hu et al. (2021) to train the query and key self-attention layers of our models, using a LoRA rank of 32323232, alpha of 64646464, with no dropout. We fine-tune all of our models for 5555 epochs using bfloat16 precision, with a learning rate of 1e51e51\mathrm{e}{-5}1 roman_e - 5 and an effective batch size of 8888. We use gradient accumulation steps larger than 1 in order to effectively use a large batch size, without having to fit all the batch in memory at once.

Compute Requirement

We use A100 GPUs to run our experiments. The total time it takes to run the experiments varies based on the strategy, LM and dataset. Total approximate GPU hours to produce Table 1 was \approx75 hours.

Appendix D Extended Results

The results shared in 1 are the average of three runs. Tables 2, 3, and 4 show the breakdown of the individual runs for HotPotQAGSM8K  and Iris  respectively.

Strategy mistral-7b-instruct-v0.2 llama-2-7b-chat llama-3-8b-instruct
Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg
Vanilla Zero-shot 17.2 17.2 17.2 17.2 13.2 13.2 13.2 13.2 31.6 31.6 31.6 31.6
Prompt Optimization (ΠΠ\Piroman_Π) 32.7 34.7 34.0 33.8 33.3 33.3 33.4 33.3 45.7 47.4 47.5 46.9
Weight Optimization (ΘΘ\Thetaroman_Θ) 22.0 23.1 23.5 22.9 12.4 11.8 12.3 12.2 34.9 35.3 34.3 34.8
ΠΠΠΠ\Pi\rightarrow\Piroman_Π → roman_Π 31.7 36.0 33.7 33.8 31.7 33.1 33.1 32.6 47.3 45.4 46.7 46.5
ΘΘΘΘ\Theta\rightarrow\Thetaroman_Θ → roman_Θ 24.1 23.9 23.9 24.0 12.4 13.5 13.3 13.0 35.1 34.1 34.1 34.4
ΠΘΠΘ\Pi\rightarrow\Thetaroman_Π → roman_Θ 34.9 39.1 34.9 36.3 32.8 32.3 33.1 32.7 40.6 42.1 45.7 42.8
ΘΠΘΠ\Theta\rightarrow\Piroman_Θ → roman_Π 29.3 33.8 35.8 33.0 36.0 33.4 33.1 34.2 44.5 40.9 45.3 43.6
ΠΘΠΠΘΠ\Pi\rightarrow\Theta\rightarrow\Piroman_Π → roman_Θ → roman_Π 34.9 40.7 37.2 37.6 34.7 34.5 35.3 34.8 46.5 47.1 46.4 46.7
Table 2: Results of HotPotQA Runs. Percentage accuracies of strategies consisting of prompt optimization (ΠΠ\Piroman_Π), weight optimization (ΘΘ\Thetaroman_Θ) and their permutations for HotPotQA evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average.
Strategy mistral-7b-instruct-v0.2 llama-2-7b-chat llama-3-8b-instruct
Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg
Vanilla Zero-shot 40.3 40.3 40.3 40.3 24.0 24.0 24.0 24.0 72.7 72.7 72.7 72.7
Prompt Optimization (ΠΠ\Piroman_Π) 45.0 47.2 47.1 46.4 27.3 25.1 25.5 26.0 76.9 77.9 78.9 77.9
Weight Optimization (ΘΘ\Thetaroman_Θ) 40.8 40.0 41.2 40.7 23.7 24.2 24.0 24.0 75.7 74.8 74.8 75.1
ΠΠΠΠ\Pi\rightarrow\Piroman_Π → roman_Π 46.3 47.2 49.6 47.7 28.4 24.0 21.8 24.7 76.5 80.1 76.1 77.6
ΘΘΘΘ\Theta\rightarrow\Thetaroman_Θ → roman_Θ 42.9 41.8 43.8 42.8 24.0 24.3 24.0 24.1 52.2 36.6 43.4 44.0
ΠΘΠΘ\Pi\rightarrow\Thetaroman_Π → roman_Θ 46.4 47.3 48.2 47.3 27.8 28.1 25.9 27.3 77.6 75.4 79.8 77.6
ΘΠΘΠ\Theta\rightarrow\Piroman_Θ → roman_Π 50.1 46.0 48.8 48.3 26.8 26.1 27.0 26.6 78.5 79.8 78.4 78.9
ΠΘΠΠΘΠ\Pi\rightarrow\Theta\rightarrow\Piroman_Π → roman_Θ → roman_Π 44.9 48.5 47.1 46.8 27.1 25.9 25.9 26.3 77.6 75.4 77.8 77.0
Table 3: Results of GSM8K Runs. Percentage accuracies of strategies consisting of prompt optimization (ΠΠ\Piroman_Π), weight optimization (ΘΘ\Thetaroman_Θ) and their permutations for GSM8K evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average.
Strategy mistral-7b-instruct-v0.2 llama-2-7b-chat llama-3-8b-instruct
Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg Run 1 Run 2 Run 3 Avg
Vanilla Zero-shot 20.0 20.0 20.0 20.0 00.0 00.0 00.0 00.0 34.0 34.0 34.0 34.0
Prompt Optimization (ΠΠ\Piroman_Π) 50.0 56.0 50.0 52.0 42.0 56.0 70.0 56.0 82.0 64.0 90.0 78.7
Weight Optimization (ΘΘ\Thetaroman_Θ) 26.0 28.0 32.0 28.7 - - - - 32.0 30.0 32.0 31.3
ΠΠΠΠ\Pi\rightarrow\Piroman_Π → roman_Π 74.0 54.0 64.0 64.0 62.0 74.0 56.0 64.0 86.0 72.0 74.0 77.3
ΘΘΘΘ\Theta\rightarrow\Thetaroman_Θ → roman_Θ 28.0 32.0 34.0 31.3 - - - - 32.0 30.0 30.0 30.7
ΠΘΠΘ\Pi\rightarrow\Thetaroman_Π → roman_Θ 22.0 26.0 26.0 24.7 30.0 28.0 30.0 29.3 36.0 32.0 36.0 34.7
ΘΠΘΠ\Theta\rightarrow\Piroman_Θ → roman_Π 60.0 68.0 68.0 65.3 - - - - 76.0 80.0 94.0 83.3
ΠΘΠΠΘΠ\Pi\rightarrow\Theta\rightarrow\Piroman_Π → roman_Θ → roman_Π 40.0 54.0 78.0 57.3 62.0 32.0 54.0 49.3 92.0 86.0 60.0 79.3
Table 4: Results of Iris Runs. Percentage accuracies of strategies consisting of prompt optimization (ΠΠ\Piroman_Π), weight optimization (ΘΘ\Thetaroman_Θ) and their permutations for Iris evaluated on mistral-7b-instruct-v0.2, llama-2-7b-chat, llama-3-8b-instruct. Reported are the performance of three runs on held-out test sets using different random seeds and their average. Settings that include weight optimization as the first step rely on the data-points bootstrapped using the “Vanilla Zero-shot” setting. Since there weren’t any data-points that were answered correctly by llama-2-7b-chat using the ‘Vanilla Zero-shot” setting, there weren’t any bootstrapped examples to for weight optimization either. Settings that weren’t possible to run due to this are marked with “–”.