\mdfdefinestyle

mystylerightline=true, innerleftmargin=10, innerrightmargin=10, outerlinewidth=3pt, topline=false, rightline=true, bottomline=false, skipabove=skipbelow= showstringspaces = false, keywords = false,true, alsoletter = 0123456789., morestring = [s]””, stringstyle = , MoreSelectCharTable =\lst@DefSaveDef‘:\colon@json\processColon@json, basicstyle = , keywordstyle = ,

Weak-to-Strong Reasoning

Yuqing Yang2,4  Yan Ma2,3,4Pengfei Liu1,3,4111 Corresponding Author.
1Shanghai Jiao Tong University  2Fudan University
3Shanghai AI Laboratory  4Generative AI Research Lab (GAIR)
{yuqingyang21, yanma23}@m.fudan.edu.cn[email protected]
Abstract

When large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervisions for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context. Yet, the efficacy of this approach for complex reasoning tasks is still untested. Furthermore, tackling reasoning tasks under the weak-to-strong setting currently lacks efficient methods to avoid blindly imitating the weak supervisor including its errors. In this paper, we introduce a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that our method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. This method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in https://github.com/GAIR-NLP/weak-to-strong-reasoning.

Refer to caption
(a) Llama2-7b Refer to caption supervises Llama2-70b Refer to caption
on GSM8K (Cobbe et al., 2021).
Refer to caption
(b) Llama3-8b-instruct Refer to caption supervises Llama3-70b Refer to caption
on OlympicArena (Huang et al., 2024).
Figure 1: (a): Test accuracy on GSM8K using Llama2-7b to supervise Llama2-70b. (b): Test accuracy on OlympicArena using Llama3-8b-instruct to supervise Llama3-70b. “Weak Floor” refers to the results of the weak model. “Full Weak FT” refers to the results of the baseline where the strong model is naively fine-tuned on the full dataset generated by the weak model. “Our Stage I” represents the results from the first stage of supervised fine-tuning using our proposed weak-to-strong method. Note that our method in Stage I produces three variants of enhanced strong models and we present the best results here. “Our Stage II” denotes the results from the second stage of preference optimization using our method.

1 Introduction

A student need not be inferior to the teacher; a teacher need not be wiser than the student.
On Teachers

Refer to caption
Figure 2: Illustration of weak-to-strong reasoning through the strong model self-refining its training data.

As the pursuit of Artificial General Intelligence (AGI) advances, the creation of superintelligent systems—models that exceed human cognitive capabilities—remains a key ambition within the field (Robert, 2017; Altman et al., 2023; Puthumanaillam et al., 2024). This quest introduces a host of challenges, especially concerning the supervision and learning paradigms for these advanced AI models. Conventional supervision methods, which typically depend on human oversight (Christiano et al., 2017; Ouyang et al., 2022; Sun et al., 2024) or guidance (i.e., distilled knowledge) from more advanced models (Bai et al., 2022; Lee et al., 2023; Peng et al., 2023), become inadequate as the capabilities of AI exceed those of their supervisors (Bowman et al., 2022; Sang et al., 2024). To address this issue, we focus on the weak-to-strong learning paradigm (Burns et al., 2023), which operates under a unique task setting where only a less capable model and a stronger111Similar to Burns et al. (2023), we define “strong model” in the context of LLMs, taking into account their characteristics—that is, LLMs often contain the knowledge and capabilities needed to perform specific tasks, but these have not yet been fully elicited Zhou et al. (2024). Typically, it refers to stronger and larger pre-trained language models whose capabilities have not been fully realized yet. but not fully utilized model are available.

The central question of weak-to-strong learning is whether models with limited capabilities can effectively guide the development of more advanced, stronger models. Previous studies by Burns et al. (2023) have demonstrated the feasibility of it in classification, chess, and reward modeling tasks. However, the applicability of this setup to more complex reasoning tasks, which demand more than mere extrapolation or pattern recognition, remains an open question. Complex reasoning represents a key aspect of human cognition, crucial for assessing whether LLMs can emulate or surpass human-like capabilities in comprehending the world, making decisions, and solving problems (Qiao et al., 2023; Huang and Chang, 2023; Chang et al., 2023). Given the complexity and the critical nature of these tasks, applying the weak-to-strong learning framework to advanced reasoning challenges is essential, particularly within the broader context of achieving superintelligence.

Although Burns et al. (2023) suggest that naively fine-tuning strong models on the full set of noisy data produced by weak models, named full weak fine-tuning, can consistently improve their performance over the weaker counterparts, this approach is still far from recovering the full capabilities of strong models, and our experiments show that it loses effectiveness when facing more complex reasoning challenges. They also propose an auxiliary confidence loss to mitigate the issue of strong models imitating the errors of their supervisors. However, this method is tailored to classification tasks with a set of fixed labels and does not naturally extend to open-ended generation tasks including reasoning. Currently, there is a lack of effective methods beyond naive fine-tuning to prevent the overfit of weak errors and to further elicit the intrinsic reasoning abilities of strong models within the weak-to-strong reasoning framework.

To achieve the above goal, we introduce a progressive refinement learning framework, guided by the principle that a model can enhance its capabilities more effectively by initially focusing on smaller, more reliable subsets of data, and then iteratively expanding its learning scope, as illustrated in Fig. 2. In the first stage, we hypothesize that it is more advantageous to utilize smaller quantities of data that are likely to be more accurate. We achieve this by combining weak data, generated by the less capable model, with data self-generated by the more advanced model through in-context learning. This blend is then used to selectively curate datasets for subsequent supervised fine-tuning. In the second stage, upon having developed a strong model with improved reasoning capabilities, we utilize its ability to construct contrastive samples for preference optimization (Rafailov et al., 2023; Hong et al., 2024) and enables the model to learn effectively from the errors of the weaker model.

In implementation, we employ Llama2-70b (Touvron et al., 2023) as the strong model, test three separate weak models: Llama2-7b, Gemma-2b (Mesnard et al., 2024), and Mistral-7b (Jiang et al., 2023), and conduct experiments on the commonly used math reasoning datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). Experimental results reveal that:

  1. 1.

    Full weak fine-tuning, while effective in classification tasks, falls short for complex reasoning tasks.

  2. 2.

    Our proposed method significantly outperforms full weak fine-tuning method, achieving a 26.99-point improvement on GSM8K when supervised solely by the weak model (i.e., Gemma-2b) after the first stage of training (plussubscriptplus\mathcal{M}\to\mathcal{M}_{\text{plus}}caligraphic_M → caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT), and further enhances performance by an additional 8.49 points through preference optimization without knowing the gold answer (plusprosubscriptplussubscriptpro\mathcal{M}_{\text{plus}}\to\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT → caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT).

  3. 3.

    Our proposed preference optimization phase enables the strong model to learn from errors made by the weak supervisor, ultimately surpassing the strong model fine-tuned on gold-standard solutions (i.e., strong ceiling) in challenging scenarios, such as level 4-5 MATH problems.

To more accurately approximate future scenarios, we conduct additionally experiments on OlympicArena (Huang et al., 2024), an extremely challenging dataset with no definitive ground truth answers. Llama3-8b-instruct (AI@Meta, 2024), despite its smaller size, has been aligned and proved to effectively supervise the larger Llama3-70b, whose potential have not yet been fully realized. Moreover, our proposed two-stage training approach outperforms full weak fine-tuning by 3.19 points.

2 Preliminaries

2.1 Typical Learning Paradigms for LLMs

G.T. Answer Stronger Model
Generic-supervised -
Distillation-based
Self-improvement -
Semi-supervised -
Weak-to-strong
Table 1: Typical Learning Paradigms for LLMs. “” and “” indicate whether supervision is required, and “” indicates it is optional. “G.T.” represents Ground Truth.

We outline common learning paradigms in large model training, primarily characterized by the need for ground truth answers and supervision from stronger models as shown in Tab. 1.

Generic-Supervised Learning

When training LLMs, it is ideal to have a sufficient amount of training data with ground truth answers, which we refer to as generic-supervised learning paradigm Ouyang et al. (2022); Yuan et al. (2023). However, acquiring such data is often label-intensive and can sometimes be impossible. As a result, various learning paradigms have emerged to reduce the effects of data quality and quantity while still improving performance.

Distillation-based Learning

In the current context, to enhance a strong model like Llama2-70b, improvements can still be made by seeking help to a stronger model like GPT-4 (OpenAI, 2023), even without ground truth. Hence, many existing works suggest that a stronger model acts as a teacher model to provide specific feedback to improve the targeted model (Lee et al., 2023; Peng et al., 2023; An et al., 2023; Agarwal et al., 2023; Chen et al., 2023). This paradigm can be viewed as distilling the stronger teacher model’s knowledge. Nonetheless, merely imitating the teacher model is not a long-term solution; imitation models only slightly close the performance gap to the teacher model on tasks not well-represented in the imitation data (Gudibande et al., 2023). Furthermore, distillation learning primarily benefits models that are less capable than the teacher model.

Self-Improvement Learning

Considering the high costs of annotating training data by humans or stronger proprietary models, a line of works relies on the correct responses generated by the model itself to update it. For example, Zelikman et al. (2022); Yuan et al. (2023); Singh et al. (2023); Hosseini et al. (2024) filter solutions according to the correctness of final answers, while Lightman et al. (2023); Lin et al. (2024) employ reward models trained on gold annotations to score self-generated content. It is evident that, whether using binary labels or fine-grained feedback, this paradigm still requires ground truth to assess the usability of the model’s self-generated responses. Without ground truth answers, self-improvement leads to minimal performance gains and may even degrade performance (Huang et al., 2023; Tyen et al., 2023).

Semi-Supervised Learning

Gaining insights from semi-supervised learning within the domain of traditional machine learning, another type of LLM learning depends not on extensive labeling but instead on a small, high-quality seed dataset. Tong et al. (2024) have demonstrated improvement by learning differences between self-generated responses and expert-annotated responses. We also include the trending research topic of easy-to-hard generalization (Hase et al., 2024; Sun et al., 2024) in this category, where models are trained to tackle complex tasks by learning from human annotations on easier tasks. This series of research inevitably require access to a small yet high quality set of standard answers.

Weak-to-Strong Learning

In scenarios where models surpass human capabilities, the challenge of providing comprehensive and precise supervision for complex tasks intensifies, particularly as no ground truth exists, nor a superior model for supervisory guidance. This absence underscores the critical importance of weak-to-strong learning approaches. Such methods uniquely leverage weaker supervisory signals to recover latent knowledge from already powerful models. For example, fine-tuning GPT-4 with a GPT-2-level supervisor can recover close to GPT-3.5-level performance on certain tasks Burns et al. (2023). This strategy holds profound implications for advancing human societal progress by equipping LLMs with the capabilities to address currently unsolvable mathematical and physical challenges. Unlike other learning paradigms, weak-to-strong learning operates under comparatively relaxed conditions, opening expansive opportunities for exploration and innovation.

2.2 Weak-to-Strong Reasoning Setup

Role weak model strong model task question
Analogue Llama2-7b Llama2-70b 𝒬GSM8K𝒬GSM8K\mathcal{Q}\in\text{GSM8K}caligraphic_Q ∈ GSM8K
+ SFT(𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT) MATHabsentMATH\mathrel{\phantom{=}}\in\text{MATH}∈ MATH
Table 2: Weak-to-Strong Reasoning Setup.

In this paper, we address reasoning tasks in the weak-to-strong setting, as illustrated in Tab. 2. First, we examine mathematical reasoning tasks, such as those in GSM8k and MATH. These tasks require each step of the reasoning process to demonstrate fundamental mathematical problem-solving skills, including problem comprehension and algebraic operations, and build upon the previous steps. It imposes higher demands on the model’s learning and generalization capabilities. Unlike classification tasks, where models can rely on superficial pattern extrapolation or recognition, reasoning tasks offer minimal benefit from guessing. Then, we use a weak model (e.g., Llama2-7b) with a certain degree of mathematical problem-solving ability,222Otherwise, the weak model can hardly provide useful supervision. denoted as m𝑚mitalic_m. This model acts analogously to human supervisors with limited expertise in the era of superintelligence. Besides, we only have a set of questions 𝒬={qi}𝒬subscript𝑞𝑖\mathcal{Q}=\{q_{i}\}caligraphic_Q = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } without ground truth answers and the goal is to improve the reasoning capability of a strong model \mathcal{M}caligraphic_M (e.g., Llama2-70b). To implement this, following Burns et al. (2023), we randomly divide the original training set into two equal parts, 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT and 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT. The weak model is initially fine-tuned using 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT where the gold solutions are available, resulting in a weak model with some problem-solving capability, i.e. m𝑚mitalic_m. In contrast, the strong model can only access the questions from 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT, without reasoning chains or final answers, i.e., 𝒬𝒬\mathcal{Q}caligraphic_Q.

3 Methodology

In this section, we propose a weak-to-strong training method designed to maximize the use of weak data and to elicit the strong model’s innate talent. First, we identify potentially positive samples in the absence of ground truth and external signals. During Stage I, we exclusively utilize this subset of data for supervised fine-tuning. Then once the strong model has achieved a certain level of reasoning proficiency, we employ the full weak data, particularly the potentially negative samples in Stage II via preference learning-based approaches like DPO Rafailov et al. (2023), encouraging the strong model to learn from mistakes made by the weaker model. The whole framework is depicted in Fig. 3.

3.1 Stage I: Learn from “Positive” Samples

Given a weak model m𝑚mitalic_m and a series of math problems 𝒬𝒬\mathcal{Q}caligraphic_Q without ground truth, m𝑚mitalic_m generates weak data 𝒟weak={qi,cweak,i,aweak,i}subscript𝒟weaksubscript𝑞𝑖subscript𝑐weak𝑖subscript𝑎weak𝑖\mathcal{D}_{\text{weak}}=\{q_{i},c_{\text{weak},i},a_{\text{weak},i}\}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT }, where qi𝒬subscript𝑞𝑖𝒬q_{i}\in\mathcal{Q}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Q, cweak,isubscript𝑐weak𝑖c_{\text{weak},i}italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT represents a reasoning chain, and aweak,isubscript𝑎weak𝑖a_{\text{weak},i}italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT represents the final answer. The correctness of aweak,isubscript𝑎weak𝑖a_{\text{weak},i}italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT is unknown. The central challenge is: how can we maximize the use of m𝑚mitalic_m and 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT to fully enhance and recover the mathematical reasoning capabilities of a stronger model \mathcal{M}caligraphic_M?

3.1.1 Full Weak Fine-Tuning

Our initial strategy is to fine-tune the stronger model \mathcal{M}caligraphic_M across the entirety of the weak dataset 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT. While prior research (Burns et al., 2023) has validated the effectiveness of this approach in text classification tasks, its efficacy in reasoning tasks remains unexplored. We have therefore embarked on an investigation to determine whether the phenomenon of weak-to-strong generalization can also enhance the reasoning capabilities of \mathcal{M}caligraphic_M in this less examined domain.

3.1.2 Weak In-Context Learning

Another straightforward approach is in-context learning (ICL, Dong et al. (2023b)), which requires only several training samples as demonstrations in the prompt. Specifically, we randomly select four samples from 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT as demonstrations. Since we do not have access to the ground truth, these demonstrations cannot be provably correct.

Refer to caption
Figure 3: Overview of our method evolving from \mathcal{M}caligraphic_M Refer to caption \to plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT Refer to caption \to prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT Refer to caption. Left: we utilize final answer consistency to selectively filter weak and icl data from diverse sources, which is used to fine-tune the strong model \mathcal{M}caligraphic_M and obtain plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT with enhanced mathematical reasoning capabilities. Right: we leverage the confidence of plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT to identify contrastive samples for performance optimization, resulting in a more robust strong model prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT.

3.1.3 Weak-ICL Fine-Tuning

Given that models can mimic weak errors through supervised fine-tuning (Charikar et al., 2024; Lang et al., 2024), we propose refining 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT before use, instead of using all data blindly. Additionally, we seek to harness the innate abilities of the strong model activated via in-context learning. Building on these two ideas, we introduce weak-icl fine-tuning, employing both weak data 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and “icl data” 𝒟icl={qi,cicl,i,aicl,i}subscript𝒟iclsubscript𝑞𝑖subscript𝑐icl𝑖subscript𝑎icl𝑖\mathcal{D}_{\text{icl}}=\{q_{i},c_{\text{icl},i},a_{\text{icl},i}\}caligraphic_D start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT icl , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT icl , italic_i end_POSTSUBSCRIPT }, where qi𝒬subscript𝑞𝑖𝒬q_{i}\in\mathcal{Q}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Q, cicl,isubscript𝑐icl𝑖c_{\text{icl},i}italic_c start_POSTSUBSCRIPT icl , italic_i end_POSTSUBSCRIPT and aicl,isubscript𝑎icl𝑖a_{\text{icl},i}italic_a start_POSTSUBSCRIPT icl , italic_i end_POSTSUBSCRIPT are generated by \mathcal{M}caligraphic_M with few-shot demonstrations,333Experiments in §4.3 show that despite ICL being affected by demonstration selection, our method can achieves further improvements accordingly beyond ICL. as higher-quality supervision signals.

Note that, for both 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and 𝒟iclsubscript𝒟icl\mathcal{D}_{\text{icl}}caligraphic_D start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT, we cannot determine whether a certain answer is correct or not. Nonetheless, when two models, employing distinct data representations, converge on the same answer in an open-ended task, it is indicative of a higher likelihood of accuracy. This phenomenon supports the reliability of the results when consistency is observed across different methodologies. We thus compare 𝒟weaksubscript𝒟weak\mathcal{D}_{\text{weak}}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and 𝒟iclsubscript𝒟icl\mathcal{D}_{\text{icl}}caligraphic_D start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT generated by the weak model and strong model, respectively, and select 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT if aweak,i=aicl,isubscript𝑎weak𝑖subscript𝑎icl𝑖a_{\text{weak},i}=a_{\text{icl},i}italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT icl , italic_i end_POSTSUBSCRIPT, for subsequent supervised fine-tuning. We call this approach final answer consistency. Considering the combination of the two sets of data, we can obtain three versions of enhanced fine-tuned strong models:

  • weak-ftsubscriptweak-ft\mathcal{M}_{\text{weak-ft}}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT: \mathcal{M}caligraphic_M fine-tuned on 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT.

  • icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT: \mathcal{M}caligraphic_M fine-tuned on 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT.

  • hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT: \mathcal{M}caligraphic_M fine-tuned on the union of 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT.

Iterative Training

Upon closed examination of weak-ftsubscriptweak-ft\mathcal{M}_{\text{weak-ft}}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT and icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT, we see that they still satisfy the condition of having different data representations, as they are trained on data from different sources—𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT is generated by the weak model, whereas 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT primarily originates from the strong model itself. Hence, we can perform iterative training to bootstrap performance. We denote the initial round of supervised fine-tuning data as 𝒟^weak1superscriptsubscript^𝒟weak1\hat{\mathcal{D}}_{\text{weak}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝒟^icl1superscriptsubscript^𝒟icl1\hat{\mathcal{D}}_{\text{icl}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, resulting in models weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. In the second iteration, we obtain zero-shot solutions from weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT applied to 𝒬𝒬\mathcal{Q}caligraphic_Q to construct 𝒟weak2superscriptsubscript𝒟weak2\mathcal{D}_{\text{weak}}^{2}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and those from icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT to construct 𝒟icl2superscriptsubscript𝒟icl2\mathcal{D}_{\text{icl}}^{2}caligraphic_D start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Here, the subscripts “weak” and “icl” indicate the initial data source. Then we apply final answer consistency to obtain 𝒟^weak2superscriptsubscript^𝒟weak2\hat{\mathcal{D}}_{\text{weak}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝒟^icl2superscriptsubscript^𝒟icl2\hat{\mathcal{D}}_{\text{icl}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Following another round of supervised fine-tuning, we have:

  • weak-ft2superscriptsubscriptweak-ft2\mathcal{M}_{\text{weak-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: \mathcal{M}caligraphic_M fine-tuned on 𝒟^weak2superscriptsubscript^𝒟weak2\hat{\mathcal{D}}_{\text{weak}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

  • icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: \mathcal{M}caligraphic_M fine-tuned on 𝒟^icl2superscriptsubscript^𝒟icl2\hat{\mathcal{D}}_{\text{icl}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

  • hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: \mathcal{M}caligraphic_M fine-tuned on the union of 𝒟^weak2superscriptsubscript^𝒟weak2\hat{\mathcal{D}}_{\text{weak}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝒟^icl2superscriptsubscript^𝒟icl2\hat{\mathcal{D}}_{\text{icl}}^{2}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Note that the iterative training step is optional; it may lead to performance degradation when data quality is too low or the model overfits.

3.2 Stage II: Learn from “Negative” Samples

We denote the final iteration of hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT from Stage I as plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT, which has learned dual mathematical solutions and holds potential for further enhancement. Next, we apply preference optimization techniques to strategically utilize the potentially erroneous subset of the original weak dataset 𝒟weak={qi,cweak,i,aweak,i}subscript𝒟weaksubscript𝑞𝑖subscript𝑐weak𝑖subscript𝑎weak𝑖\mathcal{D}_{\text{weak}}=\{q_{i},c_{\text{weak},i},a_{\text{weak},i}\}caligraphic_D start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT } generated by m𝑚mitalic_m, which allows the strong model to identify and avoid similar errors in future reasoning processes. The key factor lies in how to construct contrastive samples for learning.

Question (qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT): John has five more roommates than twice as many as Bob. If Bob has 10 roommates, how many roommates does John have?
Weak Response ({cweak,i,aweak,i}subscript𝑐weak𝑖subscript𝑎weak𝑖\{c_{\text{weak},i},a_{\text{weak},i}\}{ italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT }): John has 10+5=15 roommates. The answer is 15.
Self Response 1 ({cstrong,i1,astrong,i1}Astrong,i+superscriptsubscript𝑐strong𝑖1superscriptsubscript𝑎strong𝑖1superscriptsubscript𝐴strong𝑖\{c_{\text{strong},i}^{1},a_{\text{strong},i}^{1}\}\in A_{\text{strong},i}^{+}{ italic_c start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } ∈ italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT): Bob has 10 roommates. Twice as many as Bob is 2*10 = 20 roommates. John has 5 more roommates than twice as many as Bob, so John has 20+5 = 25 roommates. The answer is 25.
Self Response 2 ({cstrong,i2,astrong,i2}Astrong,i+superscriptsubscript𝑐strong𝑖2superscriptsubscript𝑎strong𝑖2superscriptsubscript𝐴strong𝑖\{c_{\text{strong},i}^{2},a_{\text{strong},i}^{2}\}\in A_{\text{strong},i}^{+}{ italic_c start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ∈ italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT): Let x be the number of roommates Bob has. John has 5 more roommates than twice as many as Bob, so John has 2x+5 roommates. Bob has 10 roommates, so x=10. John has 2*10+5 = 25 roommates. The answer is 25.
Table 3: A real case example. Given a math question, the incorrect “weak response” is generated by m𝑚mitalic_m, while the two correct “self responses” are sampled from Astrong,i+superscriptsubscript𝐴strong𝑖A_{\text{strong},i}^{+}italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT self-generated by plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT. Benefiting from dual solutions in the training data during Stage I, plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT is able to generate different reasoning paths that converge to the same final answer. Through Stage II, plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT learns to avoid m𝑚mitalic_m’s error of overlooking the key word “twice” in calculations.

Without access to ground truth, the current strong model with enhanced reasoning capabilities identifies the most likely correct answers based on its confidence. Specifically, for each question qi𝒬subscript𝑞𝑖𝒬q_{i}\in\mathcal{Q}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Q, we sample n𝑛nitalic_n responses from plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT, and define the probability of the answer that appears most frequently among these responses as confidence. When the confidence falls below a specified threshold τ𝜏\tauitalic_τ, we consider the model’s judgment on this question unreliable and therefore discard it. Conversely, if the confidence is no less than τ𝜏\tauitalic_τ, we regard the model as capable of solving the question and proceed to construct contrastive samples as follows.

  • For a question qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT is confident, we denote the most confident answer as astrong,i+superscriptsubscript𝑎strong𝑖a_{\text{strong},i}^{+}italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and P(astrong,i+)τ𝑃superscriptsubscript𝑎strong𝑖𝜏P(a_{\text{strong},i}^{+})\geq\tauitalic_P ( italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ≥ italic_τ. It can be considered as the “correct” answer according to plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT. For instance, if we set τ=0.6𝜏0.6\tau=0.6italic_τ = 0.6 and 8 out of 10 sampled responses have the same final answer “4.2”, we say that plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT considers “4.2” to be the correct answer to this question, i.e. astrong,i+=4.2superscriptsubscript𝑎strong𝑖4.2a_{\text{strong},i}^{+}=4.2italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = 4.2.

  • Then we divide the sampled n𝑛nitalic_n responses of plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT to qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into two sets: Astrong,i+={cstrong,ij,astrong,ij}superscriptsubscript𝐴strong𝑖superscriptsubscript𝑐strong𝑖𝑗superscriptsubscript𝑎strong𝑖𝑗A_{\text{strong},i}^{+}=\{c_{\text{strong},i}^{j},a_{\text{strong},i}^{j}\}italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = { italic_c start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } where astrong,ij=astrong,i+superscriptsubscript𝑎strong𝑖𝑗superscriptsubscript𝑎strong𝑖a_{\text{strong},i}^{j}=a_{\text{strong},i}^{+}italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT; Astrong,i={cstrong,ik,astrong,ik}superscriptsubscript𝐴strong𝑖superscriptsubscript𝑐strong𝑖𝑘superscriptsubscript𝑎strong𝑖𝑘A_{\text{strong},i}^{-}=\{c_{\text{strong},i}^{k},a_{\text{strong},i}^{k}\}italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = { italic_c start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } where astrong,ikastrong,i+superscriptsubscript𝑎strong𝑖𝑘superscriptsubscript𝑎strong𝑖a_{\text{strong},i}^{k}\neq a_{\text{strong},i}^{+}italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≠ italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. In the above example, |Astrong,i+|=8superscriptsubscript𝐴strong𝑖8|A_{\text{strong},i}^{+}|=8| italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | = 8 and |Astrong,i|=2superscriptsubscript𝐴strong𝑖2|A_{\text{strong},i}^{-}|=2| italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | = 2.

  • If the weak model holds an answer that the enhanced model considers “correct”, that is, aweak,i=astrong,i+subscript𝑎weak𝑖superscriptsubscript𝑎strong𝑖a_{\text{weak},i}=a_{\text{strong},i}^{+}italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we treat the weak model’s response {cweak,i,aweak,i}subscript𝑐weak𝑖subscript𝑎weak𝑖\{c_{\text{weak},i},a_{\text{weak},i}\}{ italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT } as chosen response and randomly select a rejected response from Astrong,isuperscriptsubscript𝐴strong𝑖A_{\text{strong},i}^{-}italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT. Otherwise, if aweak,iastrong,i+subscript𝑎weak𝑖superscriptsubscript𝑎strong𝑖a_{\text{weak},i}\neq a_{\text{strong},i}^{+}italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT ≠ italic_a start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we treat {cweak,i,aweak,i}subscript𝑐weak𝑖subscript𝑎weak𝑖\{c_{\text{weak},i},a_{\text{weak},i}\}{ italic_c start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT weak , italic_i end_POSTSUBSCRIPT } as rejected response and randomly select a chosen response from Astrong,i+superscriptsubscript𝐴strong𝑖A_{\text{strong},i}^{+}italic_A start_POSTSUBSCRIPT strong , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Examples are shown in Tab. 3.

Further training plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT on these samples enables it to distinguish between correct and incorrect solutions, leading to a stronger model prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT.

Refer to caption
Figure 4: Main results of Stage I. “Iter. 0” presents the performance of two baselines, where “weak” indicates full weak fine-tuning, i.e., naively fine-tuning on the entire weak data, and “icl” refers to weak ICL without fine-tuning. Models connected by a line mean that they share the same training data sources. Results below “strong ceiling” present test accuracy via greedy decoding, while those above show pass@k scores (k=10𝑘10k=10italic_k = 10 and temperature=1.0temperature1.0\text{temperature}=1.0temperature = 1.0). For simplicity, we only present the pass@k scores of hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT and checkpoints that surpass it using greedy decoding, and full results are provided in §A.4.2.

4 Experiments

4.1 Datasets

# 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT # 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT # Test
GSM8K 7,000 7,000 1,319
MATH 6,000 6,000 500
Table 4: Data Statistics. 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT and 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT are subsets of the training set. The weak model uses 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT to cultivate initial mathematical skills, while the strong model can only access questions from 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT without ground truths.

GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) are two widely used datasets for mathematical reasoning, and MATH comprises more challenging competition problems. The data statistics we use are presented in Tab. 4. Particularly, to ensure a sufficient amount of training data for developing preliminary mathematical skills in the weak model, we augment the GSM8K training set with the data constructed by Chern et al. (2023). Further details are available in §A.1.

4.2 Experiment Settings

We use Llama2-70b as the strong model and employ three weak models from different families: Llama2-7b, Gemma-2b, and Mistral-7b. We apply full parameter fine-tuning to the weak models on 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT, and consistently adopt LoRA (Hu et al., 2022) to fine-tune the strong model. In Stage I, we perform two rounds of iterations on GSM8K and one round on MATH according to the principles of iteration outlined in §3.1. In Stage II, we adopt two preference learning-based approaches, DPO (Rafailov et al., 2023) and its variant ORPO (Hong et al., 2024). Details are provided in §A.2.

We evaluate the accuracy on the test set. The performance of the weak model m𝑚mitalic_m is defined as the “weak floor”. The performance of the strong model \mathcal{M}caligraphic_M, fine-tuned with data containing gold solutions from 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT, is termed the “strong ceiling”. It represents the upper limit of the capabilities that the strong model can achieve with 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT.

4.3 Results of Stage I

The main results of Stage I on both GSM8K and MATH datasets are depicted in Fig. 4. Notably, in the MATH experiments, we randomly sample additional data that is not chosen based on the final answer consistency, due to the small amount available. Please refer to §A.4.1 for details. According to Fig. 4, we have the following observations.

Weak-ICL fine-tuning demonstrates a notable enhancement. Using our proposed method, the performance of the strong model, supervised only by the weak Gemma-2b with 25.17 accuracy on GSM8K (without any gold answers), can be improved up to 60.12, surpassing naive full weak fine-tuning by 31.08, and plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT (i.e., hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) outperforms it by 26.99. This verifies the effectiveness of data refining before supervised fine-tuning. Also, experimental results show that the mathematical reasoning capabilities of the strong model are increasingly recovered as the weak model improves, a conclusion verified by Liu and Alahi (2024) on classification tasks. In detail, the performance on GSM8K gradually improves for Gemma-2b, Llama-7b, and Mistral-7b (25.1733.8159.5125.1733.8159.5125.17\to 33.81\to 59.5125.17 → 33.81 → 59.51). Hence, the maximum performance of the strong model, trained with data generated by these models, also progressively enhances (60.1263.7668.3960.1263.7668.3960.12\to 63.76\to 68.3960.12 → 63.76 → 68.39).

hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT achieves the highest pass@k scores. As expected, hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT achieves the highest pass@k scores in the weak-to-strong setting, benefiting from its training data that incorporates two types of solutions—one from the weak model, and another from the strong model. This diversity enhances the robustness of the model by reducing the likelihood of overfitting. Additionally, the performance of icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT generally surpasses that of weak-ftsubscriptweak-ft\mathcal{M}_{\text{weak-ft}}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT, which can be attributed to variations in process-level accuracy and possibly the solution format. Detailed analyses are conducted in §A.3.

Naive fine-tuning is inadequate for weak-to-strong reasoning. When using Gemma-2b as the weak model on the MATH dataset, full weak fine-tuning underperforms compared to the weak floor (10.0 v.s. 11.6). This indicates that naive fine-tuning, though successfully applied to classification, chess, and reward modeling tasks (Burns et al., 2023), falls short for intricate reasoning tasks, particularly those of substantial difficulty like questions in MATH. In contrast, our weak-icl fine-tuning method effectively bridges the gap, offering an effective and scalable solution for the weak-to-strong reasoning challenge.

Effect of ICL Performance
Refer to caption
Figure 5: Results on GSM8K supervised by Gemma-2b. Refer to caption and Refer to caption are under original demonstrations, and Refer to caption and Refer to caption are under carefully selected demonstrations.

Given that the efficacy of weak-icl fine-tuning partially depends on the effectiveness of weak ICL, we further explore how enhancing ICL performance through careful selection of demonstrations affects the performance of weak-icl fine-tuning. Fig. 5 shows the test accuracy on GSM8K using Gemma-2b as the weak model under a different set of demonstrations.

The results indicate that the performance of weak ICL with this particular group of demonstrations increases from the original 56.48 to 64.06. We then regenerate 𝒟iclsubscript𝒟icl\mathcal{D}_{\text{icl}}caligraphic_D start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT with these demonstrations in the prompt and fine-tune the strong model on 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT, which is selectively curated through final answer consistency. This further improves performance from 64.06 to 64.75, confirming the utility of self-directed data curation. It is worth noting that although weak ICL holds the potential for high performance, the selection of effective demonstrations in a weak-to-strong framework is a non-trivial thing, and is beyond the scope of this paper.

4.4 Results of Stage II

Weak Model Test Accuracy
I II. DPO II. ORPO
GSM8K
Llama2-7b 62.62 66.19 (+3.57) 68.16 (+5.54)
Gemma-2b 56.03 64.52 (+8.49) 63.91 (+7.88)
Mistral-7b 68.39 70.96 (+2.57) 72.18 (+3.79)
MATH
Llama2-7b 14.00 12.00 (-2.00) 15.00 (+1.00)
Gemma-2b 14.20 11.60 (-2.60) 16.00 (+1.80)
Mistral-7b 14.80 13.40 (-1.40) 17.00 (+2.20)
Table 5: Main results of Stage II.

As discussed in §3.2, we employ the final iteration of hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT as plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT for subsequent preference learning. The experimental results in §4.3 validate this checkpoint achieves higher pass@k and possesses significant potential for further refinement.

As shown in Tab. 5, our method for constructing positive and negative samples effectively enhances the strong model’s math reasoning capabilities. On GSM8K, both DPO and ORPO consistently achieve significant improvements using our constructed datasets, notably resulting in an increase of 8.49 points when supervised by Gemma-2b. Despite the inherently challenging nature of MATH problem, which compromises the strong model’s judgment and introduces inaccuracies in the training data, our method still achieves improvements on MATH through ORPO by at least 1 point.444Pang et al. (2024); Xu et al. (2024); Yuan et al. (2024) demonstrate that DPO can cause performance degradation on MATH due to the lack of regularization in its loss.

Data Construction Recipe

When constructing preference data, we always use weak responses generated by the weak model as one of the chosen/rejected responses, instead of relying exclusively on self-generated data. We also test the self-generated setting on GSM8K using Llama2-7b as the weak model, where both chosen and rejected responses are generated by the strong model itself. The DPO test accuracy in this setting is 62.40 (-0.22), indicating a slight performance degradation. Without ground truth, the constructed positive and negative samples actually correspond to the more frequently and less frequently occurring answers, respectively, and are related to the answers the model tends to choose. Since preference optimization essentially performs ranking, the potential benefit of this self-generated setting is minimal. Therefore, incorporating weak data signals in the preference data construction process proves to be a better approach.

4.5 Analysis

Refer to caption
Figure 6: Test accuracy across varying difficulty levels on the MATH test set. We use ORPO to obtain prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT.

For further analysis, we examine the accuracy across different difficulty levels in the MATH test set (See §A.1.2 for data statistics).

As shown in Fig. 6, the strong model exhibits better generalization on easier problems. Specifically, even though Llama2-7b achieves only 6.98 points accuracy on level 1 problems, Llama2-70b can achieve an accuracy exceeding 30 points after training using this weak supervision. For more challenging problems (levels 4-5), prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT, enhanced with ORPO, even surpasses the strong ceiling obtained by supervised fine-tuning solely on gold solutions. This phenomenon serves to validate the effectiveness of learning from incorrect data.

4.6 Experiments Closer to Future Scenarios

Test Accuracy
Weak Floor 11.82
Full Weak FT 12.46
Weak ICL 8.63
weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 12.78
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 9.58
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 11.18
weak-ft2superscriptsubscriptweak-ft2\mathcal{M}_{\text{weak-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 13.10
icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 11.50
hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT) 11.82
prosubscriptpro\mathcal{M}_{\text{pro}}caligraphic_M start_POSTSUBSCRIPT pro end_POSTSUBSCRIPT 15.65
Table 6: Results on OlympicArena using Llama3 family. The best result is in bold, and the best result of supervised fine-tuning in underlined.

In preliminary tests with Llama3-70b (AI@Meta, 2024), we observe that on GSM8K and MATH, Llama3-70b can largely unlock its potential through in-context learning, with marginal or even adverse impacts from parameter updates due to training instabilities. Consequently, we focus on a more challenging dataset developed after the release of Llama3-70b, OlympicArena (Huang et al., 2024), to simulate a more realistic future scenario.

We only consider English questions in OlympicArena, excluding the CODE (Code Generation) and OT (Others) problem types that require case-based or expert evaluation. This results in 6,020 training data without solutions and final answers, and 313 test data with final answers to assess the performance of different methods. We use Llama3-8b-instruct (without initial fine-tuning on a subset of training data) as the weak model and Llama3-70b as the strong model to be improved. The hyperparameters are consistent with those used for GSM8K. This configuration more closely resembles future real-world weak-to-strong scenarios.

Experimental results are displayed in Tab. 6. “Weak Floor” represents the zero-shot performance of Llama3-8b-instruct, “Full Weak FT” denotes the performance of Llama3-70b after supervised fine-tuning on the full set (i.e, 6,020) of weak solutions generated by Llama3-8b-instruct on the training set, and “Weak ICL” indicates the performance of Llama3-70b under 4-shot weak demonstrations generated by Llama3-8b-instruct. Despite having more parameters, Llama3-70b under in-context learning still performs lower than the zero-shot performance of Llama3-8b-instruct due to insufficient mining capabilities.

weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, obtained by our proposed weak-icl fine-tuning method, achieves higher performance than Full Weak FT with fewer training data (i.e., 746), outperforming it by 0.32 points. After the second stage of preference optimization, which further exploits the weak model and training questions without answers, the strong model’s performance is improved by an additional 3.19 points over Full Weak FT. This demonstrates the robustness and generalizability of our method in scenarios closer to future conditions.

5 Related Work

5.1 LLM Training

LLMs can enhance their ability to solve tasks and better align with human instructions through a supervised fine-tuning (SFT) phase (Zhang et al., 2023; Dong et al., 2023a; Lv et al., 2023b, a). This phase heavily relies on the quality of training data, as previous studies (Zhou et al., 2023a; Wang et al., 2023b) demonstrate that higher data quality translates to substantial gains in model performance. In this paper, we investigate the potential of learning from weak supervisions.

To further align LLMs with human values and enable learning from both positive and negative feedback, additional training is required, such as reinforcement learning from human feedback (RLHF, Ouyang et al. (2022); Bai et al. (2022)) and direct preference optimization (DPO, Rafailov et al. (2023)). In particular, DPO reparameterizes reward functions in RLHF and has been widely used due to its simplicity. Several variants of DPO have then emerged to further enhance its stability and performance, such as ORPO (Hong et al., 2024) and SimPO (Meng et al., 2024), etc. This paper explores the capabilities of DPO and ORPO using our constructed contrastive samples in a weak-to-strong setting.

5.2 Mathematical Reasoning

The exploration of mathematical reasoning within LLMs has been a focal point for evaluating their cognitive capabilities akin to human reasoning (Qiao et al., 2023; Lu et al., 2023). Researchers have developed various methods to enhance the mathematical reasoning capabilities of LLMs after pre-training, which can be broadly classified into two categories: (1) Prompting: Some works (Kojima et al., 2022; Wei et al., 2022; Zhou et al., 2023b; Liu et al., 2023) aims to elicit the intrinsic reasoning abilities of LLMs by specific prompting engineering, without updating the model parameters; (2) Fine-tuning: Another line of studies focuses on generating a more extensive and higher-quality collection of question-answer pairs (Yu et al., 2023; Wang et al., 2023c, a). Through supervised fine-tuning and preference optimization (Luo et al., 2023; Azerbayev et al., 2023; Mitra et al., 2024; Xu et al., 2024), the models can achieve significant improvements in their mathematical problem-solving capabilities.

6 Conclusion

In this paper, we explore the efficacy of weak-to-strong framework in complex reasoning tasks. We introduce a new method that elicits strong capabilities using weak supervisions, without relying on annotations from humans or more advanced models. This method focuses on the strong model’s ability to autonomously refine its training data, even if it has not learned the task before. By iteratively expanding its learning scope, the strong model continuously broadens its reasoning skills. This self-directed data curation is crucial for scaling up the enhancement of AI reasoning capabilities, making the model more independent and effective in its developmental trajectory. Through this work, we seek to illuminate new pathways for AI development, emphasizing the critical role of innovative model supervision in advancing AGI and beyond.

Limitations

In our experiments, we use Llama2-70b and Llama3-70b as a proxy for hypothetical superintelligent models of the future. We acknowledge that there might be performance discrepancies compared to a genuine future advanced model. Nonetheless, our efforts lay the groundwork for investigating methodologies in weak-to-strong reasoning. Additionally, this paper does not explore supervision at the process level, such as the model’s ability to learn from partially correct data (Ni et al., 2023; Lightman et al., 2023). In the weak-to-strong scenario, the presence of non-negligible errors and noise at the process level yields only limited performance improvements in our early experiments, thereby posing challenges for future research.

Acknowledgements

We sincerely thank Xuefeng Li, Haoyang Zou, and Ting Wu for their valuable insights during discussions, which greatly enhance the quality of this work.

References

Appendix A Appendix

A.1 Dataset Details

A.1.1 Dataset Construction

For GSM8K, we evenly divide the original training dataset of 7,473 samples into two subsets, 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT and 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT. Additionally, we supplement both 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT and 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT with the data of the same distribution developed by (Chern et al., 2023), until each contains 7,000 samples. Thus, the weak model uses 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT, which includes both questions and gold solutions, to obtain basic problem-solving capabilities. Meanwhile, the strong model can only access a training dataset 𝒬={qi}𝒬subscript𝑞𝑖\mathcal{Q}=\{q_{i}\}caligraphic_Q = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where qi𝒟gold,2subscript𝑞𝑖subscript𝒟gold2q_{i}\in\mathcal{D}_{\text{gold},2}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT, consisting of 7,000 mathematical problems without ground truth answers. GSM8K also includes 1,319 test samples.

For MATH, we employ the same subset of 500 representative problems as the test set, identical to that used in Lightman et al. (2023). We then split the remaining 12,000 samples evenly between 𝒟gold,1subscript𝒟gold1\mathcal{D}_{\text{gold},1}caligraphic_D start_POSTSUBSCRIPT gold , 1 end_POSTSUBSCRIPT and 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT, each containing 6,000 samples.

A.1.2 Statistics of MATH test set

# L1 # L2 # L3 # L4 # L5 # Total
43 90 105 128 134 500
Table 7: Data statistics of the MATH test set.

The distribution of difficulty levels across the 500 test data samples in MATH is listed in Tab. 7.

A.2 Training Details

For supervised fine-tuning in Stage I, we adopt LoRA to fine-tune the strong model \mathcal{M}caligraphic_M with a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and search for weight decay in the set {0,0.01}00.01\{0,0.01\}{ 0 , 0.01 }. We run 2 epochs on GSM8K and 3 epochs on MATH, with a batch size of 8. In Stage II, we employ two preference optimization methods. For DPO, we train the enhanced strong model plussubscriptplus\mathcal{M}_{\text{plus}}caligraphic_M start_POSTSUBSCRIPT plus end_POSTSUBSCRIPT with a learning rate of 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and run 1 epoch. For ORPO, we search for β𝛽\betaitalic_β in the set {0.1,0.5,1.0}0.10.51.0\{0.1,0.5,1.0\}{ 0.1 , 0.5 , 1.0 } with a learning rate of 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and run 1 epoch. All experiments are conducted using A100 GPUs.

When constructing contrastive samples in Stage II, we sample n=10𝑛10n=10italic_n = 10 responses at temperature=1.0temperature1.0\text{temperature}=1.0temperature = 1.0, and use a confidence threshold of τ=0.6𝜏0.6\tau=0.6italic_τ = 0.6. Normally, we evaluate using greedy decoding. For calculating pass@k, we set k=10𝑘10k=10italic_k = 10 at temperature=1.0temperature1.0\text{temperature}=1.0temperature = 1.0.

A.3 Additional Analysis

A.3.1 Diversity Analysis

Refer to caption
Figure 7: Frequency distribution of the number of distinct solutions on GSM8K supervised by Llama2-7b.

To investigate why hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT achieves high pass@k scores despite lower greedy decoding results, we explore the diversity of responses generated by hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT and icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT. We specifically examine the frequency distribution of the number of distinct solutions for each question across the two strong model checkpoints.

Given a question from 𝒟gold,2subscript𝒟gold2\mathcal{D}_{\text{gold},2}caligraphic_D start_POSTSUBSCRIPT gold , 2 end_POSTSUBSCRIPT, we sample n=10𝑛10n=10italic_n = 10 responses at temperature=1.0temperature1.0\text{temperature}=1.0temperature = 1.0 for each checkpoint. We consider two responses distinct if their ROUGE-L similarity is less than 0.7. We then compute the number of clusters formed by these distinct responses and plot their frequency distribution in Fig. 7.

As shown in Fig. 7, icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT tends to produce nearly the same sampled responses for each question in more than 36% of the instances. This indicates a limited exploration of problem-solving paths and difficulty in generating diverse, correct solutions during the sampling process. In contrast, hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT generates a variety of responses, increasing its hit rate with multiple sampling and thus achieving higher pass@k scores. Additionally, diverse solutions are crucial for robust outcomes and model generalization (Yu et al., 2024; Wu et al., 2024). In Stage II, diverse solutions also ensure the distinction between positive and negative samples, demonstrating the rationale for selecting hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for preference optimization in Stage II.

A.3.2 Training Accuracy of Stage I

Final Answer Process-Level
GSM8K
Llama2-7b 𝒟^weak1superscriptsubscript^𝒟weak1\hat{\mathcal{D}}_{\text{weak}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 89.82 72.50
𝒟^icl1superscriptsubscript^𝒟icl1\hat{\mathcal{D}}_{\text{icl}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 89.82 76.50
Gemma-2b 𝒟^weak1superscriptsubscript^𝒟weak1\hat{\mathcal{D}}_{\text{weak}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 87.97 73.10
𝒟^icl1superscriptsubscript^𝒟icl1\hat{\mathcal{D}}_{\text{icl}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 87.97 73.80
Mistral-7b 𝒟^weak1superscriptsubscript^𝒟weak1\hat{\mathcal{D}}_{\text{weak}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 92.38 80.10
𝒟^icl1superscriptsubscript^𝒟icl1\hat{\mathcal{D}}_{\text{icl}}^{1}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 92.38 77.90
MATH
Llama2-7b 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT 46.11 32.04
𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT 46.11 39.22
Gemma-2b 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT 30.40 26.30
𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT 31.90 29.90
Mistral-7b 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT 24.75 21.50
𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT 25.25 25.60
Table 8: Training accuracy of Stage I.

Tab. 8 presents the final answer accuracy and process-level accuracy for both weak data and icl data utilized in the initial round.555The relatively low accuracy observed in MATH explains why we choose to perform one round of iteration. To compute process-level accuracy, we randomly sample a maximum of 1,000 training sample from each of weak data and icl data, and evaluate them using GPT-4o following Xia et al. (2024); Zeng et al. (2023), the prompt we use is illustrated in Tab. 13. Accuracy at this level is determined strictly on the basis that there are no errors throughout the intermediate reasoning steps.

From the results we can see that despite having consistent final answer accuracy (with the exceptions of Gemma-2b and Mistral-7b on MATH using augmented training data), there are noticeable differences in process-level performance, leading to variations in the effectiveness of weak-ftsubscriptweak-ft\mathcal{M}_{\text{weak-ft}}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT and icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT. Moreover, it is counterintuitive that models trained on icl data with relatively low process-level accuracy achieve higher performance. This might be because the models prefer self-generated solutions and can more effectively learn those that better align with their inherent distribution (Panickssery et al., 2024; Ren et al., 2024; Fan et al., 2024).

A.4 Additional Experiments

Greedy Decoding Pass@k
GSM8K
Llama2-7b weak-ft2superscriptsubscriptweak-ft2\mathcal{M}_{\text{weak-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 57.47 77.26
icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 63.76 81.05
hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 62.62 86.28
Gemma-2b weak-ft2superscriptsubscriptweak-ft2\mathcal{M}_{\text{weak-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 45.03 71.49
icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 60.12 80.14
hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 56.03 85.14
Mistral-7b weak-ft2superscriptsubscriptweak-ft2\mathcal{M}_{\text{weak-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 66.72 85.67
icl-ft2superscriptsubscripticl-ft2\mathcal{M}_{\text{icl-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 66.64 84.08
hybrid-ft2superscriptsubscripthybrid-ft2\mathcal{M}_{\text{hybrid-ft}}^{2}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 68.39 88.70
MATH
Llama2-7b weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 10.80 34.80
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 11.80 35.00
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 14.00 33.60
Gemma-2b weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 14.80 38.80
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 13.60 33.60
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 14.80 39.60
Mistral-7b weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 10.80 34.20
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 15.60 31.60
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 14.20 38.40
Table 9: Greedy decoding and pass@k results (k=10𝑘10k=10italic_k = 10 and temperature=1.0temperature1.0\text{temperature}=1.0temperature = 1.0) for the three variants of enhanced strong models obtained through weak-icl fine-tuning. The best results are in bold.
Test Acc. # Training Data
Gemma-2b
SFT on Full Weak 10.00 6,000
SFT on Gold Weak 15.60 644
weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 11.00 448
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 11.40 448
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 13.20 448×24482448\times 2448 × 2
Mistral-7b
SFT on Full Weak 14.40 6,000
SFT on Gold Weak 16.60 861
weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 12.40 584
icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 15.60 584
hybrid-ft1superscriptsubscripthybrid-ft1\mathcal{M}_{\text{hybrid-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 14.20 584×25842584\times 2584 × 2
Table 10: Stage I results on MATH without augmenting training data. “Test Acc.” refers to Test Accuracy.
Weak Model Full Weak FT Weak-ICL FT
GSM8K
Llama2-7b 22.47 78.53
Gemma-2b 8.27 75.71
Mistral-7b 14.63 71.38
MATH
Llama2-7b 10.45 71.64
Gemma-2b -25.81 64.52
Mistral-7b 19.05 28.57
Table 11: Performance Gap Recovered (PGR) in Stage I.

A.4.1 Details of Stage I on MATH

In the Stage I experiment conducted on the MATH dataset, it is found that the amount of training data selected via final answer consistency is so limited that the strong model can hardly learn the effective features through supervised fine-tuning. To address this, we randomly sample additional inconsistent data. Based on the weak model’s performance (Llama-7b <<< Gemma-2b <<< Mistral-7b on MATH), we supplement the data (both 𝒟^weaksubscript^𝒟weak\hat{\mathcal{D}}_{\text{weak}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT weak end_POSTSUBSCRIPT and 𝒟^iclsubscript^𝒟icl\hat{\mathcal{D}}_{\text{icl}}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT icl end_POSTSUBSCRIPT) to 1,000 instances for Gemma-2b and 2,000 instances for Mistral-7b, and present the results in Fig. 4. The original amount of training data and test accuracy for these two weak models are shown in Tab. 11.

A.4.2 Pass@k Results

Tab. 11 summarizes the greedy decoding and pass@k results for the three variants of enhanced strong models obtained through weak-icl fine-tuning. Notably, hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT utilizes a training set that combines those used by weak-ftsubscriptweak-ft\mathcal{M}_{\text{weak-ft}}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT and icl-ftsubscripticl-ft\mathcal{M}_{\text{icl-ft}}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT. The results indicate that hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT outperforms its counterparts in terms of pass@k, achieving superior pass@k scores with margins of up to 5.23 points. The only exception occurs in the MATH dataset supervised by Llama2-7b, where the underperformance is likely due to limited training data.

The superior performance of hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT can be attributed to the diversity of solutions in its training set (verified in §A.3.1), validating our approach of adopting the final iteration of hybrid-ftsubscripthybrid-ft\mathcal{M}_{\text{hybrid-ft}}caligraphic_M start_POSTSUBSCRIPT hybrid-ft end_POSTSUBSCRIPT from Stage I for preference optimization in Stage II. It is important to note that while higher pass@k scores suggest greater potential, the true challenge lies in effectively harnessing this potential, particularly in the weak-to-strong setting where no ground truths are available. Our proposed weak-to-strong preference optimization in Stage II successfully addresses this challenge, transforming theoretical potential into tangible performance gains in greedy decoding, as proved in §4.4.

A.4.3 PGR of Stage I

Burns et al. (2023) propose a new metric called performance gap recovered (PGR) to measure the fraction of the performance gap that can be recovered through weak supervision, as illustrated in Eq. 1. Tab. 11 displays the results of the naive full weak fine-tuning (i.e., Full Weak FT) and our best weak-icl fine-tuning (i.e., Weak-ICL FT) in terms of PGR, which also demonstrate that our method can outperform the simple competitor. However, the variations in PGR across different weak models do not provide meaningful insights. In the experiments described in the main text, we use test accuracy instead to provide a more detailed depiction of model performance.

PGR=weak-to-strongweak floorstrong ceilingweak floor.PGRweak-to-strongweak floorstrong ceilingweak floor\displaystyle\text{PGR}=\frac{\text{weak-to-strong}-\text{weak floor}}{\text{% strong ceiling}-\text{weak floor}}.PGR = divide start_ARG weak-to-strong - weak floor end_ARG start_ARG strong ceiling - weak floor end_ARG . (1)

A.4.4 Effect of SFT Data

Weak Model SFT Data Test Accuracy
Llama2-7b Full Weak 42.38
Gold Weak 54.21 (+11.83)
Our Weak 53.68 (+11.30)
Full ICL 59.14
Gold ICL 64.29 (+5.15)
Our ICL 61.71 (+2.57)
Gemma-2b Full Weak 29.04
Gold Weak 46.40 (+17.36)
Our Weak 42.91 (+13.87)
Full ICL 58.61
Gold ICL 63.86 (+5.25)
Our ICL 59.21 (+0.60)
Mistral-7b Full Weak 61.33
Gold Weak 67.55 (+6.22)
Our Weak 65.96 (+4.63)
Full ICL 62.32
Gold ICL 66.64 (+4.32)
Our ICL 65.43 (+3.11)
Table 12: Detailed results of Stage I on GSM8K.

Tab. 12 presents more detailed comparative experimental results of Stage I on GSM8K. “Full Weak” denotes full weak fine-tuning, “Our Weak” is equivalent to weak-ft1superscriptsubscriptweak-ft1\mathcal{M}_{\text{weak-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT weak-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and “Our ICL” is equivalent to icl-ft1superscriptsubscripticl-ft1\mathcal{M}_{\text{icl-ft}}^{1}caligraphic_M start_POSTSUBSCRIPT icl-ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. “Gold Weak” refers to the scenario where weak data with correct final answers are filtered and used for supervised fine-tuning, which is impossible in the weak-to-strong setting and just used for experimental analysis. Similarly, “Gold ICL” refers to the scenario where solutions with correct final answers, generated by the strong model via weak ICL, are filtered.

Compared to using a large volume of noisy data (i.e., Full Weak and Full ICL), reducing the data quantity while enhancing data quality can significantly improve the accuracy of the trained model, with potential gains over 17 points. Although our method performs slightly lower than the gold results, it proves highly effective and stable in scenarios where obtaining the ground truth is impossible.

Question:
{question}
Student Solution:
{solution}
Your task involves three parts:
1. **Step-by-step Evaluation:** Go through the student solution carefully and identify key errors and potential misunderstandings that led to the incorrect solution.
2. **Final Judgement:** Provide an overall judgement on the correctness of the student’s solution.
3. **First Error Step:** If the solution is incorrect, generate the step number where the first error occurs, otherwise generate N/A here.
Here’s the format I want:
Step-by-step Evaluation: [Provide a step by step examination of the student solution and identify key errors and misunderstandings here.]
Final Judgement: [Insert only **correct** or **wrong** here]
First Error Step: [Insert either N/A or the step number where the first error occurs]
Please follow this format without any additional introductory or concluding statements.
Table 13: Prompt used to evaluate process-level accuracy.