Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models

Jung Hyun Lee1\dagger, June Yong Yang11footnotemark: 12\dagger, Byeongho Heo3, Dongyoon Han3, Kang Min Yoo1,4\dagger

1NAVER Cloud, 2KAIST AI, 3NAVER AI Lab, 4SNU AI Center
\dagger[email protected], [email protected], [email protected]
Equal contribution.
      Preprint.
Abstract

Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. However, they are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to language models’ autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels to assess the correctness of token-by-token reasoning chains. Consequently, they struggle to recognize discriminative details of tokens within a reasoning path and lack the ability to evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer. To amend the lack of sound and token-grained math-verification signals, we devise a novel training scheme for verifiers that apply token-level supervision with the expected cumulative reward (i.e., value). Furthermore, we propose a practical formulation of the cumulative reward by reducing it to finding the probability of future correctness of the final answer and thereby enabling the empirical estimation of the value. Experimental results on mathematical reasoning benchmarks show that Token-Supervised Value Model (TVM) can outperform step-by-step verifiers on GSM8K and MATH with Mistral and Llama.

Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models


Jung Hyun Leethanks: Equal contribution.
      Preprint.
1\dagger, June Yong Yang11footnotemark: 12\dagger, Byeongho Heo3, Dongyoon Han3, Kang Min Yoo1,4\dagger
1NAVER Cloud, 2KAIST AI, 3NAVER AI Lab, 4SNU AI Center \dagger[email protected], [email protected], [email protected]


1 Introduction

Refer to caption

Figure 1: Illustrative comparison of token-level supervision (TVM; ours) with outcome supervision (ORM) and process supervision (PRM). We provide two examples for each correct and wrong reasoning path. Inthe reasoning step 4 of each example, both ORM and PRM use uniform labels judged by the correctness of either an entire reasoning path or step, which poses challenges for recognizing discriminative details of tokens within a reasoning path. On the other hand, TVM is supervised with distinct per-token labels, thus enabling the distinction of the details of tokens within a reasoning path and leading to more precise outcomes (see Fig. 2).

Large language models (LLMs) pre-trained on massive data have achieved human-level performance across a wide range of tasks in natural language processing (Maslej et al., 2024). A notable exception to this trend is complex multi-step reasoning tasks such as mathematical problem solving, where current state-of-the-art LLMs still struggle to attain near-human performance. Previous studies have been focused on enhancing the reasoning capabilities of LLMs through: encouraging LLMs to generate step-by-step thought processes via few-shot or zero-shot prompting (Wei et al., 2022; Kojima et al., 2022); fine-tuning LLMs with question-solution pairs to generate intermediate reasoning steps before producing a final answer (Cobbe et al., 2021; Luo et al., 2023; Yu et al., 2023; Yuan et al., 2023); and employing aggregation techniques such as majority voting over final answers extracted from solutions generated by LLMs (Wang et al., 2023).

However, when LLMs are left to their own devices to solve given problems, they remain error-prone due to their autoregressive nature in generating reasoning paths. If an LLM, by chance, produces a single error during generation, the reasoning path can be easily steered towards a wrong answer. This would worsen for LLMs when they face more complex reasoning tasks such as advanced-level mathematical problems in the MATH dataset (Hendrycks et al., 2021). To address this, researchers have focused on providing external aid to the LLM by training verifiers to assess the correctness of generated reasoning paths.

Existing verifiers can be categorized into two types: outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs). ORMs (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) are trained to assess the correctness of a reasoning path by labeling each token as either correct or incorrect solely based on whether the final answer in the reasoning path is correct. PRMs (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2024) are trained with step-level labels to assess the correctness of each reasoning step, and they are generally preferred over ORMs due to the finer resolution of assessment in practice. Despite being proposed to assist LLMs, current verifiers may retain a fundamental misalignment with their per-token granularity. Since ORMs and PRMs employ uniform labels according to the correctness of either a whole reasoning path or step, respectively (Fig. 1), we argue that they were not designed to (i) learn the discriminative details of tokens within a reasoning path or (ii) evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer.

In this paper, we propose the Token-supervised Value Model (TVM), a novel verifier that supervises each token in a reasoning path with a distinctive label, training each token with the expected cumulative reward. Unlike ORMs and PRMs, our token-level supervision with distinct per-token value labels along a reasoning path (Fig. 1) equips TVMs with the ability to capture the discriminative details of tokens within a reasoning path (see Fig. 2). Furthermore, providing a theoretical insight that the value of each token is equivalent to the probability of reaching the correct final answer from that token, we propose to label each token via empirical value estimation along sampled reasoning paths.TVM is trained to predict the probability of a per-token intermediate reasoning path being on a promising track toward the correct final answer. Therefore TVM could choose among candidate reasoning paths most likely to reach the correct final answer, whether they are partial or complete. Our contributions are threefold:

  • We propose the Token-supervised Value Model (TVM), a new verifier capable of capturing token-wise details via direct supervision with the expected cumulative reward (i.e., value) for each token along a reasoning path.

  • We generate per-token labels for verifier supervision via empirical value estimation, which allows TVM to predict the probability of an intermediate reasoning path reaching the correct final answer.

  • We show that TVM achieves performance improvements on GSM8K and MATH benchmarks across LLMs under 10B parameters, compared to ORMs and PRMs.

2 Background

This section reviews existing verifier frameworks for enhancing the mathematical reasoning capabilities of LLMs. Sec. 2.1 outlines the preliminary setups for training verifiers in mathematical reasoning verification. The subsequent sections revisit two existing types of supervision for verifier training: outcome supervision (Sec. 2.2) and process supervision (Sec. 2.3).

2.1 Training Verifiers for Mathematical Reasoning

The mathematical reasoning capabilities of LLMs can be enhanced by employing reward models as external verifiers to assess the generated reasoning paths (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023; Yu et al., 2024; Wang et al., 2024). The verifier is generally trained via supervised learning on a dataset obtained by sampling multiple reasoning paths per training problem using an LLM. Specifically, given a training problem qtrsubscript𝑞𝑡𝑟q_{tr}italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT as an input, the LLM first generates Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT reasoning paths, where n𝑛nitalic_n-th reasoning path is comprised of reasoning steps {sn,j}j=1Snsuperscriptsubscriptsubscript𝑠𝑛𝑗𝑗1subscript𝑆𝑛\{s_{n,j}\}_{j=1}^{S_{n}}{ italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a final answer ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. In token-level notation, the n𝑛nitalic_n-th reasoning path can also be expressed as a sequence of tokens {tn,k}k=1Tnsubscriptsuperscriptsubscript𝑡𝑛𝑘subscript𝑇𝑛𝑘1\{t_{n,k}\}^{T_{n}}_{k=1}{ italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT. Hereafter, {sn,}1jsuperscriptsubscriptsubscript𝑠𝑛1𝑗\{s_{n,\cdot}\}_{1}^{j}{ italic_s start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT means {sn,1,,sn,j}subscript𝑠𝑛1subscript𝑠𝑛𝑗\{s_{n,1},\cdots,s_{n,j}\}{ italic_s start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT } and {tn,1,,tn,k}subscript𝑡𝑛1subscript𝑡𝑛𝑘\{t_{n,1},\cdots,t_{n,k}\}{ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT }, respectively. The final answer ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is correct if it is equal to the ground truth answer a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG, and incorrect otherwise. Based on the correctness of the sampled reasoning paths, supervision is traditionally given in two ways: (i) outcome supervision (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) and (ii) process supervision (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2024).

2.2 Outcome Supervision

Prior works (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) employ outcome supervision to label an entire reasoning path as correct if its final answer is correct (Fig. 1). The outcome reward function ro()subscript𝑟𝑜r_{o}(\cdot)italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( ⋅ ) is the correctness of the final answer:

ro(an)={1 if an=a^0 if ana^subscript𝑟𝑜subscript𝑎𝑛cases1 if subscript𝑎𝑛^𝑎0 if subscript𝑎𝑛^𝑎\displaystyle r_{o}(a_{n})=\left\{\begin{array}[]{l}1\text{ if }a_{n}=\hat{a}% \\ 0\text{ if }a_{n}\neq\hat{a}\end{array}\right.italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = { start_ARRAY start_ROW start_CELL 1 if italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_a end_ARG end_CELL end_ROW start_ROW start_CELL 0 if italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ over^ start_ARG italic_a end_ARG end_CELL end_ROW end_ARRAY (3)

for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. An outcome-supervised reward model (ORM) fORMsubscript𝑓𝑂𝑅𝑀f_{ORM}italic_f start_POSTSUBSCRIPT italic_O italic_R italic_M end_POSTSUBSCRIPT is trained with every token in a reasoning path labeled as the outcome reward (Eq. 3). The ORM loss ORMsubscript𝑂𝑅𝑀\mathcal{L}_{ORM}caligraphic_L start_POSTSUBSCRIPT italic_O italic_R italic_M end_POSTSUBSCRIPT is defined as

ORM=n,kNtr,Tn(ro(an),fORM(qtr,{tn,}1k)).subscript𝑂𝑅𝑀superscriptsubscript𝑛𝑘subscript𝑁𝑡𝑟subscript𝑇𝑛(ro(an),fORM(qtr,{tn,}1k)).\displaystyle\mathcal{L}_{ORM}{=}\sum_{n,k}^{N_{tr},T_{n}}\leavevmode% \resizebox{260.17464pt}{}{$\ell\left(r_{o}(a_{n}),f_{ORM}(q_{tr},\{t_{n,\cdot}% \}_{1}^{k})\right)$.}caligraphic_L start_POSTSUBSCRIPT italic_O italic_R italic_M end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_ℓ ( italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_O italic_R italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) . (4)

The mean squared error is typically used as a loss function ()\ell(\cdot)roman_ℓ ( ⋅ ) in Eq. 4. Cobbe et al. (2021) demonstrated that a token-level verifier trained to judge the correctness after every token performs better than a solution-level verifier trained to determine the correctness only after the final token.

Interestingly, Yu et al. (2024) showed that ORMs can be alternatively described as modeling the cumulative reward for each token, where all intermediate rewards are zero (i.e., r(tn,k)=0𝑟subscript𝑡𝑛𝑘0r(t_{n,k})=0italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = 0 for every n𝑛nitalic_n and k𝑘kitalic_k) and the discount factor γ𝛾\gammaitalic_γ is set to 1. The cumulative reward following an intermediate token tn,ksubscript𝑡𝑛𝑘t_{n,k}italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT, R(tn,k)𝑅subscript𝑡𝑛𝑘R(t_{n,k})italic_R ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) is calculated as

R(tn,k)=r(tn,k+1)++r(tn,Tn)+ro(an)𝑅subscript𝑡𝑛𝑘𝑟subscript𝑡𝑛𝑘1𝑟subscript𝑡𝑛subscript𝑇𝑛subscript𝑟𝑜subscript𝑎𝑛\displaystyle R(t_{n,k})=r(t_{n,k+1})+\cdots+r(t_{n,T_{n}})+r_{o}(a_{n})italic_R ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) + ⋯ + italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
={0++0+1=1if an=a^0++0+0=0if ana^,absentcases0011if subscript𝑎𝑛^𝑎0000if subscript𝑎𝑛^𝑎\displaystyle=\begin{cases}0+\cdots+0+1=1&\text{if }a_{n}=\hat{a}\\ 0+\cdots+0+0=0&\text{if }a_{n}\neq\hat{a},\end{cases}= { start_ROW start_CELL 0 + ⋯ + 0 + 1 = 1 end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_a end_ARG end_CELL end_ROW start_ROW start_CELL 0 + ⋯ + 0 + 0 = 0 end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ over^ start_ARG italic_a end_ARG , end_CELL end_ROW (5)

which is equivalent to ro(an)subscript𝑟𝑜subscript𝑎𝑛r_{o}(a_{n})italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in Eq. 3. This entails that an intermediate reasoning path is labeled as correct if the final answer is correct, and vice versa. In this sense, ORMs can indirectly and implicitly learn the potential correctness of an intermediate reasoning path (Yu et al., 2024).

2.3 Process Supervision

Process supervision enables a more accurate assessment of a reasoning path by explicitly training a verifier on the correctness of each step with step-level supervision Lightman et al. (2023). The correctness of each reasoning step is either labeled via human annotation (Uesato et al., 2022; Lightman et al., 2023) or automation (Wang et al., 2024). Since acquiring human annotations is labor-intensive and costly, we mainly focus on process supervision without human annotations.

Nach Wang et al. (2024), an intermediate reasoning step sn,jsubscript𝑠𝑛𝑗s_{n,j}italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT can be labeled as correct if at least one of the reasoning paths starting from sn,jsubscript𝑠𝑛𝑗s_{n,j}italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT reaches the correct final answer a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG (Fig. 1). In practice, sn,jsubscript𝑠𝑛𝑗s_{n,j}italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT is annotated by sampling a fixed number of reasoning paths conditioned on a sequence of intermediate reasoning steps {sn,}1j={sn,1,,sn,j}superscriptsubscriptsubscript𝑠𝑛1𝑗subscript𝑠𝑛1subscript𝑠𝑛𝑗\{s_{n,\cdot}\}_{1}^{j}=\{s_{n,1},\cdots,s_{n,j}\}{ italic_s start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT }. If at least one of the sampled reasoning paths reaches the correct final answer, sn,jsubscript𝑠𝑛𝑗s_{n,j}italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT is labeled as correct with the process reward rp(sn,j)=1subscript𝑟𝑝subscript𝑠𝑛𝑗1r_{p}(s_{n,j})=1italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ) = 1. Otherwise, sn,jsubscript𝑠𝑛𝑗s_{n,j}italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT is labeled as incorrect and rp(sn,j)=0subscript𝑟𝑝subscript𝑠𝑛𝑗0r_{p}(s_{n,j})=0italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ) = 0. Using the per-step labels obtained through automation, a Process-supervised Reward Model (PRM) is trained to provide a step-level assessment by minimizing the following loss:

PRM=n,jNtr,Sn(rp(sn,j),fPRM(qtr,{sn,}1j)),subscript𝑃𝑅𝑀superscriptsubscript𝑛𝑗subscript𝑁𝑡𝑟subscript𝑆𝑛subscript𝑟𝑝subscript𝑠𝑛𝑗subscript𝑓𝑃𝑅𝑀subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑠𝑛1𝑗\displaystyle\mathcal{L}_{PRM}{=}\sum_{n,j}^{N_{tr},S_{n}}\leavevmode% \resizebox{260.17464pt}{}{$\ell\left(r_{p}(s_{n,j}),f_{PRM}(q_{tr},\{s_{n,% \cdot}\}_{1}^{j})\right)$},caligraphic_L start_POSTSUBSCRIPT italic_P italic_R italic_M end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_ℓ ( italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_P italic_R italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_s start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , (6)

where \ellroman_ℓ denotes the binary cross entropy loss.

Refer to caption
Figure 2: Illustration of reasoning paths ranked highest by ORM/PRM/TVM (ours) among 𝟐𝟓𝟔256\mathbf{256}bold_256 candidate reasoning paths for a test problem from GSM8K. We use Mistral-7B-MetaMath and illustrate practical failure cases of ORM and PRM compared to ours. (a) The reasoning step 4 begins with ‘‘So the total difference’’ but ends with a summation. Hence, as soon as step 4 is finished, the TVM score decreases dramatically. (b) The reasoning step 5 starts with ‘‘Thus, the total difference’’ but ends in subtracting a large number (‘‘120’’) from a small number (‘‘95’’), which is the exact opposite of the definition of difference. Thus, the TVM score declines. (c) The reasoning step 3 opens with ‘‘The total difference’’, ending in subtracting the small number (‘‘95’’) from the large number (‘‘120’’), which is finally correct. As a result, immediately after the token ‘‘=’’ emerges, the TVM score rises while the PRM score remains intact due to its step-wise assessment. Therefore, TVM can filter out (a) and (b) while selecting (c) with the highest score, enabling token-level discrimination within a reasoning path.

3 Method

In this section, we introduce our proposed method coined Token-supervised Model (TVM), a novel verifier trained with a token-level supervision strategy to directly estimate the expected cumulative reward (i.e., value) for each token along a reasoning path. We also describe how to empirically estimate per-token value labels from Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT generated reasoning paths for token-level supervision.

3.1 Motivation

As mentioned in Sec. 2, both outcome supervision (ORMs) and process supervision (PRMs) utilize homogeneous labels determined by the correctness of either the entire reasoning path or step (Fig. 1). Consequently, we hypothesize that they are likely to be neither explicitly nor directly trained to (i) learn the discriminative details of tokens within a reasoning path or (ii) evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer.

We elucidate our assertion through cases observed in practice, as illustrated in Fig. 2. In the reasoning path ranked highest by ORM ((a) in Fig. 2), reasoning step 4 begins with ‘‘So the total difference’’ but ends with a summation, where a logical error occurs. However, ORM is unable to catch the error and maintains a score over 0.40.40.40.4, the highest score among 256256256256 candidate reasoning paths. In the reasoning path ranked highest by PRM ((b) in Fig. 2), reasoning step 5 starts with ‘‘Thus, the total difference’’ but ends in subtracting a larger number (‘‘120’’) from a smaller number (‘‘95’’), which is the exact opposite of the definition of difference. In the reasoning path, ‘‘120’’ appears in reasoning step 4 after ‘‘95’’ appears in reasoning step 3. Since PRMs focus on assessing the correctness of the current reasoning step, the sequential appearance of numbers and the resulting subtraction are considered correct by the PRM even though the reasoning path is unlikely to lead to a correct answer. The observed failures inspire the proposal of a token-level value supervision strategy for training verifiers.

Refer to caption
Figure 3: Illustration of empirical value estimation using Eq. 11. For a single training problem, Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT reasoning-answer pairs are sampled using an LLM. Here, let Ntr=3subscript𝑁𝑡𝑟3N_{tr}=3italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT = 3 for convenience. (1) All three sentences begin with the same tokens {t1,k}k=1a1superscriptsubscriptsubscript𝑡1𝑘𝑘1𝑎1\{t_{1,k}\}_{k=1}^{a-1}{ italic_t start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT, and only one of them reaches the correct final answer (357357357357). Accordingly, every token of {t1,k}k=1a1superscriptsubscriptsubscript𝑡1𝑘𝑘1𝑎1\{t_{1,k}\}_{k=1}^{a-1}{ italic_t start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT is labeled as 1/3=0.33130.331/3=0.331 / 3 = 0.33. (2) At the a𝑎aitalic_a-th position, however, only one sentence starts with t1,asubscript𝑡1𝑎t_{1,a}italic_t start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT, which reaches an incorrect final answer (656656656656). Thus, all tokens after t1,asubscript𝑡1𝑎t_{1,a}italic_t start_POSTSUBSCRIPT 1 , italic_a end_POSTSUBSCRIPT are labeled as 0/1=00100/1=00 / 1 = 0. (3) The remaining two sentences continue with the same tokens {t2,k}k=ab1superscriptsubscriptsubscript𝑡2𝑘𝑘𝑎𝑏1\{t_{2,k}\}_{k=a}^{b-1}{ italic_t start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT, only one of which is correct. Hence, every token of {t2,k}k=ab1superscriptsubscriptsubscript𝑡2𝑘𝑘𝑎𝑏1\{t_{2,k}\}_{k=a}^{b-1}{ italic_t start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT is labeled as 1/2=0.5120.51/2=0.51 / 2 = 0.5. (4) Finally, at the b𝑏bitalic_b-th position, which one is correct is pre-determined. As a result, all tokens after t2,bsubscript𝑡2𝑏t_{2,b}italic_t start_POSTSUBSCRIPT 2 , italic_b end_POSTSUBSCRIPT are labeled as 0/1=00100/1=00 / 1 = 0, whereas all tokens after t3,bsubscript𝑡3𝑏t_{3,b}italic_t start_POSTSUBSCRIPT 3 , italic_b end_POSTSUBSCRIPT as 1/1=11111/1=11 / 1 = 1.

3.2 Token-level Value Supervision

To overcome the issues above, we propose a new verifier based on token-level supervision with distinctive token-wise labels according to the potential of tokens in deducing the correct final answer. A natural choice to appropriately reflect the token-wise potential is prospective value modeling (Sutton and Barto, 2018), which is fine-grained and future-oriented compared to retrospective cumulative reward modeling (Eq. 5). Accordingly, we construct a supervision scheme for token tn,ksubscript𝑡𝑛𝑘t_{n,k}italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT in a reasoning path {tn,}1k={tn,1,,tn,k}superscriptsubscriptsubscript𝑡𝑛1𝑘subscript𝑡𝑛1subscript𝑡𝑛𝑘\{t_{n,\cdot}\}_{1}^{k}=\{t_{n,1},\cdots,t_{n,k}\}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } with the expected cumulative reward (i.e., value):

V(tn,k)=𝔼[l=1γl1r(tn,k+l)|qtr,{tn,}1k],𝑉subscript𝑡𝑛𝑘𝔼delimited-[]superscriptsubscript𝑙1conditionalsuperscript𝛾𝑙1𝑟subscript𝑡𝑛𝑘𝑙subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle V(t_{n,k})=\mathbb{E}\big{[}\sum_{l=1}^{\infty}\leavevmode% \resizebox{195.12767pt}{}{$\gamma^{l-1}r(t_{n,k+l})\big{|}q_{tr},\{t_{n,\cdot}% \}_{1}^{k}$}\big{]},italic_V ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] , (7)

where r()𝑟r(\cdot)italic_r ( ⋅ ) and γ𝛾\gammaitalic_γ denote a reward function and the discount factor, respectively.

The primary challenge in training value models as verifiers is estimating the value labels of a generated reasoning path Yu et al. (2024). However, under the specific outcome reward formulation of Eq. 3 and no intermediate rewards, the expected cumulative reward (Eq. 7) reduces to the probability of reaching the correct final answer conditioned on the question qtrsubscript𝑞𝑡𝑟q_{tr}italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and intermediate reasoning path {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, which can be straightforwardly computed from generated reasoning paths and can indicate whether an intermediate reasoning path (i.e., {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT) is on a promising track toward the correct final answer.

Proposition 3.1.

Let the reward function r(tn,k)𝑟subscript𝑡𝑛𝑘r(t_{n,k})italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) be defined as Eq. 3, which includes only the outcome reward with the discount factor γ=1𝛾1\gamma=1italic_γ = 1 and no intermediate reward (i.e., r(tn,k)=0𝑟subscript𝑡𝑛𝑘0r(t_{n,k})=0italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = 0 except the final answer). Then, the expected cumulative reward (Eq. 7) is equivalent to the probability of reaching the correct final answer conditioned on qtrsubscript𝑞𝑡𝑟q_{tr}italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and {tn,}1k={tn,1,,tn,k}superscriptsubscriptsubscript𝑡𝑛1𝑘subscript𝑡𝑛1subscript𝑡𝑛𝑘\{t_{n,\cdot}\}_{1}^{k}=\{t_{n,1},\cdots,t_{n,k}\}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT }:

𝔼[l=1γl1r(tn,k+l)|qtr,{tn,}1k]𝔼delimited-[]conditionalsuperscriptsubscript𝑙1superscript𝛾𝑙1𝑟subscript𝑡𝑛𝑘𝑙subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})\big{% |}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] (8)
=(the final answer will be a^|qtr,{tn,}1k).absentconditionalthe final answer will be ^𝑎subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}|q_{tr},\{t_% {n,\cdot}\}_{1}^{k}).= blackboard_P ( the final answer will be over^ start_ARG italic_a end_ARG | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) .

The right-hand side of Eq. 8 can be empirically estimated from generated reasoning paths by calculating the proportion of correct reasoning paths starting from {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT among total reasoning paths starting from {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT (see Sec. 3.3).

Following Proposition 3.1, we train the Token-supervised Value Model (TVM) by supervising each token with a value label empirically estimated as the probability of reaching the correct final answer given until that token. The objective of TVM is

TVM=n,k(n,k,fTVM(qtr,{tn,}1k))subscript𝑇𝑉𝑀subscript𝑛𝑘subscript𝑛𝑘subscript𝑓𝑇𝑉𝑀subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle\mathcal{L}_{TVM}=\sum_{n,k}\leavevmode\resizebox{238.49231pt}{}{% $\ell\left(\mathbb{P}_{n,k},f_{TVM}(q_{tr},\{t_{n,\cdot}\}_{1}^{k})\right)$}caligraphic_L start_POSTSUBSCRIPT italic_T italic_V italic_M end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT roman_ℓ ( blackboard_P start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_T italic_V italic_M end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) (9)

for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and k=1,,Tn𝑘1subscript𝑇𝑛k=1,\cdots,T_{n}italic_k = 1 , ⋯ , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where n,ksubscript𝑛𝑘\mathbb{P}_{n,k}blackboard_P start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT indicates the right-hand side of Eq. 8 and the loss function \ellroman_ℓ is the mean squared error.

Table 1: Accuracy of Mistral-7B, Mistral-7B-MetaMath, Llama3-8B, and Llama3-8B-MetaMath on the GSM8K benchmark under best-of-N search (N=256𝑁256N=256italic_N = 256) and verifier-guided step-level beam search (K=40𝐾40K=40italic_K = 40, b=10𝑏10b=10italic_b = 10). "BS" stands for beam search.
Search Strategy Method Mistral-7B Mistral-7B-MetaMath Llama3-8B Llama3-8B-MetaMath
Self-Consistency 79.2379.2379.2379.23 83.9083.9083.9083.90 80.9780.9780.9780.97 85.4485.4485.4485.44
\hdashline Best-of-N Search ORM 85.5285.5285.5285.52 86.2086.2086.2086.20 87.7987.7987.7987.79 89.7789.7789.7789.77
Math-Shepherd - 87.1087.1087.1087.10 - 89.2389.2389.2389.23
TVM (Ours) 88.1788.17\mathbf{88.17}bold_88.17 88.8688.86\mathbf{88.86}bold_88.86 88.7088.70\mathbf{88.70}bold_88.70 90.3790.37\mathbf{90.37}bold_90.37
\hdashlineVerifier-guided OVM 86.7386.7386.7386.73 87.7987.7987.7987.79 88.1088.1088.1088.10 89.6989.6989.6989.69
Step-level BS TVM (Ours) 87.7287.72\mathbf{87.72}bold_87.72 88.7888.78\mathbf{88.78}bold_88.78 89.0189.01\mathbf{89.01}bold_89.01 90.3090.30\mathbf{90.30}bold_90.30

Compared to existing verifiers, the resolution of assessment provided by the proposed token-level value supervision adequately matches the token-wise granularity of LLMs, thereby being able to capture the discriminative details of tokens within a reasoning path (Fig. 2). In contrast to ORMs, TVM is trained to directly estimate the probability of an intermediate reasoning path being on a promising track toward the correct final answer (Proposition 3.1). As a result, TVM can choose the reasoning path most likely to reach the correct final answer among candidate reasoning paths, whether they are partial or complete.

During inference, TVM can be employed to either search the reasoning path most likely to be correct over complete reasoning paths generated from an LLM (Lightman et al., 2023) or distinguish prospective candidates likely to reach the correct final answer among partially generated reasoning paths. For the latter, we conduct a detailed study in Sec. 4.3 in the setting of verifier-guided step-wise beam search (Yu et al., 2024).

3.3 Empirical Value Estimation

As discussed in Sec. 3.2, Proposition 3.1 alleviates the practical challenges of value estimation (Eq. 7) by formulating the value as the ratio of correct reasoning paths to total reasoning paths. Following Eq. 7 and Eq. 8, the estimated value for each token tn,ksubscript𝑡𝑛𝑘t_{n,k}italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT can be represented as

V(tn,k)=𝔼[l=1γl1r(tn,k+l)|qtr,{tn,}1k]𝑉subscript𝑡𝑛𝑘𝔼delimited-[]conditionalsuperscriptsubscript𝑙1superscript𝛾𝑙1𝑟subscript𝑡𝑛𝑘𝑙subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle V(t_{n,k})=\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{% n,k+l})\big{|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}italic_V ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ]
=(the final answer will be a^|qtr,{tn,}1k)absentconditionalthe final answer will be ^𝑎subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}|q_{tr},\{t_% {n,\cdot}\}_{1}^{k})= blackboard_P ( the final answer will be over^ start_ARG italic_a end_ARG | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
=({tn,}1kthe final answer will be a^|qtr)({tn,}1k|qtr).absentsuperscriptsubscriptsubscript𝑡𝑛1𝑘conditionalthe final answer will be ^𝑎subscript𝑞𝑡𝑟conditionalsuperscriptsubscriptsubscript𝑡𝑛1𝑘subscript𝑞𝑡𝑟\displaystyle=\frac{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}\cap\textit{the final % answer will be }\hat{a}|q_{tr})}{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}|q_{tr})}.= divide start_ARG blackboard_P ( { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∩ the final answer will be over^ start_ARG italic_a end_ARG | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ) end_ARG start_ARG blackboard_P ( { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ) end_ARG . (10)

In practice, Eq. 10 can be empirically estimated from Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT generated reasoning paths as the ratio of correct reasoning paths starting from {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT among Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and total reasoning paths starting from {tn,}1ksuperscriptsubscriptsubscript𝑡𝑛1𝑘\{t_{n,\cdot}\}_{1}^{k}{ italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT among Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, respectively. The value label of each token V(tn,k)𝑉subscript𝑡𝑛𝑘V(t_{n,k})italic_V ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) is assigned as

n=1Ntr𝕀({tn,}1k={tn,}1kan=a^)/Ntrn=1Ntr𝕀({tn,}1k={tn,}1k)/Ntrsuperscriptsubscriptsuperscript𝑛1subscript𝑁𝑡𝑟𝕀superscriptsubscriptsubscript𝑡superscript𝑛1𝑘superscriptsubscriptsubscript𝑡𝑛1𝑘subscript𝑎superscript𝑛^𝑎subscript𝑁𝑡𝑟superscriptsubscriptsuperscript𝑛1subscript𝑁𝑡𝑟𝕀superscriptsubscriptsubscript𝑡superscript𝑛1𝑘superscriptsubscriptsubscript𝑡𝑛1𝑘subscript𝑁𝑡𝑟\displaystyle\frac{\sum_{n^{\prime}=1}^{N_{tr}}\mathbb{I}(\{t_{n^{\prime},% \cdot}\}_{1}^{k}=\{t_{n,\cdot}\}_{1}^{k}\cap a_{n^{\prime}}=\hat{a})/N_{tr}}{% \sum_{n^{\prime}=1}^{N_{tr}}\mathbb{I}(\{t_{n^{\prime},\cdot}\}_{1}^{k}=\{t_{n% ,\cdot}\}_{1}^{k})/N_{tr}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_I ( { italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∩ italic_a start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = over^ start_ARG italic_a end_ARG ) / italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_I ( { italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) / italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_ARG (11)

where 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is the indicator function and Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT cancels out. The overall procedure of empirical value estimation is described in Figure 3. The overall algorithm is deferred to Appendix C.

4 Experiments

To demonstrate the efficacy of TVM in improving the mathematical reasoning capabilities of LLMs, we conduct extensive experiments on the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) benchmarks. Our experiments are based on the following large language models: 1) Mistral-7B (Jiang et al., 2023), Llama3-8B (AI@Meta, 2024); 2) those fine-tuned on MetaMATH (Yu et al., 2023) We use two existing verifier utilization strategies: (i) best-of-N search and (ii) step-by-step beam search.

Table 2: Accuracy of Mistral-7B-MetaMath, and Llama3-8B-MetaMath on the MATH benchmark under best-of-N search (N=256𝑁256N=256italic_N = 256) and verifier-guided step-level beam search (K=40𝐾40K=40italic_K = 40, b=10𝑏10b=10italic_b = 10). "BS" stands for beam search.
Search Strategy Method Mistral-7B-MetaMath Llama3-8B-MetaMath
Self-Consistency 35.1035.1035.1035.10 42.4042.4042.4042.40
\hdashline Best-of-N Search ORM 36.4036.4036.4036.40 43.6043.60\mathbf{43.60}bold_43.60
Math-Shepherd 37.3037.3037.3037.30 43.4043.4043.4043.40
TVM (Ours) 37.4037.40\mathbf{37.40}bold_37.40 43.4043.4043.4043.40
\hdashlineVerifier-guided OVM 36.6036.6036.6036.60 42.4042.4042.4042.40
Step-level BS TVM (Ours) 39.2039.20\mathbf{39.20}bold_39.20 45.2045.20\mathbf{45.20}bold_45.20

Best-of-N search.

The best-of-N search strategy introduced in Lightman et al. (2023) is a conventional experimental setting to evaluate the performance of a verifier. For every test problem, an LLM first generates N𝑁Nitalic_N complete reasoning paths. The reasoning path ranked highest by the verifier is chosen as the final candidate. For all experiments, we set N=256𝑁256N=256italic_N = 256 following Wang et al. (2024) unless specified otherwise.

Verifier-guided step-level beam search (BS).

To prevent errors in an intermediate reasoning step from propagating to subsequent steps, Yu et al. (2024) proposed guided decoding during intermediate reasoning steps via a verifier, a search strategy we call verifier-guided step-level beam search. For a test problem, after an LLM partially generates K𝐾Kitalic_K reasoning paths each containing only the first intermediate reasoning step, the verifier-guided step-level beam search strategy alternates between the following two steps until all K𝐾Kitalic_K partially generated reasoning paths are complete: (i) a verifier selects the top-b𝑏bitalic_b (<K)absent𝐾(<K)( < italic_K ) ranked partially generated reasoning paths, and (ii) the LLM generates K/b𝐾𝑏K/bitalic_K / italic_b subsequent intermediate reasoning steps for each path chosen by the verifier. Among the K𝐾Kitalic_K complete reasoning paths, the one scored highest by the verifier is selected. Thanks to verifier intervention in generating each intermediate reasoning step, with K𝐾Kitalic_K much smaller than N𝑁Nitalic_N, the performance of verifier-guided step-level beam search can be similar to that of best-of-N search in Table 1 and 2.

4.1 Grade School Mathematics (GSM8K)

Setups.

An LLM is fine-tuned on the training dataset of GSM8K for two epochs with a batch size of 128128128128 and a learning rate of 1e1𝑒1e1 italic_e-5555. Then, we sample Ntr=100subscript𝑁𝑡𝑟100N_{tr}=100italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT = 100 reasoning paths per training problem with a temperature of 0.70.70.70.7 from the fine-tuned LLM and label each token in a reasoning path as Eq. 11. Finally, TVM initialized from either the same LLM or the fine-tuned LLM is trained on this dataset for one epoch with a batch size of 512512512512 and a learning rate of either 2e2𝑒2e2 italic_e-6666 oder 1e1𝑒1e1 italic_e-5555. More experimental details are deferred to Appendix E.

Results.

In the case of best-of-N search, we compare TVM with ORM (Cobbe et al., 2021) and Math-Shepherd (Wang et al., 2024), a PRM without human annotations, as explained in Sec. 2. As all experimental results in Wang et al. (2024) are only based on LLMs fine-tuned on MetaMATH, we also evaluate Math-Shepherd only for Mistral-7B-MetaMath and Llama3-8B-MetaMath. Despite using large N𝑁Nitalic_N, Table 1 shows that TVM surpasses ORM and Math-Shepherd with improvements ranging from 0.6 to 2.6%p as well as self-consistency from 4.9 to 8.9%p, across the board.

Under the verifier-guided step-level beam search strategy, we primarily compare TVM against OVM (Yu et al., 2024) because Yu et al. (2024) confirmed that step-level beam search guided by a token-level verifier performs significantly better than that guided by a sentence-level value model (Feng et al., 2024) on GSM8K. Further comparison to Feng et al. (2024) is presented in Appendix B. In Table 1, TVM also consistently outperforms OVM ranging from 0.6 to 1.0%p.

One might wonder why the accuracy of OVM (K=40𝐾40K=40italic_K = 40) for Mistral-7B is much higher than that of OVM (K=100𝐾100K=100italic_K = 100) reported in Yu et al. (2024). This discrepancy arises because, in our experiments, some tokens (e.g., <<much-less-than<<< <, >>much-greater-than>>> >) are correctly converted to token IDs by the Mistral-7B tokenizer.

4.2 Advanced Mathematics (MATH)

Setups.

We employ fine-tuned LLMs on MetaMath (Mistral-7B-MetaMath and Llama3-8B-MetaMath) without any further fine-tuning on the training dataset of MATH in order to sample reasoning paths in a newline-delimited format. Following Lightman et al. (2023); Wang et al. (2024), we also use 500500500500 test MATH problems for evaluation, which is the same test dataset of Lightman et al. (2023), incorporating the remaining 4500450045004500 test problems into the training dataset of MATH. For each training problem, a fine-tuned LLM on MetaMath generates Ntr=25subscript𝑁𝑡𝑟25N_{tr}=25italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT = 25 reasoning paths with a temperature of 0.70.70.70.7, with each token labeled as Eq. 11. Then, we train TVM initialized from the same fine-tuned LLM for one epoch on this dataset with a batch size of 512512512512 and a learning rate of 2e2𝑒2e2 italic_e-6666. Further experimental details are given in Appendix E.

Results.

Similar to Sec. 4.1, Table 2 compares (i) TVM’s best-of-N search performance with ORM and Math-Shepherd and (ii) TVM-guided step-level beam search to ORM-guided step-level beam search (i.e., OVM). In the former case, the performance of TVM is slightly superior or almost comparable to that of ORM and Math-Shepherd. This might be due to the fact that an LLM is extremely prone to producing errors in the process of generating N𝑁Nitalic_N reasoning paths for difficult MATH problems. However, when capitalizing on the verifier-guided step-level beam search strategy, not only does TVM outperform the OVM ranging from 2.6 to 2.8%p, but TVM-guided step-level beam search also exhibits much better performance than best-of-N search by any verifier even if K=40𝐾40K=40italic_K = 40 is much smaller than N=256𝑁256N=256italic_N = 256.

Refer to caption
Figure 4: Illustration of OVM’s and TVM’s predictions under verifier-guided step-level beam search. In the third reasoning step, while OVM incorrectly predicts a wrong intermediate reasoning path with the highest score while assigning a low score to a correct path, TVM accurately predicts a correct intermediate path with the highest score and a wrong one with a low score.

4.3 Analyses on Verifier-guided Step-level BS

Case study.

To validate the superiority of TVM over OVM in predicting whether an intermediate reasoning path is on a promising track toward the correct answer, for a test problem in the GSM8K benchmark, we compare OVM’s and TVM’s predictions. As illustrated in Fig. 4, in the third reasoning step, OVM incorrectly predicts a wrong intermediate reasoning path with the highest score while assigning a low score to a correct path. This occurs because OVM is inherently identical to ORM trained to implicitly and indirectly learn the potential correctness of an intermediate reasoning path. In contrast, TVM accurately predicts a correct intermediate path with the highest score and a wrong one with a low score. As TVM is trained to directly and explicitly estimate the probability of reaching the correct final answer for each token along a reasoning path, TVM can effectively predict at inference whether an intermediate reasoning path is on a promising track toward the correct answer.

Table 3: Mean and standard deviation of TVM’s accuracy for Mistral-7B and Mistral-7B-MetaMath on the GSM8K benchmark according to varying sizes of K𝐾Kitalic_K and b𝑏bitalic_b when employing verifier-guided step-level beam search. Three random trials are carried out.
K𝐾Kitalic_K, b𝑏bitalic_b Mistral-7B Mistral-7B-MetaMath
40404040, 10101010 87.69 ±0.22plus-or-minus0.22\pm 0.22± 0.22 88.70 ±0.16plus-or-minus0.16\pm 0.16± 0.16
80808080, 20202020 87.89 ±0.35plus-or-minus0.35\pm 0.35± 0.35 88.75 ±0.20plus-or-minus0.20\pm 0.20± 0.20
100100100100, 25252525 87.92 ±0.13plus-or-minus0.13\pm 0.13± 0.13 88.80 ±0.07plus-or-minus0.07\pm 0.07± 0.07

Beam size study.

To investigate whether the accuracy of TVM improves with larger values of K𝐾Kitalic_K and b𝑏bitalic_b in verifier-guided step-level beam search, we conduct experiments using TVM with varying sizes of K𝐾Kitalic_K and b𝑏bitalic_b for Mistral-7B and Mistral-7B-MetaMath on the GSM8K benchmark. Table 3 shows that the accuracy of TVM on GSM8K increases as both K𝐾Kitalic_K and b𝑏bitalic_b grow, but reaches a saturation point when K=100𝐾100K=100italic_K = 100 and b=25𝑏25b=25italic_b = 25.

5 Related Work

Best-of-N search.

For N𝑁Nitalic_N complete reasoning paths, a verifier Cobbe et al. (2021); Uesato et al. (2022); Lightman et al. (2023); Wang et al. (2024) ranks and picks the highest-scored reasoning path. Although best-of-N search using a verifier shows much superior performance compared to verifier-free strategies such as self-consistency (Wang et al., 2023), best-of-N search still possesses the same drawback as self-consistency as a large quantity of generated reasoning paths are required to solve challenging reasoning problems.

Step-level beam search.

In contrast to the selection among complete reasoning paths, several studies have focused on step-level beam searches for partial reasoning paths. Step-level beam search can be divided into (i) verifier-free step-level beam search and (ii) verifier-guided step-level beam search.

Under the verifier-free step-level beam search strategy, Yao et al. (2023); Hao et al. (2023) allow value estimation by prompting LLMs to sample or simulate long-term outcomes during inference. Alternatively, Feng et al. (2024); Yu et al. (2024) introduce step-level beam search guided by a sentence-level value model and an outcome-supervised reward model, respectively. Although Feng et al. (2024); Yu et al. (2024) show that verifier-guided step-level beam search achieves significant accuracy improvements over verifier-free one, each approach has its own weakness. As delineated in Yu et al. (2024), a sentence-level value model is unsuitable for step-level beam search. In addition, Yu et al. (2024) uses an outcome-supervised reward model, not a value model. As a result, there is still room for improvement in the performance of verifier-guided step-level beam search.

6 Conclusion

In this paper, we introduce a novel verifier termed the Token-supervised Value Model (TVM). This model uses per-token value labels to guide LLMs toward promising mathematical reasoning paths. Unlike traditional verifiers, which lack token-level labels and thus cannot precisely evaluate intermediate steps in reasoning paths, TVM could estimate the expected cumulative reward for each token. This enables TVM to identify detailed token-level information and perform more precise reasoning at intermediate paths leading to the correct answer. Experimental results on benchmarks such as GSM8k and MATH have revealed that TVM outperforms previous verifiers across 7B-scale LLMs, including Mistral-7B and Llama3-8B, demonstrating its enhanced accuracy and effectiveness.

Limitations

Our method has demonstrated significant improvements over previous competing methods, but resource constraints limited us from running further experiments. Our TVM was primarily evaluated using 7B-scale models for mathematical reasoning, but it can be applied to larger models and extended to other domains. Additionally, our model could be utilized as a value model in reinforcement learning, such as in Proximal Policy Optimization training (Schulman et al., 2017; Zheng et al., 2023), to supervise LLMs.

References

  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
  • Feng et al. (2024) Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. 2024. Alphazero-like tree-search can guide large language model decoding and training. Preprint, arXiv:2309.17179.
  • Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. Preprint, arXiv:2305.20050.
  • Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  • Maslej et al. (2024) Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark. 2024. Artificial intelligence index report 2024. Preprint, arXiv:2405.19522.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. Preprint, arXiv:1707.06347.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275.
  • Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. 2024. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. Preprint, arXiv:2312.08935.
  • Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate problem solving with large language models. Preprint, arXiv:2305.10601.
  • Yu et al. (2024) Fei Yu, Anningzhe Gao, and Benyou Wang. 2024. Ovm, outcome-supervised value models for planning in mathematical reasoning. Preprint, arXiv:2311.09724.
  • Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  • Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. Scaling relationship on learning mathematical reasoning with large language models. Preprint, arXiv:2308.01825.
  • Zheng et al. (2023) Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 2023. Secrets of rlhf in large language models part i: Ppo. Preprint, arXiv:2307.04964.

Appendix A Proof of Proposition 3.1

Let the reward function r(tn,k)𝑟subscript𝑡𝑛𝑘r(t_{n,k})italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) be defined as Eq. 3, which includes only the outcome reward with the discount factor γ=1𝛾1\gamma=1italic_γ = 1 and no intermediate reward (i.e., r(tn,k)=0𝑟subscript𝑡𝑛𝑘0r(t_{n,k})=0italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = 0 except the final answer). Then, l=1γl1r(tn,k+l)=l=1r(tn,k+l)superscriptsubscript𝑙1superscript𝛾𝑙1𝑟subscript𝑡𝑛𝑘𝑙superscriptsubscript𝑙1𝑟subscript𝑡𝑛𝑘𝑙\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})=\sum_{l=1}^{\infty}r(t_{n,k+l})∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) becomes either one or zero, depending on whether the resulting final answer will be a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG or not, respectively. As a result, the expected cumulative reward (Eq. 7) is written as

𝔼[l=1γl1r(tn,k+l)|qtr,{tn,}1k]𝔼delimited-[]conditionalsuperscriptsubscript𝑙1superscript𝛾𝑙1𝑟subscript𝑡𝑛𝑘𝑙subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})\big{% |}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ]
=𝔼[l=1r(tn,k+l)|qtr,{tn,}1k](γ=1)\displaystyle=\mathbb{E}\big{[}\sum_{l=1}^{\infty}r(t_{n,k+l})\big{|}q_{tr},\{% t_{n,\cdot}\}_{1}^{k}\big{]}\quad(\because\gamma=1)= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ( ∵ italic_γ = 1 )
=r=01r(l=1r(tn,k+l)=r|qtr,{tn,}1k)(l=1r(tn,k+l)=0 oder 1)\displaystyle=\sum_{r=0}^{1}r*\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})% =r\big{|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{)}\quad(\because\sum_{l=1}^{\infty% }r(t_{n,k+l})=0\text{ or }1)= ∑ start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_r ∗ blackboard_P ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) = italic_r | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ( ∵ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) = 0 or 1 )
=(l=1r(tn,k+l)=1|qtr,{tn,}1k)absentsuperscriptsubscript𝑙1𝑟subscript𝑡𝑛𝑘𝑙conditional1subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle=\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})=1\big{|}q_{tr},% \{t_{n,\cdot}\}_{1}^{k}\big{)}= blackboard_P ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) = 1 | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
=(the final answer will be a^|qtr,{tn,}1k),absentconditionalthe final answer will be ^𝑎subscript𝑞𝑡𝑟superscriptsubscriptsubscript𝑡𝑛1𝑘\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}|q_{tr},\{t_% {n,\cdot}\}_{1}^{k}),= blackboard_P ( the final answer will be over^ start_ARG italic_a end_ARG | italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , { italic_t start_POSTSUBSCRIPT italic_n , ⋅ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

because l=1r(tn,k+l)=1superscriptsubscript𝑙1𝑟subscript𝑡𝑛𝑘𝑙1\sum_{l=1}^{\infty}r(t_{n,k+l})=1∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r ( italic_t start_POSTSUBSCRIPT italic_n , italic_k + italic_l end_POSTSUBSCRIPT ) = 1 only if the resulting final answer will be a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG.

Appendix B Additional Comparison of TVM with a sentence-level value model (Feng et al., 2024) as well as OVM

Although Yu et al. (2024) corroborated that step-level beam search guided by a token-level verifier performs significantly better than that guided by a sentence-level value model (Feng et al., 2024) on GSM8K, we additionally compare our TVM with a sentence-level value model (Feng et al., 2024) as well as OVM for Llama2-7B.

Table 4: Mean and standard deviation of accuracy of a sentence-level value model (Feng et al., 2024), OVM, and TVM for Llama2-7B (Touvron et al., 2023) on the GSM8K benchmark in the case of K=10𝐾10K=10italic_K = 10 under the verifier-guided step-level beam search strategy. For TVM, three random trials are conducted.
Search Strategy Method Llama2-7B
Verifier-guided Feng et al. (2024) 52.20 ±0.90plus-or-minus0.90\pm 0.90± 0.90
Step-level OVM 66.50 ±0.20plus-or-minus0.20\pm 0.20± 0.20
Beam Search TVM (Ours) 66.82 ±0.38plus-or-minus0.38\pm 0.38± 0.38

As seen in Table 4, our TVM is superior to both a sentence-level value model (Feng et al., 2024) and OVM. As explained in Yu et al. (2024), under the verifier-guided step-level beam search strategy, outcome-supervised reward models can pretend to be a value model, but process-supervised reward models cannot. As a result, the accuracy of a sentence-level value model is worse than that of OVM and TVM.

Appendix C Algorithm of Empirical Value Estimation

Algorithm 1 Empirical Value Estimation
For a question qtrsubscript𝑞𝑡𝑟q_{tr}italic_q start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT reasoning paths, each consisting of {tn,k}k=1Tnsuperscriptsubscriptsubscript𝑡𝑛𝑘𝑘1subscript𝑇𝑛\{t_{n,k}\}_{k=1}^{T_{n}}{ italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a final answer ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the ground truth answer a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG, and the outcome reward function ro(an)subscript𝑟𝑜subscript𝑎𝑛r_{o}(a_{n})italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in Eq. 3 for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT.
H𝐻absentH\leftarrowitalic_H ← dict()
for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT do
     for k=1,,Tn𝑘1subscript𝑇𝑛k=1,\cdots,T_{n}italic_k = 1 , ⋯ , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT do
         if not H𝐻Hitalic_H.containsKey[tn,1,,tn,k]subscript𝑡𝑛1subscript𝑡𝑛𝑘[t_{n,1},\cdots,t_{n,k}][ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ] then
              H𝐻Hitalic_H.insert([tn,1,,tn,k],(ro(an),1))subscript𝑡𝑛1subscript𝑡𝑛𝑘subscript𝑟𝑜subscript𝑎𝑛1([t_{n,1},\cdots,t_{n,k}],(r_{o}(a_{n}),1))( [ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ] , ( italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , 1 ) )
         else
              (c,t)H𝑐𝑡𝐻(c,t)\leftarrow H( italic_c , italic_t ) ← italic_H.get[tn,1,,tn,k]subscript𝑡𝑛1subscript𝑡𝑛𝑘[t_{n,1},\cdots,t_{n,k}][ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ]
              H𝐻Hitalic_H.insert([tn,1,,tn,k],(c+ro(an),t+1))subscript𝑡𝑛1subscript𝑡𝑛𝑘𝑐subscript𝑟𝑜subscript𝑎𝑛𝑡1([t_{n,1},\cdots,t_{n,k}],(c+r_{o}(a_{n}),t+1))( [ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ] , ( italic_c + italic_r start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_t + 1 ) )
         end if
     end for
end for
for n=1,,Ntr𝑛1subscript𝑁𝑡𝑟n=1,\cdots,N_{tr}italic_n = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT do
     for k=1,,Tn𝑘1subscript𝑇𝑛k=1,\cdots,T_{n}italic_k = 1 , ⋯ , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT do
         (c,t)H𝑐𝑡𝐻(c,t)\leftarrow H( italic_c , italic_t ) ← italic_H.get[tn,1,,tn,k]subscript𝑡𝑛1subscript𝑡𝑛𝑘[t_{n,1},\cdots,t_{n,k}][ italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ]
         // c𝑐citalic_c means the number of correct reasoning paths starting from tn,1,,tn,ksubscript𝑡𝑛1subscript𝑡𝑛𝑘t_{n,1},\cdots,t_{n,k}italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT
         // t𝑡titalic_t indicates the number of total reasoning paths starting from tn,1,,tn,ksubscript𝑡𝑛1subscript𝑡𝑛𝑘t_{n,1},\cdots,t_{n,k}italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT
         V(tn,k)=ct𝑉subscript𝑡𝑛𝑘𝑐𝑡V(t_{n,k})=\frac{c}{t}italic_V ( italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) = divide start_ARG italic_c end_ARG start_ARG italic_t end_ARG \triangleright Eq. 11
     end for
end for

Appendix D Compute Analysis between Best-of-N Search and Verifier-guided Step-level Beam Search

Table 5: Execution time of best-of-N search without and with vLLM (Kwon et al., 2023) and verifier-guided step-level beam search on the GSM8K and MATH benchmarks when using 8888 x NVIDIA A100-80GB GPUs and Mistral-7B-MetaMath.
Search Strategy GSM8K MATH
Best-of-N search w/o vLLM 6.56.56.56.5 hours 22222222 hours
Best-of-N search w/ vLLM 2.12.12.12.1 hours 2.42.42.42.4 hours
Verifier-guided step-level beam search 0.90.9\mathbf{0.9}bold_0.9 hours 1.11.1\mathbf{1.1}bold_1.1 hours

Appendix E Implementation Details

In both Sec. 4.1 and Sec. 4.2, following Cobbe et al. (2021), we use both a language modeling objective and the verification objective in Eq. 9, with 20%percent2020\%20 % dropout (Srivastava et al., 2014). Additionally, we employ the same architecture as Cobbe et al. (2021), a language model extended with a scalar head composed of a single gain parameter and a single bias parameter, to output a score for each token in a reasoning path. We use the AdamW optimizer with a linear scheduler. We generate Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT reasoning paths with a temperature of 0.70.70.70.7, a top-k of 50505050, and a top-p of 1.01.01.01.0. Note that for every experiment, a verifier has the same model size and architecture as an LLM used to generate Ntrsubscript𝑁𝑡𝑟N_{tr}italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT reasoning paths.

Table 6: Learning rate, batch size, and verifier initialization for training TVM when using Mistral-7B, Mistral-7B-MetaMath, Llama3-8B, and Llama3-8B-MetaMath to generate Ntr=100subscript𝑁𝑡𝑟100N_{tr}=100italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT = 100 reasoning paths per training problem of GSM8K in Sec. 4.1. Fine-tuned Mistral-7B and fine-tuned Llama3-8B means that they are fine-tuned on the training dataset of GSM8K as described in Sec. 4.1.
Mistral-7B Mistral-7B-MetaMath Llama3-8B Llama3-8B-MetaMath
Learning rate 2e2𝑒2e2 italic_e-6666 2222e-6666 1111e-5555 2222e-6666
Batch size 512512512512 512512512512 512512512512 512512512512
Verifier initialization fine-tuned Mistral-7B Mistral-7B-MetaMath fine-tuned Llama3-8B Llama3-8B-MetaMath
Table 7: Learning rate, batch size, and verifier initialization for training TVM when using Mistral-7B-MetaMath and Llama3-8B-MetaMath to generate Ntr=25subscript𝑁𝑡𝑟25N_{tr}=25italic_N start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT = 25 reasoning paths per training problem of MATH in Sec. 4.2.
Mistral-7B-MetaMath Llama3-8B-MetaMath
Learning rate 2e2𝑒2e2 italic_e-6666 2222e-6666
Batch size 512512512512 512512512512
Verifier initialization Mistral-7B-MetaMath Llama3-8B-MetaMath

For both best-of-N search and verifier-guided step-level beam search, we also use a temperature of 0.70.70.70.7, a top-k of 50505050, and a top-p of 1.01.01.01.0. The maximum new token length is set to 400400400400 for GSM8K and 1024102410241024 for MATH, respectively.