Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models

Jung Hyun Lee^{1 $\dagger$}, June Yong Yang¹¹footnotemark: 1^{2 $\dagger$}, Byeongho Heo³, Dongyoon Han³, Kang Min Yoo^{1,4 $\dagger$}

¹NAVER Cloud, ²KAIST AI, ³NAVER AI Lab, ⁴SNU AI Center
^$\dagger$[email protected], [email protected], [email protected] Equal contribution.
Preprint.

Abstract

Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. However, they are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to language models’ autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels to assess the correctness of token-by-token reasoning chains. Consequently, they struggle to recognize discriminative details of tokens within a reasoning path and lack the ability to evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer. To amend the lack of sound and token-grained math-verification signals, we devise a novel training scheme for verifiers that apply token-level supervision with the expected cumulative reward (i.e., value). Furthermore, we propose a practical formulation of the cumulative reward by reducing it to finding the probability of future correctness of the final answer and thereby enabling the empirical estimation of the value. Experimental results on mathematical reasoning benchmarks show that Token-Supervised Value Model (TVM) can outperform step-by-step verifiers on GSM8K and MATH with Mistral and Llama.

Jung Hyun Lee^†^†thanks: Equal contribution.
Preprint.^{1 $\dagger$}, June Yong Yang¹¹footnotemark: 1^{2 $\dagger$}, Byeongho Heo³, Dongyoon Han³, Kang Min Yoo^{1,4 $\dagger$} ¹NAVER Cloud, ²KAIST AI, ³NAVER AI Lab, ⁴SNU AI Center ^$\dagger$[email protected], [email protected], [email protected]

1 Introduction

Refer to caption — Figure 1: Illustrative comparison of token-level supervision (TVM; ours) with outcome supervision (ORM) and process supervision (PRM). We provide two examples for each correct and wrong reasoning path. Inthe reasoning step 4 of each example, both ORM and PRM use uniform labels judged by the correctness of either an entire reasoning path or step, which poses challenges for recognizing discriminative details of tokens within a reasoning path. On the other hand, TVM is supervised with distinct per-token labels, thus enabling the distinction of the details of tokens within a reasoning path and leading to more precise outcomes (see Fig. 2).

Large language models (LLMs) pre-trained on massive data have achieved human-level performance across a wide range of tasks in natural language processing (Maslej et al., 2024). A notable exception to this trend is complex multi-step reasoning tasks such as mathematical problem solving, where current state-of-the-art LLMs still struggle to attain near-human performance. Previous studies have been focused on enhancing the reasoning capabilities of LLMs through: encouraging LLMs to generate step-by-step thought processes via few-shot or zero-shot prompting (Wei et al., 2022; Kojima et al., 2022); fine-tuning LLMs with question-solution pairs to generate intermediate reasoning steps before producing a final answer (Cobbe et al., 2021; Luo et al., 2023; Yu et al., 2023; Yuan et al., 2023); and employing aggregation techniques such as majority voting over final answers extracted from solutions generated by LLMs (Wang et al., 2023).

However, when LLMs are left to their own devices to solve given problems, they remain error-prone due to their autoregressive nature in generating reasoning paths. If an LLM, by chance, produces a single error during generation, the reasoning path can be easily steered towards a wrong answer. This would worsen for LLMs when they face more complex reasoning tasks such as advanced-level mathematical problems in the MATH dataset (Hendrycks et al., 2021). To address this, researchers have focused on providing external aid to the LLM by training verifiers to assess the correctness of generated reasoning paths.

Existing verifiers can be categorized into two types: outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs). ORMs (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) are trained to assess the correctness of a reasoning path by labeling each token as either correct or incorrect solely based on whether the final answer in the reasoning path is correct. PRMs (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2024) are trained with step-level labels to assess the correctness of each reasoning step, and they are generally preferred over ORMs due to the finer resolution of assessment in practice. Despite being proposed to assist LLMs, current verifiers may retain a fundamental misalignment with their per-token granularity. Since ORMs and PRMs employ uniform labels according to the correctness of either a whole reasoning path or step, respectively (Fig. 1), we argue that they were not designed to (i) learn the discriminative details of tokens within a reasoning path or (ii) evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer.

In this paper, we propose the Token-supervised Value Model (TVM), a novel verifier that supervises each token in a reasoning path with a distinctive label, training each token with the expected cumulative reward. Unlike ORMs and PRMs, our token-level supervision with distinct per-token value labels along a reasoning path (Fig. 1) equips TVMs with the ability to capture the discriminative details of tokens within a reasoning path (see Fig. 2). Furthermore, providing a theoretical insight that the value of each token is equivalent to the probability of reaching the correct final answer from that token, we propose to label each token via empirical value estimation along sampled reasoning paths.TVM is trained to predict the probability of a per-token intermediate reasoning path being on a promising track toward the correct final answer. Therefore TVM could choose among candidate reasoning paths most likely to reach the correct final answer, whether they are partial or complete. Our contributions are threefold:

•

We propose the Token-supervised Value Model (TVM), a new verifier capable of capturing token-wise details via direct supervision with the expected cumulative reward (i.e., value) for each token along a reasoning path.
•

We generate per-token labels for verifier supervision via empirical value estimation, which allows TVM to predict the probability of an intermediate reasoning path reaching the correct final answer.
•

We show that TVM achieves performance improvements on GSM8K and MATH benchmarks across LLMs under 10B parameters, compared to ORMs and PRMs.

2 Background

This section reviews existing verifier frameworks for enhancing the mathematical reasoning capabilities of LLMs. Sec. 2.1 outlines the preliminary setups for training verifiers in mathematical reasoning verification. The subsequent sections revisit two existing types of supervision for verifier training: outcome supervision (Sec. 2.2) and process supervision (Sec. 2.3).

2.1 Training Verifiers for Mathematical Reasoning

The mathematical reasoning capabilities of LLMs can be enhanced by employing reward models as external verifiers to assess the generated reasoning paths (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023; Yu et al., 2024; Wang et al., 2024). The verifier is generally trained via supervised learning on a dataset obtained by sampling multiple reasoning paths per training problem using an LLM. Specifically, given a training problem $q_{tr}$ as an input, the LLM first generates $N_{tr}$ reasoning paths, where $n$ -th reasoning path is comprised of reasoning steps $\{s_{n,j}\}_{j=1}^{S_{n}}$ and a final answer $a_{n}$ for $n=1,\cdots,N_{tr}$ . In token-level notation, the $n$ -th reasoning path can also be expressed as a sequence of tokens $\{t_{n,k}\}^{T_{n}}_{k=1}$ . Hereafter, $\{s_{n,\cdot}\}_{1}^{j}$ and $\{t_{n,\cdot}\}_{1}^{k}$ means $\{s_{n,1},\cdots,s_{n,j}\}$ and $\{t_{n,1},\cdots,t_{n,k}\}$ , respectively. The final answer $a_{n}$ is correct if it is equal to the ground truth answer $\hat{a}$ , and incorrect otherwise. Based on the correctness of the sampled reasoning paths, supervision is traditionally given in two ways: (i) outcome supervision (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) and (ii) process supervision (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2024).

2.2 Outcome Supervision

Prior works (Cobbe et al., 2021; Uesato et al., 2022; Yu et al., 2024) employ outcome supervision to label an entire reasoning path as correct if its final answer is correct (Fig. 1). The outcome reward function $r_{o}(\cdot)$ is the correctness of the final answer:

\displaystyle r_{o}(a_{n})=\left\{\begin{array}[]{l}1\text{ if }a_{n}=\hat{a}% \\ 0\text{ if }a_{n}\neq\hat{a}\end{array}\right.

(3)

for $n=1,\cdots,N_{tr}$ . An outcome-supervised reward model (ORM) $f_{ORM}$ is trained with every token in a reasoning path labeled as the outcome reward (Eq. 3). The ORM loss $\mathcal{L}_{ORM}$ is defined as

\displaystyle\mathcal{L}_{ORM}{=}\sum_{n,k}^{N_{tr},T_{n}}\leavevmode% \resizebox{260.17464pt}{}{$\ell\left(r_{o}(a_{n}),f_{ORM}(q_{tr},\{t_{n,\cdot}% \}_{1}^{k})\right)$.}

(4)

The mean squared error is typically used as a loss function $\ell(\cdot)$ in Eq. 4. Cobbe et al. (2021) demonstrated that a token-level verifier trained to judge the correctness after every token performs better than a solution-level verifier trained to determine the correctness only after the final token.

Interestingly, Yu et al. (2024) showed that ORMs can be alternatively described as modeling the cumulative reward for each token, where all intermediate rewards are zero (i.e., $r(t_{n,k})=0$ for every $n$ and $k$ ) and the discount factor $\gamma$ is set to 1. The cumulative reward following an intermediate token $t_{n,k}$ , $R(t_{n,k})$ is calculated as

	$\displaystyle R(t_{n,k})=r(t_{n,k+1})+\cdots+r(t_{n,T_{n}})+r_{o}(a_{n})$
	$\displaystyle=\begin{cases}0+\cdots+0+1=1&\text{if }a_{n}=\hat{a}\\ 0+\cdots+0+0=0&\text{if }a_{n}\neq\hat{a},\end{cases}$		(5)

which is equivalent to $r_{o}(a_{n})$ in Eq. 3. This entails that an intermediate reasoning path is labeled as correct if the final answer is correct, and vice versa. In this sense, ORMs can indirectly and implicitly learn the potential correctness of an intermediate reasoning path (Yu et al., 2024).

2.3 Process Supervision

Process supervision enables a more accurate assessment of a reasoning path by explicitly training a verifier on the correctness of each step with step-level supervision Lightman et al. (2023). The correctness of each reasoning step is either labeled via human annotation (Uesato et al., 2022; Lightman et al., 2023) or automation (Wang et al., 2024). Since acquiring human annotations is labor-intensive and costly, we mainly focus on process supervision without human annotations.

Nach Wang et al. (2024), an intermediate reasoning step $s_{n,j}$ can be labeled as correct if at least one of the reasoning paths starting from $s_{n,j}$ reaches the correct final answer $\hat{a}$ (Fig. 1). In practice, $s_{n,j}$ is annotated by sampling a fixed number of reasoning paths conditioned on a sequence of intermediate reasoning steps $\{s_{n,\cdot}\}_{1}^{j}=\{s_{n,1},\cdots,s_{n,j}\}$ . If at least one of the sampled reasoning paths reaches the correct final answer, $s_{n,j}$ is labeled as correct with the process reward $r_{p}(s_{n,j})=1$ . Otherwise, $s_{n,j}$ is labeled as incorrect and $r_{p}(s_{n,j})=0$ . Using the per-step labels obtained through automation, a Process-supervised Reward Model (PRM) is trained to provide a step-level assessment by minimizing the following loss:

\displaystyle\mathcal{L}_{PRM}{=}\sum_{n,j}^{N_{tr},S_{n}}\leavevmode% \resizebox{260.17464pt}{}{$\ell\left(r_{p}(s_{n,j}),f_{PRM}(q_{tr},\{s_{n,% \cdot}\}_{1}^{j})\right)$},

(6)

where $\ell$ denotes the binary cross entropy loss.

3 Method

In this section, we introduce our proposed method coined Token-supervised Model (TVM), a novel verifier trained with a token-level supervision strategy to directly estimate the expected cumulative reward (i.e., value) for each token along a reasoning path. We also describe how to empirically estimate per-token value labels from $N_{tr}$ generated reasoning paths for token-level supervision.

3.1 Motivation

As mentioned in Sec. 2, both outcome supervision (ORMs) and process supervision (PRMs) utilize homogeneous labels determined by the correctness of either the entire reasoning path or step (Fig. 1). Consequently, we hypothesize that they are likely to be neither explicitly nor directly trained to (i) learn the discriminative details of tokens within a reasoning path or (ii) evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer.

We elucidate our assertion through cases observed in practice, as illustrated in Fig. 2. In the reasoning path ranked highest by ORM ((a) in Fig. 2), reasoning step 4 begins with ‘‘So the total difference’’ but ends with a summation, where a logical error occurs. However, ORM is unable to catch the error and maintains a score over $0.4$ , the highest score among $256$ candidate reasoning paths. In the reasoning path ranked highest by PRM ((b) in Fig. 2), reasoning step 5 starts with ‘‘Thus, the total difference’’ but ends in subtracting a larger number (‘‘120’’) from a smaller number (‘‘95’’), which is the exact opposite of the definition of difference. In the reasoning path, ‘‘120’’ appears in reasoning step 4 after ‘‘95’’ appears in reasoning step 3. Since PRMs focus on assessing the correctness of the current reasoning step, the sequential appearance of numbers and the resulting subtraction are considered correct by the PRM even though the reasoning path is unlikely to lead to a correct answer. The observed failures inspire the proposal of a token-level value supervision strategy for training verifiers.

3.2 Token-level Value Supervision

To overcome the issues above, we propose a new verifier based on token-level supervision with distinctive token-wise labels according to the potential of tokens in deducing the correct final answer. A natural choice to appropriately reflect the token-wise potential is prospective value modeling (Sutton and Barto, 2018), which is fine-grained and future-oriented compared to retrospective cumulative reward modeling (Eq. 5). Accordingly, we construct a supervision scheme for token $t_{n,k}$ in a reasoning path $\{t_{n,\cdot}\}_{1}^{k}=\{t_{n,1},\cdots,t_{n,k}\}$ with the expected cumulative reward (i.e., value):

\displaystyle V(t_{n,k})=\mathbb{E}\big{[}\sum_{l=1}^{\infty}\leavevmode% \resizebox{195.12767pt}{}{$\gamma^{l-1}r(t_{n,k+l})\big{|}q_{tr},\{t_{n,\cdot}% \}_{1}^{k}$}\big{]},

(7)

where $r(\cdot)$ and $\gamma$ denote a reward function and the discount factor, respectively.

The primary challenge in training value models as verifiers is estimating the value labels of a generated reasoning path Yu et al. (2024). However, under the specific outcome reward formulation of Eq. 3 and no intermediate rewards, the expected cumulative reward (Eq. 7) reduces to the probability of reaching the correct final answer conditioned on the question $q_{tr}$ and intermediate reasoning path $\{t_{n,\cdot}\}_{1}^{k}$ , which can be straightforwardly computed from generated reasoning paths and can indicate whether an intermediate reasoning path (i.e., $\{t_{n,\cdot}\}_{1}^{k}$ ) is on a promising track toward the correct final answer.

Proposition 3.1.

Let the reward function $r(t_{n,k})$ be defined as Eq. 3, which includes only the outcome reward with the discount factor $\gamma=1$ and no intermediate reward (i.e., $r(t_{n,k})=0$ except the final answer). Then, the expected cumulative reward (Eq. 7) is equivalent to the probability of reaching the correct final answer conditioned on $q_{tr}$ and $\{t_{n,\cdot}\}_{1}^{k}=\{t_{n,1},\cdots,t_{n,k}\}$ :

	$\displaystyle\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})\big{% \|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}$		(8)
	$\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}\|q_{tr},\{t_% {n,\cdot}\}_{1}^{k}).$

The right-hand side of Eq. 8 can be empirically estimated from generated reasoning paths by calculating the proportion of correct reasoning paths starting from $\{t_{n,\cdot}\}_{1}^{k}$ among total reasoning paths starting from $\{t_{n,\cdot}\}_{1}^{k}$ (see Sec. 3.3).

Following Proposition 3.1, we train the Token-supervised Value Model (TVM) by supervising each token with a value label empirically estimated as the probability of reaching the correct final answer given until that token. The objective of TVM is

\displaystyle\mathcal{L}_{TVM}=\sum_{n,k}\leavevmode\resizebox{238.49231pt}{}{% $\ell\left(\mathbb{P}_{n,k},f_{TVM}(q_{tr},\{t_{n,\cdot}\}_{1}^{k})\right)$}

(9)

for $n=1,\cdots,N_{tr}$ and $k=1,\cdots,T_{n}$ , where $\mathbb{P}_{n,k}$ indicates the right-hand side of Eq. 8 and the loss function $\ell$ is the mean squared error.

Table 1: Accuracy of Mistral-7B, Mistral-7B-MetaMath, Llama3-8B, and Llama3-8B-MetaMath on the GSM8K benchmark under best-of-N search (

N=256

) and verifier-guided step-level beam search (

K=40

b=10

). "BS" stands for beam search.

Search Strategy	Method	Mistral-7B	Mistral-7B-MetaMath	Llama3-8B	Llama3-8B-MetaMath
	Self-Consistency	$79.23$	$83.90$	$80.97$	$85.44$
\hdashline Best-of-N Search	ORM	$85.52$	$86.20$	$87.79$	$89.77$
	Math-Shepherd	-	$87.10$	-	$89.23$
	TVM (Ours)	$\mathbf{88.17}$	$\mathbf{88.86}$	$\mathbf{88.70}$	$\mathbf{90.37}$
\hdashlineVerifier-guided	OVM	$86.73$	$87.79$	$88.10$	$89.69$
Step-level BS	TVM (Ours)	$\mathbf{87.72}$	$\mathbf{88.78}$	$\mathbf{89.01}$	$\mathbf{90.30}$

Compared to existing verifiers, the resolution of assessment provided by the proposed token-level value supervision adequately matches the token-wise granularity of LLMs, thereby being able to capture the discriminative details of tokens within a reasoning path (Fig. 2). In contrast to ORMs, TVM is trained to directly estimate the probability of an intermediate reasoning path being on a promising track toward the correct final answer (Proposition 3.1). As a result, TVM can choose the reasoning path most likely to reach the correct final answer among candidate reasoning paths, whether they are partial or complete.

During inference, TVM can be employed to either search the reasoning path most likely to be correct over complete reasoning paths generated from an LLM (Lightman et al., 2023) or distinguish prospective candidates likely to reach the correct final answer among partially generated reasoning paths. For the latter, we conduct a detailed study in Sec. 4.3 in the setting of verifier-guided step-wise beam search (Yu et al., 2024).

3.3 Empirical Value Estimation

As discussed in Sec. 3.2, Proposition 3.1 alleviates the practical challenges of value estimation (Eq. 7) by formulating the value as the ratio of correct reasoning paths to total reasoning paths. Following Eq. 7 and Eq. 8, the estimated value for each token $t_{n,k}$ can be represented as

	$\displaystyle V(t_{n,k})=\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{% n,k+l})\big{\|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}$
	$\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}\|q_{tr},\{t_% {n,\cdot}\}_{1}^{k})$
	$\displaystyle=\frac{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}\cap\textit{the final % answer will be }\hat{a}\|q_{tr})}{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}\|q_{tr})}.$		(10)

In practice, Eq. 10 can be empirically estimated from $N_{tr}$ generated reasoning paths as the ratio of correct reasoning paths starting from $\{t_{n,\cdot}\}_{1}^{k}$ among $N_{tr}$ and total reasoning paths starting from $\{t_{n,\cdot}\}_{1}^{k}$ among $N_{tr}$ , respectively. The value label of each token $V(t_{n,k})$ is assigned as

\displaystyle\frac{\sum_{n^{\prime}=1}^{N_{tr}}\mathbb{I}(\{t_{n^{\prime},% \cdot}\}_{1}^{k}=\{t_{n,\cdot}\}_{1}^{k}\cap a_{n^{\prime}}=\hat{a})/N_{tr}}{% \sum_{n^{\prime}=1}^{N_{tr}}\mathbb{I}(\{t_{n^{\prime},\cdot}\}_{1}^{k}=\{t_{n% ,\cdot}\}_{1}^{k})/N_{tr}}

(11)

where $\mathbb{I}(\cdot)$ is the indicator function and $N_{tr}$ cancels out. The overall procedure of empirical value estimation is described in Figure 3. The overall algorithm is deferred to Appendix C.

4 Experiments

To demonstrate the efficacy of TVM in improving the mathematical reasoning capabilities of LLMs, we conduct extensive experiments on the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) benchmarks. Our experiments are based on the following large language models: 1) Mistral-7B (Jiang et al., 2023), Llama3-8B (AI@Meta, 2024); 2) those fine-tuned on MetaMATH (Yu et al., 2023) We use two existing verifier utilization strategies: (i) best-of-N search and (ii) step-by-step beam search.

Table 2: Accuracy of Mistral-7B-MetaMath, and Llama3-8B-MetaMath on the MATH benchmark under best-of-N search (

N=256

) and verifier-guided step-level beam search (

K=40

b=10

). "BS" stands for beam search.

Search Strategy	Method	Mistral-7B-MetaMath	Llama3-8B-MetaMath
	Self-Consistency	$35.10$	$42.40$
\hdashline Best-of-N Search	ORM	$36.40$	$\mathbf{43.60}$
	Math-Shepherd	$37.30$	$43.40$
	TVM (Ours)	$\mathbf{37.40}$	$43.40$
\hdashlineVerifier-guided	OVM	$36.60$	$42.40$
Step-level BS	TVM (Ours)	$\mathbf{39.20}$	$\mathbf{45.20}$

Best-of-N search.

The best-of-N search strategy introduced in Lightman et al. (2023) is a conventional experimental setting to evaluate the performance of a verifier. For every test problem, an LLM first generates $N$ complete reasoning paths. The reasoning path ranked highest by the verifier is chosen as the final candidate. For all experiments, we set $N=256$ following Wang et al. (2024) unless specified otherwise.

Verifier-guided step-level beam search (BS).

To prevent errors in an intermediate reasoning step from propagating to subsequent steps, Yu et al. (2024) proposed guided decoding during intermediate reasoning steps via a verifier, a search strategy we call verifier-guided step-level beam search. For a test problem, after an LLM partially generates $K$ reasoning paths each containing only the first intermediate reasoning step, the verifier-guided step-level beam search strategy alternates between the following two steps until all $K$ partially generated reasoning paths are complete: (i) a verifier selects the top- $b$ $(<K)$ ranked partially generated reasoning paths, and (ii) the LLM generates $K/b$ subsequent intermediate reasoning steps for each path chosen by the verifier. Among the $K$ complete reasoning paths, the one scored highest by the verifier is selected. Thanks to verifier intervention in generating each intermediate reasoning step, with $K$ much smaller than $N$ , the performance of verifier-guided step-level beam search can be similar to that of best-of-N search in Table 1 and 2.

4.1 Grade School Mathematics (GSM8K)

Setups.

An LLM is fine-tuned on the training dataset of GSM8K for two epochs with a batch size of $128$ and a learning rate of $1e$ - $5$ . Then, we sample $N_{tr}=100$ reasoning paths per training problem with a temperature of $0.7$ from the fine-tuned LLM and label each token in a reasoning path as Eq. 11. Finally, TVM initialized from either the same LLM or the fine-tuned LLM is trained on this dataset for one epoch with a batch size of $512$ and a learning rate of either $2e$ - $6$ oder $1e$ - $5$ . More experimental details are deferred to Appendix E.

Results.

In the case of best-of-N search, we compare TVM with ORM (Cobbe et al., 2021) and Math-Shepherd (Wang et al., 2024), a PRM without human annotations, as explained in Sec. 2. As all experimental results in Wang et al. (2024) are only based on LLMs fine-tuned on MetaMATH, we also evaluate Math-Shepherd only for Mistral-7B-MetaMath and Llama3-8B-MetaMath. Despite using large $N$ , Table 1 shows that TVM surpasses ORM and Math-Shepherd with improvements ranging from 0.6 to 2.6%p as well as self-consistency from 4.9 to 8.9%p, across the board.

Under the verifier-guided step-level beam search strategy, we primarily compare TVM against OVM (Yu et al., 2024) because Yu et al. (2024) confirmed that step-level beam search guided by a token-level verifier performs significantly better than that guided by a sentence-level value model (Feng et al., 2024) on GSM8K. Further comparison to Feng et al. (2024) is presented in Appendix B. In Table 1, TVM also consistently outperforms OVM ranging from 0.6 to 1.0%p.

One might wonder why the accuracy of OVM ( $K=40$ ) for Mistral-7B is much higher than that of OVM ( $K=100$ ) reported in Yu et al. (2024). This discrepancy arises because, in our experiments, some tokens (e.g., $<<$ , $>>$ ) are correctly converted to token IDs by the Mistral-7B tokenizer.

4.2 Advanced Mathematics (MATH)

Setups.

We employ fine-tuned LLMs on MetaMath (Mistral-7B-MetaMath and Llama3-8B-MetaMath) without any further fine-tuning on the training dataset of MATH in order to sample reasoning paths in a newline-delimited format. Following Lightman et al. (2023); Wang et al. (2024), we also use $500$ test MATH problems for evaluation, which is the same test dataset of Lightman et al. (2023), incorporating the remaining $4500$ test problems into the training dataset of MATH. For each training problem, a fine-tuned LLM on MetaMath generates $N_{tr}=25$ reasoning paths with a temperature of $0.7$ , with each token labeled as Eq. 11. Then, we train TVM initialized from the same fine-tuned LLM for one epoch on this dataset with a batch size of $512$ and a learning rate of $2e$ - $6$ . Further experimental details are given in Appendix E.

Results.

Similar to Sec. 4.1, Table 2 compares (i) TVM’s best-of-N search performance with ORM and Math-Shepherd and (ii) TVM-guided step-level beam search to ORM-guided step-level beam search (i.e., OVM). In the former case, the performance of TVM is slightly superior or almost comparable to that of ORM and Math-Shepherd. This might be due to the fact that an LLM is extremely prone to producing errors in the process of generating $N$ reasoning paths for difficult MATH problems. However, when capitalizing on the verifier-guided step-level beam search strategy, not only does TVM outperform the OVM ranging from 2.6 to 2.8%p, but TVM-guided step-level beam search also exhibits much better performance than best-of-N search by any verifier even if $K=40$ is much smaller than $N=256$ .

4.3 Analyses on Verifier-guided Step-level BS

Case study.

To validate the superiority of TVM over OVM in predicting whether an intermediate reasoning path is on a promising track toward the correct answer, for a test problem in the GSM8K benchmark, we compare OVM’s and TVM’s predictions. As illustrated in Fig. 4, in the third reasoning step, OVM incorrectly predicts a wrong intermediate reasoning path with the highest score while assigning a low score to a correct path. This occurs because OVM is inherently identical to ORM trained to implicitly and indirectly learn the potential correctness of an intermediate reasoning path. In contrast, TVM accurately predicts a correct intermediate path with the highest score and a wrong one with a low score. As TVM is trained to directly and explicitly estimate the probability of reaching the correct final answer for each token along a reasoning path, TVM can effectively predict at inference whether an intermediate reasoning path is on a promising track toward the correct answer.

Table 3: Mean and standard deviation of TVM’s accuracy for Mistral-7B and Mistral-7B-MetaMath on the GSM8K benchmark according to varying sizes of

K

and

b

when employing verifier-guided step-level beam search. Three random trials are carried out.

$K$ , $b$	Mistral-7B	Mistral-7B-MetaMath
$40$ , $10$	87.69 $\pm 0.22$	88.70 $\pm 0.16$
$80$ , $20$	87.89 $\pm 0.35$	88.75 $\pm 0.20$
$100$ , $25$	87.92 $\pm 0.13$	88.80 $\pm 0.07$

Beam size study.

To investigate whether the accuracy of TVM improves with larger values of $K$ and $b$ in verifier-guided step-level beam search, we conduct experiments using TVM with varying sizes of $K$ and $b$ for Mistral-7B and Mistral-7B-MetaMath on the GSM8K benchmark. Table 3 shows that the accuracy of TVM on GSM8K increases as both $K$ and $b$ grow, but reaches a saturation point when $K=100$ and $b=25$ .

5 Related Work

Best-of-N search.

For $N$ complete reasoning paths, a verifier Cobbe et al. (2021); Uesato et al. (2022); Lightman et al. (2023); Wang et al. (2024) ranks and picks the highest-scored reasoning path. Although best-of-N search using a verifier shows much superior performance compared to verifier-free strategies such as self-consistency (Wang et al., 2023), best-of-N search still possesses the same drawback as self-consistency as a large quantity of generated reasoning paths are required to solve challenging reasoning problems.

Step-level beam search.

In contrast to the selection among complete reasoning paths, several studies have focused on step-level beam searches for partial reasoning paths. Step-level beam search can be divided into (i) verifier-free step-level beam search and (ii) verifier-guided step-level beam search.

Under the verifier-free step-level beam search strategy, Yao et al. (2023); Hao et al. (2023) allow value estimation by prompting LLMs to sample or simulate long-term outcomes during inference. Alternatively, Feng et al. (2024); Yu et al. (2024) introduce step-level beam search guided by a sentence-level value model and an outcome-supervised reward model, respectively. Although Feng et al. (2024); Yu et al. (2024) show that verifier-guided step-level beam search achieves significant accuracy improvements over verifier-free one, each approach has its own weakness. As delineated in Yu et al. (2024), a sentence-level value model is unsuitable for step-level beam search. In addition, Yu et al. (2024) uses an outcome-supervised reward model, not a value model. As a result, there is still room for improvement in the performance of verifier-guided step-level beam search.

6 Conclusion

In this paper, we introduce a novel verifier termed the Token-supervised Value Model (TVM). This model uses per-token value labels to guide LLMs toward promising mathematical reasoning paths. Unlike traditional verifiers, which lack token-level labels and thus cannot precisely evaluate intermediate steps in reasoning paths, TVM could estimate the expected cumulative reward for each token. This enables TVM to identify detailed token-level information and perform more precise reasoning at intermediate paths leading to the correct answer. Experimental results on benchmarks such as GSM8k and MATH have revealed that TVM outperforms previous verifiers across 7B-scale LLMs, including Mistral-7B and Llama3-8B, demonstrating its enhanced accuracy and effectiveness.

Limitations

Our method has demonstrated significant improvements over previous competing methods, but resource constraints limited us from running further experiments. Our TVM was primarily evaluated using 7B-scale models for mathematical reasoning, but it can be applied to larger models and extended to other domains. Additionally, our model could be utilized as a value model in reinforcement learning, such as in Proximal Policy Optimization training (Schulman et al., 2017; Zheng et al., 2023), to supervise LLMs.

References

AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
Feng et al. (2024) Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. 2024. Alphazero-like tree-search can guide large language model decoding and training. Preprint, arXiv:2309.17179.
Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. Preprint, arXiv:2305.20050.
Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
Maslej et al. (2024) Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark. 2024. Artificial intelligence index report 2024. Preprint, arXiv:2405.19522.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. Preprint, arXiv:1707.06347.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275.
Wang et al. (2024) Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. 2024. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. Preprint, arXiv:2312.08935.
Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate problem solving with large language models. Preprint, arXiv:2305.10601.
Yu et al. (2024) Fei Yu, Anningzhe Gao, and Benyou Wang. 2024. Ovm, outcome-supervised value models for planning in mathematical reasoning. Preprint, arXiv:2311.09724.
Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. Scaling relationship on learning mathematical reasoning with large language models. Preprint, arXiv:2308.01825.
Zheng et al. (2023) Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. 2023. Secrets of rlhf in large language models part i: Ppo. Preprint, arXiv:2307.04964.

Appendix A Proof of Proposition 3.1

Let the reward function $r(t_{n,k})$ be defined as Eq. 3, which includes only the outcome reward with the discount factor $\gamma=1$ and no intermediate reward (i.e., $r(t_{n,k})=0$ except the final answer). Then, $\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})=\sum_{l=1}^{\infty}r(t_{n,k+l})$ becomes either one or zero, depending on whether the resulting final answer will be $\hat{a}$ or not, respectively. As a result, the expected cumulative reward (Eq. 7) is written as

	$\displaystyle\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})\big{% \|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}$
	$\displaystyle=\mathbb{E}\big{[}\sum_{l=1}^{\infty}r(t_{n,k+l})\big{\|}q_{tr},\{% t_{n,\cdot}\}_{1}^{k}\big{]}\quad(\because\gamma=1)$
	$\displaystyle=\sum_{r=0}^{1}r*\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})% =r\big{\|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{)}\quad(\because\sum_{l=1}^{\infty% }r(t_{n,k+l})=0\text{ or }1)$
	$\displaystyle=\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})=1\big{\|}q_{tr},% \{t_{n,\cdot}\}_{1}^{k}\big{)}$
	$\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}\|q_{tr},\{t_% {n,\cdot}\}_{1}^{k}),$

because $\sum_{l=1}^{\infty}r(t_{n,k+l})=1$ only if the resulting final answer will be $\hat{a}$ .

Appendix B Additional Comparison of TVM with a sentence-level value model (Feng et al., 2024) as well as OVM

Although Yu et al. (2024) corroborated that step-level beam search guided by a token-level verifier performs significantly better than that guided by a sentence-level value model (Feng et al., 2024) on GSM8K, we additionally compare our TVM with a sentence-level value model (Feng et al., 2024) as well as OVM for Llama2-7B.

Table 4: Mean and standard deviation of accuracy of a sentence-level value model (Feng et al., 2024), OVM, and TVM for Llama2-7B (Touvron et al., 2023) on the GSM8K benchmark in the case of

K=10

under the verifier-guided step-level beam search strategy. For TVM, three random trials are conducted.

Search Strategy	Method	Llama2-7B
Verifier-guided	Feng et al. (2024)	52.20 $\pm 0.90$
Step-level	OVM	66.50 $\pm 0.20$
Beam Search	TVM (Ours)	66.82 $\pm 0.38$

As seen in Table 4, our TVM is superior to both a sentence-level value model (Feng et al., 2024) and OVM. As explained in Yu et al. (2024), under the verifier-guided step-level beam search strategy, outcome-supervised reward models can pretend to be a value model, but process-supervised reward models cannot. As a result, the accuracy of a sentence-level value model is worse than that of OVM and TVM.

Appendix C Algorithm of Empirical Value Estimation

Algorithm 1 Empirical Value Estimation

For a question

q_{tr}

N_{tr}

reasoning paths, each consisting of

\{t_{n,k}\}_{k=1}^{T_{n}}

and a final answer

a_{n}

, the ground truth answer

\hat{a}

, and the outcome reward function

r_{o}(a_{n})

in Eq. 3 for

n=1,\cdots,N_{tr}

H\leftarrow

dict()

for

n=1,\cdots,N_{tr}

for

k=1,\cdots,T_{n}

if not

H

.containsKey

[t_{n,1},\cdots,t_{n,k}]

then

H

.insert

([t_{n,1},\cdots,t_{n,k}],(r_{o}(a_{n}),1))

else

(c,t)\leftarrow H

.get

[t_{n,1},\cdots,t_{n,k}]

H

.insert

([t_{n,1},\cdots,t_{n,k}],(c+r_{o}(a_{n}),t+1))

end if

end for

for

n=1,\cdots,N_{tr}

for

k=1,\cdots,T_{n}

(c,t)\leftarrow H

.get

[t_{n,1},\cdots,t_{n,k}]

c

means the number of correct reasoning paths starting from

t_{n,1},\cdots,t_{n,k}

t

indicates the number of total reasoning paths starting from

t_{n,1},\cdots,t_{n,k}

V(t_{n,k})=\frac{c}{t}

\triangleright

Eq. 11

end for

Appendix D Compute Analysis between Best-of-N Search and Verifier-guided Step-level Beam Search

Table 5: Execution time of best-of-N search without and with vLLM (Kwon et al., 2023) and verifier-guided step-level beam search on the GSM8K and MATH benchmarks when using

8

x NVIDIA A100-80GB GPUs and Mistral-7B-MetaMath.

Search Strategy	GSM8K	MATH
Best-of-N search w/o vLLM	$6.5$ hours	$22$ hours
Best-of-N search w/ vLLM	$2.1$ hours	$2.4$ hours
Verifier-guided step-level beam search	$\mathbf{0.9}$ hours	$\mathbf{1.1}$ hours

Appendix E Implementation Details

In both Sec. 4.1 and Sec. 4.2, following Cobbe et al. (2021), we use both a language modeling objective and the verification objective in Eq. 9, with $20\%$ dropout (Srivastava et al., 2014). Additionally, we employ the same architecture as Cobbe et al. (2021), a language model extended with a scalar head composed of a single gain parameter and a single bias parameter, to output a score for each token in a reasoning path. We use the AdamW optimizer with a linear scheduler. We generate $N_{tr}$ reasoning paths with a temperature of $0.7$ , a top-k of $50$ , and a top-p of $1.0$ . Note that for every experiment, a verifier has the same model size and architecture as an LLM used to generate $N_{tr}$ reasoning paths.

Table 6: Learning rate, batch size, and verifier initialization for training TVM when using Mistral-7B, Mistral-7B-MetaMath, Llama3-8B, and Llama3-8B-MetaMath to generate

N_{tr}=100

reasoning paths per training problem of GSM8K in Sec. 4.1. Fine-tuned Mistral-7B and fine-tuned Llama3-8B means that they are fine-tuned on the training dataset of GSM8K as described in Sec. 4.1.

	Mistral-7B	Mistral-7B-MetaMath	Llama3-8B	Llama3-8B-MetaMath
Learning rate	$2e$ - $6$	$2$ e- $6$	$1$ e- $5$	$2$ e- $6$
Batch size	$512$	$512$	$512$	$512$
Verifier initialization	fine-tuned Mistral-7B	Mistral-7B-MetaMath	fine-tuned Llama3-8B	Llama3-8B-MetaMath

Table 7: Learning rate, batch size, and verifier initialization for training TVM when using Mistral-7B-MetaMath and Llama3-8B-MetaMath to generate

N_{tr}=25

reasoning paths per training problem of MATH in Sec. 4.2.

	Mistral-7B-MetaMath	Llama3-8B-MetaMath
Learning rate	$2e$ - $6$	$2$ e- $6$
Batch size	$512$	$512$
Verifier initialization	Mistral-7B-MetaMath	Llama3-8B-MetaMath

For both best-of-N search and verifier-guided step-level beam search, we also use a temperature of $0.7$ , a top-k of $50$ , and a top-p of $1.0$ . The maximum new token length is set to $400$ for GSM8K and $1024$ for MATH, respectively.

	$\displaystyle V(t_{n,k})=\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{% n,k+l})\big{\|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}$
	$\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}\|q_{tr},\{t_% {n,\cdot}\}_{1}^{k})$
	$\displaystyle=\frac{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}\cap\textit{the final % answer will be }\hat{a}\|q_{tr})}{\mathbb{P}(\{t_{n,\cdot}\}_{1}^{k}\|q_{tr})}.$		(10)

	$\displaystyle\mathbb{E}\big{[}\sum_{l=1}^{\infty}\gamma^{l-1}r(t_{n,k+l})\big{% \|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{]}$
	$\displaystyle=\mathbb{E}\big{[}\sum_{l=1}^{\infty}r(t_{n,k+l})\big{\|}q_{tr},\{% t_{n,\cdot}\}_{1}^{k}\big{]}\quad(\because\gamma=1)$
	$\displaystyle=\sum_{r=0}^{1}r*\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})% =r\big{\|}q_{tr},\{t_{n,\cdot}\}_{1}^{k}\big{)}\quad(\because\sum_{l=1}^{\infty% }r(t_{n,k+l})=0\text{ or }1)$
	$\displaystyle=\mathbb{P}\big{(}\sum_{l=1}^{\infty}r(t_{n,k+l})=1\big{\|}q_{tr},% \{t_{n,\cdot}\}_{1}^{k}\big{)}$
	$\displaystyle=\mathbb{P}(\textit{the final answer will be }\hat{a}\|q_{tr},\{t_% {n,\cdot}\}_{1}^{k}),$