Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning
Abstract
Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances.
Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning
Xinglin Wang1, Shaoxiong Feng2, Yiwei Li1, Peiwen Yuan1, Yueqi Zhang1, Boyuan Pan2, Heda Wang2, Yao Hu2, Kan Li1222Corresponding author. 1 School of Computer Science, Beijing Institute of Technology 2 Xiaohongshu Inc {wangxinglin,liyiwei,peiwenyuan, zhangyq, likan}@bit.edu.cn {shaoxiongfeng2023,whd.thu}@gmail.com {panboyuan,xiahou}@xiaohongshu.com
1 Introduction
Large language models (LLMs) have exhibited strong reasoning capabilities (Bubeck et al., 2023), especially with chain-of-thought (CoT) prompting (Wei et al., 2022b). Based on this, Wang et al. (2022) introduced a simple decoding strategy called self-consistency (SC) to further improve reasoning performance, leveraging the fact that challenging reasoning tasks typically require more reasoning paths to arrive at the correct answer. In contrast to the standard chain-of-thought prompting which only generates the greedy one, this method samples multiple reasoning paths according to a preset sample size, and then derives the final answer through majority-voting-based scheme.
Model | Input | Output | ||
---|---|---|---|---|
Token | Cost | Token | Cost | |
GPT-3.5-Turbo | 846.3 | 0.0004 | 163.7 | 0.0002 |
GPT-4 | 846.3 | 0.0254 | 142.1 | 0.0085 |
Despite generally leading to improvements, SC introduces a significant overhead proportional to the number of sampled outputs. As LLMs continue to grow in size and complexity, the sampling time and computational costs associated with majority voting become increasingly challenging. Recently, some works seek to reduce the cost of SC by dynamically adjusting the number of samples based on the posterior distribution of pre-samples. ASC (Aggarwal et al., 2023) samples one by one, and stops sampling when the existing sample answers have established a clear majority as judged by a lightweight stopping criterion. Alternatively, ESC (Li et al., 2023) divides the large preset sample size into several sequential small windows, and stop sampling when answers within a window are all the same.
However, both approaches still suffer from the following two shortcomings: (1) Both require a certain amount of pre-sampling for all problems, which still results in redundant waste. As shown in Figure 1, there is a high overlap of inference results between SC and CoT, which suggests that only a small portion of questions (3.8% for GPT-4) benefit from SC. Generating multiple samples for the remaining questions compared to CoT (only sampling once) results in a significant waste of costs. Although ASC and ESC somewhat reduce the ineffective cost of SC by decreasing average sample size, there remains redundant sampling for simple problems111At least 3 samples per problem for ASC and 4 for ESC.. (2) Multiple re-samples bring additional significant input costs. Both ASC and ESC focus solely on reducing the number of outputs, without considering the extra input costs brought by multiple re-samples (Table 1). Consequently, for problems requiring multiple sampling times, ASC and ESC will introduce substantial extra input costs, sometimes outweighing the savings achieved through output sampling reduction (Table 3).
To alleviate these issues, we take inspiration from the strategies humans employ when solving reasoning problems within a limited total time. As illustrated in Figure 2, humans pre-assess the difficulty of the problems before reasoning, and adaptively allocating problem-solving time based on the accessed difficulty. In light of this, we propose Difficulty-Adaptive Self-Consistency (DSC), a novel method which leverages the difficulty information from both prior and posterior perspectives to further reduce the inference computational cost of SC. As illustrated in Figure 3, DSC consists of three steps. Firstly, we propose Difficulty Ranking algorithm which utilizes the LLM itself to rank the difficulty of the given problem set222Although using LLM for difficulty ranking will introduce additional cost, our later experiments will prove that this part of the cost is totally acceptable.. Based on this, we propose a Problem Partition strategy that divides the problem set into two parts, easy and hard, using the information regarding difficulty rankings. For problems belonging to easy part, a single CoT sampling is performed, which saves cost without sacrificing performance. Lastly, we propose Sample Size Pre-Allocation algorithm to predict the sample size needed for problems belonging to hard part, which reduce the number of re-samples and save the cost of inputs.
We evaluate DSC on a wide range of arithmetic, commonsense and symbolic reasoning tasks over GPT-3.5-Turbo and GPT-4 models. The empirical results demonstrate that DSC outperforms the existing strong baselines ASC and ESC by a significant margin on six popular benchmarks, while attaining comparable performances. Further experiments confirm the effectiveness of the proposed three-step algorithms: Difficulty Ranking, Problem Partition, and Sample Size Pre-Allocation.
In summary, this work includes the following key contributions:
-
•
We analyzed two common issues present in existing SC variants, ESC and ASC, designed for cost-efficient decoding.
-
•
We proposed Difficulty-Adaptive Self-Consistency, a novel method consisting of three steps to fully utilize the difficulty information of given problems so as to reduce the inference computational cost.
-
•
We conducted extensive experiments on six popular benchmarks and validated the effectiveness of our proposed method.
2 Methodology
The core idea behind DSC is to fully utilize the difficulty information of given problems, allocating computational resources based on their respective levels of difficulty. The overall workflow of DSC is presented in Figure 3, including three steps: Difficulty Ranking, Problem Partition and Sample Size Pre-Allocation.
2.1 Difficulty Ranking
As problems usually do not carry difficulty label themselves, we seek to utilize the powerful comparative and ranking capabilities of LLM to obtain the difficulty rank of the given problem set. Considering the limited context window size of LLM and its lost in the middle (Liu et al., 2024) issue in understanding long contexts, we randomly divide the problem set of into batches of size () and let LLM rank the difficulty of the problems in each batch333See Appendix A.1 for corresponding prompts.. Since a single random split only allows us to obtain the difficulty ranking of each problem within the corresponding batch, we perform times random batch splitting, and sort the average difficulty rankings of each problem in different batches to obtain the difficulty ranking of entire problem set. Algorithm 1 illustrates the specific implementation of difficulty ranking.
2.2 Problem Partition
After obtaining the problem set with difficulty ranking, we aim to find which part of the problems only require LLM to perform CoT sampling once. A simple and intuitive idea is that when LLM is very confident about a number of continuous problems sorted by difficulty (the results of all pre-samples are the same), it can be inferred that the problems easier than these problems only need one CoT sampling. Guided by this, we design the Problem Partition algorithm, illustrated in Algorithm 2. Specifically, we pre-sample the problem set sorted by difficulty from hard to easy, and store the entropy of pre-sampling results of each problem in a list. We stop pre-sampling when the latest items in the list are all zero. The remaining problems without pre-sampling are then divided into the easy part, and a single CoT sampling is performed for each.
2.3 Sample Size Pre-Allocation
After performing CoT sampling on the easy part problems, we seek to predict and allocate the sample size for the problems belonging to hard part, so as to mitigate the substantial cost brought by multiple re-samples. Considering that the required sample size for each problem should be similar to those needed for problems of comparable difficulty, we predict the sample size of the current problem based on the total sample size of the nearest easier444As the pre-allocated sample size could be larger than the required sample size (causing higher output costs), we use the sample size of easier rather than harder neighboring problems for prediction. problems. Algorithm 3 shows the workflow of Sample Size Pre-Allocation. Specifically, we sample questions belonging to the hard part from easy to difficult. For the current question to be inferred, we predict its pre-allocation sample size based on the average total sample size of its previous questions. Then, we judge the distribution of the current samples based on the stopping criteria 555Following ASC, we use Dirichlet Stopping Criteria. Please refer to Appendix A.6 for more details.. When the criteria is not met, we re-sample based on the expansion window , until the sampling distribution meets the criteria for stopping sampling or the number of samples reaches the max sample size . After the sampling for the current question ends, we add its samples and total sample size to the and lists respectively, in order to pre-allocate the sample size for the next question.
3 Experiments
3.1 Experimental Setup
3.1.1 Benchmarks
We evaluate the proposed DSC on six benchmark datasets from three categories of reasoning tasks: For arithmetic reasoning, we consider MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021). MultiArith (Roy and Roth, 2015), SVAMP (Patel et al., 2021), AddSub (Hosseini et al., 2014) and ASDiv (Miao et al., 2020) are not chosen in this paper because they are relatively simple. For commonsense reasoning, CommonsenseQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) are used. For symbolic reasoning, we use Last Letter Concatenation and Coin Flip from Wei et al. (2022a). The data version is from Kojima et al. (2022).
3.1.2 Baselines
We compare DSC to the following self-consistency methods: (1) SC (Wang et al., 2023) is the standard self-consistency which samples multiple reasoning paths and then derives the final answer through majority-voting; (2) ASC (Aggarwal et al., 2023) samples one by one, and stops sampling when the existing samples meet a designed stopping criteria, which measures the LLM’s confidence in its current samples; (3) ESC (Li et al., 2023) proposes using small window detection to save cost, which divides the large preset sample size into several sequential small windows, and stops sampling when answers within a window are all the same. Specifically, ASC and ESC are two strong baselines for cost-efficient self-consistency methods, we reproduce both methods according to their original implementation.
3.1.3 Implementation details
We perform experiments using two powerful LLMs: GPT-4 (OpenAI, 2023) and GPT-3.5-Turbo666We use the ”2023-05-15” version of API for both.. We calculate the cost based on the price of the API we use. All experiments are conducted in the few-shot setting without training or fine-tuning the language models. To ensure a fair comparison, we use the same prompts as Wei et al. (2022a). Following Li et al. (2023), The sampling temperature for MATH is 0.5 while for other datasets is 0.7. For difficulty ranking, we use CoT sampling. We set the default parameters as follows: batch size as 8 and random split rounds as 5 for Difficulty Ranking; pre-sample size as 4 and judge window size as 32 for Problem Partition; extend window size as 4 and prediction window size as 16 for Sample Size Pre-Allocation; max sample size as 40 for all baselines. All experiments are repeated 100 times and the average performance is reported. Unless otherwise specified, the reported cost of DSC includes the cost brought by all three sub-steps.
Model | Method | MATH | GSM8K | CSQA | SQA | Letter | Coinflip | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cost | Acc | Cost | Acc | Cost | Acc | Cost | Acc | Cost | Acc | Cost | Acc | ||
GPT-4 | SC | 0.4220 | 58.52 | 0.3509 | 95.81 | 0.1280 | 85.82 | 0.1493 | 81.89 | 0.1691 | 94.51 | 0.1528 | 100 |
ASC | 0.5726 | 58.48 | 0.1712 | 95.79 | 0.1639 | 85.82 | 0.1323 | 81.89 | 0.0602 | 94.50 | 0.0657 | 100 | |
ESC | 0.4062 | 58.49 | 0.1014 | 95.80 | 0.0767 | 85.81 | 0.0647 | 81.89 | 0.0375 | 94.51 | 0.0338 | 100 | |
DSC | 0.3142 | 58.51 | 0.0699 | 95.79 | 0.0598 | 85.81 | 0.0516 | 81.88 | 0.0259 | 94.50 | 0.0264 | 100 | |
GPT-3.5-Turbo | SC | 0.0181 | 49.43 | 0.0094 | 83.22 | 0.0030 | 76.80 | 0.0043 | 71.47 | 0.0043 | 80.44 | 0.0037 | 76.60 |
ASC | 0.0193 | 49.39 | 0.0064 | 83.18 | 0.0034 | 76.78 | 0.0032 | 71.47 | 0.0018 | 80.43 | 0.0034 | 76.59 | |
ESC | 0.0172 | 49.41 | 0.0048 | 83.19 | 0.0017 | 76.79 | 0.0019 | 71.47 | 0.0014 | 80.43 | 0.0020 | 76.59 | |
DSC | 0.0148 | 49.42 | 0.0036 | 83.19 | 0.0012 | 76.79 | 0.0015 | 71.46 | 0.0011 | 80.42 | 0.0016 | 76.60 |
Model | Method | 16 | 24 | 32 | 40 | 48 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Cost | Acc | Cost | Acc | Cost | Acc | Cost | Acc | Cost | Acc | ||
GPT-4 | SC | 0.1791 - | 57.34 | 0.2602 - | 57.95 | 0.3411 - | 58.27 | 0.4220 - | 58.52 | 0.5030 - | 58.67 |
ASC | 0.3038 | 57.33 | 0.4038 | 57.93 | 0.4923 | 58.23 | 0.5726 | 58.48 | 0.6470 | 58.62 | |
ESC | 0.1870 | 57.34 | 0.2631 | 57.94 | 0.3361 | 58.25 | 0.4062 | 58.49 | 0.4723 | 58.64 | |
DSC | 0.1659 | 57.34 | 0.2192 | 57.94 | 0.2680 | 58.25 | 0.3142 | 58.51 | 0.3585 | 58.66 | |
GPT-3.5-Turbo | SC | 0.0074 - | 47.98 | 0.0110 - | 48.71 | 0.0145 - | 49.13 | 0.0181 - | 49.43 | 0.0216 - | 49.67 |
ASC | 0.0095 | 47.97 | 0.0131 | 48.69 | 0.0163 | 49.10 | 0.0193 | 49.39 | 0.0220 | 49.62 | |
ESC | 0.0075 | 47.98 | 0.0108 | 48.70 | 0.0140 | 49.12 | 0.0172 | 49.41 | 0.0203 | 49.65 | |
DSC | 0.0069 | 47.98 | 0.0097 | 48.71 | 0.0123 | 49.13 | 0.0148 | 49.42 | 0.0171 | 49.66 |
3.2 Main Results
DSC significantly reduces costs while barely affecting performance.
Table 2 summarizes the cost and accuracy of SC, ASC, ESC, and proposed DSC for each dataset. We show that DSC consistently outperforms all baselines on cost by a significant margin across all datasets, while barely affecting performance. Specifically, DSC reduces the cost on GPT-4 by an average of 65.29% and on GPT-3.5-Turbo by 56.04% compared to SC. In comparison to the strongest baseline method ESC, DSC reduces the cost on GPT-4 by an average of 24.81% and on GPT-3.5-Turbo by 21.86%, which demonstrates the effectiveness of DSC.
DSC is scalable across various max sampling size.
We conduct experiments with various max sampling size to validate the scalability of ESC. Table 3 shows the performance across different maximum sampling sizes. First we can see the performance of SC continuously improves as max sampling size increases, which is consistent with the results in (Wang et al., 2023). On this basis, ESC can significantly save costs while maintaining performance for different . On the contrary, due to the substantial additional input cost brought by multiple re-samples, ASC and ESC result in higher cost than SC when the cost savings of output tokens are limited.
Model | Metric | Geometry | Counting & | Prealgebra | Intermediate | Number | Precalculus | Algebra | Avg |
---|---|---|---|---|---|---|---|---|---|
Probability | Algebra | Theory | |||||||
GPT-4 | Spearman | 0.540 | 0.578 | 0.565 | 0.654 | 0.654 | 0.684 | 0.691 | 0.624 |
Pearson | 0.535 | 0.585 | 0.567 | 0.660 | 0.654 | 0.691 | 0.695 | 0.627 | |
Kendall | 0.417 | 0.448 | 0.440 | 0.519 | 0.520 | 0.547 | 0.547 | 0.491 | |
GPT-3.5-Turbo | Spearman | 0.415 | 0.307 | 0.362 | 0.484 | 0.445 | 0.497 | 0.573 | 0.440 |
Pearson | 0.415 | 0.314 | 0.370 | 0.495 | 0.452 | 0.514 | 0.578 | 0.448 | |
Kendall | 0.318 | 0.233 | 0.279 | 0.375 | 0.344 | 0.383 | 0.446 | 0.340 |
Model | Method | MATH | GSM8K | CSQA | SQA | Letter | Coinflip | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Input | Output | Input | Output | Input | Output | Input | Output | Input | Output | Input | Output | ||
GPT-4 | SC | 576 | 6746 | 846 | 5426 | 726 | 1770 | 611 | 2184 | 302 | 2667 | 428 | 2332 |
ASC | 11100 | 3994 | 4266 | 719 | 4805 | 327 | 3726 | 351 | 1397 | 308 | 1722 | 234 | |
ESC | 3574 | 4983 | 1423 | 982 | 1644 | 455 | 1228 | 465 | 454 | 401 | 540 | 294 | |
DSC | 834 | 4606 | 975 | 495 | 978 | 271 | 822 | 310 | 314 | 113 | 428 | 58 | |
w/o step 2 | 847 | 4701 | 1035 | 755 | 1081 | 370 | 873 | 381 | 353 | 313 | 431 | 235 | |
w/o step 3 | 2896 | 4142 | 1033 | 479 | 1068 | 253 | 903 | 290 | 328 | 108 | 428 | 58 | |
GPT-3.5-Turbo | SC | 576 | 11851 | 846 | 6005 | 726 | 1749 | 611 | 2644 | 302 | 2735 | 428 | 2316 |
ASC | 13421 | 8380 | 8075 | 1577 | 5633 | 365 | 4698 | 535 | 2150 | 487 | 4932 | 624 | |
ESC | 4223 | 10045 | 2889 | 2273 | 1946 | 506 | 1599 | 731 | 742 | 673 | 1613 | 819 | |
DSC | 793 | 9456 | 1410 | 1809 | 1140 | 376 | 985 | 591 | 470 | 535 | 704 | 736 | |
w/o step 2 | 799 | 9610 | 1426 | 1879 | 1172 | 401 | 997 | 605 | 482 | 584 | 707 | 761 | |
w/o step 3 | 3485 | 8634 | 2144 | 1679 | 1467 | 362 | 1238 | 560 | 568 | 513 | 1290 | 648 |
Task | MATH | GSM8K | CSQA | SQA | Letter | Coinflip |
---|---|---|---|---|---|---|
Ranking | 403 | 364 | 261 | 130 | 161 | 308 |
Reasoning | 576 | 846 | 726 | 611 | 302 | 428 |
Model | Metric | Unmixed | Mixed |
---|---|---|---|
GPT-4 | Spearman | 0.6451 | 0.5148 |
Pearson | 0.6487 | 0.5197 | |
Kendall | 0.5060 | 0.3931 | |
GPT-3.5-Turbo | Spearman | 0.4828 | 0.4070 |
Pearson | 0.4887 | 0.4145 | |
Kendall | 0.3697 | 0.3073 |
3.3 Analysis of Three Sub-steps
To further validate the effectiveness of proposed three sub-steps, including Difficulty Ranking, Problem Partition, and Sample Size Pre-Allocation, we conduct experimental analyses on each of them respectively.
3.3.1 Difficulty Ranking
Difficulty Ranking exhibits a good correlation with humans.
As shown in Table 4, we calculate Spearman, Pearson, and Kendall correlation coefficients on seven subsets of MATH benchmark. Overall, GPT-4 demonstrates a high consistency with human labels, while the weaker GPT-3.5-Turbo shows a moderate consistency. Specifically, both GPT-4 and GPT-3.5-Turbo rank weakly on Geometry difficulty, which may be due to the fact that text LLMs cannot intuitively assess the difficulty of geometry problems through visual information like humans can.
Difficulty Ranking produces an acceptable cost.
As we instruct LLM to produce extremely concise outputs for Difficulty Ranking, the cost of output tokens brought by Difficulty ranking is very low777With an average consumption of 15 tokens for each question.. Table 6 gives comparison of input tokens brought by Difficulty Ranking and few-shot reasoning across six datasets. We show that the average input cost for each problem’s difficulty ranking is less than the average single input cost for its few-shot reasoning. Considering that ESC and ASC require multiple re-samples for reasoning (multiple few-shot inputs), this additional input cost of Difficulty Ranking is totally acceptable.
Mixing types of questions lowers the effectiveness of Difficulty Ranking.
To explore whether Difficulty Ranking can effectively rank problems of different types in terms of difficulty, we compare the performance of Difficulty Ranking under mixed and unmixed problem types on MATH, as shown in Table 7. Specifically, for unmixed, we perform Difficulty Ranking on seven subsets separately; for mixed, we directly execute Difficulty Ranking on random shuffled MATH dataset, and calculate the correlation on the entire dataset for both. The results indicate that mixing different types of questions for LLM to rank can lead to a decrease in performance. This might be due to the challenge for LLM in adhering to the same assessment standards for various types of questions.888It could also lead to inaccurate ranking for humans.
3.3.2 Problem Partition
Problem Partition proves to be effective across all datasets.
As shown in Table 5, we conduct an ablation study on Problem Partition (step 2). We show that the removal of Problem Partition results in an increase in both input and output tokens of DSC across all six datasets, which validates the effectiveness and generalizability of Problem Partition. Furthermore, we find that the removal of Problem Partition has a small impact on output tokens when it comes to MATH dataset, which could be due to the fact that for datasets with overall higher difficulty, the easy part problem (only require one CoT sample) for LLM is fewer.
Judge window size is essential for the accuracy of DSC.
Regarding the intuitive idea proposed in Section 2.2, a straightforward question is how large the judge window size needs to be to ensure the accuracy on simpler problems with single CoT sampling. To answer this question, we conduct an experiment of DSC under different judge window size with GPT-4 on GSM8K. As shown in Figure 4, the accuracy increases with the rise of until it reaches 32. Meanwhile, the cost of DSC also increases with the enlargement of , as the easy part divided by the problem partition decreases with its increase. Therefore, it is a good choice to set a smaller judge window size while ensuring accuracy.999We select 32 as the default value for .
3.3.3 Sample Size Pre-Allocation
Sample Size Pre-Allocation significantly alleviates the substantial input token costs brought by multiple re-samples.
We conduct an ablation study of proposed Sample Size Pre-Allocation (step 3) to validate its effectiveness. As shown in Table 5, with the removal of Sample Size Pre-Allocation, the count of input tokens on DSC rose on all datasets on GPT-3.5-Turbo and GPT-4 (with the exception of GPT-4 on Coinflip), validating the effectiveness and generalizability of Sample Size Pre-Allocation. Moreover, we notice that the input tokens of ASC and ESC far exceed that of SC, which once again confirms that multiple re-samples will bring a large amount of additional overhead. Meanwhile, DSC significantly reduced the number of input tokens through Pre-Allocation, keeping it comparable to SC. In addition, we find that Sample Size Pre-Allocation leads to a slight increase in output tokens (due to the predicted sample size of some questions exceeding the actual requirement). Overall, we believe it is a good choice to trade a small number of output tokens for a significant saving in input tokens.
Prediction window size is the key to trade output tokens for more savings in input tokens.
Given that Sample Size Pre-Allocation can trade output tokens for savings in input tokens, a natural question is when the max total savings can be achieved. As shown in Figure 5, we count the tokens of input and output and the corresponding cost of DSC under different prediction window size . The experimental results show that the cost of DSC decreases rapidly first and then slighly increases with the increase of , and the input tokens also follow a similar trend. This is because as increases, the sample size of the current question is predicted by more simpler questions. When the value of is large enough, the predicted sample size will gradually be less than its actual value, leading to the need for more re-samples. In contrast, the tokens of the output part gradually decrease, which is due to as m increases, the predicted sample size will continue to decrease, leading to a gradual decrease in the situation where the predicted sample size is greater than its actual need, saving the output tokens from redundant sampling. In summary, it is suggested to select within a relatively moderate range (e.g. from 8 to 32).
4 Related Work
Chain-of-thought Reasoning
Chain-of-thought prompting has been proven to be an effective method of solving complex reasoning problems (Wei et al., 2022a). By following the pattern of gradually solving sub-problems, few-shot CoT (Fu et al., 2023) are capable of stimulating LLM reasoning abilities. On this basis, Least-to-most prompting (Zhou et al., 2023) suggests explicitly splitting the problem and solving them step by step. Zheng et al. (2023) reaches the final answer by iteratively generating answers and using the previously generated answers as context hints.
Self-Consistency
Self-consistency (Wang et al., 2023) refers to a simple decoding strategy for further improving reasoning performance, leveraging the fact that complex reasoning tasks typically allow for more than one correct reasoning path. Li et al. (2024) assign appropriate weights for answer aggregation to achieve adaptive self-consistency. Jain et al. (2023) and Wang et al. (2024) extend it for open-ended generation tasks like code generation and text summarization. However, they require multiple sampling with the pre-set size, which will incur much more computation cost. To realize cost-efficient self-consistency, Aggarwal et al. (2023) introduce an adaptive stopping criterion based on the amount of agreement between the samples so far. Li et al. (2023) divides the large preset sample size into several sequential small windows, and stop sampling when answers within a window are all the same.
5 Conclusion
In this work, we propose a novel cost-efficient decoding method called Difficulty-Adaptive Self-Consistency (DSC), which fully leverages the difficulty information of given problem set to allocate computational resources based on their respective levels of difficulty, so as to alleviate issues of redundant sampling on simple questions and multiple re-sampling limitations that current cost-efficient self-consistency methods face. Extensive experiments show that DSC consistently surpasses two strong cost-effcient self-consistency baselines ASC and ESC by a significant margin on six popular benchmarks, while attaining comparable accuracy. Additional experiments confirm the effectiveness of the proposed three-step algorithms: Difficulty Ranking, Problem Partition, and Sample Size Pre-Allocation.
Limitations
Despite the remarkable efficiency gain on variety of reasoning tasks, the current implementation of DSC still suffers from the following two limitations:
-
•
As illustrated in analysis of Difficulty Ranking, DSC is difficult to apply directly in scenarios that require real-time reasoning for mixed-type problems. Further classification of problems by type or further optimization of the current Difficulty Ranking algorithm is needed.
-
•
Given that DSC demands an awareness of the test set to rank samples based on their difficulty, its use could be restricted in scenarios where a single user is only permitted one input at a time. Nevertheless, the application scenarios of DSC include but are not limited to: (1) The server end (such as OpenAI company, etc) simultaneously receives a large number of query requests from users. (2) A user possesses a batch of data that requires a one-time inference.
References
- Aggarwal et al. (2023) Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. 2023. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396.
- Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Fu et al. (2023) Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. CoRR, abs/2301.12726.
- Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 523–533. ACL.
- Jain et al. (2023) Siddhartha Jain, Xiaofei Ma, Anoop Deoras, and Bing Xiang. 2023. Self-consistency for open-ended generations. CoRR, abs/2307.06857.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS.
- Li et al. (2024) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, and Kan Li. 2024. Turning dust into gold: Distilling complex reasoning capabilities from llms by leveraging negative data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18591–18599.
- Li et al. (2023) Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. 2023. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations.
- Liu et al. (2024) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 11:157–173.
- Miao et al. (2020) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 975–984. Association for Computational Linguistics.
- OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2080–2094. Association for Computational Linguistics.
- Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1743–1752. The Association for Computational Linguistics.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational Linguistics.
- Wang et al. (2024) Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. 2024. Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11782–11794.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022a. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Zheng et al. (2023) Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. CoRR, abs/2304.09797.
- Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Appendix A Appendix
A.1 Prompt for Difficulty Ranking
Your task is to rank the given questions from easy to hard based on their difficulty level. Questions to be evaluated:
Q1:{Question 1}
Q2:{Question 2}
…
Qn: {Question n}
The output format should be a comma-separated list containing the Qnumber of corresponding question. Do not give any explanation.
Difficulty Ranking result (from easy to hard):
Method | Input tokens | Output tokens | Cost | Acc |
---|---|---|---|---|
SC | 846 | 6045 | 0.3881 | 73.19 |
ASC | 15982 | 3265 | 0.6754 | 73.14 |
ESC | 4458 | 4415 | 0.3986 | 73.18 |
DSC | 1617 | 3857 | 0.2799 | 73.18 |
A.2 Generalizability of DSC On Small Open-source Language Model
To further explore the effectiveness of DSC on small open-source models, we conduct experiments on the open-source model Mistral-7B-Instruct-v0.3 (Jiang et al., 2023) using GSM8K dataset. As shown in Table 8, we convert the token cost into price according to that of GPT-4 for a simpler comparison and keep the other settings completely consistent with the main experiment (Please note that for DSC, the cost of step 1 and step 2 is included.). The experimental results indicate that DSC has the potential to work effectively on smaller models.
A.3 Comparison of Inference Time Between Different Methods
Considering the inference time is important for real world scenarios, we calculate the inference time of different methods on the MATH test set (5000 questions) with GPT-4. As shown in Table 9, for DSC, the ranking time corresponds to the time produced by step 1, while sampling time corresponds to the time generated by step 2 and step 3. The results show that, compared to SC, both ASC and ESC require significantly more time due to the need for a large amount of resampling. DSC, on the other hand, effectively mitigates this issue by pre-allocating the sample size for questions.
A.4 Comparison with ESC under Different Window Size
To further demonstrate the effectiveness of DSC compared to ESC, we make a comparison between the performance of DSC and ESC under different window sizes on the MATH dataset using GPT-4, while maintaining the rest of the settings completely consistent with the main experiment. As shown in Table 10, when the window size is greater than or equal to 5, ESC can maintain its Accuracy and its cost increases as the window size enlarges. However, when the window size is less than 5, the Accuracy of ESC drops significantly due to excessively relaxed constraints. Overall, the performance of DSC consistently surpasses that of ESC by a large margin.
A.5 Hyperparameter Experiment of Difficulty Ranking
As shown in Figure 6, we conduct hyperparameter experiments on the proposed Difficulty Ranking algorithm. The experimental results indicate that a large batch size () leads to a decrease in LLM ranking performance, which is consistent with our expectations. For both GPT-3.5-Turbo and GPT-4, difficulty ranking approximately converges in the 5th round. Therefore, we choose 5 and 8 as the default values for iteration and batch size for Difficulty Ranking, respectively.
Time (hours) | Ranking Time | Sampling Time | Total Time |
---|---|---|---|
SC | 0 | 11.75 | 11.75 |
ASC | 0 | 221 | 221 |
ESC | 0 | 57.26 | 57.26 |
DSC | 2.26 | 17.08 | 19.34 |
Window | 3 | 4 | 5 | 6 | 7 | DSC |
---|---|---|---|---|---|---|
Acc | 58.10 | 58.40 | 58.49 | 58.51 | 58.51 | 58.51 |
Cost | 0.3505 | 0.3905 | 0.4062 | 0.4135 | 0.4173 | 0.3142 |
A.6 Background of ESC and ASC
ESC
ESC proposes extension window sampling, where the extension window is a fixed preset value , Each time, entries are sampled through the LLM and added to the sampling set. If the answers of the current samples are the same or the preset maximum sampling value is reached, the sampling is stopped. The core idea of ESC is that if the model’s one-time sample answers are completely the same, it can be considered that it has a high confidence in this answer, and sampling can be stopped.
ASC
ASC proposes the Dirichlet Stopping Criteria, where sampling is conducted one by one. After each sampling, the Dirichlet Stopping Criteria is used to determine whether all current samples meet a specific distribution. If they do, the sampling is stopped. If not, the one-by-one sampling continues until the preset maximum sampling value is reached. The Dirichlet Stopping Criteria is shown in Equation 1, where represents the current set of samples, is the number of , is the preset threshold (set to 0.95 for ASC in default), and is the probability of the most frequently occurring answer in the set . The core idea of ASC is to measure the two answers with the highest probability in the current set of samples. If the difference between the probability of the answer with the highest probability and the second highest probability exceeds a threshold (), it can be considered that the model is very confident in the answer with the highest probability, and sampling can be stopped.
(1) |