Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

Xunyu Zhu^1,2, Jian Li^1,2, Yong Liu³, Can Ma^1,2, Weiping Wang^1,2
¹Institute of Information Engineering, Chinese Academy of Sciences
²School of Cyber Security, University of Chinese Academy of Sciences
³Gaoling School of Artificial Intelligence, Renmin University of China
{zhuxunyu, lijian9026, macan, wangweiping}@iie.ac.cn, [email protected]
Corresponding author

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in mathematical reasoning tasks due to their extensive parameter counts and training on vast datasets. Despite these capabilities, deploying LLMs is hindered by their computational demands. Distilling LLM mathematical reasoning into Smaller Language Models (SLMs) has emerged as a solution to this challenge, although these smaller models often suffer from errors in calculation and semantic understanding. Prior work has proposed Program-of-Thought Distillation (PoTD) to avoid calculation error. To further address semantic understanding errors, we propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD). KPDD enhances the reasoning performance of SLMs by breaking down the problem-solving process into three stages: Core Question Extraction, Problem-Solving Information Extraction, and Step-by-Step Solution. This method is further divided into KPDD-CoT, which generates Chain-of-Thought rationales, and KPDD-PoT, which creates Program-of-Thought rationales. The experiment results show that KPDD-CoT significantly improves reasoning abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical reasoning tasks. Our approach effectively mitigates misunderstanding errors, advancing the deployment of efficient and capable SLMs.

Xunyu Zhu^1,2, Jian Li^1,2^†^†thanks: Corresponding author, Yong Liu³, Can Ma^1,2, Weiping Wang^1,2 ¹Institute of Information Engineering, Chinese Academy of Sciences ²School of Cyber Security, University of Chinese Academy of Sciences ³Gaoling School of Artificial Intelligence, Renmin University of China {zhuxunyu, lijian9026, macan, wangweiping}@iie.ac.cn, [email protected]

1 Introduction

Large language models (LLMs) based on Transformer architectures represent a significant advancement in natural language processing. Notable models such as LLaMA (Touvron et al., 2023a), GPT-4 (OpenAI, 2023), and PaLM (Chowdhery et al., 2023) feature hundreds of billions of parameters. Trained on extensive text datasets, these models exhibit exceptional proficiency across a broad range of downstream tasks.

Recent studies (Chen et al., 2023; Wang et al., 2023b, a; Liu et al., 2023) have enhanced the mathematical reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting, which generates intermediate steps to solve complex problems. However, deploying these models remains challenging due to their size and computational demands. For instance, the GPT-3 model (Brown et al., 2020) requires at least 350GB of FP16 storage and multiple A100 GPUs with 80GB of memory each for efficient inference.

Refer to caption — Figure 1: Error analysis of 50 GSM8K problems with incorrect answers returned by CoTD using FlanT5-Base. The experimental results indicate that multiple errors may exist in the reasoning process of CoTD, with understanding errors and calculation errors being the major factors affecting CoTD’s reasoning performance.

Recent work (Magister et al., 2023; Shridhar et al., 2023; Ho et al., 2023; Fu et al., 2023) investigates distilling LLM reasoning into SLMs (under 1B parameters) for broader deployment. This involves using LLMs to create enriched datasets with detailed reasoning paths, which then fine-tune SLMs, endowing them with advanced reasoning abilities. For example, Chain-of-Thought Distillation (CoTD)(Ho et al., 2023) encapsulates the reasoning process into textual rationales. However, there is a significant performance gap between SLMs and LLMs. Prior work(Wei et al., 2022) identifies three main error types in CoT reasoning: (1) Calculation errors: Incorrect calculations leading to wrong answers. (2) Missing Step errors: Omissions of intermediate reasoning steps, especially in multi-step problems. (3) Semantic misunderstanding errors: Errors in understanding the problem or maintaining coherent reasoning, often due to insufficient model capability. To explore the reasons for the performance gap between SLMs and LLMs, we conducted the same error analysis on CoTD. Our preliminary experiments (shown in Figure 1) reveal numerous error combinations in CoTD, with calculation and semantic misunderstanding errors being the most prevalent. Prior work Zhu et al. (2023a) proposed Program-of-Thought Distillation (PoTD) to mitigate calculation errors by formulating the reasoning process as a Python program executed by an external interpreter. This approach allows the SLM to focus on generating the program, avoiding calculation errors and improving reasoning performance. Given these circumstances, our paper focuses on addressing semantic misunderstanding errors in CoTD to further enhance the reasoning performance of SLMs.

In our paper, we propose a novel mathematical reasoning distillation method called Key-Point-Driven Mathematical Reasoning Distillation (KPDD) to enhance the mathematical reasoning performance of SLMs. KPDD breaks the reasoning process into three parts: (1) Core Question Extraction: Identifies the core question from the original problem. (2) Problem-Solving Information Extraction: Extracts relevant data and information needed to solve the problem. (3) Step-by-Step Solution: Uses the extracted key points to solve the problem in a step-by-step manner. The third part is further divided into two formats, KPDD-CoT and KPDD-PoT: (1) KPDD-CoT: Generates rationales in the form of Chain-of-Thought (CoT). This method focuses on reducing misunderstanding errors and explicitly illustrates the reasoning process, aiding in error analysis. (2) KPDD-PoT: Generates rationales in the form of Program-of-Thought (PoT). This approach not only reduces misunderstanding errors but also avoids calculation errors, further enhancing the SLM’s mathematical reasoning performance.

We assessed KPDD across FlanT5 models from Small (0.06B) to Large (0.76B) on four mathematical reasoning datasets. The results show that KPDD-CoT significantly enhances SLMs’ reasoning abilities, while KPDD-PoT enables SLMs to achieve state-of-the-art (SOTA) mathematical reasoning performance. For instance, with KPDD-CoT, FlanT5-Large achieved an average accuracy of 24.71% on these datasets, and KPDD-PoT elevated FlanT5-Large to an average accuracy of 63.83%. Furthermore, our error analysis on KPDD confirms that KPDD effectively mitigates misunderstanding errors, thereby improving the mathematical reasoning performance of SLMs.

Our contributions are summarized as follows:

1.

Our study reveals that misunderstanding errors and calculation errors are the major factors limiting CoTD’s reasoning.
2.

We propose Key-Point-Driven Mathematical Reasoning Distillation (KPDD) to alleviate misunderstanding errors and effectively improve the reasoning performance of SLMs.
3.

Extensive experiments show that KPDD outperforms other methods across various benchmarks and achieves new state-of-the-art results on these mathematical reasoning datasets.

2 Related Work

2.1 Mathematical Reasoning

Mathematical reasoning tasks, exemplified by benchmarks such as GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021), present a substantial challenge for LLMs. To enhance LLMs’ performance in this domain, researchers have identified two primary strategies.

Chain-of-Thought Reasoning LLMs’ reasoning ability can be enhanced by prompting them to articulate intermediate steps towards a solution, as demonstrated by Wei et al. Wei et al. (2022). This insight has spurred various advancements (Chen et al., 2023; Wang et al., 2023b, a; Liu et al., 2023) that refine reasoning paths: Chen et al. Chen et al. (2023) prompt LLMs to generate executable code; Wang et al. Wang et al. (2023b) use multiple reasoning paths with a voting mechanism; Wang et al. Wang et al. (2023a) have LLMs create a plan before reasoning; Liu et al. Liu et al. (2023) employ diverse reasoning prompts for problem-solving; Zhong et al. Zhong et al. (2024) encourage LLMs to deeply understand problems and leverage key information for better reasoning. Building on these methods, our work introduces Key-Point-Driven Mathematical Reasoning Distillation (KPDD) to further enhance SLMs’ mathematical reasoning.

Finetuning-based Reasoning refines LLMs such as Llama2 (Touvron et al., 2023b), Qwen (Bai et al., 2023), and Baichuan2 (Yang et al., 2023) by integrating techniques from advanced models like GPT-4 (OpenAI, 2023) and PaLM-2 (Anil et al., 2023). Notably, Yuan et al.Yuan et al. (2023) utilize Rejection Sampling Fine-Tuning (RFT) to enhance LLMs’ mathematical reasoning, while WizardMath(Luo et al., 2023) employs Reinforcement Learning from Evolved Instructions Feedback (RLEIF) to improve LLaMA-2’s reasoning abilities. MAmmoTH (Yue et al., 2023) combines Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales for more effective instruction-tuning of LLMs in math problem-solving. Despite their effectiveness, the large model sizes of these LLMs continue to limit their deployment efficiency.

2.2 Knowledge Distillation

Knowledge Distillation optimizes LLMs for practical use by transferring knowledge from larger models to smaller, efficient ones (Zhu et al., 2023b). Recent research (Magister et al., 2023; Shridhar et al., 2023; Ho et al., 2023; Fu et al., 2023) focuses on endowing compact models ( $\leq$ 1B parameters) like T5 (Raffel et al., 2020) and GPT-2 (Radford et al., 2019) with advanced reasoning capabilities from LLMs such as GPT-4 (OpenAI, 2023) and PaLM-2 (Anil et al., 2023). For instance, Ho et al. Ho et al. (2023) fine-tune student models using accurate reasoning paths from LLMs, Shridhar et al. Shridhar et al. (2023) train dual-model systems on sub-questions and solutions, and Fu et al. Fu et al. (2023) propose scaling down general competencies of smaller models to enhance task-specific performance. Our work introduces a novel distillation approach where two SLMs independently extract the core question and key problem-solving information from an original question. These key points are then utilized to guide another SLM in solving the original question effectively.

3 Method

In this work, we introduce a novel distillation method for mathematical reasoning tasks called Key-Point-Driven Distillation (KPDD), structured into three stages: (1) Stage 1: KPDD distills the first SLM to extract the core question from the original question. (2) Stage 2: KPDD distills the second SLM to extract problem-solving information from the original question. (3) Stage 3: KPDD distills the third SLM to solve the original problem using the core question and problem-solving information. In Stage 3, we prompt the LLM to construct two types of reasoning datasets: (1) CoT Rationales: These are more comprehensible to both humans and LLMs, showcasing a detailed reasoning process. (2) PoT Rationales: These rationales delegate computational tasks to an external Python interpreter, thereby avoiding calculation errors.

3.1 Data Generation from LLMs

Our KPDD method begins by creating a mathematical reasoning dataset from LLMs, which is then used for SLM fine-tuning. In our paper, we use in-context learning (Dong et al., 2023; Min et al., 2022; Rubin et al., 2022) to prompt LLMs for constructing the reasoning dataset. Furthermore, in stage 3, our KPDD method employs two distillation approaches: one distills the SLM to generate CoT rationales for problem-solving, and the other distills the SLM to generate PoT rationales for problem-solving. In other words, our KPDD method can be divided into two approaches: KPDD-CoT and KPDD-PoT.

3.1.1 Data Generation for KPDD-CoT

Given a mathematical dataset $\mathcal{D}$ , each entry $(x,y)$ pairs a question $x$ with its answer $y$ . As illustrated in Figure 2, we select $k$ samples $\{(x_{1},y_{1}),\ldots,(x_{k},y_{k})\}$ from $\mathcal{D}$ and manually craft reasoning processes. Each reasoning process $c$ includes a core question, problem-solving information, and rationales in CoT format. These elements are separated by HTML tags: "<core>{core question}</core><info>{problem-solving information}</info><cot>{rationales in CoT format}</cot>". These form contextualized instances $\{(x_{1},c_{1},y_{1}),\ldots,(x_{k},c_{k},y_{k})\}$ , compiled into a demonstration set $\mathcal{D}_{c}$ . We then prompt LLMs with the demonstration set $\mathcal{D}_{c}$ , a new question, and the instruction "Firstly, let’s extract the most comprehensive and detailed key question. Then, let’s identify and list the most useful information related to the question. Finally, let’s understand the key question and the problem-solving information, solve the question step by step, and show the answer." to generate the reasoning process for the new question. The KPDD-CoT dataset generation is formalized as:

c_{i}=f_{\mathcal{M}}(x_{i},\mathcal{D}_{c}),

(1)

where $\mathcal{M}$ denotes the LLM, $f$ is the decoding function, and $i$ denotes the index in $\mathcal{D}$ . This yields the KPDD-CoT dataset $\mathcal{D}_{C}$ , composed of triplets $(x,c,y)$ .

Data Filtering—Upon generating the KPDD-CoT dataset with LLMs, we validate the reasoning process against the gold standard answer—a crucial step to ensure the quality of our reasoning dataset $\mathcal{D}_{C}$ . Discrepancies between the generated reasoning process and the gold standard answer result in the exclusion of those entries from $\mathcal{D}_{C}$ . This meticulous filtering removes incorrect examples, thereby enhancing the dataset’s overall quality. Finally, this refinement directly contributes to the improved performance of fine-tuned SLMs, due to the increased accuracy and reliability of the training data. By ensuring that only high-quality reasoning processes are included, we bolster the effectiveness of the SLMs in solving mathematical reasoning tasks.

3.1.2 Data Generation for KPDD-PoT

Similar with KPDD-CoT, the initial phase in our KPDD-PoT entails creating a dataset from LLMs, setting the stage for SLM fine-tuning. For KPDD-CoT dataset generation, we also choose $k$ samples $\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{k},y_{k})\}$ from $\mathcal{D}$ and manually create reasoning processes $p$ , where each reasoning process includes a core question, problem-solving information, and rationales in PoT format. These elements are also separated by HTML tags: "<core>{core question}</core><info>{problem-solving information}</info><pot>{rationales in PoT format}</pot>". These form contextualized instances $\{(x_{1},p_{1},y_{1}),(x_{2},p_{2},y_{2}),\ldots,(x_{k},p_{k},y_{k})\}$ , which are compiled into a demonstration set $\mathcal{D}_{p}$ . We then prompt the LLM with the demonstration set $\mathcal{D}_{p}$ and a new question, and input the instruction "Firstly, let’s extract the most comprehensive and detailed key question. Then, let’s identify and list the most useful information related to the question. Finally, let’s understand the key question and the problem-solving information, and generate the python code (return ans) to solve the question." to generate a reasoning process for the new question. Figure 3 outlines this data generation process, and the KPDD-PoT dataset generation is formalized as:

p_{i}=f_{\mathcal{M}}(x_{i},\mathcal{D}_{p}),

(2)

where $\mathcal{M}$ is the LLM, $f$ denotes the greedy decoding function, and $i$ is represented as the index of the instance $(x,y)$ in $\mathcal{D}$ . This yields a KPDD-PoT dataset $\mathcal{D}_{P}$ , organized as triplets $(x,p,y)$ .

Data Filtering—Following KPDD-PoT dataset generation by LLMs, each program in the reasoning process undergoes validation using an external Python interpreter, a vital step to ensure the quality of our initial dataset $\mathcal{D}_{P}$ . Programs that fail to compile or produce incorrect results are immediately discarded. This rigorous filtering process removes flawed instances, thus improving the dataset’s quality.

3.2 Fine-tuning SLMs

After constructing these reasoning datasets, we use them to fine-tune the SLMs. In the KPDD, we fine-tune three SLMs: the first SLM, called KPDD-CoT/PoT-core, is used to extract the core question from the original problem, the second SLM, called KPDD-CoT/PoT-info, extracts the problem-solving information, and the third SLM, called KPDD-CoT/PoT-solve, uses both the core question and problem-solving information to solve the original question.

3.2.1 Fine-tuning SLMs for KPDD-CoT

Firstly, we construct a core question subset from the KPDD-CoT dataset, denoted as $\mathcal{D}_{CC}$ . Each sample in this subset can be represented as $(x,cc)$ , where $x$ represents the original question and $cc$ represents the core question. For each training instance $(x,cc)$ from $\mathcal{D}_{CC}$ , we prepend the prompt $p_{cc}$ "Let’s extract the most comprehensive and detailed core question." to the question $x$ . This guides the KPDD-CoT-core in fine-tuning to accurately extract the corresponding core question $cc$ . The fine-tuning loss function can be represented as follows:

\mathcal{L}=-\sum_{i=1}^{N}\sum_{t=1}^{T}\log P({cc}^{i}_{t}\mid{cc}^{i}_{<t},% x^{i},p_{cc}),

(3)

where $N$ is the number of examples in $\mathcal{D}_{CC}$ , $p_{cc}$ is the prompt, and ${cc}_{:T}$ is the sequence of the core question.

Then, we construct a problem-solving subset from the KPDD-CoT dataset, denoted as $\mathcal{D}_{CI}$ . Each sample in this subset can be represented as $(x,ci)$ , where $x$ represents the original question and $ci$ represents the problem-solving information. For each training instance $(x,ci)$ from $\mathcal{D}_{CI}$ , we prepend the prompt $p_{ci}$ "Let’s identify and list the most useful information related to the question." to the question $x$ . This guides the KPDD-CoT-info in fine-tuning to accurately extract the corresponding problem-solving information $ci$ . The fine-tuning loss function can be represented as follows:

\mathcal{L}=-\sum_{i=1}^{N}\sum_{t=1}^{T}\log P({ci}^{i}_{t}\mid{ci}^{i}_{<t},% x^{i},p_{ci}),

(4)

where $N$ is the number of examples in $\mathcal{D}_{CI}$ , $p_{ci}$ is the prompt, and ${ci}_{:T}$ is the sequence of the problem-solving information.

Finally, we construct a problem-solving subset from the KPDD-CoT dataset, denoted as $\mathcal{D}_{CS}$ . Each sample in this subset can be represented as $(x,cc,ci,cs)$ , where $x$ represents the original question, $cc$ represents the core question, $ci$ represents the problem-solving information, and $cs$ represents the rationales in CoT format. For each training instance $(x,cc,ci,cs)$ from $\mathcal{D}_{CS}$ , we integrate the original question $x$ , the core question $cc$ , the problem-solving information $ci$ and the prompt $p_{cs}$ "Let’s understand the core question and the problem-solving information, solve the question step by step, and show the answer." to construct a new input. This guides the KPDD-CoT-solve in fine-tuning to generate rationales $cs$ for solving the origin question in CoT format. The fine-tuning loss function can be represented as follows:

\mathcal{L}=-\sum_{i=1}^{N}\sum_{t=1}^{T}\log P({cs}^{i}_{t}\mid{cs}^{i}_{<t},% x^{i},cc^{i},ci^{i},p_{cs}),

(5)

where $N$ is the number of examples in $\mathcal{D}_{CS}$ , $p_{cs}$ is the prompt, and ${cs}_{:T}$ is the sequence of the rationale in CoT format.

3.2.2 Fine-tuning SLMs for KPDD-PoT

In KPDD-PoT, aside from replacing the KPDD-CoT dataset with the KPDD-PoT dataset, the fine-tuning method for KPDD-PoT-core remains consistent with that of KPDD-CoT-core, and the fine-tuning method for KPDD-PoT-info remains consistent with that of KPDD-CoT-info. However, the fine-tuning method of KPDD-PoT-solve is different with KPDD-CoT-solve. The main difference between them is the input instruction. Specifically, when fine-tuning KPDD-PoT-solve, the input instruction is: "Let’s understand the core question and the problem-solving information, and generate the python code (return ans) to solve the question." This instruction guides the model to not only understand the core question and the problem-solving information but also to generate Python code that can compute the answer. This approach leverages the model’s ability to perform code generation, which can be particularly effective for solving mathematical problems programmatically.

Moreover, the fine-tuning loss functions for the SLMs in KPDD-PoT are identical to those in KPDD-CoT. This ensures that the optimization process remains consistent across both methods, focusing on minimizing the discrepancies between the model’s output and the expected solutions.

3.3 Inference-time Predictions

Figure 4 illustrates the inference process of KPDD. After fine-tuning, the process for solving a given question involves three main steps:

1.

Core Question Extraction: First, we use the KPDD-CoT/PoT-core model to extract the core question from the original problem. This step isolates the essential part of the problem that needs to be addressed.
2.

Problem-Solving Information Extraction: Next, the KPDD-CoT/PoT-info model extracts the relevant problem-solving information. This model identifies and lists the necessary context and data required to solve the core question.
3.

Solution Generation: Finally, based on the original question, the core question, and the problem-solving information, the KPDD-CoT/PoT-solve model generates rationales in either CoT or PoT format to solve the original question. For KPDD-PoT, this involves generating Python code that can compute the answer.

This structured approach ensures that each model focuses on a specific aspect of the problem-solving process, leading to more accurate and reliable solutions.

4 Experiments

	Dataset	Size
Train	GSM8K	7473
Train	(+) augmented	29892
Test	GSM8K	1319
	ASDiv	2096
	SVAMP	1000
	MultiArith	600

Table 1: Statistics of the datasets used in our experiments. Augmented refers that we run 4 times data synthesis on the training set of GSM8K.

4.1 Dataset

In our paper, the training dataset is derived from the GSM8K training set, which comprises diverse grade school math word problems (Cobbe et al., 2021). Additionally, the mathematical reasoning capabilities of the SLMs are evaluated using the GSM8K test set, along with other datasets including ASDiv, which contains diverse math word problems (Miao et al., 2020), SVAMP, which features math word problems with varying structures (Patel et al., 2021), and MultiArith, which consists of arithmetic word problems (Roy and Roth, 2015). The statistics of these datasets are summarized in Table 1. This comprehensive evaluation approach ensures that the SLMs’ mathematical reasoning capabilities are thoroughly tested across a variety of problem types and structures, providing a robust assessment of their performance.

4.2 Implementation

We employ GPT-4 as the teacher LLM to construct our training dataset and utilize FlanT5 models—Small (60M), Base (250M), and Large (760M) (Chung et al., 2022)—as student SLMs. We manually create 8 demonstrations to guide GPT-4 in generating 4 reasoning paths for each dataset (KPDD-CoT and KPDD-PoT). Fine-tuning of all student SLMs is conducted using the Huggingface library (Wolf et al., 2020) on an NVIDIA 3090 GPU with 24 GB RAM. The learning rate for fine-tuning is set to 5e-4, with a total of 10 fine-tuning epochs.

4.3 Baselines

Proprietary Large Language Models We present CoT prompting results from an array of SoTA LLMs, such as OpenAI’s GPT-4, ChatGPT (gpt-3.5-turbo), Google’s PaLM-2, and Anthropic’s Claude-2.

Open-Source Large Language Models We present mathematical reasoning performance of Llama-2-7B, CodeLLaMA-7B, and their fine-tuned versions, such as Platpus-2, WizardMath, TORA.

Fine-tuned Small Language Models We present some works that try to fine-tune SLMs under 1B, such as Ho et al. Ho et al. (2023) fine-tune GPT-3-ada, Fu et al. Fu et al. (2023) fine-tune FlanT5, and Shridhar et al. Shridhar et al. (2023) fine-tune GPT-2.

Proprietary Large Language Models
Models	#Params	GSM8K	ASDiv	SVAMP	MultiArith	AVG
GPT-4 (OpenAI, 2023)	-	92.0	91.3	93.1	-	92.13
ChatGPT	-	80.8	87.3	83.0	-	83.7
Claude-2 (Anthropic, 2023)	-	85.2	-	-	-	85.2
PaLM-2 (Anil et al., 2023)	540B	80.7	-	-	-	80.7
Open-Source Large Language Models
Llama-2 (Touvron et al., 2023b)	7B	13.3	50.7	38.0	-	34
CodeLLaMA (Rozière et al., 2023)	7B	34.0	61.4	59.0	-	51.46
Platypus-2 (Lee et al., 2023)	7B	14.4	47.9	36.7	-	33
WizardMath (Luo et al., 2023)	7B	54.9	59.1	57.3	-	57.1
TORA (Gou et al., 2023)	7B	68.8	73.9	68.2	-	70.3
Fine-tuned Small Language Models
Ho et al. Ho et al. (2023)	0.3B	3.11	-	-	-	3.11
Fu et al. Fu et al. (2023)	0.76B	20.2	23.8	20.4	38.5	25.72
Fu et al. Fu et al. (2023)	0.25B	13.4	20.9	14.2	29.7	19.55
Shridhar et al. Shridhar et al. (2023)	0.77B	17.89	-	18.14	-	18.01
Zhu et al. Zhu et al. (2023a)	0.77B	39.2	51.2	48.2	79.2	54.45
Our fine-tuned Small Language Models
FlanT5-Small	0.06B	2.1	2.8	2.1	4.0	2.75
(+) KPDD-CoT		7.58	8.73	6.9	7.83	7.76
(+) KPDD-PoT		20.77	40.07	34.1	44.16	34.93
FlanT5-Base	0.25B	3.0	4.2	3.8	7.0	4.5
(+) KPDD-CoT		14.63	14.93	13.8	21.5	16.21
(+) KPDD-PoT		34.57	52.29	50.5	73.66	52.75
FlanT5-Large	0.76B	6.9	10.1	6.8	13.0	9.2
(+) KPDD-CoT		21.75	22.51	19.1	35.5	24.71
(+) KPDD-PoT		46.32	59.92	61.6	87.5	63.83

Table 2: Overall test set performance. We use KPDD to fine-tune SLMs, and evaluate them on four mathematical reasoning datasets, i.e., GSM8K, ASDiv, SVAMP, and MultiArith. The experiment results show that KPDD-CoT can effectively improve SLMs’ reasoning performance, and KPDD-PoT makes SLMs achieve SOTA reasoning performance.

4.4 Main Results

Table 2 showcases our method’s performance on four mathematical datasets, revealing key insights:

1.

KPDD-CoT Enhances Mathematical Reasoning: KPDD-CoT significantly improves the mathematical reasoning capabilities of SLMs, with absolute improvements ranging from 5.01% to 15.51% across tasks. Traditional baselines typically rely on CoTD, which involves generating numerous steps and performing extensive calculations. However, CoTD often encounters semantic misunderstanding errors that hinder the improvement of SLMs’ mathematical reasoning abilities. In contrast, KPDD-CoT employs extra SLMs to extract key points (including the core question and problem-solving information) of the question and uses these key points to guide the SLMs’ reasoning. This approach significantly reduces the semantic misunderstanding errors of CoTD, making KPDD-CoT better suited for improving the mathematical reasoning ability of SLMs.
2.

KPDD-PoT Outperforms State-of-the-Art: KPDD-PoT surpasses previous state-of-the-art fine-tuned SLMs at all scales, with absolute improvements between 32.18% and 54.63% across tasks. Furthermore, KPDD-PoT’s accuracy is higher than that of KPDD-CoT, highlighting the advantage of rationales in PoT format in enhancing SLMs’ reasoning capabilities. Our analysis finds that the mathematical reasoning performance of CoTD is limited not only by semantic misunderstanding errors but also by calculation errors. PoTD converts rationales from CoT format into PoT format, formulating the reasoning process into a Python program and sending it to an extra Python interpreter to generate the final answer. This method transfers numerical computation from SLMs to a Python interpreter, avoiding calculation errors. Additionally, by extracting key points of the question, KPDD-PoT implicitly enhances the SLMs’ understanding of the question, thereby improving their overall mathematical reasoning capabilities.
3.

Importance of Model Size: The efficacy of mathematical reasoning distillation in SLMs is highly dependent on model size; larger models assimilate more reasoning knowledge, leading to superior performance. For instance, under KPDD-PoT, FlanT5-Small achieves 20.77% accuracy on GSM8K, FlanT5-Base reaches 34.57%, and FlanT5-Large attains 46.32%.
4.

Strong Transferability of KPDD: KPDD exhibits strong transferability. The distillation dataset of KPDD is constructed based on the GSM8K training dataset, and we evaluate our SLMs on several mathematical reasoning datasets, including the GSM8K test dataset, ASDiv dataset, SVAMP dataset, and MultiArith dataset. Our experimental results show that KPDD not only achieves good reasoning performance on the GSM8K test dataset but also performs well on the ASDiv, SVAMP, and MultiArith datasets. These results demonstrate that KPDD has strong transferability and further corroborate that SLMs do not improve their reasoning performance through data leakage.

Kategorie	Core	Info	Solve	GSM8K	ASDiv	SVAMP	MultiArith	AVG
1	$\times$	$\times$	$\times$	3.0	4.2	3.8	7.0	4.5
2	$\times$	$\times$	$\checkmark$	8.71	9.2	8.2	10.33	9.11
3	$\checkmark$	$\times$	$\checkmark$	9.02	9.25	8.9	11.5	9.66
4	$\times$	$\checkmark$	$\checkmark$	8.87	9.73	8.9	11.0	9.59
5	$\checkmark$	$\checkmark$	$\checkmark$	9.17	9.92	9.03	11.83	9.98

Table 3: Effect of Different Components in KPDD-CoT. We consider five different categories to analyse the effect of different components in KPDD-CoT. The experiment result shows that key points in questions can deepen SLMs’ understanding of the questions, and combining several key points can provide richer information, leading to further improvements in SLMs’ reasoning abilities.

Kategorie	Core	Info	Solve	GSM8K	ASDiv	SVAMP	MultiArith	AVG
1	$\times$	$\times$	$\times$	3.0	4.2	3.8	7.0	4.5
2	$\times$	$\times$	$\checkmark$	19.40	44.32	40.6	45.33	37.41
3	$\checkmark$	$\times$	$\checkmark$	23.19	45.89	44.1	53.33	41.62
4	$\times$	$\checkmark$	$\checkmark$	25.39	46.85	44.6	57.33	43.54
5	$\checkmark$	$\checkmark$	$\checkmark$	27.06	49.33	46.1	58.33	45.20

Table 4: Effect of Different Components in KPDD-PoT. We consider five different categories to analyse the effect of different components in KPDD-PoT. The experiment result shows that key points in questions can deepen SLMs’ understanding of the questions, and combining several key points can provide richer information, leading to further improvements in SLMs’ reasoning abilities.

4.5 Effect of Different Components in KPDD

In this subsection, we delve into the impact of various components within KPDD. We have considered five distinct categories, which include: 1. Original SLMs without any fine-tuning; 2. SLMs with original CoT/PoT distillation; 3. SLMs with core distillation combined with CoT/PoT distillation; 4. SLMs with problem-solving information distillation combined with CoT/PoT distillation; 5. SLMs with KPDD. For each of the latter four categories, we have constructed corresponding reasoning datasets, each containing a single reasoning path per question. Following this, we have utilized FlanT5-base as our foundation for SLMs, and we have fine-tuned these models using the aforementioned reasoning datasets. To evaluate the reasoning capabilities of these SLMs, we have tested them on the GSM8K test dataset, as well as on the ASDiv, SVAMP, and MultiArith datasets.

Tables 3 and 4 present the results of our experiments, from which we make several observations: (1) We observe a significant performance improvement in Category 2 compared to original SLMs. Specifically, under CoT reasoning, Category 2 achieves an average accuracy gain of 4.61% across multiple datasets, while under PoT reasoning, it achieves a substantial average accuracy improvement of 32.91%. These experimental results indicate that CoTD and PoTD can markedly enhance the mathematical reasoning ability of SLMs. (2) We find that Categories 3 and 4 exhibit a further performance increase relative to Category 2. Specifically, in the context of CoT reasoning, Categories 3 and 4 achieve average accuracy gains of 0.55% and 0.45% respectively over Category 2 across multiple datasets. Under PoT reasoning, the gains are more pronounced with Categories 3 and 4 achieving average accuracy improvements of 4.21% and 6.13% respectively. This suggests that SLMs can deepen their understanding of questions by focusing on key points, thereby further enhancing their mathematical reasoning ability. (3) In Category 5, we combine the core questions with the problem-solving information to guide SLMs in addressing the questions. The results are promising: Category 5 achieves an average accuracy of 9.98% under CoT reasoning and a remarkable 45.20% under PoT reasoning across multiple datasets. This indicates that key points in questions play a crucial role in boosting the reasoning capabilities of SLMs, and that combining several key points provides richer information, leading to further improvements in their reasoning abilities.

Kategorie	Core	Info	Solve	GSM8K	ASDiv	SVAMP	MultiArith	AVG
$\mathrm{I}$	1^*	1	1	7.88	4.72	5.4	10.66	7.16
$\mathrm{II}$	1	1	2	9.09	9.44	8.2	11.33	9.51
$\mathrm{III}$	1	2	2	8.41	7.72	6.7	11.24	8.51
$\mathrm{IV}$	2	1	2	7.80	7.58	7.1	11.16	8.41
$\mathrm{V}$	1	2	3	9.17	9.92	9.03	11.83	9.98

*

The index of SLM.

Table 5: Effect of SLM Quantity in KPDD-CoT. We consider five different categories to analyse the effect of SLM quantity in KPDD-CoT. The experimental results show that for KPDD-CoT, using a separate SLM for each component is necessary to maximize the reasoning performance of KPDD-CoT.

Kategorie	Core	Info	Solve	GSM8K	ASDiv	SVAMP	MultiArith	AVG
$\mathrm{I}$	1^*	1	1	24.18	44.32	41.19	48.66	39.61
$\mathrm{II}$	1	1	2	26.0	42.69	42.69	55.83	41.80
$\mathrm{III}$	1	2	2	24.79	46.37	40.6	49.16	40.23
$\mathrm{IV}$	2	1	2	24.63	45.37	41.3	49.33	40.15
$\mathrm{V}$	1	2	3	27.06	49.33	46.1	58.33	45.20

*

The index of SLM.

Table 6: Effect of SLM Quantity in KPDD-PoT. We consider five different categories to analyse the effect of SLM quantity in KPDD-PoT. The experimental results show that for KPDD-PoT, using a separate SLM for each component is necessary to maximize the reasoning performance of KPDD-PoT.

4.6 Effect of SLM Quantity in KPDD

In this subsection, we investigate the impact of SLM Quantity in KPDD. We consider five distinct categories: $\mathrm{I}$ . Using one SLM to simultaneously extract the core question and problem-solving information, and solve the original question; $\mathrm{II}$ . Using one SLM to extract the core question and problem-solving information, and another SLM to solve the original question; $\mathrm{III}$ . Using one SLM to extract the core question, another SLM to extract the problem-solving information, and a third SLM to solve the original question; $\mathrm{IV}$ . Using one SLM to extract the problem-solving information, another SLM to extract the core question, and both to solve the original question; $\mathrm{V}$ . Using one SLM to extract the core question, another SLM to extract the problem-solving information, and a third SLM to solve the original question. For each category, we create corresponding reasoning datasets, each containing a single reasoning path per question. We utilize FlanT5-base as our base SLMs, fine-tuning them on these reasoning datasets. To assess their reasoning capabilities, we evaluate these SLMs on the GSM8K test dataset, as well as on the ASDiv, SVAMP, and MultiArith datasets.

Table 5 and 6 present the results of our experiments, from which we make several observations: (1) Compared to other categories, Category $\mathrm{I}$ performed worse. For KPDD-CoT, Category $\mathrm{I}$ achieved an average accuracy of 7.16% across multiple datasets, while for KPDD-PoT, it achieved an average accuracy of 39.61%. This suggests that the limited model size of a single SLM hinders its performance across multiple tasks. (2) Category $\mathrm{II}$ outperformed Categories $\mathrm{III}$ and $\mathrm{IV}$ in reasoning performance. For KPDD-CoT, Category $\mathrm{II}$ achieved an average accuracy of 9.51% across multiple datasets, while for KPDD-PoT, it achieved an average accuracy of 41.80%. We attribute this result to the importance of the KPDD-CoT/PoT-solve component, where using a single SLM for this phase yields the best reasoning performance. (3) For KPDD-CoT, Category $\mathrm{V}$ achieved an average accuracy of 9.98% across multiple datasets, while for KPDD-PoT, it achieved an average accuracy of 45.20%. This is the highest reasoning performance among all categories, indicating that our approach of using a separate SLM for each component maximizes the performance of each component, thereby maximizing the reasoning performance of KPDD.

4.7 Diverse Reasoning Paths Improve SLMs’ Reasoning Performance

In this subsection, we fine-tune CodeT5-Base on our reasoning datasets, which are differentiated by the number of reasoning paths they contain, to analyze the effect of reasoning path multiplicity on the reasoning performance of SLMs. This examination aims to discern how the quantity of reasoning paths in training data influence the model’s ability to perform reasoning tasks.

Figure 5 presents the results of our experiments, which demonstrate that a variety of reasoning paths can bolster the reasoning performance of SLMs. For instance, CodeT5-Base, when trained on an KPDD-PoT dataset featuring four reasoning paths, attains a 34.57% accuracy on the GSM8K test dataset and a 52.29% accuracy on ASDiv. In contrast, CodeT5-Base trained on an KPDD-PoT dataset with only one reasoning path achieves 49.33% accuracy on GSM8K test dataset and 46.1% accuracy on ASDiv. This suggests that the inclusion of multiple reasoning paths in training data can significantly enhance the model’s performance, particularly in tasks requiring explanation generation.

4.8 Error Analysis

In this subsection, our aim is to verify whether KPDD can indeed reduce semantic misunderstanding errors. KPDD-PoT implicitly includes the reasoning process within its rationales, making it challenging to conduct error analysis on rationales in PoT format. Conversely, rationales in CoT format explicitly contain the reasoning steps, allowing us to clearly understand how the SLM solves the questions step by step, thus facilitating error analysis. Therefore, in this part, we focus on error analysis for rationales in CoT format. To achieve our goal, we randomly sample 100 examples from GSM8K/SVAMP and perform error analysis on the questions with incorrect answers. For a better understanding of KPDD’s effect, we also consider three other scenarios: (1) vanilla CoTD, (2) reasoning that combines vanilla CoTD and core question extraction, and (3) reasoning that combines vanilla CoTD and problem-solving information extraction. Furthermore, to simplify our analysis, we use flanT5-base as our SLMs, and the corresponding reasoning datasets still contain a single reasoning path per question.

The detailed quantitative results are illustrated in Figure 6. By analyzing the experimental results, we found that: (1) Combination of Multiple Errors in SLMs: SLMs tend to exhibit combinations of multiple errors, with calculation errors having the most significant impact on reasoning performance. Specifically, vanilla CoTD on the GSM8K dataset showed 51 understanding errors, 79 calculation errors, and 34 step missing errors, resulting in a total of 164 errors. This number far exceeds the original number of problems, with calculation errors outnumbering other types of errors. Similar results were observed in the SVAMP dataset. This explains why PoTD achieves better reasoning performance than CoTD: PoTD converts vanilla rationales into Python programs, delegating the calculation process to an external Python interpreter to avoid calculation errors. (2) Reduction of Understanding Errors with Key Points: Introducing key points of the original questions effectively reduces understanding errors. Specifically, when core questions were introduced in vanilla CoTD, the number of understanding errors on the GSM8K dataset decreased to 50, and on the SVAMP dataset, it decreased to 53. When problem-solving information was introduced in vanilla CoTD, the number of understanding errors decreased to 48 on GSM8K and to 51 on SVAMP. These results indicate that key points of the original questions help SLMs better understand the questions, thereby reducing understanding errors and improving reasoning performance. (3) Further Reduction of Understanding Errors with Multiple Key Points: Combining multiple key points can further reduce understanding errors. Specifically, KPDD reduced the number of understanding errors to 46 on GSM8K and to 50 on SVAMP. This suggests that KPDD’s method of integrating multiple key points can deepen SLMs’ understanding of the original questions, further reducing understanding errors and enhancing reasoning performance.

5 Conclusion

In this paper, we propose Key-Point-Driven Distillation (KPDD) for enhancing mathematical reasoning in Small Language Models (SLMs). Our approach leverages the extraction of key points from questions to improve understanding and reduce errors in reasoning tasks. Experimental results demonstrate that KPDD significantly reduces understanding errors compared to conventional mathematical reasoning distillation method. However, PoTD implicitly embeds the reasoning process within the generated program, making it difficult to analyze misunderstandings. In the future, we will explore error analysis methods to facilitate PoTD error analysis.

References

Anil et al. (2023) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan A. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and et al. 2023. Palm 2 technical report. CoRR, abs/2305.10403.
Anthropic (2023) Anthropic. 2023. Model card and evaluations for claude models. Anthropic blog.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. CoRR, abs/2309.16609.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
Dong et al. (2023) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A survey for in-context learning. CoRR, abs/2301.00234.
Fu et al. (2023) Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. 2023. Specializing smaller language models towards multi-step reasoning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10421–10430. PMLR.
Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452.
Ho et al. (2023) Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14852–14882, Toronto, Canada. Association for Computational Linguistics.
Lee et al. (2023) Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. 2023. Platypus: Quick, cheap, and powerful refinement of llms. CoRR, abs/2308.07317.
Liu et al. (2023) Tengxiao Liu, Qipeng Guo, Yuqing Yang, Xiangkun Hu, Yue Zhang, Xipeng Qiu, and Zheng Zhang. 2023. Plan, verify and switch: Integrated reasoning with diverse X-of-thoughts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2807–2822, Singapore. Association for Computational Linguistics.
Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583.
Magister et al. (2023) Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. 2023. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada. Association for Computational Linguistics.
Miao et al. (2020) Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.
Min et al. (2022) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
Rozière et al. (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2023. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
Rubin et al. (2022) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
Shridhar et al. (2023) Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. 2023. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, Toronto, Canada. Association for Computational Linguistics.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
Wang et al. (2023a) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023a. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics.
Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, and Zhiying Wu. 2023. Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305.
Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. 2023. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825.
Yue et al. (2023) Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
Zhong et al. (2024) Qihuang Zhong, Kang Wang, Ziyang Xu, Juhua Liu, Liang Ding, Bo Du, and Dacheng Tao. 2024. Achieving >97% on GSM8K: deeply understanding the problems makes llms better reasoners. CoRR, abs/2404.14963.
Zhu et al. (2023a) Xuekai Zhu, Biqing Qi, Kaiyan Zhang, Xingwei Long, and Bowen Zhou. 2023a. Pad: Program-aided distillation specializes large models in reasoning. CoRR, abs/2305.13888.
Zhu et al. (2023b) Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. 2023b. A survey on model compression for large language models. CoRR, abs/2308.07633.