Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks

Dharunish Yugeswardeenoo           Kevin Zhu           Sean O’Brien
Algoverse AI Research
[email protected], [email protected]
Abstract

Although LLMs have the potential to transform many fields, they still underperform humans in reasoning tasks. Existing methods induce the model to produce step-by-step calculations, but this research explores the question: Does making the LLM analyze the question improve its performance? We propose a novel prompting strategy called Question Analysis Prompting (QAP), in which the model is prompted to explain the question in n words before solving. The value of n influences the length of response generated by the model. QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP is compared with other state-of- the-art prompts including Chain-of-Thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP consistently ranks among the top-2 prompts on 75% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.

Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks


Dharunish Yugeswardeenoo           Kevin Zhu           Sean O’Brien Algoverse AI Research [email protected], [email protected]


1 Introduction

Large language models (LLMs) have recently shown rapid improvement across a host of standard natural language processing (NLP) tasks, including arithmetic, commonsense and symbolic reasoning. (Brown et al., 2020) Although these models show improved ability to understand and generate text (OpenAI, 2023), their performance can still be further improved. One solution is to encourage the model to think step-by-step. Using chain-of-thought prompting (Wei et al., 2022), LLMs are given Q&A exemplars which are designed to elicit a structured step-by-step response from the model. Many newly developed strategies meant to improve LLM performance have been focused on sophisticating the model’s step-by-step calculation (Gu et al., 2023). Despite SoTA prompts’ remarkable success across various tasks, their accuracies can still be further improved. In this work, we explore ways to improve the model reasoning not only in the answer steps, but also how the model interprets the question itself. By making the model to explicitly interpret the question, we maximize its understanding of the question and minimize missed key information. This paper introduces Question-Analysis Prompting (QAP), a simple zero-shot prompting strategy that induces the model to first explain the question before solving. This method is adaptable to various problem difficulties and shows promising results in math and commonsense reasoning across different model sizes.

2 Prompt Design

The key principle behind QAP is that the model should reiterate the problem before solving. Another principle is that we should be able to control how much the model explains so that we can adapt the prompt to different model sizes and problem complexities. The specific prompt used is as follows:

"Explain this problem to me in at least n𝑛nitalic_n words. Then solve for the answer."

In this work, we experiment with n = 25, 50, 100, 150, 200. The versions of these prompts are named QAPn𝑛nitalic_n. Although the model is not constrained to generating fewer than n tokens in its summary, we find that the number of tokens in the response correlates strongly with the choice of n𝑛nitalic_n. We show specific examples of the impacts of n𝑛nitalic_n in the Appendix.

3 Prompt Impact

In Figure 1, we highlight the structure of a standard QAP output. First, the model breaks down the question in its own words and provides in-depth analysis on each event. We notice a direct relationship between the explanation and the answer steps. Each calculation is previously mentioned in the explanation portion, and this proves that the explanation has allowed the model to plan its approach even before solving. As a result, there is a significant increase in step-by-step calculation and a decreased chance of missed steps.

Refer to caption
Figure 1: Example of QAP prompting - shows how the prompt triggers explanation of the question followed by an approach to solve the problem, detailed steps, finally leading to correct answer

4 Experimental Setup

4.1 Benchmarks

We evaluate the effectiveness of QAP on three arithmetic reasoning datasets. These include grade-school math questions from GSM8K (Cobbe et al., 2021), algebraic word problems from AQuA (Ling et al., 2017), and SAT math problems from AGIEval (Zhong et al., 2023). For commonsense reasoning, we evaluate on open-domain questions that require implicit reasoning, from StrategyQA (Geva et al., 2021). We evaluate on the test sets of all benchmarks, as some proprietary models are partially trained on the training set of such tasks. (OpenAI, 2023)

4.2 Models

We specifically choose our models to observe the prompts’ impacts across differences in model size. The smaller model is GPT3.5 Turbo with version gpt-3.5-turbo-0613. Our larger model is GPT4 Turbo with version gpt-4-1106-preview (OpenAI, 2023). For both of the models we used the OpenAI API 111https://platform.openai.com/docs/api-reference/chat for running our experiments.

4.3 Prompts

For all datasets and models, we experiment with different variations of QAP. We utilize QAP25, QAP50, QAP100, QAP150, and QAP200. We compare the performance of QAP with the baseline (no prompt). Additionally we compare QAP with two different zero-shot prompts, TADB - "Take a deep breath and work on this problem step-by-step" (Yang et al., 2023) and PS+ (Plan and Solve Plus) (Wang et al., 2023). Finally we also compare QAP with 8-shot chain-of-thought prompting.

4.4 Results

The results for GPT-3.5 and GPT-4 Turbo are shown in Table  Table 1 and Table  Table 2 respectively. General word counts are shown in Figure 7.

Arithmetic Reasoning: On GPT 3.5 Turbo, a variant of QAP is the top performer in 2 out of 3 arithmetic tasks. QAP shows significant gains on AQuA and SAT. With GPT-4 Turbo, QAP performs the best in the same 2 out of 3 arithmetic tasks. This suggests that QAP may be more beneficial on questions involving algebraic and higher-level problem solving; additionally, GPT-4 is trained at least in part on GSM8K (OpenAI, 2023) and thus its performance may be less sensitive to prompting changes.

Prompt GSM8K AQuA SAT StratQA
Baseline 78.7 52.8 70.9 65.1
QAP25 67.1 39.4 35.0 63.1
QAP50 77.8 50.0 52.7 61.4
QAP100 77.4 53.9 75.0 57.1
QAP150 78.5 59.4 78.6 53.2
QAP200 76.8 52.4 75.0 51.8
TADB 78.5 57.1 74.5 62.9
CoT 79.0 53.1 65.9 59.2
PS+ 74.7 35.0 70.9 35.6
Table 1: Results for GPT-3.5 Turbo (highest scores bolded)
Prompt GSM8K AQuA SAT StratQA
Baseline 95.3 78.7 96.8 76.3
QAP25 94.8 77.6 94.5 77.6
QAP50 93.4 79.1 95.9 76.9
QAP100 94.6 75.6 96.8 77.2
QAP150 94.7 78.0 97.3 77.6
QAP200 95.0 76.4 98.2 75.9
TADB 95.1 78.7 96.8 78.0
CoT 95.6 74.4 95.0 75.1
PS+ 94.8 52.8 97.3 77.1
Table 2: Results for GPT 4 Turbo. (highest scores bolded)

Commonsense Reasoning:. On StrategyQA, QAP consistently performs second-best when compared to other prompts. On both models, QAP25 is the highest QAP performer. This suggests that fewer-word explanations benefit commonsense reasoning. This is because too much explanation can cause the model to confuse a simple answer 6 While there is a decline in performance as n𝑛nitalic_n increases on the 3.5 model, the larger GPT-4 Turbo model yields similar performances across all QAP variants.

5 Analysis

Question Difficulties Based On Baseline Performance: Within a given dataset, the difficulty of the individual question may vary. We propose a method to measure question difficulty based on performance with the baseline prompt. If the model can answer the problem correctly with the baseline prompt, then we consider the question to be easy; otherwise the question is hard. We analyze the performance of different prompts across “easy” and “hard” questions. Table 3 and Table 4 shows that QAP consistently outperforms other prompts in the “hard” category.

Impact Of Word Counts On Question Difficulties: QAP generates higher word counts for both “easy" and “hard" questions ( Table 5 and  Table 6 ), despite performing lower on “easy” questions. Although more step-by-step thought processes are encouraged to avoid mistakes during reasoning, this suggests that over-explanation can negatively impact the mode (also shown in Figure  Figure 5). Thus, the most suitable word count to solve a problem will vary from task to task; longer explanations are best suited to more complicated questions for which baseline prompting fails.

Downsides Of Smaller QAPs: Despite high performance on StrategyQA, QAP25 performs poorly on arithmetic datasets (mostly SAT and AQuA) using GPT-3.5 Turbo. Due to a small value of n, the model outputs are unfinished responses (i.e. the model stops midway through its reasoning steps) (shown in Figure 8) On SAT math, 51% of responses were incomplete for QAP25. On AQuA, 19% of responses were incomplete for QAP25.

Refer to caption
Figure 2: We consider difficulty of the problem based on baseline’s results. E.g., an incorrect answer is “hard” and a correct answer is “easy”. Left chart shows accuracy within each difficulty. Right chart shows mean (average) word count for within each difficulty. All results for each prompt are shown in Table: 6 and Table:4

6 Additional Studies

Placement of the prompt: In this evaluation, we studied the impact of prompt placement on performance using GSM8K dataset. Two options for prompt placement were considered, Q_Begin - adding the prompt before the question and Q_End - adding the prompt after the question. Both placements provided similar results on GPT-3.5 and GPT-4. Results shown in the rest of the paper are based on Q_End.

Two-stage QAP: In this approach, we performed the prompting in two stages. In the first stage the model is prompted with, “Explain this problem to me in at least 50 words WITHOUT SOLVING.” In the second stage, the model is prompted again with the question and the explanation from the first stage. On GSM8K and AQuA, the model not only explained the problem, but also outlined steps needed to solve it. However, the accuracy was almost 50% worse than single stage prompting.

7 Related Work

In one-shot and few-shot prompting, the model is given one or more input/output examples which will serve as a demonstration for it to solve the problem using in-context learning (Mahabadi et al., 2022). QAP is a zero-shot prompt. In zero-shot prompting the model does not receive exemplars, but is given a specially crafted instruction on how to approach the task (Kojima et al., 2022).

Chain of Thought: Chain-of-thought reasoning is a notable few-shot (zero-shot also exists (Yang et al., 2023) example in which the model is shown how to express its reasoning steps (Wei et al., 2022). This approach was highly effective as the model would replicate these exemplars, and their accuracies improved drastically. CoT encouraged the model to think step-by-step, and this concept would be repeating theme among other zero-shot counterparts.

TADB: Among different variants of Zero-Shot CoT, the TADB prompt (Yang et al., 2023) was derived using an optimization objective to find instructions that would maximize task accuracy. The eventual prompt was "Take a deep breath, and work on this problem step by step". TADB is an example of how the wording of a prompt can drastically impact responses.

Plan and Solve Prompting Plus: Another zero-shot prompt is Plan-and-Solve Prompting (Wang et al., 2022). There were two versions to this prompt. The first simply asked the model devise a plan and solve step-by-step. The second version (PS+) extended the prompt by specifically asking the prompt to extract relevant variables and their corresponding numerals and to calculate intermediate results. We used PS+ on our experiments. One difference between PS+ and QAP is that PS+ prompt is more specific to math datasets - as it instructs to extract variables, intermediate results etc, whereas QAP is more general. Also, PS+ prompts the model to understand the problem, but it is not clear if model should output anything specific to the question itself, but QAP explicitly instructs the model to explain the problem in n words.

Question Decomposition: Question Decomposition (Radhakrishnan et al., 2023) strategy causes the model to break down the question by creating sub-questions. The model answers each of these sub-questions and it ties together all the sub-answers into a final answer. It considers two methods for decomposition, Factored Decomposition and CoT Decomposition. In factored decomposition each sub-question is answered in a separate context. CoT decomposition is an intermediate between factored decomposition and CoT. It enforces one context for sub-question, sub-answer and the answer to the original question. The analysis of question decomposition shows reduced bias and ignored reasoning, improves the faithfulness of a model-generated reasoning over CoT while retaining the performance gains of CoT.

8 Conclusion

In this paper, we explored the approach of question analysis prompting to improve LLM accuracy across math and commonsense reasoning. The ability of this prompting method to perform well in diverse model types and tasks difficulty and type of tasks seems promising. To our best understanding, QAP is the first zero-shot prompt to introduce adaptability with a configurable parameter. We plan to extend this work further by combining QAP with other prompt strategies,  applying decoding strategies and evaluating multi-modal tasks.

9 Limitations

There are a few limitations of QAP. First, LLMs are sensitive to the prompt’s word choice, particularly for zero-shot prompts. As a result so small changes to the prompt wording can impact the model’s performance. For example, the current QAP prompt asks the model to "solve" for the answer. While this works well for math tasks, it may not be optimal for commonsense tasks. Secondly, the results in this paper are based on four datasets and a single class of aligned models; further results should evaluate on more diverse and multi-modal datasets, as well as a greater variety of models. Finally, more robust methods (e.g., based on a classifier) to determine the choice of the parameter n𝑛nitalic_n should be investigated to go beyond manual selection.

10 Ethics

We experimented on three arithmetic datasets: GSM8K (Cobbe et al., 2021), AQuA (Ling et al., 2017), and AGIEval SAT Math (Zhong et al., 2023). For commonsense reasoning, used StrategyQA (Geva et al., 2021). GSM8K use the MIT License code, while AQUA and StrategyQA use the Apache-2.0 code. QAP and the prompts used in this work do not jeopardize the safety of others. They do not include any wording which may deem offensive to any individual or group.

References

  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  • Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  • Gu et al. (2023) Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip Torr. 2023. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  • Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  • Mahabadi et al. (2022) Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert Mathias, Veselin Stoyanov, and Majid Yazdani. 2022. Perfect: Prompt-free and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  • Radhakrishnan et al. (2023) Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson E. Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, John Kernion, Kamil.e Lukovsiut.e, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkat Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Janina Brauner, Sam Bowman, and Ethan Perez. 2023. Question decomposition improves the faithfulness of model-generated reasoning. ArXiv, abs/2307.11768.
  • Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Yang et al. (2023) Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2023. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  • Zhong et al. (2023) Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.

Appendix A Appendix

A.1 Analysis of Accuracy Based On Question Difficulty

Performance of prompts on problems categorized into easy and hard - where easy problems are those where baseline prompt leads to a correct answer and hard problems are those where baseline prompt leads to a wrong answer. For each category the % of correct answers are calculated by number of correct answers(per prompt) over the total number of problems in that category (easy or hard)

Prompt Easy Hard
QAP25 84.7 30.1
QAP50 90.0 36.7
QAP100 91.5 39.5
QAP150 92.3 43.2
QAP200 91.1 41.3
TADB 93.6 34.9
CoT 92.6 35.0
PS+ 88.2 31.5
Table 3: Accuracy for Arithmetic Reasoning
Prompt Easy Hard
QAP25 89.5 24.3
QAP50 87.7 24.6
QAP100 83.8 26.9
QAP150 81.4 27.0
QAP200 80.0 25.0
TADB 91.3 20.3
CoT 85.8 27.3
PS+ 70.6 21.1
Table 4: Accuracy for Commonsense Reasoning

A.2 Analysis of Word Count based on Question Difficulty

Median word count generated by various prompts on all datasets and models categorized into easy and hard - where easy problems are those where baseline prompt leads to a correct answer and hard problems are those where baseline prompt leads to a wrong answer.

Prompt Easy Hard
QAP25 94.6 126.7
QAP50 123.6 158.5
QAP100 200.4 229.6
QAP150 224.4 257.9
QAP200 270.0 301.0
TADB 146.3 214.5
CoT 99.4 128.3
PS+ 197.8 216.3
Table 5: Mean word count for Arithmetic Reasoning
Prompt Easy Hard
QAP25 36.9 38.7
QAP50 71.5 73.8
QAP100 183.8 192.3
QAP150 215.8 220.4
QAP200 268.8 274.6
TADB 37.5 58.0
CoT 29.1 30.9
PS+ 162.4 179.0
Table 6: Mean word count for Commonsense Reasoning

A.3 Example Explanations

Refer to caption
Figure 3: Examples of QAP inducing explanations of the question on GSM8K, AQuA, and StrategyQA. The prompts include QAP50, QAP150, QAP50 respectively. Pink highlights key phrases (math reasoning) and orange highloghts represents useful background information (commonsense reasoning).

A.4 Impact of Changing n

Refer to caption
Figure 4: This comparison shows how responses vary when changing n. This is only the answer portion. This was experimented on QAP50 and QAP20 on GSM8K on AQuA. Blue represents a QAP200 section which provides more detail than QAP100’s (Red) response on the same step. Green represents a section that QAP200 had that QAP100 did not have at all.

A.5 Large value of n for simple problems hurts the performance

Refer to caption
Figure 5: Example in which over-explanation can negatively impact a response. QAP50 acquires the correct answer (34), but QAP200 does not. In fact, QAP200 reaches the correct answer, but additional explanation leads to a wrong answer.
Refer to caption
Figure 6: Example in which over-explanation negatively impacts a commonsense reasoning response. The comparison shows that more words can confuse the model.

A.6 Word Counts for all datasets with GPT 3.5 and GPT 4

Refer to caption
Figure 7: Median word counts in response for all datasets using GPT 3.5 Turbo and GPT 4 Turbo

A.7 QAP25 Unfinished Response

Refer to caption
Figure 8: Example in which QAP25 outputs an unfinished response on the SAT dataset.