A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

Shubham Vatsal & Harsh Dubey
Department of Computer Science
New York University, CIMS
New York, USA
{sv2128,hd2225}@nyu.edu
Abstract

Large language models (LLMs) have shown remarkable performance on many different Natural Language Processing (NLP) tasks. Prompt engineering plays a key role in adding more to the already existing abilities of LLMs to achieve significant performance gains on various NLP tasks. Prompt engineering requires composing natural language instructions called prompts to elicit knowledge from LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models, prompt engineering does not require extensive parameter re-training or fine-tuning based on the given NLP task and thus solely operates on the embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently extract LLMs’ knowledge through a basic natural language conversational exchange or prompt engineering, allowing more and more people even without deep mathematical machine learning background to experiment with LLMs. With prompt engineering gaining popularity in the last two years, researchers have come up with numerous engineering techniques around designing prompts to improve accuracy of information extraction from the LLMs. In this paper, we summarize different prompting techniques and club them together based on different NLP tasks that they have been used for. We further granularly highlight the performance of these prompting strategies on various datasets belonging to that NLP task, talk about the corresponding LLMs used, present a taxonomy diagram and discuss the possible SoTA for specific datasets. In total, we read and present a survey of 44 research papers which talk about 39 different prompting methods on 29 different NLP tasks of which most of them have been published in the last two years.

1 Introduction

Artificial Intelligence has advanced significantly with the introduction of LLMs. LLMs are trained on huge corpora of text documents with millions and billions of tokens. It has been shown that as the number of model parameters increase, the performance of machine learning models improve and such has been the case with these LLMs. They have attained unprecedented performance on a wide array of NLP tasks Chang et al. (2023) because of which they have attracted a lot of interest from academia and different industries including medicine, law, finance and more. The present phase of research on LLMs focuses on their reasoning capacity via prompts rather than just next token prediction which has opened a new field of research around prompt engineering.

Prompt engineering is the process of creating natural language instructions, or prompts, to extract knowledge from LLMs in an organized manner. Prompt engineering, in contrast to earlier conventional models, relies only on the embedded knowledge of LLMs and does not require extensive parameter re-training or fine-tuning based on the underlying NLP task. Understanding model parameters in terms of real world knowledge embedded in them is beyond human capabilities and hence this new field of prompt engineering has caught everyone’s attention as it allows natural language exchange between researchers and LLMs to achieve the goals of the underlying NLP task.

In this work, we enumerate several prompting strategies and group them according to different NLP tasks that they have been used for. We provide a taxonomy diagram, tabulate the prompting techniques tried on various datasets for different NLP tasks, discuss the LLMs employed, and list potential SoTA methods for each dataset. As a part of this survey, we have reviewed and analyzed 44 research papers in total, the majority of which have been published in the previous two years and cover 39 prompting techniques applied on 29 different NLP tasks. There have not been a lot of prior systematic surveys on prompt engineering. Sahoo et al. (2024) surveys 29 prompting technique papers based on their applications. This is a very broad categorization as a single application can encapsulate numerous NLP tasks. For example, one of the applications which they discuss is reasoning and logic which can have plethora of NLP tasks like commonsense reasoning, mathemathical problem solving, multi-hop reasoning etc. This is different from our approach as we take a more granular categorization of prompting strategies based on the NLP tasks. Edemacu & Wu (2024) provides an overview of privacy protection prompting methods and thus focuses on a comparatively small sub-field of prompt engineering. Chen et al. (2023) limits the discussion of prompting strategies to some 9-10 methodologies and also does not incorporate categorizing them based on the NLP tasks.

The rest of the paper is organized in the following way. Section 2 talks about various prompt engineering techniques and section 3 highlights different NLP tasks. The sub-sections of section 3 discuss different prompting strategies that have been applied on a given NLP task and their corresponding results. Section 4 concludes the paper.

2 Prompt Engineering Techniques

In this section, we talk briefly about different prompting methods and how they bring improvement in existing performance as and when they were published. An important thing to note here is that most of the following prompting strategies have been experimented in two different variations or settings if not more. These variations include zero-shot and few-shot. Some of the prompting techniques may inherently exist in either zero-shot or few-shot variation and there may not be a possibility for any other variation to exist. In zero-shot Radford et al. (2019) setting, there is no training data involved and an LLM is asked to perform a task through prompt instructions while completely relying on it’s embedded knowledge learnt during it’s pre-training phase. On the other hand, in few-shot variation Brown et al. (2020), few training datapoints are provided along with task-based prompt instructions for better comprehension of the task. The results from various prompt engineering works have shown few-shot variations to have helped improve the performance but this comes at a cost of carefully preparing few-shot datapoints as the LLM can show unexplained bias towards the curated few-shot datapoints.

2.1 Basic/Standard/Vanilla Prompting

Basic prompting refers to the method of directly throwing a query at the LLM without any engineering around it to improve the LLM’s performance which is the core goal behind most of the prompting strategies. Basic prompting also goes by the name of Standard or Vanilla prompting in different research papers.

2.2 Chain-of-Thought (CoT)

In this prompting strategy Wei et al. (2022), the authors build up on the idea of how human beings break a complex problem into smaller easier sub-problems before arriving at the final solution of the complex problem. Along similar lines, the authors investigate how capabilities of LLMs to do complicated reasoning is inherently enhanced by producing a chain of thought, or a sequence of intermediate reasoning steps. The results show a considerable improvement from Basic prompting with the maximum difference between CoT and Basic prompting results being as big as around 39% for Mathematical Problem Solving task and around 26% for Commonsense Reasoning task. This work opened a new direction of research for the field of prompt engineering.

2.3 Self-Consistency

Self-Consistency Wang et al. (2022) prompting technique is based on the intuition that complex reasoning problems can be solved in multiple ways and hence the correct answer can be reached via different reasoning paths. Self-Consistency uses a novel decoding strategy unlike the greedy one being used by CoT and consists of three important steps. The first step requires prompting the LLM using CoT, the second step samples diverse reasoning paths from LLM’s decoder and the final step involves choosing the most consistent answer across multiple reasoning paths. Self-Consistency on an average achieves 11% gain on Mathematical Problem Solving task, 3% gain on Commonsense Reasoning task and 6% gain on Multi-Hop Reasoning task when compared to CoT.

2.4 Ensemble Refinement (ER)

This prompting method has been discussed in Singhal et al. (2023). It builds on top of CoT and Self-Consistency. ER consists of two stages. First, given a few-shot CoT prompt and a query, LLM is made to produce multiple generations by adjusting it’s temperature. Each generation contains a reasoning and an answer for the query. Next, the LLM is conditioned on the original prompt, query and the concatenated generations from the previous stage to generate a better explanation and an answer. This second stage is done multiple times followed by a majority voting over these second stage generated answers just as it is done in case of Self-Consistency to select the final answer. ER is seen to perform better than CoT and Self-Consistency across many datasets belonging to the Context-Free Question-Answering task.

2.5 Automatic Chain-of-Thought (Auto-CoT)

In this work Zhang et al. (2022), the authors address the problem faced by few-shot CoT or manual CoT which is the need of curation of good quality training datapoints. Auto-CoT consists of two primary steps. The first one requires dividing queries of a given dataset into a few clusters. The second one involves choosing a representative query from each cluster and then generating its corresponding reasoning chain using zero-shot CoT. The authors claim that Auto-CoT either outperforms or matches the performance of few-shot CoT across Mathematical Problem Solving, Multi-Hop Reasoning and Commonsense Reasoning task. This indicates that the step of curation of training datapoints for few-shot or manual CoT can be ruled out.

2.6 Complex CoT

Fu et al. (2022) introduces a new prompting strategy which aims at choosing complex datapoint prompts over simpler ones. The complexity of a datapoint is defined here by the number of reasoning steps involved with it. The authors hypothesize that the LLMs’ reasoning performance can increase if complex datapoints are used as in-context training examples as they already subsume simpler datapoints. Another important aspect of Complex CoT apart from using complex datapoints as training examples is that during decoding, just like Self-Consistency, out of N sampled reasoning chains the majority answer over the top K most complex chains is chosen as the final answer. There is one other baseline prompting method which has been introduced in this paper called Random CoT. In Random CoT, the datapoints are randomly sampled without adhering to their complexity. Complex CoT achieves on an average a gain of 5.3% accuracy and up to 18% accuracy improvement across various datasets of Mathematical Problem Solving, Commonsense Reasoning, Table-Based Mathematical Problem Solving and Multi-Hop Reasoning tasks.

2.7 Program-of-Thoughts (PoT)

The authors of Chen et al. (2022a) build up on CoT but in contrast to CoT which uses LLMs to perform both reasoning and computation, PoT generates Python programs and thus relegates computation part to a Python interpreter. This work argues that reduced LLM responsibilities make it more accurate especially for numerical reasoning. PoT gets an average performance gain over CoT of around 12% across Mathematical Problem Solving, Table-Based Mathematical Problem Solving, Contextual Question-Answering and Conversational Contextual Question-Answering tasks.

2.8 Least-to-Most

Least-to-Most Zhou et al. (2022) prompting technique tries to address the problem of CoT where CoT fails to accurately solve problems harder than the exemplars shown in the prompts. It consists of two stages. First, the LLM is prompted to decompose a given problem into sub-problems. Next, the LLM is prompted to solve the sub-problems in a sequential manner. The answer to any sub-problem depends on the answer of the previous sub-problem. The authors show that Least-to-Most prompting is able to significantly outperform CoT and Basic prompting methods on Commonsense Reasoning, Language-Based Task Completion, Mathematical Problem Solving and Contextual Question-Answering tasks.

2.9 Chain-of-Symbol (CoS)

CoS Hu et al. (2023) builds up on the idea of CoT. In conventional CoT, the intermediate chain of reasoning steps are represented in natural language. While this approach has shown remarkable results in many cases, it can include incorrect or redundant information as well. The authors of this work present their hypothesis that spatial descriptions are hard to express in natural language thus making it difficult for LLMs to understand. Instead, expressing these relationships using symbols in word sequences can be a better form of representation for LLMs. CoS achieves an improvement of up to 60.8% accuracy for Spatial Question-Answering task.

2.10 Structured Chain-of-Thought (SCoT)

The intuition behind SCoT Li et al. (2023b) is that structuring intermediate reasoning steps using program structures like sequencing, branching and looping helps in more accurate code generation than having intermediate reasoning steps in natural language as we see in conventional CoT. The authors claim that the former approach more closely mimics a human developer’s thought process than the latter one and the same has been confirmed by the the final results as SCoT outperforms CoT by up to 13.79% for the Code Generation task.

2.11 Plan-and-Solve (PS)

Wang et al. (2023) discusses and tries to address three shortcomings of CoT which are calculation errors, missing-step errors and semantic misunderstanding errors. PS contains two components where the first one requires devising a plan to divide the entire problem into smaller sub-problems and the second one needs to carry out these sub-problems according to the plan. A better version of PS called PS+ adds more detailed instructions which helps in improving the quality of reasoning steps. PS prompting method improves the accuracy over CoT by at least 5% for almost all the datasets in Mathematical Problem Solving task in zero-shot setting. Similarly, for the Commonsense Reasoning task, it consistently outperforms CoT by at least 5% in zero-shot setting whereas for the Multi-Hop Reasoning task it gets around 2% better accuracy score.

2.12 MathPrompter

Imani et al. (2023) tries to address two key problems of CoT for Mathematical Problem Solving task: (1) lack of validity of steps followed by CoT for solving a problem; (2) how confident is an LLM in it’s predictions. MathPrompter prompting strategy consists of 4 steps in total. (I) Given a query, the first step requires to generate an algebraic expression for the query which replaces the numerical values by variables. (II) Next, LLM is prompted to solve the query analytically either by deriving the algebraic expression or writing a Python function. (III) Third, the query in step (I) is solved by assigning different values to the variables. (IV) If the solutions in (III) are correct over N iterations, the variables are finally replaced with original query values and the answer is computed. If not, then the steps (II), (III) and (IV) are repeated. MathPrompter is able to improve the performance on a dataset belonging to Mathematical Problem Solving task from 78.7% to 92.5%.

2.13 Contrastive CoT/ Contrastive Self-Consistency

The authors of Chia et al. (2023) claim that Contrastive CoT or Contrastive Self Consistency is a general enhancement of CoT or Self-Consistency. The inspiration for this prompting approach is based on how humans can learn from both positive as well as negative examples. Along similar lines, in this prompting technique, both positive and negative demonstrations are provided to enhance the reasoning capabilities of the LLM. Contrastive CoT on an average is able to gain an average of 10% improvement over conventional CoT for Mathematical Problem Solving task across multiple datasets. Similarly, Contrastive Self-Consistency is able to outperform conventional Self-Consistency by over 15% for Mathematical Problem Solving task across multiple datasets. For Multi-Hop Reasoning task, both Contrastive CoT and Contrastive Self-Consistency have more than 10% gains over their conventional counterparts.

2.14 Federated Same/Different Parameter Self-Consistency/CoT (Fed-SP/DP-SC/CoT)

Introduced in Liu et al. (2023), this prompting method is based on the core idea of improving the reasoning capabilities of LLMs by using synonymous crowd-sourced queries. There are two slightly different variations of this prompting method. The first one is Fed-SP-SC where the crowd-sourced queries are paraphrased versions of the original query but with same parameters. Parameters here can refer to the numeric values in Mathematical Problem Solving task datapoints. For Fed-SP-SC, the answers are directly generated first and then Self-Consistency is applied on top of it. The other one is Fed-DP-CoT. In Fed-DP-CoT, LLMs are used to first generate answers to different queries and then they are federated by forming CoT to provide hints to the LLMs. The results for these methods on Mathematical Problem Solving task show that they are able to do better than conventional CoT by at least 10% and up to 20%.

2.15 Analogical Reasoning

The authors of this work Yasunaga et al. (2023) draw their inspiration from a psychological notion, analogical reasoning, where people use pertinent prior experiences to solve new problems. In the realm of LLMs, the authors first prompt them to generate examples similar to that of the original problem followed by solving them and then proceed to answer the original problem. The results show that Analogical Reasoning is able to achieve an average accuracy gain of 4% when compared to CoT across Mathematical Problem Solving, Code Generation, Logical Reasoning and Commonsense Reasoning tasks.

2.16 Synthetic Prompting

The authors of Shao et al. (2023) come up with Synthetic prompting using LLMs to generate synthetic examples which are augmented to the existing hand-crafted examples as seen in a conventional few-shot setting. This prompting method involves two steps: (1) the backward step, where the LLM synthesizes a query based on a self-generated reasoning chain; and (2) the forward step, where the LLM generates a reasoning chain for the synthesized query, making the reasoning chain to be more accurate. Finally, to choose the best examples, this work uses an in-cluster complexity and the most complex examples with the longest reasoning chains are used during inference. The results show Synthetic prompting achieving up to 15.6% absolute gains when experimented with different Mathematical Problem Solving, Commonsense Reasoning and Logical Reasoning task datasets.

2.17 Tree-of-Thoughts (ToT)

ToT Yao et al. (2024) prompting technique has been drawn from the idea that any kind of problem solving requires searching through a combinatorial space represented as a tree where each node represents a partial solution and each branch corresponds to an operator that modifies it. Now, the decision about which branch to choose is determined by heuristics that help to navigate the problem-space and guide the problem-solver towards a solution. Based on this idea, the authors propose ToT which actively maintains a tree of thoughts where each thought is a coherent language sequence that serves as an intermediate reasoning step toward problem solving. This framework allows LLMs to evaluate the progress generated by thoughts while trying to solve the problem. ToT further incorporates search techniques such as breadth-first or depth-first search with the model’s ability to generate and evaluate thoughts. ToT achieves 65% better success rate than CoT on Mathematical Problem Solving task and around 40% better success rate on different Logical Reasoning task datasets. ToT further achieves coherency score of 7.56 where CoT gets only 6.93 on an average on Free Response task.

2.18 Logical Thoughts (LoT)

In this work Zhao et al. (2023b), the authors investigate the usage of logical equivalence in order to improve the zero-shot reasoning abilities of an LLM. In addition to allowing the LLM to reason step-by-step, LoT also allows the LLM to verify step-by-step in accordance with the guidelines provided by the Reductio ad Absurdum principle and, if needed, amend the reasoning chain to ensure a valid inference. LoT is able to surpass CoT in Mathematical Problem Solving task by a maximum of 3.7%, Commonsense Reasoning task by a maximum of 16.2%, Logical Reasoning task by a maximum of 2.5%, Causal Reasoning task by a maximum of 15.8% and Social Reasoning task by a maximum of 10% accuracy.

2.19 Maieutic Prompting

By using deep recursive reasoning to elicit abductive explanations for a variety of hypotheses, Maieutic prompting Jung et al. (2022) encourages the LLM to produce consistent responses by collaboratively eliminating alternatives that contradict one another. The generation process of Maieutic prompting derives a tree structure of generated propositions, where one proposition establishes a logical ground for the correctness of one another. Finally, to infer the answer to the original query, the degree to which the LLM believes each proposition and the logical connections between propositions in the maieutic tree is measured. The results for Maieutic prompting on Commonsense Reasoning task shows that it is able to achieve up to 20% better accuracy when compared to Basic prompting, CoT, Self-Consistency and GKP Liu et al. (2021) while performing competitively with supervised models.

2.20 Verify-and-Edit (VE)

Zhao et al. (2023a) focuses on developing a technique which can post-edit the reasoning chains generated by CoT for more factually aligned outputs. This method consists of three stages: (1) the deciding when to edit stage where the authors use Self-Consistency to find uncertain outputs; (2) the how to edit rationales stage where the authors edit CoT reasoning chains of uncertain outputs by searching for supporting facts from external knowledge sources and (3) the reasoning stage where the edited rationales from previous stage are used to come up with final answers. VE is able to outperform CoT, Self-Consistency and Basic prompting by up to 10% on Multi-Hop Reasoning task and by up to 2% on Truthfulness task.

2.21 Reason + Act (ReAct)

Yao et al. (2022b) presents ReAct, which combines reasoning and acting with LLMs to solve diverse language reasoning and decision making tasks. In order to enable the model to perform dynamic reasoning to build and modify high-level plans for acting (reason to act), ReAct prompts LLMs to generate verbal reasoning traces and actions related to a task in an interleaved manner. Another prompting method similar to ReAct discussed in Yao et al. (2022b) is Act which basically removes thoughts or reasoning in ReAct trajectories but performs suboptimal to ReAct in all the discussed tasks. For Multi-Hop Reasoning and Truthfulness tasks, ReAct is able perform better than Basic prompting while being competitive with CoT. When ReAct is combined with CoT or Self-Consistency, it is able to get better results than CoT. For Language-Based Task Completion task, ReAct outperforms reinforcement learning methods with an absolute improvement of more than 10% in success rates individually on different datasets.

2.22 Active-Prompt

Diao et al. (2023) proposes Active-Prompt to help LLMs adapt to different tasks with task-specific examples by identifying the most relevant datapoints to be used as examples while prompting the LLM in a few-shot setting. Active-Prompt is a four-step technique. In the first step, the LLM is prompted k times for each query in the training set to generate k possible answers with their corresponding reasoning chains. The next step requires calculating the uncertainty metric based on the answers generated in step one. In the third step, the top n most uncertain queries are selected and annotated by humans. In the final step, the new annotated examples are used to do few-shot prompting for the test data. The authors also introduce a different version of Active-Prompt called Random CoT where in step 3, top n queries are selected randomly than based on the uncertainty metric. The results show that Active-Prompt is able to get better results than Self-Consistency, CoT, Auto-CoT and Random CoT across multiple datasets for Mathematical Problem Solving, Commonsense Reasoning, Multi-Hop Reasoning, Commonsense Reasoning tasks.

2.23 Thread-of-Thought (ThoT)

Zhou et al. (2023) proposes a prompting method focusing on handling long chaotic contexts. It is based on the idea that there is an unbroken flow of thought that people retain when going through a large amount of information, enabling the selective extraction of pertinent data and the rejection of irrelevant ones. This balance of attention across a document’s sections is important for accurate interpretation and response to the information supplied. ThoT consists of two steps. The first one requires the LLM to analyze and summarize the different sections of the context. In the second step, the LLM is prompted to answer the asked query based on the output of first step. ThoT is able to outperform CoT and Basic promoting techniques by achieving a score of around 0.56 exact match in Context-Free Question-Answering task. For Dialogue System task, ThoT is able to get the highest average score of 3.8 again surpassing other discussed prompting techniques.

2.24 Implicit Retrieval Augmented Generation (Implicit RAG)

Contrary to the conventional RAG Lewis et al. (2020), Implicit RAG Vatsal & Singh (2024); Vatsal et al. (2024) asks the LLM itself to retrieve important chunks or sections from the given context and then proceed to answer the asked query. This technique requires tuning of two hyper-parameters. The first one is the number of sections to extract whereas the second one is the number of words in each section. Implicit RAG achieves SoTA result on Contextual Question-Answering task in Vatsal et al. (2024) on Patient Case Reports dataset whereas achieved either SoTA or close to SoTA results on biomedical Contextual Question-Answering task datasets in Vatsal & Singh (2024).

2.25 System 2 Attention (S2A)

LLMs can often end up making erroneous judgments when presented with irrelevant context. Weston & Sukhbaatar (2023) tries to address this issue with two-step prompting strategy. The first step instructs the LLM to regenerate a given context such that the regenerated version does not contain any irrelevant parts that could adversely affect the output. The second step then instructs the LLM to produce the final response using the regenerated context from step 1. The results show that S2A is able to outperform Basic, CoT as well Instructed prompting Shi et al. (2023) over different Truthfulness task datasets.

2.26 Instructed Prompting

Intructed prompting Shi et al. (2023) again revolves around the same idea as that of S2A which tries to address the issue of LLMs getting distracted by irrelevant context. It consists of only one step of explicitly instructing the language model to ignore irrelevant information in the problem description. Instructed prompting is able to achieve 88.2 normalized micro accuracy for Truthfulness task and is able to surpass all it’s counterparts including CoT, Least-To-Most, Program prompting and Self-Consistency. Program prompting Chowdhery et al. (2023) strategy here tries to solve a problem by writing a Python program for it. Later, the correctness of the written program is verified by running the Python code using an external Python interpreter to obtain the final answer.

2.27 Chain-of-Verification (CoVe)

LLMs are prone to generating factually incorrect information called hallucination. The authors of Dhuliawala et al. (2023) try to address this problem of hallucination and improve performance via CoVe. CoVe performs four core steps. First, the LLM generates a baseline response for a given query. Second, using both the original query and the baseline response from step one, generate a list of verification queries that are capable of checking if there are any errors in the baseline response. Third, generate answers to all the verification queries from step three. Fourth, correct all the mistakes in the baseline response detected after step three and produce a revised response. The results show that CoVe is able to outperform CoT and Basic prompting by around at least 10% on Context-Free Question-Answering, Contextual Question-Answering and Free Response tasks.

2.28 Chain-of-Knowledge (CoK)

Similar to CoVe, CoK Li et al. (2023c) tries to address the issue of hallucination to get more accurate results. It’s a three-stage prompting technique. The first stage is reasoning preparation where given a query, CoK prepares several preliminary rationales and answers while identifying the relevant knowledge domains. The second stage is dynamic knowledge adaptation where if there is no majority consensus among the answers, CoK corrects the rationales step by step by adapting knowledge from the identified domains in stage one. The third stage is answer consolidation which uses these corrected rationales from stage two to serve as a better foundation for the final answer consolidation. CoVe surpasses CoT, Self-Consistency, VE and Basic prompting across Context-Free Question-Answering, Table-Based Question-Answering, Multi-Hop Reasoning and Truthfulness tasks and shows an improvement of at least 3%, 3%, 1% and 1% respectively.

2.29 Chain-of-Code (CoC)

In this work Li et al. (2023a), the authors propose an extension to make LLM’s code-oriented reasoning better. Here, the LLM not only writes a code for a program but also selectively simulates the interpreter by producing the expected outputs of certain lines of code which cannot be actually executed by an interpreter. The main idea is to motivate LLMs to format semantic sub-tasks in a program as flexible pseudocode that may be explicitly caught and passed off to an LLM for emulation at runtime which the authors call an LMulator. Experiments demonstrate CoC surpassing CoT and other baselines across a variety of tasks including Recommender System, Causal Reasoning, Commonsense Reasoning, Spatial Question-Answering, Emotion/Sentiment Understanding, Machine Translation, Logical Reasoning, Table-Based Mathematical Problem Solving and Mathematical Problem Solving.

2.30 Program-aided Language Models (PAL)

Gao et al. (2023) proposes a prompting strategy that uses an LLM to read natural language problems and generate interleaved natural language and programming language statements as reasoning steps. Finally, a Python interpreter is used to execute programming statements to get the answer. The results show that PAL easily performs better than it’s counterparts like CoT and Basic prompting across multiple NLP tasks including Mathematical Problem Solving, Table-Based Mathematical Problem Solving, Commonsense Reasoning and Logical Reasoning.

2.31 Binder

The authors claim Binder Cheng et al. (2022) to be a training-free neural-symbolic technique that maps an input to a program which (I) enables binding of a single API of LLM functionalities to a programming language such as Python or SQL in order to increase it’s coverage of grammar and to address a wider range of queries; (II) uses an LLM as the underlying model as well as the program parser during execution; (III) needs only a few in-context sample annotations. The binder pipeline has two stages. First, in the parsing stage, the LLM maps the input to a program given the query and knowledge sources. Second, in the execution stage, the LLM returns values in the chosen programming language and finally the program is run using an interpreter. Binder is able to get better accuracy when compared to previous methodologies which required explicit training or fine-tuning for Table-Based Truthfulness and Table-Based Question-Answering tasks.

2.32 Dater

Ye et al. (2023) explores the idea of few-shot learning with LLMs to decompose evidence and queries for efficient table-based reasoning. This prompting strategy involves three important steps. It starts with decomposing a huge table into relevant smaller sub-tables given the query. Next, SQL programming language is used to decompose the complex natural language query into logical and numerical computations. Finally, the sub-tables and sub-queries from previous two steps are used to arrive at the final answer in a few-shot setting. The results show that Dater is able to surpass previous methodologies which required explicit fine-tuning by at least 2% in Table-Based Truthfulness task. Similarly, for Table-Based Question-Answering task, it is able to outperform such methods by at least 1%. Dater is also able to do better than Binder for both the above-mentioned tasks.

2.33 Chain-of-Table

In Wang et al. (2024), the authors build up on the famous prompting technique of CoT and bring it to the tabular setting. This multi-step tabular prompting approach leads to more accurate table understanding. Chain-of-Table is a three-step prompting technique. The first step instructs the LLM to dynamically plan the next table operation by in-context learning. An operation here could be anything from addition of columns to sorting of rows. The second step generates arguments for the selected table operation. The first two steps help in transforming the table and creating various intermediate table representations with the goal of answering the original query. In the final step, the last table representation from the first two steps is used to finally answer the query. Chain-of-Table achieves SoTA performance on Table-Based Question-Answering and Table-Based Truthfulness tasks. For Table-Based Question-Answering task, it gets around 3% of average better performance whereas for Table-Based Truthfulness task it is able to get around 1.5% of average better performance when compared to the prior SoTA results.

2.34 Decomposed Prompting (DecomP)

Khot et al. (2022) comes up with DecomP technique which decomposes a complex problem into simpler sub-problems and then delegates these to sub-problem specific LLMs, which have their own prompts and decomposers to further decompose the sub-problems. The decomposers can either resort to hierarchical decomposition, recursive decomposition or make external API calls to solve the sub-problem. DecomP is able to outperform CoT and Least-to-Most on an average by 25% in terms of exact match for Commonsense Reasoning task. For Multi-Hop Reasoning task, DecomP is comfortably able to do better than CoT on four different datasets.

2.35 Three-Hop Reasoning (THOR)

The authors of Fei et al. (2023) come up with THOR to mimic human-like reasoning process for Emotion/Sentiment Understanding task. THOR consists of three steps. In the first step, the LLM is asked to identify the aspect mentioned in the given query. Next, based on previous step output and the original query, the LLM is asked to answer in detail about the underlying opinion embedded in the query. Finally, all of the above information is combined and the LLM is asked to infer the sentiment polarity associated with the given query. THOR is able to significantly surpass prior SoTA supervised as well as zero-shot models on multiple Emotion/Sentiment Understanding task datasets.

2.36 Metacognitive Prompting (MP)

MP Wang & Zhao (2023) is based on the concept of meta-cognition which is derived from cognitive psychology and relates to an individual’s awareness and self-reflection on their cognitive processes. It consists of five stages. 1) understanding the input text, 2) making a preliminary judgment, 3) critically evaluating this preliminary analysis, 4) reaching a final decision accompanied by an explanation of the reasoning, and 5) evaluating the confidence level in the entire process. The results show that MP consistently excels CoT and PS across numerous NLP tasks including Paraphrasing, Natural Language Inference, Contextual Question-Answering, Word Sense Disambiguation, Named Entity Recognition, Relation Extraction and Multilabel Text Classification.

2.37 Chain-of-Event (CoE)

Bao et al. (2024) proposes CoE for the Summarization task. CoE has four sequential steps. The first one focuses on specific event extraction. Next, the events extracted in step one are analyzed and generalized into more concise and refined form. Third, the events generalized in the previous step are filtered and only those are selected which cover most of the text. In the last step, the events selected in step three are integrated based on their chronological order of importance. The results show that CoE is able to perform better than CoT across two Summarization datasets in terns of rouge score while also being more concise.

2.38 Basic with Term Definitions

This is one of prompting methods discussed in Vatsal et al. (2024). In this method, basic prompt instructions get enhanced by addition of medical term definitions based on the hypothesis that adding these definitions would help the LLM in gaining more context while answering the asked query. But the results show that these term definitions do not really help possibly because of their narrow knowledge scope which may be conflicting with the bigger knowledge base of the LLM.

2.39 Basic + Annotation Guideline-Based Prompting + Error Analysis-Based Prompting

Hu et al. (2024) tests LLM capabilities in clinical Named Entity Recognition task. This prompting strategy has three different components. The basic component tells the LLM about the rudimentary information regarding the task and in what format the LLM should output the results. The annotation guideline component contains entity definitions and linguistic rules derived from the annotation guidelines. The error analysis component incorporates additional instructions following error analysis of LLM outputs using the training data. Different versions of this prompting method have been also experimented by the authors by creating different combination of above-mentioned components. This prompting method is able to get on an average 0.57 exact match F1 score on multiple datasets belonging to Named Entity Recognition task.

3 Prompt Engineering on Different NLP Tasks

Different research papers have used different measures when it comes to categorizing a dataset under an NLP task and it keeps varying from one work to another. In this section, we try to standardize this and put a structure around these prior ways of categorization by defining different NLP tasks and putting different datasets under these tasks. We further talk about various prompting methods that have been used for these tasks. A taxonomy diagram reflecting this can be seen in Figure 1. An important thing to note here is that it is quite possible that a dataset can belong to different NLP tasks at the same time. But that can result in a complex entanglement of structured analyses of how prompting techniques perform across various NLP tasks. Therefore, in our work, we make sure that a dataset belongs to only one NLP task to which it most strongly associates with. The following sub-sections each define a different NLP task, corresponding datasets and various prompting strategies that have been applied to those datasets. They further contain the potential SoTA prompting technique for each dataset. The performance of a prompting method varies based on the LLM used. Therefore, we have also included a list of LLMs that were used along with prompting strategies on a given dataset. For the SoTA, we have only mentioned the name of the prompting method, as in many cases a particular LLM has not been experimented with a given prompting method, making it unclear if it could have achieved SoTA performance. Hence, if any LLM from the list of LLMs, along with

{forest}

for tree= grow=east, draw, rounded corners, align=center, text width=4cm, inner xsep=4pt, inner ysep=2pt, l sep=5mm, s sep=1mm, parent anchor=east, child anchor=west, anchor=west, calign=first, edge path= [draw, \forestoptionedge] (!u.parent anchor) – +(3mm,0) —- (.child anchor)\forestoptionedge label; , font=, if level=0 fill=red!20, text width=2cm, inner xsep=6pt, inner ysep=4pt, align=center, text centered , if level=1 text width=3.3cm, inner xsep=5pt, inner ysep=3pt, align=center, text centered , if level=2 text width=7.7cm, inner xsep=4pt, inner ysep=2pt, align=center, text centered [NLP Tasks [Table-Based Truthfulness, fill=orange!20 [ Basic, CoT, Binder, Dater, Chain-of-Table [Wang et al. (2024),
Cheng et al. (2022), Ye et al. (2023)] , fill=orange!20] ] [Truthfulness, fill=blue!20 [ S2A, CoT, Instructed Prompting, Basic, Act, ReAct, Self-Consistency,
VE, CoK, Least-to-Most [Weston & Sukhbaatar (2023),
Shi et al. (2023)] , fill=blue!20] ] [Free Response, fill=yellow!20 [ Basic, CoT, Self-Consistency, ToT, CoVe [Yao et al. (2024),
Dhuliawala et al. (2023)] , fill=yellow!20] ] [Code Generation, fill=cyan!20 [ Analogical Reasoning, CoT, Basic, SCoT [Yasunaga et al. (2023),
Li et al. (2023b)] , fill=cyan!20] ] [Dialogue System, fill=purple!20 [ Basic, CoT, ThoT [Zhou et al. (2023)] , fill=purple!20] ] [Conversational Contextual
Question-Answering, fill=orange!20 [PoT, CoT, Self-Consistency, PAL [Chen et al. (2022a)], fill=orange!20] ] [Spatial Question-Answering, fill=pink!20 [CoT, CoS, Basic, CoC [Hu et al. (2023), Li et al. (2023a)], fill=pink!20] ] [Context-Free
Question-Answering, fill=green!20 [Basic, CoT, ThoT, CoVe, Self-Consistency, VE, CoK,
ER [Wang et al. (2022), Zhou et al. (2023), Dhuliawala et al. (2023),
Li et al. (2023a), Nori et al. (2023), Singhal et al. (2023),
Liévin et al. (2024)], fill=green!20] ] [Contextual
Question-Answering, fill=orange!20 [Basic, Implicit RAG, CoT, Analogical Reasoning, CoVe, PoT,
Self-Consistency, Basic with Term Definitions, Least-to-Most, PS,
MP [Vatsal & Singh (2024), Dhuliawala et al. (2023), Chen et al. (2022a),
Vatsal et al. (2024), Zhou et al. (2022), Wang & Zhao (2023)], fill=orange!20] ] [Social Reasoning, fill=blue!20 [CoT, LoT [Zhao et al. (2023b)], fill=blue!20] ] [Causal Reasoning, fill=yellow!20 [CoT, LoT, Basic, CoC [Zhao et al. (2023b), Li et al. (2023a)], fill=yellow!20] ] [Multi-Hop Reasoning, fill=cyan!20 [Basic, CoT, Auto-CoT, Self-Consistency, Contrastive CoT,
Contrastive Self-Consistency, Random-CoT, Active-Prompt,
Complex CoT, Act, ReAct, VE, CoK, Least-to-Most, DecomP,
PS, [Wei et al. (2022), Zhang et al. (2022), Wang et al. (2022),
Yao et al. (2022b), Li et al. (2023c), Chia et al. (2023),
Diao et al. (2023), Fu et al. (2022), Khot et al. (2022),
Wang et al. (2023), Zhao et al. (2023a) ], fill=cyan!20] ] [Commonsense Reasoning, fill=purple!20 [CoT, DecomP, Basic, Self-Consistency, GKP,
Maieutic Prompting, CoC, LoT, Auto-CoT, PS, Random CoT,
Active-Prompt, Least-to-Most, PAL, Complex CoT, PoT,
Analogical Reasoning, Synthetic Prompting [Yasunaga et al. (2023),
Wei et al. (2022), Zhang et al. (2022), Wang et al. (2022),
Zhao et al. (2023b), Li et al. (2023a), Gao et al. (2023),
Diao et al. (2023), Shao et al. (2023), Jung et al. (2022),
Zhou et al. (2022), Fu et al. (2022), Khot et al. (2022),
Wang et al. (2023)], fill=purple!20] ] [Logical Reasoning, fill=pink!20 [Basic, CoT, PAL, Synthetic Prompting, CoC, LoT, ToT,
Analogical Reasoning [Yasunaga et al. (2023), Yao et al. (2024),
Zhao et al. (2023b), Li et al. (2023a), Gao et al. (2023),
Shao et al. (2023)], fill=pink!20] ] [Mathematical Problem Solving, fill=green!20 [CoT, Random CoT, Complex CoT, Basic, PAL,
Synthetic Prompting, Contrastive CoT,
Contrastive Self-Consistency, CoC, Auto-CoT, Self-Consistency,
Active-Prompt, PS, PoT, MathPrompter, ToT, LoT,
Fed-SP-SC, Fed-DP-CoT, Analogical Reasoning,
Least-to-Most [Yasunaga et al. (2023), Wei et al. (2022),
Zhang et al. (2022), Wang et al. (2022), Yao et al. (2024),
Zhao et al. (2023b), Chen et al. (2022a), Li et al. (2023a),
Gao et al. (2023), Liu et al. (2023), Chia et al. (2023),
Diao et al. (2023), Shao et al. (2023), Zhou et al. (2022),
Imani et al. (2023), Fu et al. (2022), Wang et al. (2023)], fill=green!20] ] ]

{forest}

for tree= grow=east, draw, rounded corners, align=center, text width=4cm, inner xsep=4pt, inner ysep=2pt, l sep=5mm, s sep=1mm, parent anchor=east, child anchor=west, anchor=west, calign=first, edge path= [draw, \forestoptionedge] (!u.parent anchor) – +(3mm,0) —- (.child anchor)\forestoptionedge label; , font=, if level=0 fill=red!20, text width=2cm, inner xsep=6pt, inner ysep=4pt, align=center, text centered , if level=1 text width=3.3cm, inner xsep=5pt, inner ysep=3pt, align=center, text centered , if level=2 text width=7.7cm, inner xsep=4pt, inner ysep=2pt, align=center, text centered [NLP Tasks [Multilabel Text Classification, fill=orange!20 [ CoT, PS, Self-Consistency, MP [Wang & Zhao (2023)] , fill=orange!20] ] [Language-Based Task
Completion, fill=blue!20 [ Basic, CoT, Act, ReAct, Least-to-Most [Wei et al. (2022),
Yao et al. (2022b), Zhou et al. (2022)] , fill=blue!20] ] [Relation Extraction, fill=yellow!20 [ CoT, PS, Self-Consistency, MP [Wang & Zhao (2023)] , fill=yellow!20] ] [Natural Language Inference, fill=cyan!20 [ CoT, PS, Self-Consistency, MP [Wang & Zhao (2023)] , fill=cyan!20] ] [Stance Detection, fill=purple!20 [ Basic, CoT [Zhang et al. (2023b)] , fill=purple!20] ] [Paraphrasing, fill=pink!20 [ CoT, PS, Self-Consistency, MP [Wang & Zhao (2023)] , fill=pink!20] ] [Summarization, fill=green!20 [ CoE, Basic [Bao et al. (2024)] , fill=green!20] ] [Word Sense Disambiguation, fill=orange!20 [ CoT, PS, Self-Consistency, MP [Wang & Zhao (2023)] , fill=orange!20] ] [Named Entity Recognition, fill=blue!20 [ Basic, Basic + Annotation Guideline-Based Prompting,
Basic + Annotation Guideline-Based Prompting +
Error Analysis-Based Prompting, CoT, PS, Self-Consistency,
MP [Hu et al. (2024), Tang et al. (2024), Wang & Zhao (2023)] , fill=blue!20] ] [Machine Translation, fill=yellow!20 [ Basic, CoT, CoC, Basic + Variations [Li et al. (2023a),
Zhang et al. (2023a)] , fill=yellow!20] ] [Emotion/Sentiment
Understanding, fill=cyan!20 [ Basic, CoT, CoC, THOR, Basic + Variations [Li et al. (2023a),
Fei et al. (2023), Fatouros et al. (2023)] , fill=cyan!20] ] [Recommender System, fill=purple!20 [ Basic, CoT, CoC [Li et al. (2023a)] , fill=purple!20] ] [Table-Based Mathematical
Problem Solving, fill=pink!20 [ PoT, CoT, Self-Consistency, PAL, Basic, CoC, Random CoT,
Complex CoT [Chen et al. (2022a), Li et al. (2023a), Gao et al. (2023),
Fu et al. (2022)] , fill=pink!20] ] [Table-Based
Question-Answering, fill=green!20 [ Basic, CoT, Binder, Dater, Chain-of-Table, Self-Consistency, VE,
CoK [Wang et al. (2024), Li et al. (2023c),Cheng et al. (2022),
Ye et al. (2023)] , fill=green!20] ] ]

Figure 1: Taxonomy Diagram of Prompt Engineering Methods Across Different NLP Tasks

a prompting strategy, has been used to experiment with the given dataset and achieved the best performance, we have designated that as the SoTA regardless of the exact LLM used for that technique. Another point to highlight is that in many works, the authors have experimented with different versions of the same dataset, making it difficult for an absolute comparison between different prompting techniques applied to them. Based on our understanding, we have considered all the above-mentioned factors and used our best judgment when selecting the SoTA for each dataset.

3.1 Mathematical Problem Solving

This task measures a model’s ability to perform any kind of mathematical computation in a non tabular setting. The different datasets which we came across while reading up on different prompting methods for this task are GSM8K Cobbe et al. (2021), MATH Hendrycks et al. (2021), SVAMP Patel et al. (2021), ASDiv Miao et al. (2021), AQuA Ling et al. (2017), MAWPS Koncel-Kedziorski et al. (2016), MultiArith Koncel-Kedziorski et al. (2016), AddSub Koncel-Kedziorski et al. (2016), SingleEq Koncel-Kedziorski et al. (2016), Game of 24 Yao et al. (2024), Multi-Step Arithmetic Srivastava et al. (2022), GSM-HARD Gao et al. (2023), SingleOp Koncel-Kedziorski et al. (2016) and MathQA Amini et al. (2019). Table LABEL:tab:mps lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.2 Logical Reasoning

Logical Reasoning task checks a model’s natural language understanding to follow a set of commands with inputs and solve a given problem. The different datasets which we covered while reading up on different prompting strategies for this task are Word Sorting Srivastava et al. (2022), Temporal Sequences Srivastava et al. (2022), Formal Fallacies Srivastava et al. (2022), Mini Crosswords Yao et al. (2024), Object Counting Srivastava et al. (2022), Logical Deduction Srivastava et al. (2022), Boolean Expressions Srivastava et al. (2022), Tracking Shuffled Objects Srivastava et al. (2022), Web of Lies Srivastava et al. (2022), Dyck Languages Srivastava et al. (2022), Geometric Shapes Srivastava et al. (2022), Repeat Copy Logic Srivastava et al. (2022). Table LABEL:tab:logicalr contains above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting method.

Table 1: Prompt Engineering Analysis for Mathematical Problem Solving Task
Dataset Prompting Strategies LLM(s) SoTA
GSM8K Basic, Analogical Reasoning, CoT, Auto-CoT, Self-Consistency, LoT, PoT, PAL, CoC, Contrastive CoT, Contrastive Self-Consistency, Least-to-Most, Synthetic Prompting, Random CoT, Complex CoT, Active-Prompt, Fed-SP-SC, Fed-DP-CoT, PS GPT-3.5-Turbo, GPT-4, PaLM 2-L, GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), GPT-3, Codex (Code-Davinci-001), Vicuna-7B, Vicuna-13B, Vicuna-33B, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA, PaLM 2-S, GPT-3.5 (Text-Davinci-003), Minerva-540B, InstructGPT (Text-Davinci-003), DiVeRSe, UL2-20B PoT
MATH Analogical Reasoning, CoT GPT-3.5-Turbo, GPT-4, PaLM 2-L Analogical Reasoning
SVAMP Basic, CoT, Auto-CoT, Self-Consistency, PAL, PoT, Random CoT, Active-Prompt, Synthetic Prompting, Contrastive CoT, Contrastive Self-Consistency, Fed-SP-SC, Fed-DP-CoT, PS GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), GPT-3, UL2-20B, Codex (Code-Davinci-001), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA, Minerva-540B, GPT-3.5 (Text-Davinci-003), InstructGPT (Text-Davinci-003) PoT
ASDiv Basic, CoT, Self-Consistency, PAL, Contrastive CoT, Contrastive Self-Consistency, Synthetic Prompting, Auto-CoT, Random CoT, Active-Prompt GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), GPT-3, Codex (Code-Davinci-001), Minerva-540B, GPT-3.5-Turbo, InstructGPT (Text-Davinci-003), GPT-3.5 (Text-Davinci-003) Contrastive Self-Consistency
AQuA Basic, CoT, Auto-CoT, Self-Consistency, LoT, PoT, Contrastive CoT, Contrastive Self-Consistency, Random CoT, Active-Prompt, PS GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), GPT-3, Codex (Code-Davinci-001), GPT-3.5-Turbo, GPT-4, Vicuna-7B, Vicuna-13B, Vicuna-33B, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA, GPT-3.5 (Text-Davinci-003) PoT
MAWPS Basic, CoT GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002) CoT
Game of 24 Basic, CoT, Self-Consistency, ToT GPT-4 ToT
MultiArith Basic, CoT, Auto-CoT, Self-Consistency, PoT, PAL, MathPrompter, Random CoT, Complex CoT, PS GPT-3 (Text-Davinci-002), Codex (Code-Davinci-002), GPT-3, LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-001), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA, Minerva-540B, GPT-3.5 (Text-Davinci-003), DiVeRSe Self-Consistency
Multi-Step Arithmetic Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
AddSub Basic, CoT, Auto-CoT, Self-Consistency, PAL, PoT, PS GPT-3 (Text-Davinci-002), GPT-3.5 (Text-Davinci-003) Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B PAL
SingleEq Basic, CoT, Auto-CoT, PAL, Self-Consistency, Random CoT, Active-Prompt, PS, PoT GPT-3 (Text-Davinci-002), Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, GPT-3.5 (Text-Davinci-003) Active-Prompt
GSM-HARD Basic, CoT, PAL, Contrastive CoT, Contrastive Self-Consistency, Synthetic Prompting Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, GPT-3.5-Turbo, InstructGPT (Text-Davinci-003) Synthetic Prompting
SingleOp Basic, CoT, PAL, Synthetic Prompting Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, InstructGPT (Text-Davinci-003), GPT-3 (Text-Davinci-002) Synthetic Prompting
MathQA CoT, Random CoT, Complex CoT LaMDA-137B, PaLM-540B, Minerva-540B, GPT-3 (Text-Davinci-002), Codex (Code-Davinci-002), DiVeRSe Complex CoT
Table 2: Prompt Engineering Analysis for Logical Reasoning Task
Dataset Prompting Strategies LLM(s) SoTA
Word Sorting Basic, Analogical Reasoning, CoT, CoC GPT-3.5-Turbo, GPT-4, PaLM 2-L, PaLM 2-S, GPT-3.5 (Text-Davinci-003) CoC
Logical Deduction Basic, Analogical Reasoning, CoT, CoC GPT-3.5-Turbo, GPT-4, PaLM 2-L, PaLM 2-S, GPT-3.5 (Text-Davinci-003) CoC
Temporal Sequences Basic, Analogical Reasoning, CoT, CoC GPT-3.5-Turbo, GPT-4, PaLM 2-L, PaLM 2-S, GPT-3.5 (Text-Davinci-003) CoC
Formal Fallacies Basic, Analogical Reasoning, CoT, CoC GPT-3.5-Turbo, GPT-4, PaLM 2-L, PaLM 2-S, GPT-3.5 (Text-Davinci-003) Analogical Reasoning
Mini Crosswords Basic, CoT, ToT GPT-4 ToT
Tracking Shuffled Objects Basic, CoT, LoT, CoC GPT-3.5-Turbo, GPT-4, Vicuna-7B, Vicuna-13B, Vicuna-33B, PaLM 2-S, GPT-3.5 (Text-Davinci-003) CoT, LoT, CoC
Object Counting Basic, CoT, CoC, PAL PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B CoC
Boolean Expressions Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Web of Lies Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoT
Dyck Languages Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Geometric Shapes Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Repeat Copy Logic Basic, CoT, PAL, Synthetic Prompting Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, InstructGPT (Text-Davinci-003) PAL

3.3 Commonsense Reasoning

Contrary to Logical Reasoning task, Commonsense Reasoning task measures a model’s ability in terms of common practical knowledge often referred to as commonsense by humans to make any kind of judgement. It does not involve solving a problem to arrive at an answer. Rather, it is more of a form of inherent general knowledge. The various datasets that we discovered while surveying different prompting methods for this task include Reasoning about Colored Objects Srivastava et al. (2022), CSQA Talmor et al. (2018), Date Understanding Srivastava et al. (2022), Sports Understanding Srivastava et al. (2022), Last Letter Concatenation Wei et al. (2022), Coin Flip Wei et al. (2022), Odd One Out Srivastava et al. (2022), Disambiguation QA Srivastava et al. (2022), Hyperbaton Srivastava et al. (2022), Com2Sense Singh et al. (2021), CSQA 2.0 Talmor et al. (2022), Creak Onoe et al. (2021) and List Reversal Khot et al. (2022). Table LABEL:tab:commonr shows above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting method.

Table 3: Prompt Engineering Analysis for Commonsense Reasoning Task
Dataset Prompting Strategies LLM(s) SoTA
Reasoning about Colored Objects Analogical Reasoning, CoT, Basic, CoC, PAL, Synthetic Prompting PaLM 2-L, PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, InstructGPT (Text-Davinci-003), Codex (Code-Davinci-002) Synthetic Prompting
CSQA Basic, CoT, Auto-CoT, Self-Consistency, Random CoT, Active-Prompt, PoT, PS Codex (Code-Davinci-001), Codex (Code-Davinci-002), GPT-3, GPT-3 (Text-Davinci-002), GPT-3.5 (Text-Davinci-003), LaMDA-137B, PaLM-540B, UL2-20B Active-Prompt
Last Letter Concatenation Basic, CoT, Auto-CoT, Self-Consistency, LoT, Random CoT, Active-Prompt, Least-to-Most, DecomP, PS Codex (Code-Davinci-001), Codex (Code-Davinci-002), GPT-3, GPT-3 (Text-Davinci-002), GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, InstructGPT (Text-Davinci-001), InstructGPT (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Vicuna-13B, Vicuna-33B, Vicuna-7B DecomP
CSQA 2.0 Basic, CoT, Self-Consistency, GKP, Maieutic Prompting InstructGPT (Text-Davinci-001) Maieutic Prompting
Date Understanding Basic, CoT, LoT, CoC, PAL, Complex CoT Codex (Code-Davinci-002), DiVeRSe’, GPT-3 (Text-Davinci-002), GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, LaMDA-137B, Minerva-540B, PaLM 2-S, PaLM-540B, UL2-20B, Vicuna-13B, Vicuna-33B, Vicuna-7B Complex CoT
Sports Understanding Basic, CoT, CoC GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoT
Coin Flip Basic, CoT, Auto-CoT, Self-Consistency, PS GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), GPT-3, Codex (Code-Davinci-001) Auto-CoT
Odd One Out CoT, LoT GPT-3.5-Turbo, GPT-4, Vicuna-7B, Vicuna-13B, Vicuna-33B LoT
Disambigu-ation QA Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Hyperbaton Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Com2Sense Basic, CoT, Self-Consistency, GKP, Maieutic Prompting InstructGPT (Text-Davinci-001) Maieutic Prompting
Creak Basic, CoT, Self-Consistency, GKP, Maieutic Prompting InstructGPT (Text-Davinci-001) Maieutic Prompting
List Reversal CoT, DecomP InstructGPT (Text-Davinci-002), InstructGPT (Text-Davinci-001), Codex (Code-Davinci-002) DecomP

3.4 Multi-Hop Reasoning

Multi-Hop Reasoning task assess how good a model is at connecting pieces of evidence from different parts of a context to answer a given query. The different datasets which we covered while reading up on different prompting strategies for this task are StrategyQA Geva et al. (2021), HotpotQA Yang et al. (2018), Bamboogle Press et al. (2022), CommaQA-E Khot et al. (2021), MuSiQue Trivedi et al. (2022), 2WikiMultihopQA and Ho et al. (2020). Table LABEL:tab:mhr lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.5 Causal Reasoning

Causal Reasoning task checks a model’s ability to deal with cause and effect. We came across two datasets while reading up on different prompting techniques for this task which are Cause And Effect Srivastava et al. (2022) and Causal Judgement Srivastava et al. (2022). Table LABEL:tab:causalr shows above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting method.

Table 4: Prompt Engineering Analysis for Multi-Hop Reasoning Task
Dataset Prompting Strategies LLM(s) SoTA
StrategyQA Basic, CoT, Auto-CoT, Self-Consistency, Contrastive CoT, Contrastive Self-Consistency, Random CoT, Active-Prompt, Complex CoT, PS GPT-3, GPT-3 (Text-Davinci-002), GPT-3.5 (Text-Davinci-003), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002), Codex (Code-Davinci-001), GPT-3.5-Turbo, Minerva-540B, DiVeRSe Active-Prompt
HotpotQA Basic, CoT, Act, ReAct, Self-Consistency, VE, CoK, DecomP, Least-to-Most PaLM-540B, GPT-3 (Text-Davinci-002), GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), InstructGPT (Text-Davinci-001), Codex (Code-Davinci-002) CoK
CommaQA-E CoT, DecomP InstructGPT (Text-Davinci-002), InstructGPT (Text-Davinci-001), Codex (Code-Davinci-002) DecomP
MuSiQue Basic, CoT, DecomP InstructGPT (Text-Davinci-002), InstructGPT (Text-Davinci-001), Codex (Code-Davinci-002) DecomP
2WikiMult-ihopQA Basic, CoT, DecomP InstructGPT (Text-Davinci-002), InstructGPT (Text-Davinci-001), Codex (Code-Davinci-002) DecomP
Table 5: Prompt Engineering Analysis for Causal Reasoning Task
Dataset Prompting Strategies LLM(s) SoTA
Cause And Effect CoT, LoT GPT-3.5-Turbo, GPT-4, Vicuna-7B, Vicuna-13B, Vicuna-33B LoT
Causal Judgement Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 Basic, CoT

3.6 Social Reasoning

This task tests a model’s ability to reason about human social interactions. We discovered only one dataset while surveying different prompting techniques for this task which is SocialQA Srivastava et al. (2022). Table LABEL:tab:socialr contains above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.7 Contextual Question-Answering

This task measures a model’s ability to answer a query solely by relying on a given context. The different datasets which we covered while reading up on different prompting methods for this task are ProcessBank Berant et al. (2014), BioMRC Pappas et al. (2020), MASH-QA Zhu et al. (2020), CliCR Šuster & Daelemans (2018), MultiSpanQA Li et al. (2022), FinQA Chen et al. (2021b), TAT-QA Zhu et al. (2021), Patient Case Reports Vatsal & Singh (2024), Drop Dua et al. (2019) and BoolQ Clark et al. (2019). Table LABEL:tab:cqa lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting technique.

Table 6: Prompt Engineering Analysis for Social Reasoning Task
Dataset Prompting Strategies LLM(s) SoTA
SocialQA CoT, LoT GPT-3.5-Turbo, GPT-4, Vicuna-7B, Vicuna-13B, Vicuna-33B LoT
Table 7: Prompt Engineering Analysis for Contextual Question-Answering Task
Dataset Prompting Strategies LLM(s) SoTA
ProcessBank Basic, Implicit RAG, CoT, Analogical Reasoning GPT-4 Implicit RAG
BioMRC Basic, Implicit RAG, CoT, Analogical Reasoning GPT-4 Basic
MASH-QA Basic, Implicit RAG, CoT, Analogical Reasoning GPT-4 Basic
CliCR Basic, Implicit RAG, CoT, Analogical Reasoning GPT-4 Implicit RAG, Analogical Reasoning
MultiSpanQA Basic, CoT, CoVe LLaMA-65B, LLaMA-2-70B Chat CoVe
FinQA PoT, CoT, Self-Consistency Codex (Code-Davinci-002), GPT-3 (Text-Davinci-002), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi and Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA PoT
TAT-QA PoT, CoT, Self-Consistency Codex (Code-Davinci-002), GPT-3 (Text-Davinci-002), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi and Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA PoT
Patient Case Reports Implicit RAG, CoT, Analogical Reasoning, Basic, Basic with Term Definitions GPT-4 Implicit RAG
Drop Basic, CoT, Least-to-Most GPT-3 (Text-Davinci-002), Codex (Code-Davinci-002), Codex (Code-Davinci-001) Least-to-Most
BoolQ CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

3.8 Context-Free Question-Answering

In contrast to the Contextual Question-Answering task, the Context-Free Question-Answering task relies on model’s embedded knowledge base or any open-source knowledge base, such as Wikipedia, to answer a query instead of using only the context provided. The various datasets that we discovered while surveying different prompting techniques for this task are PopQA Mallen et al. (2022), EntityQ Sciavolino et al. (2021), Wikidata Dhuliawala et al. (2023), Wiki-Catoegory List Dhuliawala et al. (2023), MedMCQA Pal et al. (2022), MMLU Physics Hendrycks et al. (2020), MMLU Biology Hendrycks et al. (2020), USMLE Sample Exam Nori et al. (2023), USMLE Self Assessments Nori et al. (2023), MedQA Jin et al. (2021), PubMedQA Jin et al. (2019), MMLU Hendrycks et al. (2020) and AI2 Reasoning Challenge Clark et al. (2018). Table LABEL:tab:cfqa lists above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting strategy.

Table 8: Prompt Engineering Analysis for Context-Free Question-Answering Task
Dataset Prompting Strategies LLM(s) SoTA
PopQA Basic, CoT, ThoT GPT-4, GPT-3.5-Turbo, LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, LLaMA-2-70B-Chat, Vicuna-7B, Vicuna-13B, Vicuna-33B ThoT
EntityQ Basic, CoT, ThoT GPT-4, GPT-3.5-Turbo, LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, LLaMA-2-70B-Chat, Vicuna-7B, Vicuna-13B, Vicuna-33B ThoT
Wikidata Basic, CoT, CoVe LLaMA-65B, LLaMA-2-70B Chat CoVe
Wiki-Catoegory List Basic, CoT, CoVe LLaMA-65B, LLaMA-2-70B Chat CoVe
MedMCQA Basic, CoT, Self-Consistency, VE, CoK, ER GPT-3.5-Turbo, GPT-4, GPT-3.5, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM, Med-PaLM 2, Flan-PaLM, GPT-4-Base, Codex (Code-Davinci-002), LLaMA-2-70B, LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B Chat, LLaMA-2-7B Chat, LLaMA-2-13B Chat, GPT-NeoX, MPT-Instruct-7B, MPT-Instruct-30B, Falcon-Instruct-7B, Falcon-Instruct-40B, Guanaco-33B, Guanaco-65B, Vicuna-1.3-7B, Vicuna-1.3-13B, Vicuna-1.3-33B, Vicuna-1.5-7B, Vicuna-1.5-13B, U-PaLM-540B, Flan-U-PaLM-540B, Med-PaLM V2-540B Basic
MedQA Basic, CoT, Self-Consistency, ER GPT-4, GPT-3.5, GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM, Med-PaLM 2, Flan-PaLM, GPT-4-Base, Codex (Code-Davinci-002), LLaMA-2-70B, LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B Chat, LLaMA-2-7B Chat, LLaMA-2-13B Chat, GPT-NeoX, MPT-Instruct-7B, MPT-Instruct-30B, Falcon-Instruct-7B, Falcon-Instruct-40B, Guanaco-33B, Guanaco-65B, Vicuna-1.3-7B, Vicuna-1.3-13B, Vicuna-1.3-33B, Vicuna-1.5-7B, Vicuna-1.5-13B, U-PaLM-540B, Flan-U-PaLM-540B, Med-PaLM V2-540B Basic
MMLU Physics Basic, CoT, Self-Consistency, VE, CoK GPT-3.5-Turbo CoK
MMLU Biology Basic, CoT, Self-Consistency, VE, CoK GPT-3.5-Turbo CoK
USMLE Sample Exam Basic GPT-4, GPT-3.5, GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM Basic
USMLE Self Assessments Basic GPT-4, GPT-3.5, GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM Basic
AI2 Reasoning Challenge CoT, Self-Consistency GPT-3, LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-001), Codex (Code-Davinci-002) Self-Consistency
PubMedQA Basic, CoT, Self-Consistency, ER GPT-4, GPT-3.5, GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM, Med-PaLM 2, Flan-PaLM, GPT-4-Base, Codex (Code-Davinci-002), LLaMA-2-70B, LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B Chat, LLaMA-2-7B Chat, LLaMA-2-13B Chat, GPT-NeoX, MPT-Instruct-7B, MPT-Instruct-30B, Falcon-Instruct-7B, Falcon-Instruct-40B, Guanaco-33B, Guanaco-65B, Vicuna-1.3-7B, Vicuna-1.3-13B, Vicuna-1.3-33B, Vicuna-1.5-7B, Vicuna-1.5-13B, U-PaLM-540B, Flan-U-PaLM-540B, Med-PaLM V2-540B Basic
MMLU Basic, CoT, Self-Consistency, ER Med-PaLM 2, Flan-PaLM, GPT-4-Base, GPT-4, GPT-3.5, GPT-3.5-Turbo, InstructGPT (Text-Davinci-002), Flan-PaLM 540B, Med-PaLM, Codex (Code-Davinci-002), LLaMA-2-70B, LLaMA-2-7B, LLaMA-2-13B, LLaMA-2-70B Chat, LLaMA-2-7B Chat, LLaMA-2-13B Chat, GPT-4, GPT-NeoX, MPT-Instruct-7B, MPT-Instruct-30B, Falcon-Instruct-7B, Falcon-Instruct-40B, Guanaco-33B, Guanaco-65B, Vicuna-1.3-7B, Vicuna-1.3-13B, Vicuna-1.3-33B, Vicuna-1.5-7B, Vicuna-1.5-13B, U-PaLM-540B, Flan-U-PaLM-540B, Med-PaLM V2-540B Basic

3.9 Spatial Question-Answering

Spatial Question-Answering task measures a model’s ability to deal with spatial reasoning which is a cognitive process based on the construction of mental representations for spatial objects, relations, and transformations. The various datasets which we came across while reading up on different prompting techniques for this task include Brick World Hu et al. (2023), NLVR-Based Manipulation Hu et al. (2023), Natural Language Navigation Hu et al. (2023), Spartun Mirzaee & Kordjamshidi (2022) and Navigate Srivastava et al. (2022). Table LABEL:tab:sqa contains above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.10 Conversational Contextual Question-Answering

In this task, the model is assessed based on it’s understanding of a given text extract and how it is able to answer a series of interconnected queries that appear in a conversational format. A key thing to note here is that each query may depend on previous queries’ answers. We covered only one dataset while reading up on different prompting methods for this task which includes ConvFinQA Chen et al. (2022b). Table LABEL:tab:ccqa shows above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

Table 9: Prompt Engineering Analysis for Spatial Question-Answering Task
Dataset Prompting Strategies LLM(s) SoTA
Brick World CoT, CoS GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoS
NLVR-Based Manipulation CoT, CoS GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoS
Natural Language Navigation CoT, CoS GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoS
Spartun CoT, CoS GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoS
Navigate Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoT
Table 10: Prompt Engineering Analysis for Conversational Contextual Question-Answering Task
Dataset Prompting Strategies LLM(s) SoTA
ConvFinQA PoT, CoT, Self-Consistency, PAL Codex (Code-Davinci-002), GPT-3 (Text-Davinci-002), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA PoT

3.11 Dialogue System

Dialogue System task checks model’s ability to perform language generation in a user-to-machine conversation setting or answer queries given an already generated conversation. It is possible that when the text extract in case of Conversational Contextual Question-Answering becomes a conversation, there will be a strong overlap between these two tasks but based on the datasets and prompting techniques encountered during our survey, we decided to keep these two as separate tasks. We discovered only one dataset while surveying different prompting methods for this task which includes Multi-Turn Conversation Response (MTCR) Zhou et al. (2023). Table LABEL:tab:dias lists above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting technique.

3.12 Code Generation

This task involves all the cases where the input or the final output to the model is a programming language code. The different datasets which we came across while reading up on different prompting strategies for this task are Codeforce Scraping Yasunaga et al. (2023), HumanEval Chen et al. (2021a), MBPP Austin et al. (2021) and MBCPP Athiwaratkun et al. (2022). Table LABEL:tab:codeg contains above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting strategy.

Table 11: Prompt Engineering Analysis for Dialogue System Task
Dataset Prompting Strategies LLM(s) SoTA
MTCR Basic, CoT, ThoT GPT-4, GPT-3.5-Turbo, LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, LLaMA-2-70B-Chat, Vicuna-7B, Vicuna-13B, Vicuna-33B ThoT
Table 12: Prompt Engineering Analysis for Code Generation Task
Dataset Prompting Strategies LLM(s) SoTA
Codeforce Scraping Analogical Reasoning, CoT GPT-3.5-Turbo, GPT-4, PaLM 2-L Analogical Reasoning
HumanEval Basic, SCoT, CoT Codex (Code-Davinci-002), GPT-3.5-Turbo SCoT
MBPP Basic, SCoT, CoT Codex (Code-Davinci-002), GPT-3.5-Turbo SCoT
MBCPP Basic, SCoT, CoT Codex (Code-Davinci-002), GPT-3.5-Turbo SCoT

3.13 Free Response

This task assess a model’s ability in generating unconstrained textual response. The various datasets which we covered while reading up on different prompting methods for this task include Creative Writing Yao et al. (2024) and Longform Generation of Biographies Min et al. (2023). Table LABEL:tab:freer lists above-mentioned datasets and different prompting strategies that have been experimented on them along with the best technique.

Table 13: Prompt Engineering Analysis for Free Response Task
Dataset Prompting Strategies LLM(s) SoTA
Creative Writing Basic, CoT, Self-Consistency, ToT GPT-4 ToT
Longform Generation of Biographies Basic, CoT, CoVe LLaMA-65B, LLaMA-2-70B Chat CoVe

3.14 Truthfulness

This task assess a model’s ability to communicate factually and not spread any kind of misinformation. This task does not represent a model’s capability in understanding a given context, rather it is more focused on them not making false statements based on their understanding. The various datasets that we discovered while surveying different prompting strategies for this task are SycophancyEval, https://github.com/meg-tong/sycophancy-eval 111https://github.com/meg-tong/sycophancy-eval, GSM-IC Shi et al. (2023) and Fever Thorne et al. (2018). Table LABEL:tab:truth shows above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting technique.

Table 14: Prompt Engineering Analysis for Truthfulness Task
Dataset Prompting Strategies LLM(s) SoTA
Sycophancy-Eval S2A, CoT, Instructed Prompting LLaMA-2-70B-Chat S2A
Longform Generation S2A, CoT, Instructed Prompting LLaMA-2-70B-Chat S2A
Fever Basic, CoT, Act, ReAct, Self-Consistency, VE, CoK PaLM-540B, GPT-3.5 (Text-Davinci-002), GPT-3.5-Turbo, InstructGPT (Text-Davinci-003) ReAct
GSM-IC CoT, Least-to-Most, Instructed Prompting, Self-Consistency, S2A Codex (Code-Davinci-002), GPT-3.5 ( Text-Davinci-003), LLaMA-2-70B-Chat Least-to-Most

3.15 Table-Based Truthfulness

This task is an extension of Truthfulness task and measures a model’s ability to communicate factually and not spread any kind of misinformation in a tabular setting. The only dataset we came across while reading up on different prompting methods for this task is TabFact Chen et al. (2019). Table LABEL:tab:tbtruth contains above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting strategy.

Table 15: Prompt Engineering Analysis for Table-Based Truthfulness Task
Dataset Prompting Strategies LLM(s) SoTA
TabFact Basic, CoT, Binder, Dater, Chain-of-Table PaLM 2-S, GPT-3.5-Turbo, LLaMA-2-17B-Chat Chain-of-Table

3.16 Table-Based Question-Answering

This task involves any kind of question-answering in a tabular setting. It can be considered as a super set of other kinds of table-based tasks like Table-Based Truthfulness or Table-Based Mathematical Problem Solving. But in this work, in order to avoid any confusion, we have captured all the datasets under this task which do not fall under more specific table-based tasks like Table-Based Truthfulness or Table-Based Mathematical Problem Solving. We came across only two datasets while reading up on different prompting strategies for this task which are FeTaQA Nan et al. (2022) and WikiTQPasupat & Liang (2015). Table LABEL:tab:tbqa shows above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.17 Table-Based Mathematical Problem Solving

This task is an extension of Mathematical Problem Solving task and measures a model’s ability to perform any kind of mathematical computation in a tabular setting. The different datasets which we covered while reading up on different prompting techniques for this task include TabMWP Lu et al. (2022) and Penguins in a Table Srivastava et al. (2022). Table LABEL:tab:tbmps lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

3.18 Recommender System

This task measures a model’s ability to process a given input and suggest a set of items which are most relevant from a list of possible items as output. We discovered only one dataset while surveying different prompting techniques for this task which is Movie Recommendation Srivastava et al. (2022). Table LABEL:tab:rec lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting technique.

Table 16: Prompt Engineering Analysis for Table-Based Question-Answering Task
Dataset Prompting Strategies LLM(s) SoTA
WikiTQ Basic, CoT, Binder, Dater, Chain-of-Table PaLM 2-S, GPT-3.5-Turbo, LLaMA-2-17B-Chat, Codex (Code-Davinci-002) Chain-of-Table
FeTaQA Basic, CoT, Dater, Chain-of-Table, Self-Consistency, VE, CoK PaLM 2-S, GPT-3.5-Turbo, LLaMA-2-17B-Chat, GPT-3.5-Turbo, Codex (Code-Davinci-002) Chain-of-Table
Table 17: Prompt Engineering Analysis for Table-Based Mathematical Problem Solving Task
Dataset Prompting Strategies LLM(s) SoTA
TabMWP PoT, CoT, Self-Consistency, PAL Codex (Code-Davinci-002), GPT-3 (Text-Davinci-002), GPT-3.5-Turbo, CodeGen (Codegen-16B-Multi), CodeGen (Codegen-16B-Mono), CodeT5+, Xgen, PaLM, LaMDA PoT
Penguins in a Table Basic, CoT, CoC, PAL, Random CoT, Complex CoT PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, Codex (Code-Davinci-002), UL2-20B, LaMDA-137B, PaLM-540B, Minerva-540B, GPT-3 (Text-Davinci-002), DiVeRSe PAL
Table 18: Prompt Engineering Analysis for Recommender System Task
Dataset Prompting Strategies LLM(s) SoTA
Movie Recommendation Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4, Codex (Code-Davinci-002) Basic

3.19 Emotion/Sentiment Understanding

This task checks how good a model is at understanding human emotions or sentiments. The various datasets which we came across while reading up on different prompting methods for this task include Ruin Names Srivastava et al. (2022), SemEval14 Laptop and Restaurant Pontiki et al. (2016) and Forex Fatouros et al. (2023). Table LABEL:tab:emotion contains above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting strategy.

3.20 Machine Translation

In this task, a model is tested on it’s ability in terms of translation between two languages. The different datasets which we came across while reading up on different prompting techniques for this task include Salient Translation Error Detection Srivastava et al. (2022), FLORES Costa-jussà et al. (2022), WMT21 Farhad et al. (2021), Multi-Domain Aharoni & Goldberg (2020) and PDC Sun et al. (2020). Table LABEL:tab:mtrans lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

Table 19: Prompt Engineering Analysis for Emotion/Sentiment Understanding Task
Dataset Prompting Strategies LLM(s) SoTA
Snarks Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 CoC
Ruin Names Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 Basic
SemEval14 Laptop and Restaurant THOR, CoT Flan-T5-250M (Base), Flan-T5-780M (Large), Flan-T5-3B (XL), Flan-T5-11B (XXL), GPT3-350M, GPT3-1.3B, GPT3-6.7B, GPT3-175B, GPT-3.5-Turbo THOR
Forex Basic, Basic + Variations GPT-3.5-Turbo Basic + Variations
Table 20: Prompt Engineering Analysis for Machine Translation Task
Dataset Prompting Strategies LLM(s) SoTA
Salient Translation Error Detection Basic, CoT, CoC PaLM 2-S, GPT-3.5 (Text-Davinci-003), GPT-3.5-Turbo, GPT-4 Basic
FLORES Basic, Basic + Variations GLM-130B Basic + Variations
WMT21 Basic, Basic + Variations GLM-130B Basic + Variations
Multi-Domain Basic, Basic + Variations GLM-130B Basic + Variations
PDC Basic, Basic + Variations GLM-130B Basic + Variations

3.21 Named Entity Recognition

Named Entity Recognition task aims at identifying predefined classes or categories of objects in a given input text. The different datasets that we discovered while surveying different prompting techniques for this task are MTSamples Uzuner et al. (2011), VAERS Du et al. (2021), Research Papers Tang et al. (2024) and BC5CDR-chem Li et al. (2016). Table LABEL:tab:ner shows above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting strategy.

Table 21: Prompt Engineering Analysis for Named Entity Recognition Task
Dataset Prompting Strategies LLM(s) SoTA
MTSamples Basic, Basic + Annotation Guideline-based Prompting, Basic + Annotation Guideline-Based Prompting + Error Analysis-Based Prompting GPT-3.5-Turbo, GPT-4 Basic + Annotation Guideline-Based Prompting + Error Analysis-Based Prompting
VAERS Basic, Basic + Annotation Guideline-based Prompting, Basic + Annotation Guideline-Based Prompting + Error Analysis-Based Prompting GPT-3.5-Turbo, GPT-4 Basic + Annotation Guideline-Based Prompting + Error Analysis-Based Prompting
Research Papers Basic, CoT GPT-3.5-Turbo, GPT-4 Basic
BC5CDR-chem CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

3.22 Word Sense Disambiguation

Word Sense Disambiguation task checks how good a model is at deciphering different meanings of a word in different contextual surroundings. We came across only one dataset while reading up on different prompting methods for this task which includes WiC Pilehvar & Camacho-Collados (2018). Table LABEL:tab:wsd shows above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting method.

3.23 Summarization

This task tests a model’s ability in breaking down a lengthy piece of input text into smaller chunks while ensuring retention of vital information in these smaller chunks. We covered only one dataset while reading up on different prompting methods for this task which is CCTC Bao et al. (2024). Table LABEL:tab:summa contains above-mentioned datasets and different prompting techniques that have been experimented on them along with the best performing prompting strategy.

Table 22: Prompt Engineering Analysis for Word Sense Disambiguation Task
Dataset Prompting Strategies LLM(s) SoTA
WiC CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP
Table 23: Prompt Engineering Analysis for Summarization Task
Dataset Prompting Strategies LLM(s) SoTA
WCEP Basic, CoE ChatGLM2-6B CoE
CCTC Basic, CoE ChatGLM2-6B CoE

3.24 Paraphrasing

Paraphrasing task aims at rewriting a given piece of input text by using different words while keeping the true semantics of the original input text same. A key difference between Summarization task and Paraphrasing task is that the main goal of Summarization task is to shorten the length of output text with respect to that of input text whereas Paraphasing task just focuses on using different words during it’s rewriting process. We discovered only one dataset while surveying different prompting methods for this task which includes QQP 222https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs. Table LABEL:tab:parap lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting technique.

Table 24: Prompt Engineering Analysis for Paraphrasing Task
Dataset Prompting Strategies LLM(s) SoTA
QQP CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

3.25 Stance Detection

This task evaluates a model’s ability in determining from text whether the author of the text is in favor or against a topic or target or an object of evaluation. The different datasets which we came across while reading up on different prompting techniques for this task are SemEval-2016 Mohammad et al. (2016), VAST Allaway & McKeown (2020) and P-Stance Li et al. (2021). Table LABEL:tab:stanced shows above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting technique.

Table 25: Prompt Engineering Analysis for Stance Detection Task
Dataset Prompting Strategies LLM(s) SoTA
SemEval-2016 CoT GPT-3.5-Turbo CoT
VAST CoT GPT-3.5-Turbo CoT
P-Stance CoT GPT-3.5-Turbo CoT

3.26 Natural Language Inference

The main objective of this task is to determine whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise. The different datasets which we covered while reading up on different prompting methods for this task are QNLI Rajpurkar et al. (2016) and MedNLI Romanov & Shivade (2018). Table LABEL:tab:nli contains above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting method.

Table 26: Prompt Engineering Analysis for Natural Language Inference Task
Dataset Prompting Strategies LLM(s) SoTA
QNLI CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP
MedNLI CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

3.27 Relation Extraction

Relation Extraction evaluates a model’s ability in identifying semantic relationships between predefined classes or categories of objects or named entities. We came across only one dataset while reading up on different prompting techniques for this task which includes DDI Segura-Bedmar et al. (2013). Table LABEL:tab:re shows above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting strategy.

Table 27: Prompt Engineering Analysis for Relation Extraction Task
Dataset Prompting Strategies LLM(s) SoTA
DDI CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

3.28 Language-Based Task Completion

The main objective of this task to check how good is a model in following a sequence of language-based navigational commands to make decisions about it’s actions required to complete a task.The different datasets that we discovered while surveying different prompting strategies for this task are ALFWorld Shridhar et al. (2020), WebShop Yao et al. (2022a), SayCan Ahn et al. (2022) and Scan Lake & Baroni (2018). Table LABEL:tab:langbased lists above-mentioned datasets and different prompting methods that have been experimented on them along with the best performing prompting method.

Table 28: Prompt Engineering Analysis for Language-Based Task Completion Task
Dataset Prompting Strategies LLM(s) SoTA
ALFWorld Act, ReAct PaLM-540B, GPT-3 (Text-Davinci-002) ReAct
Scan Basic, CoT, Least-to-Most GPT-3 (Text-Davinci-002), Codex (Code-Davinci-001), Codex (Code-Davinci-001) Least-to-Most
WebShop Act, ReAct PaLM-540B, GPT-3 (Text-Davinci-002) ReAct
SayCan Basic, CoT GPT-3 (Text-Davinci-002), LaMDA-137B, PaLM-540B, UL2-20B, Codex (Code-Davinci-002) CoT

3.29 Multilabel Text Classification

This task measures a model’s ability to assign each input to a set of predefined target labels. This task can encapsulate a lot of above-mentioned tasks like Stance Detection, Named Entity Recognition etc but again in order to keep these task definitions as disjoint as possible for a better survey of prompting methods, we have included only those datasets under this task which could not be suitably categorized under any of the above-discussed tasks. The different datasets which we covered while reading up on different prompting strategies for this task include EUR-LEX Chalkidis et al. (2021), UNFAIR-ToS Lippi et al. (2019) and LEDGAR Tuggener et al. (2020). Table LABEL:tab:mlc contains above-mentioned datasets and different prompting strategies that have been experimented on them along with the best performing prompting method.

Table 29: Prompt Engineering Analysis for Multilabel Text Classification Task
Dataset Prompting Strategies LLM(s) SoTA
EUR-LEX CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP
UNFAIR-ToS CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP
LEDGAR CoT, PS, Self-Consistency, MP Llama-2-13B-Chat, GPT-3.5-Turbo, GPT-4, PaLM-Bison-Chat MP

4 Conclusion

Prompt engineering has become indispensable in the present realm of LLMs. It plays a crucial role in realising the full potential of LLMs through various measures. In this work, we do an in-depth survey of 44 research papers talking about 39 prompting strategies across 29 different NLP tasks. We pictorially present this through a taxonomy diagram. We try to standardize the categorization of different datasets into 29 NLP tasks and discuss the overall effect of recent prompting techniques across them while also listing down potential SoTA prompting method for each dataset.

References

  • Aharoni & Goldberg (2020) Roee Aharoni and Yoav Goldberg. Unsupervised domain clusters in pretrained language models. arXiv preprint arXiv:2004.02105, 2020.
  • Ahn et al. (2022) Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  • Allaway & McKeown (2020) Emily Allaway and Kathleen McKeown. Zero-shot stance detection: A dataset and model using generalized topic representations. arXiv preprint arXiv:2010.03640, 2020.
  • Amini et al. (2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  • Athiwaratkun et al. (2022) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868, 2022.
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  • Bao et al. (2024) Songlin Bao, Tiantian Li, and Bin Cao. Chain-of-event prompting for multi-document summarization by large language models. International Journal of Web Information Systems, (ahead-of-print), 2024.
  • Berant et al. (2014) Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D Manning. Modeling biological processes for reading comprehension. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.  1499–1510, 2014.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chalkidis et al. (2021) Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv preprint arXiv:2109.00904, 2021.
  • Chang et al. (2023) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  • Chen et al. (2023) Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735, 2023.
  • Chen et al. (2021a) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
  • Chen et al. (2019) Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019.
  • Chen et al. (2022a) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022a.
  • Chen et al. (2021b) Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. arXiv preprint arXiv:2109.00122, 2021b.
  • Chen et al. (2022b) Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849, 2022b.
  • Cheng et al. (2022) Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022.
  • Chia et al. (2023) Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277, 2023.
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Costa-jussà et al. (2022) Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  • Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
  • Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.
  • Du et al. (2021) Jingcheng Du, Yang Xiang, Madhuri Sankaranarayanapillai, Meng Zhang, Jingqi Wang, Yuqi Si, Huy Anh Pham, Hua Xu, Yong Chen, and Cui Tao. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (vaers) using deep learning. Journal of the American Medical Informatics Association, 28(7):1393–1400, 2021.
  • Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  • Edemacu & Wu (2024) Kennedy Edemacu and Xintao Wu. Privacy preserving prompt engineering: A survey. arXiv preprint arXiv:2404.06001, 2024.
  • Farhad et al. (2021) Akhbardeh Farhad, Arkhangorodsky Arkady, Biesialska Magdalena, Bojar Ondřej, Chatterjee Rajen, Chaudhary Vishrav, Marta R Costa-jussa, España-Bonet Cristina, Fan Angela, Federmann Christian, et al. Findings of the 2021 conference on machine translation (wmt21). In Proceedings of the Sixth Conference on Machine Translation, pp.  1–88. Association for Computational Linguistics, 2021.
  • Fatouros et al. (2023) Georgios Fatouros, John Soldatos, Kalliopi Kouroumali, Georgios Makridis, and Dimosthenis Kyriazis. Transforming sentiment analysis in the financial domain with chatgpt. Machine Learning with Applications, 14:100508, 2023.
  • Fei et al. (2023) Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint arXiv:2305.11255, 2023.
  • Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, 2022.
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  • Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  • Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
  • Hu et al. (2023) Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, and Yue Zhang. Chain-of-symbol prompting elicits planning in large langauge models. arXiv preprint arXiv:2305.10276, 2023.
  • Hu et al. (2024) Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association, pp.  ocad259, 2024.
  • Imani et al. (2023) Shima Imani, Liang Du, and Harsh Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
  • Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  • Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
  • Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822, 2022.
  • Khot et al. (2021) Tushar Khot, Kyle Richardson, Daniel Khashabi, and Ashish Sabharwal. Hey ai, can you solve complex tasks by talking to agents? arXiv preprint arXiv:2110.08542, 2021.
  • Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  • Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pp.  1152–1157, 2016.
  • Lake & Baroni (2018) Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pp.  2873–2882. PMLR, 2018.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li et al. (2023a) Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474, 2023a.
  • Li et al. (2022) Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. Multispanqa: A dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1250–1260, 2022.
  • Li et al. (2023b) Jia Li, Ge Li, Yongmin Li, and Zhi Jin. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599, 2023b.
  • Li et al. (2016) Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016.
  • Li et al. (2023c) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In The Twelfth International Conference on Learning Representations, 2023c.
  • Li et al. (2021) Yingjie Li, Tiberiu Sosea, Aditya Sawant, Ajith Jayaraman Nair, Diana Inkpen, and Cornelia Caragea. P-stance: A large dataset for stance detection in political domain. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  2355–2365, 2021.
  • Liévin et al. (2024) Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. Can large language models reason about medical questions? Patterns, 5(3), 2024.
  • Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  • Lippi et al. (2019) Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. Claudette: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law, 27:117–139, 2019.
  • Liu et al. (2021) Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387, 2021.
  • Liu et al. (2023) Xiangyang Liu, Tianqi Pang, and Chenyou Fan. Federated prompting and chain-of-thought reasoning for improving llms answering. In International Conference on Knowledge Science, Engineering and Management, pp.  3–11. Springer, 2023.
  • Lu et al. (2022) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022.
  • Mallen et al. (2022) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022.
  • Miao et al. (2021) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
  • Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  • Mirzaee & Kordjamshidi (2022) Roshanak Mirzaee and Parisa Kordjamshidi. Transfer learning with synthetic corpora for spatial role labeling and reasoning. arXiv preprint arXiv:2210.16952, 2022.
  • Mohammad et al. (2016) Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp.  31–41, 2016.
  • Nan et al. (2022) Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryściński, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49, 2022.
  • Nori et al. (2023) Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  • Onoe et al. (2021) Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. Creak: A dataset for commonsense reasoning over entity knowledge. arXiv preprint arXiv:2109.01653, 2021.
  • Pal et al. (2022) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pp.  248–260. PMLR, 2022.
  • Pappas et al. (2020) Dimitris Pappas, Petros Stavropoulos, Ion Androutsopoulos, and Ryan McDonald. Biomrc: A dataset for biomedical machine reading comprehension. In Proceedings of the 19th SIGBioMed workshop on biomedical language processing, pp.  140–149, 2020.
  • Pasupat & Liang (2015) Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015.
  • Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  • Pilehvar & Camacho-Collados (2018) Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121, 2018.
  • Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammed Al-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, et al. Semeval-2016 task 5: Aspect based sentiment analysis. In ProWorkshop on Semantic Evaluation (SemEval-2016), pp.  19–30. Association for Computational Linguistics, 2016.
  • Press et al. (2022) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  • Romanov & Shivade (2018) Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. arXiv preprint arXiv:1808.06752, 2018.
  • Sahoo et al. (2024) Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927, 2024.
  • Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. Simple entity-centric questions challenge dense retrievers. arXiv preprint arXiv:2109.08535, 2021.
  • Segura-Bedmar et al. (2013) Isabel Segura-Bedmar, Paloma Martínez, and María Herrero-Zazo. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pp.  341–350, 2013.
  • Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In International Conference on Machine Learning, pp.  30706–30775. PMLR, 2023.
  • Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  • Shridhar et al. (2020) Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020.
  • Singh et al. (2021) Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormolabashi, Te-Lin Wu, Xuezhe Ma, and Nanyun Peng. Com2sense: A commonsense reasoning benchmark with complementary sentences. arXiv preprint arXiv:2106.00969, 2021.
  • Singhal et al. (2023) Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  • Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  • Sun et al. (2020) Zewei Sun, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Shujian Huang, Jiajun Chen, and Lei Li. Rethinking document-level neural machine translation. arXiv preprint arXiv:2010.08961, 2020.
  • Šuster & Daelemans (2018) Simon Šuster and Walter Daelemans. Clicr: a dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720, 2018.
  • Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  • Talmor et al. (2022) Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification. arXiv preprint arXiv:2201.05320, 2022.
  • Tang et al. (2024) Yiyi Tang, Ziyan Xiao, Xue Li, Qingpeng Zhang, Esther WY Chan, Ian CK Wong, and Research Data Collaboration Task Force. Large language model in medical information extraction from titles and abstracts with prompt engineering strategies: A comparative study of gpt-3.5 and gpt-4. medRxiv, pp.  2024–03, 2024.
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  • Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022.
  • Tuggener et al. (2020) Don Tuggener, Pius Von Däniken, Thomas Peetz, and Mark Cieliebak. Ledgar: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Proceedings of the twelfth language resources and evaluation conference, pp.  1235–1241, 2020.
  • Uzuner et al. (2011) Özlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556, 2011.
  • Vatsal & Singh (2024) Shubham Vatsal and Ayush Singh. Can gpt redefine medical understanding? evaluating gpt on biomedical machine reading comprehension. arXiv preprint arXiv:2405.18682, 2024.
  • Vatsal et al. (2024) Shubham Vatsal, Ayush Singh, and Shabnam Tafreshi. Can gpt improve the state of prior authorization via guideline based automated question answering? arXiv preprint arXiv:2402.18419, 2024.
  • Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Wang & Zhao (2023) Yuqing Wang and Yun Zhao. Metacognitive prompting improves understanding in large language models. arXiv preprint arXiv:2308.05342, 2023.
  • Wang et al. (2024) Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, et al. Chain-of-table: Evolving tables in the reasoning chain for table understanding. arXiv preprint arXiv:2401.04398, 2024.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • Weston & Sukhbaatar (2023) Jason Weston and Sainbayar Sukhbaatar. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829, 2023.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  • Yao et al. (2022a) Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
  • Yao et al. (2022b) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Yasunaga et al. (2023) Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H Chi, and Denny Zhou. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714, 2023.
  • Ye et al. (2023) Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning. arXiv preprint arXiv:2301.13808, 2023.
  • Zhang et al. (2023a) Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pp.  41092–41110. PMLR, 2023a.
  • Zhang et al. (2023b) Bowen Zhang, Xianghua Fu, Daijun Ding, Hu Huang, Yangyang Li, and Liwen Jing. Investigating chain-of-thought with chatgpt for stance detection on social media. arXiv preprint arXiv:2304.03087, 2023b.
  • Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  • Zhao et al. (2023a) Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268, 2023a.
  • Zhao et al. (2023b) Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae Hee Lee, Kun Chu, and Stefan Wermter. Enhancing zero-shot chain-of-thought reasoning in large language models through logic. arXiv preprint arXiv:2309.13339, 2023b.
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  • Zhou et al. (2023) Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, and Jianbing Shen. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734, 2023.
  • Zhu et al. (2021) Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624, 2021.
  • Zhu et al. (2020) Ming Zhu, Aman Ahuja, Da-Cheng Juan, Wei Wei, and Chandan K Reddy. Question answering with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3840–3849, 2020.