CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction

Guangya Wan¹, Yuqi Wu²¹¹footnotemark: 1, Jie Chen², Sheng Li¹
¹School of Data Science, University of Virginia
²Department of Electrical and Computer Engineering, University of Alberta
{wxr9et,shengli}@virginia.edu, [email protected], [email protected] These authors contributed equally to this work.

Abstract

Chain-of-Thought (CoT) prompting enhances Large Language Models’ (LLMs) complex reasoning abilities by generating intermediate steps. However, these steps can introduce hallucinations and accumulate errors. We propose the “CoT Rerailer” to address these challenges, employing self-consistency and multi-agent debate systems to identify and rectify errors in the reasoning process. The CoT Rerailer first selects the most logically correct Reasoning Path (RP) using consistency checks and critical evaluation by automated agents. It then engages a multi-agent debate system to propose and validate corrections to ensure the generation of an error-free intermediate logical path. The corrected steps are then used to generate a revised reasoning chain to further reduce hallucinations and enhance answer quality. We demonstrate the effectiveness of our approach across diverse question-answering datasets in various knowledge domains. The CoT Rerailer enhances the reliability of LLM-generated reasoning, contributing to more trustworthy AI-driven decision-making processes.

Guangya Wan¹^†^†thanks: These authors contributed equally to this work., Yuqi Wu²¹¹footnotemark: 1, Jie Chen², Sheng Li¹ ¹School of Data Science, University of Virginia ²Department of Electrical and Computer Engineering, University of Alberta {wxr9et,shengli}@virginia.edu, [email protected], [email protected]

1 Introduction

The development of Large Language Models (LLMs) has revolutionized the field of Natural Language Processing (NLP), marking a significant departure from traditional statistical or deep learning-based NLP methodologies. Unlike their predecessors, LLMs are trained on a vast corpus encompassing billions of tokens, endowing them with unprecedented capabilities in text generation, reasoning, and other linguistic tasks. Recent iterations of LLMs, such as ChatGPT (Achiam et al., 2023), GPT-4 (Radford et al., 2018), Llama2 (Touvron et al., 2023), and Claude (Anthropic, 2024), have showcased remarkable proficiency in producing content that is not only realistic but also rich in information. However, a significant challenge encountered with LLMs is their propensity to generate misleading or unfounded content, commonly termed "hallucination" (Huang et al., 2023a). Hallucinations can manifest as deviations from user instructions, contradictory responses, or the production of details misaligned with factual reality. Detecting and mitigating these hallucinations is crucial for ensuring the reliability and trustworthiness of LLM-generated content.

The Chain-of-Thought (CoT) (Wei et al., 2022) prompting technique has been proposed to mitigate hallucinations by encouraging LLMs to tackle questions in a step-by-step manner when generating reasoning paths (RPs), mimicking human problem-solving processes. However, the effectiveness of the CoT method is limited by the next-token prediction mechanism inherent to LLMs (Delétang et al., 2023; Bubeck et al., 2023), which can lead to a cascade of errors if inaccuracies or hallucinations occur at intermediate stages. The Self-Consistency (SC) approach (Wang et al., 2023) builds upon the CoT work by sampling multiple reasoning chains and aggregating their outputs using majority voting, but it also has higher computational costs and lacks explicit error detection or prevention mechanisms for intermediate steps.

Multi-agent debate (MAD) systems have emerged as another promising approach to mitigate hallucinations. By assigning roles to multiple LLMs (Li et al., 2023) and enabling their cooperation (Talebirad and Nadiri, 2023), these systems leverage the debate process to amplify correct answers and allow LLMs to challenge and refine each other’s responses. This interactive process facilitates the identification and correction of potential hallucinations, leading to more reliable and trustworthy outputs. However, these debates often remain at a higher level and do not directly address intermediate errors, which could further improve reliability if detected and resolved.

Building upon the ideas of SC and MAD, we developed the “CoT Rerailer”, an innovative pipeline that combines these approaches to detect and minimize hallucinatory responses (Fig. 1). The CoT Rerailer comprises two key processes: the identification of derailments and the rerailment procedure. The derailment identification process distinguishes consistent and inconsistent RPs and passes the responses with the least internal mistakes to the rerailment process. The rerailment process then thoroughly checks each intermediate step and mitigates errors, if detected, by engaging a multi-agent debate system. This approach enhances error correction capability in LLM’s reasoning and promotes the mitigation of intermediate-step hallucinations and has outperformed existing methods including CoT, SC, and MAD across a diverse range of datasets. Our key contribution includes:

1.

We propose a novel “CoT Rerailer” pipeline that enhances the interpretability and reliability of Large Language Models by identifying and rectifying hallucinations in the generated reasoning paths.
2.

We introduce a unique combination of consistency checks and MAD to efficiently and effectively detect and mitigate hallucinations in the reasoning process while minimizing computational overhead.
3.

We extensively test and benchmark the CoT Rerailer pipeline on four commonly used Question Answering datasets, demonstrating its efficiency, effectiveness, and versatility in detecting and reducing hallucinations, improving accuracy, and lowering the computational cost of generated responses compared to existing methods.

Refer to caption — Figure 1: LLMs problem-solving approaches schematics. The rectangle box represents intermediate steps generated by LLMs when solving the problem.

2 Related Work

Chain-of-Thought Prompting. Chain-of-thought (CoT) prompting (Wei et al., 2022) has been introduced to enhance language model reasoning by generating intermediate steps. This approach has inspired numerous works (Diao et al., 2023; Zhang et al., 2023b; Kojima et al., 2022; Zhou et al., 2023) focusing on step-by-step reasoning. Another foundation work is Self-consistency (Wang et al., 2023), which extends CoT prompting by sampling multiple reasoning paths and taking a majority vote on the final answer, improving performance on various reasoning tasks. Our proposed CoT Rerailer shares similarities with these approaches in generalizing over the CoT method and focuses on intermediate error detection and corrections to mitigate hallucinations. However, it differs from methods like (Gao et al., 2023), which leverage external solvers, and Tree-of-Thoughts (ToT) (Yao et al., 2024), which proposed multiple candidates thoughts in a tree structure on each step. Instead, our method focuses on directly identifying hallucinated reasoning chains and correcting the root causes within the problematic reasoning chain through a two-step pipeline without utilizing external resources.

Reasoning with Multi-Agent Systems. Recent works have explored assigning roles to multiple LLMs (Li et al., 2023) and enabling LLM cooperation (Talebirad and Nadiri, 2023) to distill reasoning capabilities. Multi-agent debating frameworks (Liang et al., 2023; Chan et al., 2024) have shown the potential to handle complex reasoning compared to single LLMs. In particular, Du et al. (2023) demonstrate that their multi-agent debate approach outperforms single-model baselines on various reasoning, factuality, and question-answering tasks, implying that using multiple model agents and multiple rounds of debate enhances performance. Their findings suggest that the debate process not only amplifies correct answers but also enables models to converge on the correct answer even when all models initially make incorrect predictions. Building upon these ideas, we incorporate a multi-agent debate component in the Rerailment Process of our CoT Rerailer to enhance error tracing abilities and mitigate intermediate-step hallucinations.

Hallucination. Hallucination is a critical issue in applying LLMs to real-world applications. Many survey provide comprehensive examinations of this topic (Huang et al., 2023b; Zhang et al., 2023a), which can be divided into two main categories: hallucination detection and hallucination mitigation. Most of these methods explore the ideas of self-consistency to handle the detection or mitigation. For hallucination detection, one closely related work is SelfCheckGPT (Manakul et al., 2023), which proposes to leverage self-consistency by comparing multiple sampled responses and measuring their consistency to detect hallucinations in a zero-resource setting. For hallucination mitigation, ChatProtect (Mündler et al., 2024) aims to mitigate self-conflicting hallucinations by focusing on identifying and correcting self-contradictions, which occur when an LM generates two logically inconsistent sentences within the same context through an iterative process that refines the text to remove contradictory information while preserving fluency and informativeness. Our CoT Rerailer distinguishes itself by leveraging a two-step pipeline by combining the ideas of self-consistency and multi-agent debates to enhance the reliability of hallucination detection and mitigation.

3 Methods

Preliminary. Our focus is on reasoning-based question-answering (QA) tasks, which aim to mimic human-like reasoning by generating coherent, logical sequences of thought leading to an answer. Formally, a QA task is defined as a tuple $(Q,C,A)$ , where $Q$ is a question, $C$ is the context providing necessary background information, and $A$ is the ground-truth answer. The goal for LLMs is to produce a sequence of tokens $T=(t_{1},t_{2},\dots,t_{n})$ that accurately answers $Q$ through logically coherent intermediate reasoning steps $S=(s_{1},s_{2},\dots,s_{m})$ , where each step $s_{i}$ is a subsection of the tokens $(t_{li},t_{ri})\subseteq T$ . A special case of QA tasks is multiple-choice question-answering (MCQA), where the LLM is provided with a set of predefined answer options $O=(o_{1},o_{2},\dots,o_{k})$ in addition to the question $Q$ and context $C$ . MCQA tasks can be formally defined as a tuple $(Q,C,O,A)$ . The fixed answer format of MCQA makes evaluation more straightforward compared to open-ended QA tasks.

In this work, we primarily focus on MCQA tasks and simple open-ended QA tasks, such as math questions, where the answer format is relatively simple to parse and evaluate. However, the format of these reasoning tasks, coupled with the intrinsic next-token prediction mechanism of LLMs, makes them susceptible to accumulating errors, thus leading to hallucinated output.

3.1 CoT-Rerailer

To address the issue of error accumulation in the reasoning-based QA tasks, we introduce the “CoT Rerailer” (Fig. 2), which reduces hallucination and optimizes computational resources by selectively processing reasoning chains $S=(s_{1},s_{2},\dots,s_{m})$ that benefit from intervention. Our method identifies and corrects the root causes of errors early in the $S$ , ensuring each step $s_{i}$ is logically coherent and factually accurate. This iterative correction of intermediate steps prevents the compounding of errors, maintaining $S$ on a sound logical track toward the correct answer $A$ .

3.1.1 Derailment Identifier

Our analysis begins by detecting hallucinations in LLM-generated answers $A=(a_{1},a_{2},\dots,a_{k})$ similar to the principle of self-consistency (Wang et al., 2023). Since uncertainty in answer generation strongly indicates hallucination (Zhang et al., 2023a), we identify questions $Q$ yielding inconsistent RPs $S_{i}$ for further analysis.

Consistency. Consistency measures the LLM’s ability to generate coherent answers across multiple iterations of the same $Q$ .Given a set of generated RPs ( ${S_{1},S_{2},\dots,S_{n}}$ ), the responses are considered inconsistent if there exist at least two RPs $S_{j}$ and $S_{k}$ such that their corresponding final answers $a_{i}$ and $a_{j}$ are different, i.e., $a_{i}\neq a_{j}$ . We generate multiple $S$ for each $Q$ , outputting the answers for consistent answers and passing on inconsistent chains $S_{i}$ in subsequent steps, such as the CoT Rerailer, which enhances the efficiency and reduces the overall hallucination of our pipeline.

Judge. The Judge selects one of the RP among the inconsistent RPs $S_{i}$ with the least hallucinated intermediate steps $s_{i}$ from the responses generated during consistency assessment. Given the context $C$ and $Q$ , the Judge, which is an LLM model, evaluates each RP $S_{i}$ and prioritizes the most coherent and contextually relevant RP $S_{s}$ . This complements the consistency check and serves as an additional safeguard against hallucination. Both the consistency check and the Judge aim to reduce hallucinations and minimize the workload of the CoT Rerailer pipeline. By selecting the "least incorrect" RP $S_{s}$ which is more likely to have minimum incorrect intermediate steps $s_{i}$ , the Judge further reduces the burden on the CoT Rerailer.

3.1.2 Rerailment Process

Our Rerailment Process, inspired by the concept of multi-agent debate (MAD) (Talebirad and Nadiri, 2023), consists of three key stages: (1) identifying the root cause of the error within the selected RP $S_{s}$ , (2) correcting the erroneous intermediate steps $s_{i}$ through multi-agent debates, and (3) generating a new, logically coherent RP $S_{c}$ based on the corrected information. This method addresses hallucinatory responses by identifying the intermediate mistakes and proposing a revised RP that closely aligns with factual and logical reasoning. Here is a detailed explanation:

Intermediate-Step Evaluator: Identifying the root cause of errors in intermediate steps. To effectively trace the root cause of errors in the selected RP $S_{s}$ , we dissect the RP into distinct steps $s_{i}$ and evaluate each one individually. At the heart of this process is the Intermediate-Step Evaluator, tasked with identifying instances of factuality or faithfulness hallucinations within each targeted step $s_{i}$ . For any given step $s_{i}$ , our methodology presents only the segments leading up to $s_{i}$ —while concealing subsequent steps—to the evaluator.

The preceding steps contain essential background knowledge that could influence the outcome of $s_{i}$ , aiding the evaluator in assessing the step’s accuracy. The assumption here is that the previous steps ( $s_{1}$ to $s_{i-1}$ ) have already been evaluated or corrected, thus considered error-free. Also, by masking the steps that follow $s_{i}$ , we prevent potential biases in the evaluation process that could arise from erroneous or misleading information contained within the $S_{s}$ . Moreover, limiting the information presented to the Evaluator only to include preceding steps reduces the likelihood of re-hallucination.

Debate Mitigator: Validating and refining corrections through multi-agent debate. To validate the accuracy of the proposed correction and ensure it effectively addresses the hallucination, we employ a multi-agent debate Du et al. (2023) technique, introducing the Debate Mitigator as shown in Fig. 3. The Debate Mitigator’s role is to inspect the evaluator’s initial correction through a series of debates, critically assessing whether the suggested modification indeed rectifies the hallucination identified in the original step $s_{i}$ . The debate process is designed to be iterative, with multiple stages of discussion aimed at challenging and refining the proposed correction. This method is based on the premise that, through successive rounds of debate, agents will converge toward a consensus, thereby significantly reducing the likelihood of persistent hallucinations. Finally, the verified step correction replaces the incorrect step $s_{i}$ in the $S_{s}$ , and the steps up to the correction are passed to the Re-answer agent to complete the final mitigation.

Re-answer: Generating the final corrected reasoning chain. After the proposed corrections are checked by the Debate Mitigator, we generate a new RP $S_{c}$ based on the corrected intermediate steps $s_{i}$ . This process involves utilizing the original RP $S_{s}$ —up to and including the corrected step—as a foundation for developing subsequent steps and formulating a final answer $A$ . The Re-answer agent is tasked with taking the $S_{c}$ , considered as the accurate initial attempt, and using it to guide the generation of the remaining steps. By focusing on the corrected segments, the Re-answer agent ensures that the new RP $S_{f}$ builds logically and coherently from a validated starting point $S_{c}$ . The generation of a $S_{f}$ by the Re-answer agent ensures that the entire reasoning sequence reflects the corrections and insights gained through the root cause analysis and debate phases.

Algorithm 1 CoT Rerailer

1: Input: Reasoning Path

RP

, Question

Q

, Subject

S

, Total Steps

num\_steps

2: Output: Rerailed Path

rerailed\_RP

, Rerailed Answer

rerailed\_answer

3: for

i=1

num\_steps

index\leftarrow i

masked\_RP\leftarrow RP[:index]

hallucination,Correction\leftarrow

Evaluator(Q,S,masked\_RP,RP[index-1])

8: if

hallucination

then

corrected\leftarrow DebateMitigator(Q,S,masked\_RP,

10:

Correction)

11:

RP[index-1]\leftarrow corrected

12:

rerailed\_RP,rerailed\_answer\leftarrow Reanswer(Q,S,RP)

13: return

rerailed\_RP,rerailed\_answer

14: end if

15: end for

4 Experiments

Our study evaluates the CoT Rerailer pipeline using a streamlined experimental framework tailored to diverse knowledge domains. We classify test datasets, including MathQA (Amini et al., 2019), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021), and BigBench (Srivastava et al., 2023), into three broad categories: Commonsense Reasoning, Math, Advanced Math & Science. The detailed mapping of subjects to the broad category is referred to Appendix A.2.1. For computational efficiency, we subdivide each category into three segments, each comprising 800 test cases chosen at random on average. This approach ensures thorough coverage across different domains while keeping the dataset size manageable. We use GPT-4 for our entire experiments. Detailed information, including hyperparameter settings, prompt designs, category definitions, and the code is available in the appendix: A.1 A.2.1 A.3. We utilize CoT prompting, SC prompting, and MAD prompting as baselines to compare with the CoT Rerailer. For SC prompting, we adopt a fixed sampling budget of 40 as recommended by the original authors. We include 2 debate agents and 3 rounds of debate for MAD prompting.

4.1 Main Results

Intermediate Error Detection: As the core capability of the Rerailer prompting method, intermediate error tracking and detection are important to ensure error correction and QA accuracy. Our CoT Rerailer operates on the principle that errors in intermediate steps can cascade and affect the accuracy of subsequent steps. Fig. 4 illustrates the Rerailer’s ability to detect and correct such errors in sample basic and advanced math problems resulting in a correct final answer. In the first example, the original RP makes a counting mistake in step 3, concluding with ‘a 1 followed by 27 nines’ instead of the correct ‘26 nines, a five, and a four’. The original RP follows this unfaithful intermediate step and results in an incorrect final answer. However, the Rerailer identifies and rectifies this error, allowing the Re-answer agent to generate a corrected RP that effectively addresses the hallucination. In the next example of the differential equation, the original reasoning path makes a wrong assumption that a simple integral can be applied to both sides to solve the differential equation. The Rerailer identifies this principal mistake and proposes that an integrating factor is needed to further process the equation since the variables are mixed together (More examples see Fig. 7 in Appendix).

Under the hood, the CoT-Rerailer comprises three key components: the Step Evaluator, Debate Mitigator, and Re-answer agents. The rerailment process starts from extensively checking the correctness of each proposed step from the CoT RP. Once the Step Evaluator identifies potential hallucination from the particular step, as shown in Fig. 10 and Fig. 11. The step hallucination flag is raised with verification reasoning. Then the StepEvaluator proposes a corrected version of this hallucinated step. To ensure the proposed correction is indeed capturing the mistake, Debate Mitigator (as shown in Fig. 12 and Fig. 13.) starts debating about the proposed correction. Through the debate process, the correction has been justified, or altered depending on the debate results. Finally, after all RP steps have been evaluated or corrected, the Re-answer agent generates the correct CoT reasoning path and answer based on these newly proposed reasoning steps (see Fig. 14.). Through this iterative intermediate-step-checking mechanism, our pipeline delivers a more reliable RP and helps users correct the errors from the original reasoning step.

Pipeline QA Performance: Besides specific example demonstrations, we also verify the correction capability of the CoT rerailer through QA performance. Table 1 summarizes the performance of our proposed methods compared to standard Chain-of-Thought (CoT) prompting, self-consistent CoT, and Multi-agent Debate prompting across different datasets and categories. The CoT-Rerailer pipeline consistently improves LLM response accuracy overall and across various domains. Please refer to the Appendix 6 for a more detailed performance on different subjects.

Table 1: The Accuracy of Different Methods Across Main Categories. CoT: Chain-of-Thoughts; SC: Self-Consistency; MAD: Multi-agent Debate CM: Commonsense; Adv. Math: Advanced Math.

Models	Methods	CM Reasoning	Math	Adv. Math & Science	Overall
Claude-3	CoT	0.718	0.652	0.683	0.686
Claude-3	SC	0.736	0.667	0.706	0.705
Claude-3	MAD	0.717	0.654	0.688	0.688
Claude-3	Rerailer	0.757	0.669	0.729	0.722
GPT-3.5	CoT	0.727	0.625	0.669	0.675
GPT-3.5	SC	0.738	0.640	0.690	0.692
GPT-3.5	MAD	0.727	0.667	0.693	0.697
GPT-3.5	Rerailer	0.740	0.694	0.745	0.730
GPT-4	CoT	0.728	0.675	0.680	0.694
GPT-4	SC	0.740	0.690	0.701	0.711
GPT-4	MAD	0.746	0.689	0.692	0.708
GPT-4	Rerailer	0.760	0.716	0.759	0.748

4.2 Ablation Study

Table 2: Ablation Study of Our Model Across Different Components and Categories with GPT-4

Method	CM Reasoning	Math	Adv. Math & Science	Overall
Full Pipeline	0.420	0.557	0.668	0.585
Without Judge	0.400	0.518	0.479	0.472
Without Debate	0.343	0.593	0.548	0.516

Through a series of ablation studies, we assess the contribution of individual components within the CoT Rerailer to its overall effectiveness.

NO. of Generated RPs in Derailment: In the Derailment Identification step of our CoT Rerailer pipeline, we generate multiple RPs for each question to assess the consistency of the model’s responses. To determine the optimal number of samples, we first followed the practice from the original self-consistency work Wang et al. (2023) and generated 40 samples for selected questions from our dataset. We then experimented with varying the number of generated samples from 1 to 40 to find the most efficient and effective number of samples for our pipeline. Figure 18 in the Appendix A.6 presents the accuracy of the Derailment Identification step for each number of samples. We observed that increasing the number of samples generally improved performance, but the rate of change for improvement starts decreasing beyond 3 samples. Considering the trade-off between computational cost and performance, we chose to generate 3 samples for each question in the Derailment Identification step of our pipeline.

Judge v.s Random Selection: The Judge plays a crucial role in selecting the best RP from the set of inconsistent RPs to reduce computational overhead. Demonstrations regarding the Judge agent are shown in Fig. 8 and Fig. 9. When the Judge is removed from the pipeline, we resort to random selection to choose an RP. This means that, in the absence of the Judge, an RP is randomly selected from the inconsistent RPs to be used in the subsequent steps of the pipeline. The performance drop was observed in the "CoT-Rerailer Without Judge" row of the ablation study table.

Multi-agent Debate: The multi-agent debate component of the CoT Rerailer pipeline serves to verify and refine the selected RPs after the derailment step through a process of argumentation and counterarguments. When the debate component is removed, as shown in the "CoT-Rerailer Without Debate" row of the ablation study table, the output of the evaluator is directly fed back to the LLM to re-answer the question without further verification. The performance drop is observed in the "CoT-Rerailer Without Debate" row of the ablation study table.

Cost Analysis: In Table 3, we compare the average API and time costs for the basic CoT prompting, SC prompting, Multi-agent Debate and Rerailer. CoT prompting only involves one-time API requests hence the cost is significantly lower than SC and Rerailer, as a trade-off, the accuracy is relatively lower and the reasoning path often contains unfaithful intermediate hallucinations. SC improves the overall accuracy by 2% on average yet requires 40-64 times more sampling costs depending on the predefined sampling budget, which results in $2,400 on API cost and 244.4 hours for 1000 questions. As for MAD prompting, the exact cost approximation depends on the number of agents and rounds of debates. For our replication setting, the resultant cost from MAD is approximately $360 on API cost and 35.0 hours for 1000 questions. The accuracy is only improved by 1% with 6x more costs compared to CoT prompting. Finally, for the proposed CoT Rerailer, even though the rerailment stage itself involves many API calls of LLM, the derailment identifier significantly reduces this effect since not all RPs is necessary to enter rerailment. Therefore, with 5% accuracy improvement, we require less API cost and computational time compared to both SC and MAD.

Table 3: Cost Analysis of Methods on GPT-4 on Average per one thousand questions. ACC: Accuracy; hrs: hours.

Method	ACC (%)	Cost ($)	Time (hrs)
CoT	69.4	60	6.1
SC	71.1	2400	244.4
MAD	70.8	360	36.7
Rerailer	74.8	343.7	35.0

Error Analysis:

Despite the advancements, hallucinations were not entirely eliminated. In Fig. 5, we compare the error-correcting ability of SC, MAD, and Rerailer. The upper-left corner of the confusion matrix represents the case where all the methods fail to correct the RP. The main sources of this error type include inaccuracies in the original dataset, lack of background knowledge, and ambiguous questions. For instance, discrepancies between the LLM’s answers and the ground-truth responses often highlighted errors in the dataset rather than the LLM’s reasoning. Some questions required external information that was not available to the LLM, leading to incorrect responses. Ambiguous questions also posed challenges in achieving accurate answers without further clarification. In addition, LLM-specific errors, such as the Step Evaluator capturing minor variations in expression, were also observed. In one case, the LLM considered two equivalent equations as different, leading to a false positive hallucination detection. This highlights the need for further refinement in distinguishing genuine hallucinations from minor variations in expression. Detailed error analysis cases are provided Fig: 15-17 in Appendix. Besides, it is worth noting that Rerailer demonstrates better correction capability compared to SC and MAD in general. This is reflected by the upper-right corner of the confusion matrix where the original RP fails to correctly answer the question while Rerailer succeeds.

5 Conclusion

In this work, we introduced the CoT Rerailer, a novel pipeline that combines consistency checks and multi-agent debate to detect and mitigate hallucinations in the reasoning paths generated by large language models. Our extensive experiments on diverse question-answering datasets demonstrate the effectiveness of the CoT Rerailer in improving the accuracy and reliability of the generated responses while maintaining computational efficiency compared to existing methods. The CoT Rerailer’s ability to identify and correct errors in intermediate reasoning steps is a significant advancement in enhancing the interpretability and trustworthiness of LLM-generated content. By leveraging the strengths of self-consistency and multi-agent debate, our pipeline effectively traces the root causes of hallucinations and proposes corrected reasoning paths that align with factual and logical reasoning. The improved performance of the CoT Rerailer across various knowledge domains highlights its versatility and potential for real-world applications. As LLMs continue to play an increasingly important role in AI-driven decision-making processes, the development of methods like the CoT Rerailer is crucial for ensuring the reliability and transparency of these systems.

6 Limitations

Despite its advancements, the CoT Rerailer faces challenges, including increased computational demands and randomness on the quality of source RPs, which can affect its evaluation. Furthermore, its performance may be limited by questions requiring knowledge beyond the LLM’s training corpus. Finally, we only apply our methods on MCQAs and some simple open-ended math problems, but their performance on other tasks requires complex reasoning that is under-explored. Future work should aim to enhance computational efficiency, possibly through advanced filtering or parallel processing, and integrate external knowledge sources to surpass the limitations of current LLM training with the RAG system and improve reproducibility, which could also further justify our hallucination mitigation strategies. Additionally, enhancing the Step Evaluator’s ability to distinguish between genuine hallucinations and minor variations in expression could further reduce false positive detections and improve the pipeline’s precision.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
Anthropic (2024) Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Chan et al. (2024) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
Delétang et al. (2023) Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. 2023. Language modeling is compression. arXiv preprint arXiv:2309.10668.
Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. Preprint, arXiv:2302.12246.
Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. Preprint, arXiv:2211.10435.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
Huang et al. (2023a) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023a. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
Huang et al. (2023b) L Huang, W Yu, W Ma, W Zhong, Z Feng, H Wang, Q Chen, W Peng, X Feng, B Qin, et al. 2023b. A survey on hallucination in large language models: Principles. Taxonomy, Challenges, and Open Questions.
Kojima et al. (2022) Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213.
Li et al. (2023) Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems.
Liang et al. (2023) Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging divergent thinking in large language models through multi-agent debate. Preprint, arXiv:2305.19118.
Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
Mündler et al. (2024) Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. 2024. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. In The Twelfth International Conference on Learning Representations.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.
Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Preprint, arXiv:2206.04615.
Talebirad and Nadiri (2023) Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-agent collaboration: Harnessing the power of intelligent llm agents. Preprint, arXiv:2306.03314.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. Preprint, arXiv:2203.11171.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
Zhang et al. (2023a) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023a. Siren’s song in the ai ocean: A survey on hallucination in large language models. Preprint, arXiv:2309.01219.
Zhang et al. (2023b) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023b. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.

Appendix A Appendix

A.1 Package Used and Code

For generating LLMs responses and parsing out answers, we utilize packages "langchain" ,"langchain_openai", "langchain_anthropic", "langchain_community", and "langchain_core" offered by Langchain¹¹1https://www.langchain.com/. In addition, we use "pandas" for data processing, "matplotlib" for visualization, and "numpy" for basic mathematical manipulation.

Here is the code link for our GitHub (Anonymous): https://anonymous.4open.science/r/rerailer-0031/Readme.md

A.2 Experiments Details

In this section, we will include some details that might be helpful for readers to better understand our work. Note that this might not be comprehensive due to the space and time limit, but we will include what we think is necessary to get a good picture of what we did and what we found in this study.

A.2.1 Experiments Setup

LLM Sources In this study, the gpt models we adopt are from OpenAI "gpt-4" and "gpt-3.5-turbo" models. The claude model is from anthropic ’claude-3-sonnet’.

Data Sources The datasets used in our experiments are derived from a variety of sources, each contributing to the diversity of subjects and complexity of questions analyzed. Below we detail the origin of each subject’s data:

•

Big Bench: This dataset contributes to the Date Understanding and Disambiguation subjects, providing a focused set of questions that test the model’s ability to process and understand dates and time-related queries.
•

MathQA: The Math subject is sourced from the MathQA dataset, which includes a wide range of mathematical problem-solving questions designed to test computational and reasoning skills.
•

GSK8K: Data for the Challenging Math subject comes from the GSK8K dataset, known for its complex mathematics questions that require advanced problem-solving capabilities.
•

MMLU (Test Set): The Majority of subjects, including Philosophy, Jurisprudence, International Law, Professional Law, Business Ethics, College Chemistry, College Medicine, College Physics, College Biology, College Mathematics, Abstract Algebra, Formal Logic, Professional Accounting, College Computer Science, Econometrics, and Electrical Engineering, are derived from the test set of the Massive Multitask Language Understanding (MMLU) dataset. This dataset is notable for its broad coverage of subjects, offering a rigorous testing ground for our models across a wide spectrum of disciplines.

Subjects Descriptions In our experiments, we utilize a diverse range of subjects to evaluate the performance of our models. These subjects are grouped into broader categories to facilitate analysis and understanding. Below is a description of these subjects and their corresponding broader category:

•
Commonsense Reasoning: This broad category encompasses subjects that deal with the fundamentals of commonsense and social science subjects. It contains 787 questions spanning from:
- -
  
  Disambiguation
- -
  
  Date Understanding
- -
  
  Philosophy
- -
  
  Jurisprudence
- -
  
  International Law
- -
  
  Professional Law
- -
  
  Business Ethics
•
Advanced Math & Science: Focused on the challenging math and college level natural sciences, this theme covers various scientific disciplines that explore the natural world and contains 1025 questions. Subjects under this theme are:
- -
  
  Challenging Math
- -
  
  Professional Accounting
- -
  
  College Chemistry
- -
  
  College Medicine
- -
  
  College Physics
- -
  
  College Biology
- -
  
  College Mathematics
- -
  
  College Computer Science
- -
  
  Electrical Engineering
•
Elementary Math: This theme includes categories related to foundational mathematical concepts and high school-level statistics. The category contains 665 questions and the specific subjects are:
- -
  
  Math
- -
  
  High School Statistics
- -
  
  Abstract Algebra
- -
  
  Elementary Mathematics
- -
  
  Formal Logic

The categorization is designed to reflect the diversity and scope of the subjects our models are evaluated against, ensuring a comprehensive assessment across a wide array of knowledge domains. To obtain LLM RP for the experiment, we spend roughly $2000 USD and 250 hrs in total for LLM API usage.

Key code Explanation

We will explain in detail two parts of our code. 1: How we define the llm model class and parse the result. 2: How we check consistency for the parsed output.

We first define a ChatModelWorker class to handle the interactions with the LLMs. The class is initialized with the following parameters:

•

output_parser: An instance of the output parser used to extract the formatted output from the LLM’s response.
•

temperature (optional): The temperature value for controlling the randomness of the LLM’s output. The default is 0.
•

model (optional): The name of the LLM to use. The default is ’gpt-4’.

The class reads the API key from a file named api_key.txt and initializes an instance of the ChatOpenAI class from the OpenAI library with the specified parameters.

The ChatModelWorker class includes a method called prompt_temps that generates prompt templates for the LLM. It takes the following parameters:

•

sys_temp: The system message template.
•

human_temp: The human message template.
•

format_instructions: Additional formatting instructions for the LLM.

The method creates instances of SystemMessagePromptTemplate and HumanMessagePromptTemplate from the provided templates and combines them into a ChatPromptTemplate along with the formatting instructions.

The chain_generator method of the ChatModelWorker class generates an LLMChain object that connects the LLM with the prompt templates. It takes the following parameters:

•

template: The system message template.
•

human_template: The human message template.

The method retrieves the formatting instructions from the output_parser and creates an LLMChain instance using the LLM, prompt templates, and formatting instructions.

The output_repraser function is a utility function that takes the raw output string from the LLM and parses it into a dictionary. It assumes that the output is in JSON format and is enclosed within triple backticks (‘‘‘).

The function performs the following steps:

1.

Strips the leading and trailing triple backticks and newline characters from the input string.
2.

Parses the resulting JSON string into a dictionary using the json.loads function.
3.

Returns the parsed dictionary.

All of our created agents are based on the above code templates.

we will further explain how we check consistency among output and correct answers.

The check_consistency function assesses the consistency of a list of options, applying several criteria to deem them consistent:

1.

First, the function defines a set of valid_options containing the characters {’A’, ’B’, ’C’, ’D’, ’E’, ’F’}.
2.

It checks if all options in the input list are long strings, specifically with a length greater than 30 characters. If this condition is met, the options are considered consistent, and the function returns True.
3.

Next, the function cleans the options by removing any special characters and converting the text to uppercase. This step ensures that only alphanumeric characters and spaces are considered in the consistency check.
4.
The function then verifies if all cleaned options:
1. (a)
  
  Begin with the same character that exists in the valid_options set, and
2. (b)
  
  Are shorter than 40 characters.
If these conditions hold, the options are deemed consistent, returning True.
5.

Lastly, it checks if all cleaned options are identical. If so, the function concludes the options are consistent and returns True.
6.

If none of the above criteria are met, the options are considered inconsistent, and the function returns False.

A.2.2 Main Results

In this subsection, we will introduce some detailed information about our experiments.

The provided plots compare the accuracy of different prompting methods, including Self-Consistency, Multi-agent Debate and Rerailer, across subjects (Figure 6).

In Figure 6, we can observe that the Output_Answer consistently outperforms other baseline answer types across all broad categories (Advanced Math, Applied Science, Elementary Math, Law & Philosophy, and Natural Science). This suggests that the Derailment Identification step is effective in identifying and filtering out incorrect answers, leading to improved accuracy. We also obserse Multi-step answer has the best performance overall, which is also reflected in the main text.

Overall, these plots highlight the effectiveness of our pipeline in improving the accuracy of the question-answering system across various categories.

We will present our results from our second pipeline using Confusion Matrices. A confusion matrix summarizes the performance of a classification model by comparing the predicted labels to the actual labels. In our project, the confusion matrix evaluates the effectiveness of our pipeline in correcting the answers from the Derailment Identification step.

The confusion matrix consists of four components: True Positive (TP), True Negative (TN), False Negative (FN), and False Positive (FP). TP indicates that both the RP and Rerailer-corrected RP answers are correct. TN means the raw RP answer is incorrect, but the pipeline successfully corrects it. FN indicates that both the raw and corrected RP answers are incorrect, and the pipeline fails to correct the answer. FP means the raw RP answer is correct, but the pipeline introduces an error by modifying it.

We calculate confusion matrices for each broad category and the overall dataset in Figure 5 to analyze the model’s performance within each category and across all categories.

A.2.3 Descriptive Statistics

All the reported results are based on a single run due to the budget constraint. Re-running the experiments will cost additional thoughts of dollars. However, we verify the robustness of the proposed method by including different categories and subjects.

A.3 Prompts and Case Studies

In this section, we included our prompts for each component and their corresponding results. Another hallucination mitigation example of a physics question is shown in Fig. 7.

A.3.1 Raw CoT Generator

In our study, the primary goal was to rerail a hallucinated RP. Hence, we use the basic CoT prompt design with a zero-shot learning approach. Our prompt was defined as follows: System Message: You are a professional specialized in {subject}. You need to help me answer the given question. Notice that you need to solve the question step by step and as detailed as possible. Do not jump to the answer directly. If it is a computational question, please provide me with the detailed calculation in your steps, not just say the method! Your intermediate steps and thoughts are critical! Human Message:The question can be found in {question}

A.3.2 Derailment Identification

The first step of the proposed pipeline consists of a consistency filter with a judge. The consistency filter removed confident responses that LLMs produce similar answers all the time. In the case where LLMs produce inconsistent RPs, we leveraged a Judge agent to determine which RP was most likely to be addressed. Our Judge agent selected the best RP from three candidates and passed the best RP to Rerailer for further mitigation (see Fig. 8,9). The Judge agent prompt is defined as follows: System Message: ”’You are a professional specialized in {subject}. A Chain of Thought (COT) is a step-by-step reasoning process used to solve a problem or answer a question. You have been presented with three different RPs below for the question "{question}". Please carefully analyze these RPs and provide your assessment on which one is the most logically sound based on the given information and your expertise in the subject. Human Message:"Here are the three Reasoning Paths (RPs) for your analysis:" "RP 1: {rp1}" "RP 2: {rp2}" "RP 3: {rp3}"

A.3.3 Step Evaluator Agent

As a critical component of the CoT-Rerailer, the step Evaluator agent checks each intermediate step hallucination. We tried multiple prompts and found that directly asking if the LLM thought the step was "correct" was ambiguous. Hence, in our prompt engineering, we formally defined the term hallucination inspired by []’s definition. Instead of asking for "correctness", we required the LLM to determine if there were any logic mistakes (factuality hallucination) or inconsistency (faithfulness hallucination). Some of the generated results can be found in Fig. 10,11 and the prompt was defined as: System Message: You are a professional specialized in {subject}. You need to help me verify my steps when I solve the question. I am currently at step #{current_step}. Before you perform the task, I want you to keep in mind several definitions for my possible mistakes. 1. Factuality: This type of error emphasizes the discrepancy between generated content and verifiable real-world facts, including factual inconsistency or fabrication. In mathematics, for instance, it may represent the computational error. 2. Faithfulness: This type of error refers to the divergence of my step analysis from the original question or previous steps, as well as consistency within my steps. In mathematics, for instance, it may represent that I understood the question wrongly or my proposed step is inconsistent with my previous step. Based on my current step response, question, previous steps, and my error definitions, help me verify if any of the mistakes (factuality or faithfulness) occur in my analysis. Notice that skipping a step should not be considered an error as long as the calculation is correct! For instance, 2x+2 should be the same as 2+2x. Also, 2x+2+3 should be the same as 2x+5 At step 1, since we have no step 0, instead, the factuality and faithfulness check should reflect if I correctly understood the answer. Do not detect any minor hallucinations! In other words, only targeting the mistakes that contain calculation errors or apparent logical flaws or contradict real-world facts! If the provided step acknowledges the mistake, you need to capture it and correct it. If you see any step ends up with ’verified’ it means it has been checked without any mistake, so just consider it as correct and do not have to give the verification. Simply say step hallucination is [NO] Human Message:Here is my complete thought process {RP} and this is the original question {question}

A.3.4 Debate Mitigator Agent

To verify if the proposed correction generated by the step Evaluator agent was truly correct, our Debate Mitigator agents conducted a multi-agent debate. Similar to the Evaluator, it was also necessary to let the debate agent understand the formal definition of hallucination, Fig. 12,13 showed examples of the process and the prompt was defined as follows: System Message: You are a professional specialized in {subject}. You need to help me verify my steps when I solve the question. I am currently at step #{current_step}. 1. Factuality: This type of error emphasizes the discrepancy between generated content and verifiable real-world facts, including factual inconsistency or fabrication. In mathematics, for instance, it may represent the computational error. 2. Faithfulness: This type of error refers to the divergence of my step analysis from the original question or previous steps, as well as consistency within my steps. In mathematics, for instance, it may represent that I understood the question wrongly or my proposed step is inconsistent with my previous step. Other agents helped me identify the error I made in the current step. Your goal is to debate with the other agents and justify if their corrections were correct based on my question, and thought process. Please use Critical Thinking and only capture the significant mistake that will lead to the wrong answer. Errors like different interpretations should be ignored. Human Message:Here is my complete thought process {RP} and this is the original question {question}. The full response from the other agents was given as {response}

A.3.5 Re-answer Agent

Finally, the corrected step along with the previously verified steps were used as the initial thought process to inspire the LLM to regenerate the thought chain. In our experiment, we observed that sometimes only correcting one step was not sufficient to mitigate mistakes since the newly generated RP can also suffer from hallucinations. However, this hallucination becomes more unlikely or contains fewer error since the first few checked steps were certain. An example illustration can be found in Fig. 14. The prompt for the re-answer agent was defined as follows: System Message: You are a professional specialized in {subject}. Your task is to help me answer the question based on my initial thoughts. I will provide you with several steps of my attempt. Your task is to CONTINUE my thought process and then answer my question step by step. Also, a maximum of 12 steps are allowed and you can assume my initial thoughts had been checked since could be trusted. Remember, your response should based on my initial thoughts! Human Message:Here is my question:{question}. And my initial thought process is given as {RP}

A.4 Error Analysis Case Study

In our experiment, mainly three types of typical errors occurred with potentially downgrading our performance including wrong ground-truth, lack of background information, and ambiguous questions. The detailed example were show in Fig. 15,16,17.

A.5 Number of Generated Samples

A.6 Selecting the Number of Generated Samples in the Derailment Process

In the Derailment Identification step of our CoT Rerailer pipeline, we generate multiple reasoning paths (RPs) for each question to assess the consistency of the model’s responses. To determine the optimal number of samples, we first followed the practice from the original self-consistency work Wang et al. (2023) and generated 40 samples for selected questions from our dataset. We then experimented with varying the number of generated samples from 1 to 40 to find the most efficient and effective number of samples for our pipeline. Figure 18 presents the percentage of correct answers (i.e., accuracy) of the Derailment Identification step for each number of samples. We observed that increasing the number of samples generally improved performance, but the rate of improvement started decreasing beyond 3 samples. The accuracy plateaued around 54% when using 3 or more samples.

Considering the trade-off between computational cost and performance, we chose to generate 3 samples for each question in the Derailment Identification step of our pipeline. This choice strikes a balance between effectively identifying inconsistent responses and maintaining computational efficiency. While generating more samples might slightly improve the performance, the additional computational cost outweighs the marginal gains in accuracy beyond 3 samples.