Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning

Yaswanth Narsupalli1, Abhranil Chandra211footnotemark: 1, Sreevatsa Muppirala1,
Manish Gupta 3, Pawan Goyal 1,
1IIT Kharagpur, 2University of Waterloo, 3Microsoft, India
[email protected], [email protected], [email protected]
denotes the first two authors contributed equally.
Abstract

Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Traditional approaches involve either resource-intensive human evaluations or automatic metrics, which often exhibit a low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. We rigorously test ReFeR using two pre-existing benchmark datasets on diverse NLG tasks. The proposed framework not only enhances the accuracy of NLG evaluation, surpassing previous benchmarks by similar-to\sim20%, but also generates constructive feedback and significantly improves collective reasoning. This feedback is then leveraged for the creation of instruction-tuning datasets, which, when used to fine-tune smaller models like Mistral-7B, makes them extremely good evaluators, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight the effectiveness of our methodology through its application on three reasoning benchmarks, where it outperforms most of the state-of-the-art methods, and also outperforms the reasoning capabilities of models like GPT-3.5 Turbo by similar-to\sim11.67% and GPT-4 by similar-to\sim1% on an average.111The code will be provided after review.

\useunder

\ul

Review-Feedback-Reason (ReFeR): A Novel Framework for NLG Evaluation and Reasoning


Yaswanth Narsupalli1thanks: denotes the first two authors contributed equally., Abhranil Chandra211footnotemark: 1, Sreevatsa Muppirala1, Manish Gupta 3, Pawan Goyal 1, 1IIT Kharagpur, 2University of Waterloo, 3Microsoft, India [email protected], [email protected], [email protected]


1 Introduction

The rapid production of content by Foundation Models (FMs) (Bommasani et al., 2021), including large language models (LLMs), poses a challenge to traditional human-centric evaluation methods and conventional linguistic metrics like BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005), which often misalign with human judgment and face limitations in assessing creative or nuanced responses. Recent developments suggest using LLMs as novel, reference-independent evaluators by assessing text quality based on predicted sequence likelihoods, bypassing the need for direct reference comparisons (Chen et al., 2023). Recently, there has been work (Liu et al., 2023b; Chiang and Lee, 2023) on improving the evaluation capability of individual LLMs. Surprisingly, although an ensemble of multiple LLMs is expected to perform better, there has not been much work on how to align evaluations using multiple LLMs with human judgments.

We thus introduce the Review-Feedback-Reason (ReFeR) framework, drawing inspiration from the academic peer review process to enhance the evaluation of NLG outputs. By using LLMs as evaluators and feedback providers in a system akin to academic peer review, ReFeR enables a nuanced and comprehensive evaluation of NLG tasks across various domains, promoting model self-improvement, explainability, and robustness in complex scenarios. The paper outlines ReFeR’s methodology, including its unique evaluation schema (that diverges from existing benchmarks as outlined by Liu et al. (2023b) and Chiang and Lee (2023)) and the strategic use of LLM agents in roles parallel to peer reviewers and area chairs, facilitating a multi-dimensional evaluation and generating constructive feedback for model refinement.

Refer to caption
Figure 1: Illustration of ReFeR Framework on SummEval dataset. A case study example is given in the Appendix E

The framework’s efficacy has been validated across two established benchmarks. The ReFeR framework not only surpasses existing methods on benchmarks but also suggests its potential as a universal tool for evaluating NLG text. Furthermore, ReFeR’s feedback mechanism has enabled the production of instruction-tuning datasets, which are used to fine-tune smaller models to evaluative standards nearly matching those of at least 25×\times× larger models like GPT-3.5, achieving a closer correlation with human evaluation. The framework’s review and feedback mechanisms enhance its performance on reasoning tasks despite not being specifically pre-trained or fine-tuned for the purpose.

To summarize, the primary contributions of our research are as follows: (1) Introduction of a general purpose NLG evaluation framework, called as ReFeR, inspired by the traditional academic peer review system. (2) Development of a novel evaluation schema, incorporating an evaluation guidelines module alongside a critical comments module. (3) Creation of an automated instruction tuning dataset from the framework’s feedback outputs, designed to enhance smaller models to achieve the evaluative prowess of their larger counterparts. (4) Empirical validation of the framework’s capability to show enhanced reasoning skills on three benchmarks.

2 ReFeR Methodology for NLG Evaluation

In this section, we introduce ReFeR, a versatile framework for evaluating NLG tasks by using LLM agents in roles analogous to peer reviewers and area chairs, as found in the academic peer review process.

2.1 ReFeR Framework

The challenge of evaluating work without a predefined correct answer, such as determining the quality of a research paper, is traditionally addressed in academia through the peer review system. In this process, subject-matter experts called peer reviewers, independently review submissions. Authors then have an opportunity to address any concerns raised. Finally, senior researchers serving as area chairs review the adjusted feedback and make the final decision on whether to accept or reject the submission. Our framework draws inspiration from this process, and aims to evaluate NLG outputs replicating this academic review methodology.

The framework is structured into three distinct modules, as depicted in Fig. 1. The first module, the Peer Review Body, consists of three LLM agents. Each agent independently evaluates a specific NLG output, providing a comment and a rating. The following module is the Critic Module (optional), wherein another LLM agent, emulating a critic, assesses the evaluations made by the peer reviewers. The peer reviewers can then revisit these interactions and can adjust their assessments before forwarding their final reviews to the Area Chair Module. The final module features an LLM agent acting as an Area Chair, who considers the conclusive reviews to perform the ultimate evaluation of the NLG output. We Reason using the LLM Agents as peers and Area Chairs, take Feedback of peers and pass it to the area chair and finally give a Review or score to the NLG text. Hence our framework is named as ReFeR.

Refer to caption
Figure 2: Evaluation Schema for ReFeR’s prompt.

2.2 Evaluation Schema

An important aspect of assessing NLG outputs with LLM agents involves crafting prompts that elicit the highest quality evaluations. Prior work G-Eval by Liu et al. (2023b) introduced a structured evaluation schema, which organized the prompt into sections: task introduction, evaluation criteria, steps for evaluation, input presentation, and an evaluation form designed to output a numerical rating. Subsequently, Chiang and Lee (2023) demonstrated that a Chain of Thought (CoT) approach does not consistently yield the most accurate correlations with human judgment. They proposed an adjusted schema named Analyze-Rate, which prioritizes an analytical review followed by the scoring. This method showed improved performance over the G-Eval schema.

To further refine this approach, we introduce “evaluation guidelines” to enhance the peer reviewer’s understanding of the scoring criteria, much like guidelines provided in traditional academic review processes. This modification posits that clear guidelines can improve evaluation accuracy by standardizing the scoring rationale. Evaluation guidelines can be automatically generated by prompting an LLM with examples from the dataset. We call this process ‘Auto Prompt’. Another possible way to include evaluation guidelines is to use manually written human annotation guidelines of the dataset. We also changed the evaluation form to include a critical comment or reasoning for the given score. The proposed evaluation schema is shown in Fig. 2. This method has improved the performance, as was previously shown by Chiang and Lee (2023).

3 Datasets

Following previous works (Zhong et al., 2022; Liu et al., 2023b; Chiang and Lee, 2023), our meta-evaluations predominantly utilize two datasets (SummEval, TopicalChat), each designed to test distinct evaluation capabilities of our framework for Summarization and Dialog Generation tasks.

SummEval  (Fabbri et al., 2021) provides human assessments on four critical dimensions of summarization quality: fluency, coherence, consistency, and relevance, utilizing the CNN/DailyMail dataset (Hermann et al., 2015) as its foundation. Considering computational budget and time constraints, we selected only 200 samples for our experiments.

TopicalChat (Gopalakrishnan et al., 2019) establishes a framework for the meta-evaluation of evaluators in dialogue response generation systems, specifically those that incorporate knowledge elements. Our approach adheres to the methodology outlined in Zhong et al. (2022), employing human ratings to assess dialogues on four attributes: coherence, engagingness, groundedness, and naturalness.

4 Experiments and Results for NLG Evaluation

4.1 Baselines

While the current landscape of models for evaluating NLG responses includes reference-free methods such as BERTScore, GPTScore and UniEval (Zhang et al., 2020; Fu et al., 2023; Zhong et al., 2022), we do not consider these models as baselines given they were clearly surpassed by G-Eval (Liu et al., 2023b) and later works. Given our work primarily proposes a LLM based evaluation, we do a comparative analysis primarily against G-Eval (Liu et al., 2023b) and Analyze-Rate (Chiang and Lee, 2023). G-Eval presents a unique approach by deploying a single LLM agent. This agent employs Auto-CoT (chain of thought) reasoning and a form-filling paradigm to ascertain the quality of NLG outputs, delivering only scores for the specific dimensions under scrutiny. Analyze-Rate builds upon G-Eval, advocating for an enhanced prompt structure. This methodology incorporates a preliminary analysis phase before scoring, aiming to enrich the evaluative process for NLG tasks.

Models Coherence Consistency Fluency Relevance Avg
ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ
Baselines (Peer) GPT-3.5 0.357 0.291 0.363 \ul0.330 0.237 0.204 0.279 0.226 0.309 0.262
(Peer) Mixtral 0.296 0.240 0.350 0.317 0.363 \ul0.324 0.293 0.243 0.326 0.281
(Peer) Gemini 0.370 0.296 0.289 0.268 0.309 0.267 0.210 0.168 0.295 0.250
Analyze-Rate (GPT-3.5, n=20) \ul0.404 0.299 \ul0.370 0.303 \ul0.377 0.301 0.348 0.252 \ul0.375 \ul0.289
G-Eval (GPT-3.5, n=20) 0.393 \ul0.295 0.175 0.143 0.268 0.216 \ul0.355 \ul0.261 0.297 0.229
ReFeR (Ours) 0.502 0.374 0.428 0.370 0.414 0.330 0.459 0.338 0.450 0.353
Table 1: Spearman Correlation (ρ𝜌\rhoitalic_ρ) and Kendall-Tau Correlation (τ𝜏\tauitalic_τ) on SummEval. The best and second-best per column have been highlighted in bold and underlined, respectively. ReFeR results are without Critic LLM. For ReFeR with different Critic strategy results see Appendix D.1.
Models Coherence Engagingness Groundedness Naturalness Avg
ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ
Baselines (Peer) GPT-3.5 0.417 0.350 0.519 0.439 0.527 0.493 0.416 0.348 0.470 0.407
(Peer) Mixtral 0.424 0.358 0.532 0.456 0.443 0.410 0.451 0.376 0.463 0.400
(Peer) Gemini 0.363 0.303 0.477 0.398 0.539 0.504 0.398 0.333 0.444 0.385
Analyze-Rate (GPT-3.5, n=20) \ul0.506 \ul0.384 \ul0.637 \ul0.480 \ul0.646 \ul0.546 \ul0.522 \ul0.391 \ul0.578 \ul0.450
G-Eval (GPT-3.5, n=20) 0.472 0.356 0.618 0.474 0.456 0.377 0.501 0.373 0.512 0.395
ReFeR (Ours) 0.514 0.390 0.651 0.502 0.678 0.590 0.544 0.414 0.597 0.474
Table 2: Spearman Correlation (ρ𝜌\rhoitalic_ρ) and Kendall-Tau Correlation (τ𝜏\tauitalic_τ) on TopicalChat. Best and second-best per column have been highlighted with bold and underline respectively. ReFeR results are without Critic LLM. For ReFeR with different Critic strategy results see Appendix D.1.

4.2 Experimental Setup

In response to G-Eval not releasing TopicalChat dataset prompts, Chiang and Lee (2023) crafted new prompts for the TopicalChat benchmark, mirroring the original G-Eval design. Consequently, we used their G-Eval prompts for TopicalChat in our experiments. This approach ensures consistency across evaluations and addresses the significant impact of prompt modifications on outcomes, potentially shifting the narrative.

Our experimental framework employs GPT-3.5 Turbo (2023-06-13) (OpenAI, 2023), Gemini-Pro (Team et al., 2023), and Mixtral 8x7B (Jiang et al., 2024) as peer evaluators, and GPT-3.5 Turbo (with number of responses generated per prompt n=20𝑛20n=20italic_n = 20) acting solely as the Area Chair and Critic LLM. Appendix B provides details of LLM hyper-parameters. Following (Fu et al., 2023; Liu et al., 2023b), we primarily report Spearman correlations (ρ𝜌\rhoitalic_ρ) between the scores generated by our framework and those annotated by humans and use this as the primary differentiator to find the best model.

4.3 Main Results for NLG Evaluation

We assess ReFeR’s performance through a series of experiments, employing a diverse array of LLM agents as peers and an Area Chair.

Tables 1 and 2 show ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ for SummEval and TopicalChat, respectively. Results are shown for individual performance by each of the 3 peers, and G-Eval and Analyze-Rate (both with GPT-3.5, n=20) as baseline methods. Our framework’s results here does not use the Critic LLM Phase but we show results using 4 variations of the ReFeR framework with critic phase in Appendix D.1. By juxtaposing the ReFeR framework’s outcomes against those derived from G-Eval, Analyze-Rate and contrasting these findings with individual peers’ scores, we get insights into the substantial enhancements by our framework. Specifically, ReFeR surpasses the average Spearman correlation by similar-to\sim20% on the SummEval dataset and by similar-to\sim3% on the TopicalChat dataset when compared to best baseline performances.

Table 1 compares different models based on coherence, consistency, fluency, and relevance on SummEval. Table 2 compares different models based on various aspects of conversation quality like coherence, engagingness, groundedness, and naturalness, and average scores on TopicalChat. We make the following observations for both the tables: (1) ReFeR framework performs better than the baselines in terms of spearman correlation (ρ𝜌\rhoitalic_ρ) and kendall-Tau correlation (τ𝜏\tauitalic_τ). We give more detailed analysis of the framework’s capabilities in Section 4.4. (2) The best baseline is Analyze-Rate, as expected.

We chose to report main results without the (optional) critic module because as highlighted by Laban et al. (2024), existing LLMs often exhibit fluctuating stances under scrutiny regarding their response confidence, suggesting a propensity for opinion revision. This observation implies that, for optimal correlation scores, it may be advantageous to bypass the critic module until such foundational issues within LLMs are addressed, at which point its incorporation could yield further benefits. However, for completeness, we report results using multiple critic LLMs in Appendix D.1.

Further, prompt sensitivity is a fundamental constraint of LLMs (Sclar et al., 2024; Loya et al., 2023). A poorly constructed prompt can skew results, leading to outcomes that deviate from expected benchmarks. This observation is further validated by the findings of Chiang and Lee (2023) in the automated evaluation domain. When the prompt is not optimized the results can be very misleading and can confuse researchers with the thinking that a certain method is not accurate. So to get any conclusive results, we always need a very well-crafted, manually engineered prompt. This led us to investigating the best prompt that should be used for each LLM Agent. We discuss this in detail in the next subsection.

4.4 Ablations

Prompt Ablations

To identify the most effective prompt for a task, we designed prompts aimed at achieving the highest correlation. We employed the same three models used in our main experiments (GPT-3.5 Turbo, Gemini-Pro, Mixtral-8x7B) and conducted ablation studies on the SummEval dataset by varying the prompts. We utilized two different prompt schemas for this experiment: Analyze-Rate and Eval Guidelines. Table 3 presents the average Spearman score for all three models, broken down by metric and averaged over two runs. It is evident that Analyze-Rate and Eval Guidelines perform very similarly, with Analyze-Rate being marginally better by 0.0001. When observing the average ρ𝜌\rhoitalic_ρ of all three peers, we see that the Eval Guidelines prompt works better for GPT-3.5 and Mixtral, while the Analyze-Rate prompt is better for Gemini. This leads to the important observation that the best-performing prompt for one model may not be the best for another.

The results in Table 3 might suggest that using the Eval Guidelines prompt for both the peers and the Area Chair would yield the best results. However, to verify this, we conducted further experiments by permuting the Analyze-Rate and Eval Guidelines prompts for peers and the Area Chair.

Prompt SummEval Coherence Consistency Fluency Relevance Average
Analyze Rate GPT-3.5 \ul0.337 \ul0.333 \ul0.270 0.298 \ul0.309
Mixtral 0.291 0.383 0.365 \ul0.237 0.319
Gemini 0.362 0.323 0.220 0.228 0.283
Average Peers 0.330 0.346 0.285 0.255 0.3039
Eval Guideline GPT-3.5 0.452 \ul0.278 \ul0.328 \ul0.336 0.348
Mixtral \ul0.308 0.311 0.350 0.366 \ul0.334
Gemini 0.241 0.258 0.201 0.217 0.229
Average Peers 0.334 0.286 0.293 0.306 0.3038
Table 3: Prompt Ablation of Peers

Table 4 displays the outcomes when the prompts were permuted between the peers and the Area Chair. We found that the optimal combination was not using Analyze-Rate for both the peers and the Area Chair, but rather using Analyze-Rate for the peers and Eval Guidelines for the Area Chair on the SummEval dataset. Although the difference in average correlation across all four metrics is only 0.005, a closer examination of the metric-wise differences reveals that Row 2 outperforms Row 1 in all metrics except for the Relevance metric. This indicates that refining the Eval Guidelines prompt for the Relevance metric could yield even better results. As previously mentioned, identifying the optimal prompt is always challenging. Therefore, we did not further explore improvements to the Eval Guidelines prompt, as this is not the primary focus of our paper.

Peer Prompt AC Prompt Coh Con Flu Rel Avg
Analyze Rate Analyze Rate \ul0.463 \ul0.404 \ul0.380 0.535 \ul0.445
Analyze Rate Eval Guidelines 0.502 0.428 0.414 \ul0.459 0.450
Eval Guidelines Analyze Rate \ul0.459 0.403 0.377 0.474 0.428
Eval Guidelines Eval Guidelines 0.480 \ul0.392 \ul0.350 \ul0.463 \ul0.421
Table 4: Prompt Ablation by varying both peer prompt and area chair prompt on SummEval.

Our current conclusion is that prompt modification can enhance scores, and we leave the task of finding the best method for determining the optimal prompt to future research. Another important observation is that just because a prompt works best for a model does not guarantee that using the same prompt in a framework setting would yield better results. This suggests that users might need to perform experiments on their downstream applications to determine the best working prompt for their use case.

Performance Ablations

To understand the overall percentage gains from different parts of the model, we conducted a performance ablation experiment where we added each component of the framework incrementally and observed the improvement in overall performance. Table 5 shows the results of this experiment. We first start with the base model, i.e., Single Peer (GPT-3.5 Turbo with n=1𝑛1n=1italic_n = 1). We check the average Spearman correlation across the four metrics of SummEval (coherence, consistency, fluency, relevance) for all the models listed and then calculate their relative percentage gain with respect to the base model (Row 1). For the second row, we increased the hyperparameter n=20𝑛20n=20italic_n = 20 and observed a gain of +21.29%, highlighting the importance of this hyperparameter. We suspect, this is the reason for the performance behind G-Eval and Analyze-Rate. Then we use all three peers individually and take the average of the peers, resulting in a decline in performance (1.6%) compared to the base model.

Models Avg ρ𝜌\rhoitalic_ρ % Gain Reason
GPT-3.5 (n=1) 0.309 - -
GPT-3.5 (n=20) 0.375 +21.29% n=20
Average of Peers 0.304 -1.603% 3 Peers (No AC)
AR (Peers + AC) 0.365 +18.05% 3 Peers + AC
AR (Peers + AC (n=20)) 0.445 +44.20% n=20 for AC
AR (Peers) + EG (AC) 0.450 +45.84% Full Framework
Table 5: Performance Ablation on SummEval dataset; AR stands for Analyze-Rate prompt, EG stands for Eval Guidlines prompt, AC stands for Area chair. n is the hyperparameter that tells the model how many responses to give for each prompt. Average ρ𝜌\rhoitalic_ρ is the average spearman correlation across the 4 metrics for SummEval dataset. % gain is relative to 1st row.

Next, we add the Area Chair into the framework with (n=1𝑛1n=1italic_n = 1, Analyze-Rate prompt for both Peers and Area Chair), and we see a percentage gain of +18.05% relative to the base model. This shows the importance of the Area Chair in helping the model reconcile all the evaluations and provide a better overall evaluation. We then further conduct ablation by checking with n=20𝑛20n=20italic_n = 20 for the Area Chair, which improves the gain to +44.2% relative to the base model. From Table 4, we see that Row 2 performs best, and when used in our ablation, it helps us decide what prompts should be finally used in the current framework to achieve maximum gain. This ablation is represented in the last row of Table 5, which shows the overall percentage gain our framework brings.

5 Instruction-Tuning of Small LLMs using Area Chair Outputs

Utilizing Analysis from larger LLMs (“Area Chairs”), we enhance smaller LLMs through instruction-tuning, using a dataset crafted from comprehensive evaluations. We use the analysis feedback generated within the ReFeR framework, transforming it into a useful resource for instructional tuning. This fine-tuning significantly improves smaller models performance, enabling them to reach or surpass their larger counterparts in evaluation tasks. We use Mistral-7B, since its cost of operation is much cheaper than GPT-3.5 Turbo. We used the instruction-tuning dataset (final output of Area Chair) of ReFeR framework as the training data by separating same 200 test samples we used for the experiments on SummEval dataset, i.e training data is 1350 samples, 50 samples for development and same 200 samples for testing. For TopicalChat dataset we used 45 samples for test data and remaining as training and development data.

Models Coh Con Flu Rel Avg
Mistral-7B-non-finetuned 0.284 0.210 0.158 0.240 0.223
GPT-3.5 (n=1) \ul0.357 0.363 \ul0.237 0.279 0.309
Mistral-7B-finetuned (ReFeR) 0.372 \ul0.255 0.289 \ul0.258 \ul0.293
Table 6: Performance Comparison of Finetuned vs Non-Finetuned Models (Spearman Correlation (ρ𝜌\rhoitalic_ρ) metric) on entire 200 test samples of SummEval dataset.
Models Coh Eng Gro Nat Avg
Mistral-7B-non-finetuned 0.136 0.205 0.086 0.087 0.128
GPT-3.5 (n=1) 0.437 0.531 0.497 0.544 0.502
Mistral-7B-finetuned (ReFeR) 0.544 0.389 \ul0.287 \ul0.385 \ul0.401
Table 7: Performance Comparison of Finetuned vs Non-Finetuned Models (Spearman Correlation (ρ𝜌\rhoitalic_ρ) metric) on 45 samples of TopicalChat dataset.

This method, by integrating detailed feedback into training, enables smaller models to perform detailed, accurate assessments, demonstrating the efficacy and potential of feedback-driven instructional tuning in NLG evaluation. A case study example is given in Appendix F.

The comparative analysis presented in Tables 6 and 7 establish the enhanced performance of the fine-tuned model. Notably, the fine-tuned Mistral-7B model achieves an average Spearman correlation of 0.293 and 0.401 for SummEval and TopicalChat respectively, surpassing the baseline established by its non-fine-tuned counterpart, which stands at 0.223 and 0.128. These results not only underscore the effectiveness of fine-tuning in augmenting model performance but also illustrate the competitive edge that fine-tuned, smaller models gain against the considerably larger and more resource-intensive GPT 3.5 Turbo model. Mistral-7B-Finetuned is close with GPT-3.5 despite GPT-3.5’s 25-fold larger size and associated operational costs. This finding demonstrates the computational efficiency and cost-effectiveness of employing smaller models enhanced through feedback-driven fine-tuning in lieu of relying on larger models like GPT-4 for equivalent levels of evaluative accuracy.

6 Collective Reasoning through ReFeR

An important inquiry related to the deployment of our proposed framework centers on its capacity to amplify collective reasoning beyond the scope of individual cognitive capabilities. This question is especially important in the context of complex problem-solving and evaluative tasks. To systematically explore this aspect, we aligned our investigation with the methodologies outlined by Chen et al. (2024), focusing our analysis on the reasoning performance of the Graduate School Mathematical (GSM8K) dataset (Cobbe et al., 2021) testing math reasoning, StrategyQA (Geva et al., 2021) testing deductive reasoning and CSQA (Talmor et al., 2019) testing commonsense reasoning.

In alignment with our evaluation schema, we crafted prompts that mirrored the structure and objectives of the framework’s intended application. This initial step involved the generation of answers and corresponding reasoning by each peer agent, which were subsequently consolidated and presented to an area chair. The ensuing step entailed the formulation of a finalized response to the presented reasoning challenge. Sample reasoning prompts used with the framework is mentioned in the Appendix I.

The outcomes of this experimental inquiry are systematically documented in Table 8. Results of ReConCile are sourced from (Chen et al., 2024). They have used GPT-3.5 Turbo, Claude-2 and Bard. A direct comparison between our framework and ReConCile is not feasible due to the utilization of distinct peer groups and different test samples in each framework. Through this empirical assessment, we sought to quantify the extent to which the collective intelligence harnessed through our framework surpasses the reasoning limitations of individual agents. Some examples of how our framework improves the collective reasoning of each peer is given in Appendix G.

Models GSM8k StrategyQA CSQA
GPT-3.5 71 70 72
Mixtral 74 77 71
Gemini-Pro 77 71 73
Reconcile 85 79 75
GPT-4 95 72 78
ReFeR 87 81 80
Table 8: Accuracy (%) for the reasoning tasks across the GSM8k, StrategyQA and CSQA benchmarks. Following ReConCile we also reported our numbers on 100 Test samples for each dataset.

We assess the collective reasoning proficiency of the framework against that of individual peer models and the current state-of-the-art reasoning framework, ReConCile (Chen et al., 2024). Despite not being primarily designed to augment reasoning capabilities, our proposed ReFeR framework astonishingly matches and, surpasses the performance of ReConCile in StrategyQA and CSQA datasets, achieving a collective reasoning score of 87.0 compared to ReConCile’s average of 85.3 on GSM8K dataset, 81.0 in comparison to 75.6 on StrategyQA and 80 in comparison to 74.7 on CSQA. The proximity to GPT-4’s score of 95.0 on GSM8K and surpassing GPT-4 on StrategyQA and CSQA not only highlights ReFeR’s competitive reasoning ability but also illustrates its cost-effectiveness relative to the significantly higher computational and financial demands of operating GPT-4 OpenAI (2024).

7 Related Work

Evaluation using LLMs. Fu et al. (2023) proposes GPTScore, a framework that evaluates texts with generative pre-training models like GPT-3, assuming that a generative pre-training model will assign a higher probability of high-quality generated text following a given instruction and context. Wang et al. (2023) conducted a preliminary survey of using ChatGPT as an NLG evaluator. Kocmi and Federmann (2023) proposed to use GPT models for evaluating machine translation tasks. Hada et al. (2023) investigate whether LLM-based evaluators can help scale up multilingual evaluation. Liu et al. (2023b) introduced G-Eval, a novel framework using large language models through a chain-of-thoughts (CoT) approach combined with a form-filling methodology to evaluate natural language generation outputs. Chiang and Lee (2023) subsequently demonstrated that G-Eval’s implementation of an automated CoT does not consistently align with human evaluations. Furthermore, they highlighted the limitations of restricting LLMs to solely numeric evaluations within G-Eval, prompting our investigation into generating evaluative outputs that include both critical commentary and numerical ratings. Jiang et al. (2023) introduced TIGERScore, an innovative metric designed to offer detailed error analysis (in contrast to the scores) for identifying specific inaccuracies within generated texts, moving beyond mere scoring. This metric is underpinned by the use of Llama-2, which was fine-tuned on a proprietary dataset. Shu et al. (2024) introduced FusionEval, a novel evaluation framework that leverages auxiliary evaluators such as NLI, BLEURT, and SBLEURT to analyze questions for assessment. This analysis is then conveyed to a primary large language model, which assigns the final score. Liu et al. (2023a) developed X-Eval, a two-stage instruction tuning framework designed to evaluate texts across both familiar and novel dimensions, tailored to the specific needs of end-users. Chan et al. (2023) proposed ChatEval, a multi-agent referee system that employs a unique method of autonomous debating among the same agents with different personas to evaluate the quality of generated responses. While their approach shares similarities with our work, it primarily relies on a debate and discussion methodology utilizing the same models under varied personas. In contrast, our method employs diverse models acting as peers and area chairs and incorporates a significantly richer evaluation schema.

Reasoning using Multiple LLMs as Peers. Chen et al. (2024) unveiled ReConcile, a multi-model, multi-agent framework structured akin to a round table conference among various LLM agents. Their findings suggest that LLMs exhibit enhanced reasoning capabilities when engaging in discussions and reaching consensus. Xu et al. (2023) introduced a novel framework aimed at augmenting reasoning abilities, drawing inspiration from the academic peer review process. This approach uniquely emphasizes iterative improvement through feedback from peer evaluations, distinguishing it from our methodology, which does not facilitate direct communication between peers but instead involves an area chair reviewing all peer responses. Pham et al. (2023) advocated for the use of embeddings as a communication medium within multi-agent frameworks to optimize reasoning. Conversely, Du et al. (2023) focuses on using solutions from other peers to enhance an individual’s reasoning, employing a repetitive improvement cycle. Lastly, Wang et al. (2024) proposes a strategy for selecting the most coherent response from multiple reasoning chains, offering a different perspective on achieving consensus and enhancing reasoning accuracy.

8 Conclusion

We introduce an NLG evaluation framework, ReFeR, inspired by the open-review academic peer review process which also improves collective reasoning ability. Our framework comprises three distinct stages of evaluation. Initially, a peer review body of three peers independently assesses the NLG text, this is followed by an optional critic phase, where a single model reassesses the initial reviews. Then the final evaluation by area chairs who utilize the three peer reviews to deliver a conclusive score and detailed analysis for enhancement. This analysis serves as a foundation for creating an instruction-tuning dataset, useful in elevating the performance of smaller LLMs to match that of larger LLM evaluators. Our empirical analysis demonstrates that our framework has a notable 20% average improvement in Spearman correlation across the SummEval dataset and 3% average improvement across the TopicalChat dataset compared to the best baseline respectively. Utilizing the derived instruction tuning dataset to fine-tune Mistral-7B led to a remarkable 31% and 213% increase in average Spearman correlation on the SummEval dataset and TopicalChat datasets respectively, indicating a significant leap over Non-finetuned performance. Our framework also introduces an innovative evaluation schema, incorporating two modules: Evaluation Guidelines, providing clear scoring directives to LLMs, and an Explanation section for articulating the rationale behind assigned scores. We do a detailed ablation experiments finding multiple observations and performance gains brought by different parts of the framework. Despite not being primarily designed to augment reasoning capabilities, our framework surpasses the performance of the ReConCile reasoning framework with an accuracy of 87% on the GSM8K, 80% on CSQA and 81% on the StrategyQA reasoning benchmarks. The Framework’s collective reasoning also outperformed models like GPT-3.5 Turbo by similar-to\sim11.67% and GPT-4 by similar-to\sim1% on an average.

9 Limitations

This work explores the ReFeR framework with only three peer agents, one critic, and 1 Area Chair. Due to computation and time limitations, we did not extend the work to explore the effects of having more peers and area chairs. Also, we explore only a limited number of LLMs, such as GPT-3.5 Turbo, Gemnini-Pro, and Mixtral-8x7B. Future works can explore other LLMs and some source LLMs like Llama-3. We also limited our experiments to two English datasets; it will be interesting to see other multi-lingual works. Regarding reasoning benchmark experiments, it would be interesting if future works compare our ReFeR work with other SOTA methods apart from ReConCile. One of the most important problems with this type of evaluation using LLM agents is prompt sensitivity. Having a bad prompt will reduce the performance drastically, and it is difficult to find an optimized prompt for any new task. So we did not explore on optimizing all our prompts using state of the art techniques, we leave the work of optimizing prompts for the future. Another vital direction we did not focus on is the method of communication between peers and Area Chairs, i.e., how can we effectively communicate with other LLM Agents.

10 Ethics Statement

This work adheres to ethical guidelines for research and publishing. All data utilized has been sourced responsibly, with proper acknowledgment of existing works and adherence to the fair use policy. No part of this research involves deception, coercion, or any form of unethical treatment of subjects. We do not foresee any ethical problems using our system except where user tries to manipulate the automatic evaluation to evaluate responses in his favor.

11 Potential Risks

There is a potential risk of misuse where it could be employed to generate misleading or manipulative evaluations. Ensuring ethical usage of the framework is important to prevent scenarios where the generated feedback and evaluations are used to deceive or unfairly influence decisions. As LLM models keep updating it poses a risk in maintaining consistency in evaluations over time. One of the primary risks is the inherent bias in LLMs used as evaluators.

Acknowledgments

We would really like to thank Microsoft Research for their Accelerating Foundation Models Research initiative for generously funding our entire research and helping us with the GPUs and API credits. We would also like to thank Ganesh Jawahar for helping us with brainstorming the problem statement initially.

References

Appendix A Scoring Function

In their seminal work, Liu et al. (2023b) broached the subject of a post-evaluation scoring function designed to alleviate inherent biases and discrepancies within scoring mechanisms. However, the intricacies and the practical application of this scoring function remained undisclosed, echoing the reservations posited by Chiang and Lee (2023). In our approach, we similarly refrain from integrating an unspecified scoring function into our schema. This decision stems from the aspiration to ensure clarity and reproducibility in our methodology. Despite this, the potential benefits of incorporating a scoring function cannot be understated, particularly in addressing two significant challenges highlighted by Liu et al. (2023b): the propensity of scoring outcomes to gravitate towards a dominant value—thereby exhibiting low variability and a diminished correlation with human assessments—and the constraints of Large Language Models (LLMs) in generating only integer values for scores, precluding fractional evaluations and consequently leading to a proliferation of ties that mask the nuanced differences among Natural Language Generation (NLG) outputs.

To confront these challenges, both Liu et al. (2023b) and Chiang and Lee (2023) have explored the utilization of the “n” parameter in LLMs, notably OpenAI’s GPT-3.5. This parameter, which dictates the quantity of generated outputs per given prompt, serves as a cornerstone in their strategy to yield decimal scores. By calculating the average of these multiple outputs, they endeavored to engender a scoring system characterized by enhanced variance and distribution more closely aligned with human evaluative patterns. Although Liu et al. (2023b) alluded to the employment of log probabilities within their scoring function, their implementation primarily leveraged a straightforward averaging mechanism. This discrepancy is presumed to arise from the unavailability of a log probabilities functionality in versions of GPT-3.5-turbo and subsequent iterations. In light of this limitation, our framework adopts a simplistic averaging approach whenever the “n” value exceeds unity, thereby ensuring consistency and uniformity in our evaluative processes.

Furthermore, the application of log probability within the scoring function emerges as a feasible approach solely under the condition that the evaluation conforms to the methodology outlined in (Liu et al., 2023b), focusing exclusively on the generation of scores. This technique is predicated on the calculation of probabilities associated with the generation of specific outputs, offering a nuanced metric for evaluation. However, this method’s relevance diminishes when the evaluative process extends beyond mere scoring to encompass reasoning or the generation of critical commentary prior to the assignment of a score. In such contexts, where evaluative narratives or qualitative feedback precede quantitative scoring, the direct application of log probabilities becomes less pertinent. The essence of incorporating critical commentary or explanatory feedback is to shed light on the rationale behind the score, thus providing a comprehensive understanding of the evaluated output’s strengths and weaknesses. In these scenarios, the scoring mechanism necessitates a more adaptable and interpretive approach, one that transcends the straightforward application of mathematical probabilities and ventures into the realm of qualitative assessment. Consequently, while log probabilities offer a rigorous and mathematically grounded method for score calculation in certain instances, their utility is contextually bound and may not align with evaluative frameworks that prioritize explanatory or critical analysis alongside numerical scoring.

Appendix B Hyperparameters

Regarding the selection of hyperparameters for LLM agents, we adhered to default settings with exceptions for ‘n’ and ‘temperature’. Echoing findings from (Chiang and Lee, 2023), we set the temperature to 1 across all tasks to optimize NLG task evaluations. The ‘n’ parameter, dictating the number of responses generated per prompt, played a crucial role in our methodology. Following the precedent set by Liu et al. (2023b), who utilized n = 20 to average out scores from multiple responses, we explored the impact of varying ’n’ on evaluation outcomes. Preliminary experiments demonstrate the influence of higher ’n’ values on achieving more representative scores. All the experiments are conducted on a A100 (80GB) GPU server.

Appendix C Is this a General Purpose Framework?

The ReFeR framework shows its efficacy for NLG assessment, utilizing Large Language Models (LLMs) as its cornerstone evaluative agents. The framework encourages a paradigm shift towards a more nuanced examination of NLG outputs, fostering a structured approach that emphasizes review, feedback, and reasoning processes. But the framework is by default modality independent, and we can extend the ReFeR framework’s applicability beyond its textual confines, aiming to encompass a broader spectrum of data modalities by using the capabilities of Multi-Modal Foundation Models (FMs) in these domains (Li et al., 2023). Also, with the increased use of external knowledge and tool-usage (Schick et al., 2023) (Patil et al., 2023), in conjunction with LLMs, we can use these for improved peer evaluation and further feedback and reasoning. And these things can be added modularly without any change in the framework, just like we experiment with different peers.

The review, feedback, and reasoning modules make the ReFeR framework useful even beyond the evaluation of NLG content, making it an effective generator of instruction-tuning data for fine-tuning smaller models and an effective reasoning module for complex tasks.

Due to constraints of computing and time, we could not verify these results using multimodal LLMs and external tools. This remains a promising direction to extend our work. This adaptability and easy extendability not only broadens the framework’s applicability across diverse AI outputs but also highlights its evolutionary potential alongside technological progressions in the field of generative models. Thus, the ReFeR framework stands as a testament to the ongoing evolution in the evaluation of multimedia content, offering nuanced and multidimensional assessments that reflect the complexity and diversity of modern AI-generated outputs.

Appendix D Critic Communication Strategies and Discussion

D.1 Our Proposed Communication Strategies

The Critic module serves as a crucial second step in our ReFeR framework. This module operates by submitting the initial evaluations—comprising both scores and commentary generated by peer agents—to a distinct critic LLM agent. This agent then undertakes the critical decision-making process regarding the necessity of re-evaluating the task at hand. Significantly, the critic LLM’s feedback is designed to enable peer agents to refine and enhance their evaluations. And in some strategies, the critic LLM is used to give feedback on the peer reviews to the Areachair. Consequently, this process gives rise to a fundamental question: How can effective communication between peer LLM agents and the critic LLM agent be established?

Model Coherence Consistency Fluency Relevance Average
ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ
ReFeR (Ours) No Critic Phase 0.502 0.374 0.428 0.370 0.414 0.330 0.459 0.338 0.450 0.353
Individual Peer Eval 0.500 0.381 0.331 0.305 0.393 0.309 0.433 0.306 0.414 0.326
Collective Peer Eval 0.505 \ul0.377 0.381 0.321 0.390 \ul0.311 0.419 0.307 0.424 \ul0.329
Weighted Feedback 0.510 0.376 0.346 0.288 0.375 0.300 0.435 \ul0.328 0.417 0.323
Critic Comment Feedback 0.425 0.309 0.332 0.270 0.294 0.233 0.328 0.248 0.345 0.265
Table 9: Spearman Correlation (ρ𝜌\rhoitalic_ρ) and Kendall-Tau Correlation (τ𝜏\tauitalic_τ) on SummEval. The best and second-best per column have been highlighted in bold and underlined, respectively.
Model Coherence Engagingness Groundedness Naturalness Average
ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ ρ𝜌\rhoitalic_ρ τ𝜏\tauitalic_τ
ReFeR (Ours) No Critic Phase 0.514 0.390 0.651 0.502 0.678 0.590 0.544 0.414 0.597 0.474
Individual Peer Eval \ul0.527 \ul0.400 \ul0.650 \ul0.499 0.628 0.546 0.556 0.426 0.590 0.468
Collective Peer Eval \ul0.527 0.399 0.643 0.495 \ul0.659 \ul0.571 0.536 0.414 \ul0.591 \ul0.470
Weighted Feedback 0.535 0.403 0.635 0.483 0.638 0.553 \ul0.546 \ul0.417 0.588 0.464
Critic Comment Feedback 0.481 0.369 0.507 0.383 0.526 0.454 0.437 0.326 0.488 0.383
Table 10: Spearman Correlation (ρ𝜌\rhoitalic_ρ) and Kendall-Tau Correlation (τ𝜏\tauitalic_τ) on TopicalChat. Best and second-best per column have been highlighted with bold and underline respectively.
Refer to caption
Figure 3: Critic Communication Strategies

Hence, we propose multiple communication strategies as follows:

(1) Individual Peer Evaluation: In this strategy, the peer evaluations are given to the critic module separately. The critic agent is tasked with assessing these responses individually, determining the need for re-evaluation, and suggesting enhancements to bolster the evaluation process. Using this feedback from the critic, the peer LLMs do a re-evaluation of the sample.

(2) Collective Peer Evaluation: This strategy explores the efficacy of submitting all peer evaluations to the critic simultaneously. The goal was to ascertain how such an aggregate submission impacts the critique process. Feedback from the critic was structured in a JSON format, ensuring clear delineation of comments for each peer and facilitating targeted improvements. This strategy significantly reduces the number of calls made to the critic. This strategy is also a re-evaluation strategy similar to the previous strategy.

(3) Weighted Feedback: Here, the critic is requested to assign weights to each peer’s evaluation. These weighted assessments are intended for subsequent review by an area chair, offering a nuanced perspective on the relative merits of each evaluation. In this strategy, the weights are directly relayed to the Area Chair as feedback along with the peer evaluations.

(4) Critic Feedback to AreaChair: This strategy also involves the direct transmission of the critic’s commentary on the peer evaluations to the area chairs with no re-evaluation. This approach prioritizes the provision of expert insight at the decision-making apex of the evaluative hierarchy.

D.2 Discussion about the Communication Strategies

Table 9 compares different models based on coherence, consistency, fluency, and relevance on SummEval. We make the following observations: (1) ReFeR with No Critic provides the best ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ. (2) In terms of ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ, collective peer evaluation strategy is the second best. Individual peer evaluation is worse than collective strategy, even though it incurs thrice the cost of its collective counterpart and the critic gives feedback to the peers individually. (3) Strategies like the Critic Comment Feedback Strategy(where the critic gives comments/feedback on the peer evaluations to Area Chair) involve giving the model too much information through prompt which reduce it’s performance significantly.

Table 10 compares different models based on various aspects of conversation quality like coherence, engagingness, groundedness, and naturalness, and average scores on TopicalChat. We make the following observations: (1) ReFeR with No Critic provides the best ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ. (2) In terms of ρ𝜌\rhoitalic_ρ and τ𝜏\tauitalic_τ, ReFeR with No critic performs well for engagingness and groundedness, but ReFeR with Weighted Feedback strategy is the best for coherence and Individual Peer Eval is best for Naturalness. (3) ReFeR with Direct Feedback is the least performing ReFeR model since it obtains the lowest ρ𝜌\rhoitalic_ρ, τ𝜏\tauitalic_τ values among the ReFeR models.

Appendix E Case Study of Evaluation Using Our Framework

{mdframed}

Article: Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team’s 0-0 draw with Burnley on Sunday…

Summary: Paul Merson has restarted his row with Andros Townsend. The Tottenham midfielder was brought on with only seven minutes remaining in his team’s 0-0 draw with Burnley. Andros Townsend scores England’s equaliser in their 1-1 friendly draw with Italy in Turin.

Human Annotated Scores:

  • Coherence: 2.33

Peer 1’s Evaluation:

Analysis: The summary includes the main topic and key points of the news article, but the sentences are not well-structured or organized. It seems to be a combination of different sentences from the source text without a clear flow or logical order.
Rating: 2

Peer 2’s Evaluation:

Analysis: The summary lacks a clear and logical order in presenting the information, and the connection between the two events (Townsend’s substitution and his goal for England) is not explicitly stated.
Rating: 3.0.

Peer 3’s Evaluation:

Analysis: The summary lacks proper structure and does not provide clear transitions between sentences. The sentences appear to be a collection of disjointed facts without a coherent narrative.
Rating: 1

AC’s Evaluation:

[ Analysis: The summary accurately covers some key points from the article, including the restart of the row between Paul Merson and Andros Townsend and Townsend’s goal for England against Italy. However, it fails to mention the specific incident with Townsend being brought on late in the game against Burnley. The summary lacks a clear structure and logical flow, as it jumps from one point to another without a cohesive progression. Rating: 2.5,
… (19 more responses like this) ]

Final rating: 2.35

Appendix F Case Study for Instruction Tuning Experiments

{mdframed}

Article: Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team’s 0-0 draw with Burnley on Sunday…

Summary: Paul merson was brought on with only seven minutes remaining in his team ’s 0-0 draw with burnley . Andros townsend scored the tottenham midfielder in the 89th minute . Paul merson had another dig at andros townsend after his appearance . The midfielder had been brought on to the england squad last week . Click here for all the latest arsenal news news .,

Human Annotated Scores:

  • Coherence: 1.33

Mistral-7B’s Evaluation (Non Finetuned):

Analysis: The given coherence example is about a summary of a news article. The summary is meant to be a concise summary that covers the main points of the news article. If each sentence of the summary is relevant to the main topic and key points mentioned in the news article, then the summary can have a high coherence rating. Rating: 4

Mistral-7B’s Evaluation (Finetuned):

Analysis: The summary seems to lack coherence as it is disjointed and does not link one idea into the next effectively. The central idea about Merson criticizing Townsend and Townsend scoring for England get mixed up in the middle, lacking a logical flow of information. Furthermore, the sentence T̈he midfielder had been brought on to the England squad last weeks̈eems out of place and does not correlate well with the rest of the content, hence, compromising overall coherence. Rating: 1.4

Appendix G Example of Reasoning improved by Framework

{mdframed}

Example from GSM8k:

QUESTION: Solo has to read 4 pages from his Science textbook, 20 pages from his Social Studies textbook, 7 pages from his History textbook and 8 pages from his Geography textbook. Solo read 15 pages on Monday. If he has 4 more days to complete all his reading, how many pages does he need to read, on average, in one day?

GOLD ANSWER: 6 pages

PEER 1’S RESPONSE: To find the total number of pages Solo needs to read, we add up the pages from each textbook: 4 (Science) + 20 (Social Studies) + 7 (History) + 8 (Geography) = 39 pages. Solo needs to read, on average, 9.75 pages per day.

PEER 2’S RESPONSE: Total pages to read: 4 (Science) + 20 (Social Studies) + 7 (History) + 8 (Geography) = 40 pages. After subtracting 15 pages already read, Solo needs to read 25 more pages. He needs to read an average of 6.25 pages per day.

PEER 3’S RESPONSE: Total pages to read: 4 (Science) + 20 (Social Studies) + 7 (History) + 8 (Geography) = 39 pages. Solo needs to read, on average, 9.75 pages per day.

AREA CHAIR’S RESPONSE: Total pages to read: 4 + 20 + 7 + 8 = 39 pages. After subtracting 15 pages already read, Solo needs to read 24 more pages. He needs to read an average of 6 pages per day. Peer 1 and Peer 3 provided the correct calculations but made a mistake in their final answers by not subtracting the number of pages Solo read on Monday from the total number of pages. They both stated that Solo needs to read, on average, 9.75 pages per day, which is incorrect. Peer 2 made a calculation mistake. My solution corrects these errors and provides the accurate calculation for the average number of pages Solo needs to read per day.

Appendix H Sample Prompts for SummEval dataset

Peer Prompt (Analyze Rate): You will be given one summary written for a news article. Your task is to rate the summary on one metric. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic." Evaluation Steps: 1. Read the news article carefully and identify the main topic and key points. 2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order. 3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} Evaluation Form (Answer by starting with "Analysis:" to analyze the given example regarding the evaluation criteria as concise as possible, and then give the numeric rating on the next line by "Rating:): - Coherence:

Peer Prompt (Eval Guidelines) : You will be given one summary written for a news article. Your task is to rate the summary on one metric. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic." Evaluation Steps: 1. Read the news article carefully and identify the main topic and key points. 2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order. 3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. 4. Scoring Guidelines: 𝐒𝐜𝐨𝐫𝐞=𝟓𝐒𝐜𝐨𝐫𝐞5\mathbf{Score=5}bold_Score = bold_5: The summary fully captures all key points of the article with an accurate and logical flow, without any significant omissions or irrelevant information. 𝟒𝐒𝐜𝐨𝐫𝐞<𝟓4𝐒𝐜𝐨𝐫𝐞5\mathbf{4\leq Score<5}bold_4 ≤ bold_Score < bold_5: Most key points are included with a generally logical sequence, albeit with minor omissions or slight inclusions of less relevant information. 𝟑𝐒𝐜𝐨𝐫𝐞<𝟒3𝐒𝐜𝐨𝐫𝐞4\mathbf{3\leq Score<4}bold_3 ≤ bold_Score < bold_4: Some key points are present, but others are missing, and the flow has noticeable gaps or jumps, including some irrelevant details. 𝟐𝐒𝐜𝐨𝐫𝐞<𝟑2𝐒𝐜𝐨𝐫𝐞3\mathbf{2\leq Score<3}bold_2 ≤ bold_Score < bold_3: Several key points are missed, and the flow is disjointed with significant omissions or inaccuracies, and noticeable irrelevant content. 𝟏𝐒𝐜𝐨𝐫𝐞<𝟐1𝐒𝐜𝐨𝐫𝐞2\mathbf{1\leq Score<2}bold_1 ≤ bold_Score < bold_2: Fails to represent the article accurately, lacks coherence and logical flow, with major elements missing or misrepresented, and significant irrelevant details. Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} Evaluation Form (Answer by starting with "Analysis:" to analyze the given example regarding the evaluation criteria as concise as possible, and then give the numeric rating on the next line by "Rating:): - Coherence: Peer Prompt (Re-Evaluation): You will be given one summary written for a news article. Your task is to re-evaluate the summary based on your previous evaluation, which will also be provided. Please consider the Critic Comment on your intial evalution when re-evaluating. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic." Evaluation Steps: 1. Read the news article carefully and identify the main topic and key points. 2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order. 3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria. Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} Initial Evaluation: {{initial_eval}} Critic Comment: {{Critic Response}} Re-Evaluation Instructions: 1. Analysis: Based on the critic’s comment provided, re-evaluate the summary for coherence. 2. Rating: Provide a numeric rating for coherence based on your revised evaluation.

Critic Prompt: (For Individual Peer Eval Strategy) You will be provided with a news article summary and the initial evaluation from a large language model (LLM), referred to as the assistant’s evaluation. The assistant’s evaluation includes a brief analysis by the assistant and a rating given by the assistant. Your task is to correct one aspect of the assistant’s evaluation based on a specific metric and provide feedback to the LLM in the form of a critic comment. Additionally, you need to determine whether there is a need for re-evaluation. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. Evaluation Criteria: Coherence (1-5) - The collective quality of all sentences. The summary should be well-structured and well-organized, not just a heap of related information, but building from sentence to a coherent body of information about a topic. Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} Assistant’s Evaluation: {{Peer Response}} Evaluation Form: -Critic Comment: Provide concise feedback to the assistant regarding the evaluation. -Re-Evaluation: Yes/No, based on whether you believe there is a need for re-evaluation.

{mdframed}

Critic Prompt: (For Collective Peer Eval Strategy)

You will be provided with a news article summary and the initial evaluation from three large language models (LLMs), referred to as the assistant’s evaluation. The assistant’s evaluation includes a brief analysis by the assistant and a rating given by the assistant.

Your task is to correct one aspect of each assistant’s evaluation based on a specific metric and provide feedback to the LLM in the form of a critic comment. Additionally, you need to determine whether there is a need for re-evaluation for each assistant.

Please carefully review and understand these instructions. Keep this document open for reference while reviewing.

Evaluation Criteria:

Coherence (1-5) - The collective quality of all sentences. The summary should be well-structured and well-organized, not just a heap of related information, but building from sentence to a coherent body of information about a topic.

Example:

Source Text: {{Full Article}}

Summary: {{Summary of Article}}

First Assistant’s Evaluation: {{Peer Response}}

Second Assistant’s Evaluation: {{Peer Response2}}

Third Assistant’s Evaluation: {{Peer Response3}}

Evaluation Form:

-Critic Comment: Provide concise feedback to the assistant regarding the evaluation.

-Re-Evaluation: Yes/No, based on whether you believe there is a need for re-evaluation.

Provide your feedback for each assistant in the following format:

{
"evaluators":
{
"evaluator": "Assistant 1",
"critic_comment": "Your feedback for Assistant 1’s evaluation.",
"re_evaluation": "Yes/No"
},
{
"evaluator": "Assistant 2",
"critic_comment": "Your feedback for Assistant 2’s evaluation.",
"re_evaluation": "Yes/No"
},
{
"evaluator": "Assistant 3",
"critic_comment": "Your feedback for Assistant 3’s evaluation.",
"re_evaluation": "Yes/No"
}
}

Please provide the critic comments and re-evaluation decisions for each assistant model as requested.

{mdframed}

Critic Prompt: (For Weighted Feedback Strategy)

You will be provided with a news article summary and the initial evaluation from three large language models (LLMs), referred to as the assistant’s evaluation. The assistant’s evaluation includes a brief analysis by the assistant and a rating given by the assistant.

Your task is to correct one aspect of each assistant’s evaluation based on a specific metric and provide feedback to the LLM in the form of a critic comment. Additionally, you need to provide a weightage for the assistant’s evaluation.

Please carefully review and understand these instructions. Keep this document open for reference while reviewing.

Evaluation Criteria:

Coherence (1-5) - The collective quality of all sentences. The summary should be well-structured and well-organized, not just a heap of related information, but building from sentence to a coherent body of information about a topic.

Example:

Source Text: {{Full Article}}

Summary: {{Summary of Article}}

First Assistant’s Evaluation: {{Peer Response}}

Second Assistant’s Evaluation: {{Peer Response2}}

Third Assistant’s Evaluation: {{Peer Response3}}

Evaluation Form:

-Critic Comment: Provide concise feedback to the assistant regarding the evaluation.

-Weightage: Provide a weightage for the assistant’s evaluation, indicating the quality of the evaluation. Use a scale of 0-1, where 0 is the lowest and 1 is the highest.

Provide your feedback for each assistant in the following format:

{
"evaluators":
{
"evaluator": "Assistant 1",
"critic_comment": "Your feedback for Assistant 1’s evaluation.",
"weightage": "Weightage value (0-1)"
},
{
"evaluator": "Assistant 2",
"critic_comment": "Your feedback for Assistant 2’s evaluation.",
"weightage": "Weightage value (0-1)"
},
{
"evaluator": "Assistant 3",
"critic_comment": "Your feedback for Assistant 3’s evaluation.",
"weightage": "Weightage value (0-1)"
}
}

Please provide the critic comments and weightage for each assistant model as requested.

Area Chair Prompt: (For No Critic, Individual and Collective Strategies) You will be given one summary written for a news article and you will also receive initial evaluations from three large language models, referred to as the assistants’ evaluations. Your task is to rate the summary on one metric. Please read the instructions and criteria below carefully and use them as a guide in your evaluation. Evaluation Criteria: Coherence (1-5) - Assess the structural and organizational quality of the summary. It should present information logically and clearly, relating to the main topic of the news article. Consider if the summary is well-structured, if it progresses logically from point to point, and if it effectively encapsulates the key points of the article. Evaluation Guidelines: 1. Read the news article to understand the main topic and key points. 2. Review the summary. Analyze if it accurately and logically covers the main points of the article. 3. Rate the summary’s coherence on a scale of 1 to 5, with 1 being the least coherent and 5 being the most coherent. 4. Scoring Guidelines: 𝐒𝐜𝐨𝐫𝐞=𝟓𝐒𝐜𝐨𝐫𝐞5\mathbf{Score=5}bold_Score = bold_5: The summary fully captures all key points of the article with an accurate and logical flow, without any significant omissions or irrelevant information. 𝟒𝐒𝐜𝐨𝐫𝐞<𝟓4𝐒𝐜𝐨𝐫𝐞5\mathbf{4\leq Score<5}bold_4 ≤ bold_Score < bold_5: Most key points are included with a generally logical sequence, albeit with minor omissions or slight inclusions of less relevant information. 𝟑𝐒𝐜𝐨𝐫𝐞<𝟒3𝐒𝐜𝐨𝐫𝐞4\mathbf{3\leq Score<4}bold_3 ≤ bold_Score < bold_4: Some key points are present, but others are missing, and the flow has noticeable gaps or jumps, including some irrelevant details. 𝟐𝐒𝐜𝐨𝐫𝐞<𝟑2𝐒𝐜𝐨𝐫𝐞3\mathbf{2\leq Score<3}bold_2 ≤ bold_Score < bold_3: Several key points are missed, and the flow is disjointed with significant omissions or inaccuracies, and noticeable irrelevant content. 𝟏𝐒𝐜𝐨𝐫𝐞<𝟐1𝐒𝐜𝐨𝐫𝐞2\mathbf{1\leq Score<2}bold_1 ≤ bold_Score < bold_2: Fails to represent the article accurately, lacks coherence and logical flow, with major elements missing or misrepresented, and significant irrelevant details. Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} First Assistant’s Evaluation: {{Peer_response1}} Second Assistant’s Evaluation: {{Peer_response2}} Third Assistant’s Evaluation: {{Peer_response3}} Evaluation Form (Please provide your analysis and rating as follows): - Analysis: [Your detailed analysis here, focusing on the structural and logical flow of the summary in relation to the source text.] - Rating: [Your coherence rating here on a scale from 1 to 5.]

Area Chair Prompt: (For Critic Comment Feedback Strategy) You will be given one summary written for a news article and you will also receive initial evaluations from three large language models, referred to as the assistants’ evaluations. You will also recieve critic comments by another LLM for each of these assistant’s evaluations. Consider these in your evaluation. Your task is to rate the summary on one metric. Please read the instructions and criteria below carefully and use them as a guide in your evaluation. (Evaluation Criteria and Evaluation Guidelines same as above) Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} First Assistant’s Evaluation: {{Peer_response1}} Critic Comment : {{Critic_Comment1}} Second Assistant’s Evaluation: {{Peer_response2}} Critic Comment : {{Critic_Comment2}} Third Assistant’s Evaluation: {{Peer_response3}} Critic Comment : {{Critic_Comment3}} Evaluation Form (Please provide your analysis and rating as follows): - Analysis: [Your detailed analysis here, focusing on the structural and logical flow of the summary in relation to the source text.] - Rating: [Your coherence rating here on a scale from 1 to 5.]

Area Chair Prompt: (For Weighted Feedback Strategy) You will be given one summary written for a news article and you will also receive initial evaluations from three large language models, referred to as the assistants’ evaluations. You will also recieve weightage’s for evaluations by another critic LLM for each of these assistant’s evaluations. Consider these in your evaluation. Your task is to rate the summary on one metric. Please read the instructions and criteria below carefully and use them as a guide in your evaluation. (Evaluation Criteria and Evaluation Guidelines same as above) Example: Source Text: {{Full Article}} Summary: {{Summary of Article}} First Assistant’s Evaluation: {{Peer_response1}} Weightage : {{weightage1}} Second Assistant’s Evaluation: {{Peer_response2}} Weightage : {{weightage2}} Third Assistant’s Evaluation: {{Peer_response3}} Weightage : {{weightage3}} Evaluation Form (Please provide your analysis and rating as follows): - Analysis: [Your detailed analysis here, focusing on the structural and logical flow of the summary in relation to the source text.] - Rating: [Your coherence rating here on a scale from 1 to 5.]

Appendix I Sample Prompts for GSM8k Dataset

Peer Prompt: You will be provided with a problem that requires logical reasoning, mathematical calculation, or both. Your task is to solve the problem accurately, providing not just the answer but also a clear explanation of the steps taken to reach that answer. It is crucial to understand the problem thoroughly and apply the correct principles or formulas to solve it. Instrucions: 1. Read the problem statement carefully, ensuring you understand all the details and what is required for the solution. 2. Work through the problem logically and methodically, explaining your reasoning and the steps you take to solve the problem. 3. Provide the final answer clearly, specifying it as a numerical value or a specific explanation as required by the problem statement. Problem Statement: {{Problem}} Evaluation Form: - -Analysis: Start with "Analysis:" to provide a concise and structured explanation of the steps and reasoning used to solve the problem. Ensure your analysis is clear and follows a logical sequence. - -Final Answer: Clearly state the final answer only (number) on the line after your analysis.

Area Chair Prompt: You will be provided with a problem that requires logical reasoning, mathematical calculation, or both. Alongside the problem, you will also see solutions provided by three other Language Models (LLMs). Your task is to analyze these solutions, identify any inaccuracies or inconsistencies, and then solve the problem accurately yourself. This requires not just providing the correct answer but also a clear explanation of the steps taken to reach that answer, especially in light of the previous solutions provided. It is crucial to understand the problem thoroughly and apply the correct principles or formulas to solve it correctly. Instrucions: 1. Read the problem statement carefully, ensuring you understand all the details and what is required for the solution. 2. Review the solutions provided by the three other LLMs. Note any common mistakes or discrepancies in their reasoning or calculations. 3. Work through the problem logically and methodically, explaining your reasoning and the steps you take to solve the problem. 4. Provide the final answer clearly, specifying it as a numerical value or a specific explanation as required by the problem statement. Problem Statement: {{Problem}} Solution by other LLMs: LLM1 Answer: {{Peer_response1}} LLM2 Answer: {{Peer_response2}} LLM3 Answer: {{Peer_response3}} Evaluation Form: - -Analysis: Start with "Analysis:" to provide a concise and structured explanation of the steps and reasoning used to solve the problem. Ensure your analysis is clear and follows a logical sequence. - -Final Answer: Clearly state the final answer only (number) on the line after your analysis.