What’s Wrong? Refining Meeting Summaries with LLM Feedback

Frederic Kirstein^1,2, Terry Ruas¹, Bela Gipp¹
¹Georg-August-Universität Göttingen, Germany
²[email protected]

Abstract

Meeting summarization has become a critical task since digital encounters have become a common practice. Large language models (LLMs) show great potential in summarization, offering enhanced coherence and context understanding compared to traditional methods. However, they still struggle to maintain relevance and avoid hallucination. We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process: mistake identification and summary refinement. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types, including structural, omission, and irrelevance errors. Our experiments show that these errors can be identified with high accuracy by an LLM. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence. This post-hoc refinement effectively improves summary quality by leveraging multiple LLMs to validate output quality. Our multi-LLM approach for meeting summarization shows potential for similar complex text generation tasks requiring robustness, action planning, and discussion towards a goal.

Frederic Kirstein^1,2, Terry Ruas¹, Bela Gipp¹ ¹Georg-August-Universität Göttingen, Germany ²[email protected]

1 Introduction

Refer to caption — Figure 1: Overview of the two-stage refinement protocol displaying the assessed variants. The Mistake Identification block is analyzed Section 4 and the Refinement block in Section 5.

Meeting summaries are essential for professional conversations, they serve as a reference for subsequent processes, update absentees, and reinforce the most important topics discussed. The growing importance of summarization systems is evident from the recent release of tools in virtual meeting software (e.g., Zoom¹¹1https://www.zoom.com/en/ai-assistant, Microsoft Teams²²2https://copilot.cloud.microsoft, Google Meet³³3https://support.google.com/meet/). Still, meeting summarization faces challenges, such as handling spoken language idiosyncrasies and identifying salient content Kirstein et al. (2024a). Existing techniques, like AMR-graphs for capturing speaker relations Hua et al. (2023), are often tailored to specific backbone models, typically using BART Lewis et al. (2020), PEGASUS Zhang et al. (2020a) or their variations. Recent explorations of large language models (LLMs) for meeting summarization reveal their strong capabilities (e.g., high-quality summaries of long inputs) Laskar et al. (2023). However, these LLM-generated summaries are still error-prone Kirstein et al. (2024b) and costly to fine-tune Chauhan et al. (2022); Wang et al. (2022).

The shift to LLMs as backbone models raises the question of how to better use their capabilities and mitigate their weaknesses. (Self-)correction through few-shot prompting improves LLM performance by asking it to review and correct its output Pan et al. (2023). While successful in various tasks (e.g., question answering Jiang et al. (2024), reasoning Madaan et al. (2021), and summarization Saunders et al. (2022)), self-correction still falls short to identify and correct errors Huang et al. (2024). To address this, Tyen et al. (2024) propose a multi-LLM refinement process for reasoning tasks with, leading to a more robust correction approach.

Analogous to how humans iterate over suggestions and edits when writing texts, we explore how LLMs may be employed in the same way to improve meeting summarization in a two-stage approach consisting of mistake identification in an existing summary and a subsequent refinement (Figure 1). For mistake identification, we annotate QMSum Zhong et al. (2021) on nine error types (e.g., omission, structural mistakes) Kirstein et al. (2024b); Chang et al. (2024). GPT-4 Turbo⁴⁴4We will refer to this as GPT4 throughout the paper. identifies errors on average with $\sim$ 89% accuracy, but it struggles with irrelevance ( $\sim$ 81%) and hallucination ( $\sim$ 72%) errors. We achieve the best results on the mistake identification task using multiple LLM instances for each error type and Chain-of-Thought (CoT) prompting Wei et al. (2023). For the refinement stage, we use an additional model instance to adjust an erroneous summary according to the detailed feedback from the mistake identification stage. We explore what content a refinement model requires, considering the CoT explanation from the mistake identification task, a correction suggestion, and the original meeting transcript as additional information sources for pointed-out mistakes. We further analyze if the feedback should be passed through an intermediate planning stage that extracts which content to add, remove, or rewrite in a summary. We identify strong quality improvements for refined summaries over the original ones and baselines when using the CoT explanation from the mistake identification as feedback along the erroneous summary without additional processing. Our contributions are summarized as follows:

•

QMSum Mistake⁵⁵5The dataset will be later available through Huggingface and the project-accompanying Github repository., a dataset of 200 meeting summaries and human-annotated errors.
•

A multi-LLM approach to finding mistakes in meeting summaries considering different prompting approaches.
•

A transformation of identified mistakes into actionable feedback to refine an erroneous summary and derive a refinement protocol.

Dataset	# Meetings	# Turns	# Speakers	# Len. of Meet.	# Len. of Gold Sum.	# Len. of Aut. Sum.
AMI	124 (113)	535.6	4.0	6007.7	108.8	112.4
ICSI	52 (42)	819.0	6.3	13317.3	103.0	108.2
WPCP	24 (14)	207.7	34.1	13761.9	129.5	112.9
QMSum Mistake	200 (169)	556.8	9.2	9069.8	109.1	116.9

Table 1: Statistics for the QMSum Mistake dataset. Values are averages of the respective categories. Lengths (Len.) are in number of words. In # Meetings, values in parentheses are the number of erroneous samples.

2 Related Work

Meeting Summarization and its parent domain dialogue summarization are transitioning from traditional encoder-decoder models to LLMs. Traditional models, such as BART Lewis et al. (2020) and PEGASUS Zhang et al. (2020a), improved through techniques tailored to specific challenges like language, structure, comprehension, speaker, salience, and factuality Kirstein et al. (2024a, b). These models integrated methods such as AMR-graphs for speaker relations Hua et al. (2023), role vectors for speaker correlation Asi et al. (2022); Naraki et al. (2022), and additional training stages to bridge the gap between pre-training on written texts and spoken dialogue tasks Raffel et al. (2020); Khalifa et al. (2021); Lee et al. (2021b). Recently, LLMs have been explored for meeting summarization by prompting the model to create a TL;DR Laskar et al. (2023); Kirstein et al. (2024b), showing comparable performance to specialized encoder-decoder models but with better context comprehension. They thereby use LLMs without any adaptations and serve as the first works to report on LLM performance on meeting summarization. Our work examines the effectiveness of LLMs as post-processors for summaries, assessing if this approach can achieve high-quality summaries without requiring techniques tailored to a specific challenge of meeting summarization. We compare this against original summaries, single-LLM baselines, and human summaries, providing an updated benchmark for LLMs in meeting summarization. For the creation of QMSum Mistake, we extend the work by Kirstein et al. (2024b), refining their definition of errors.

Self-correction methods have been extensively studied in recent literature Pan et al. (2023), including training-time correction strategies like Reinforcement Learning from Human Feedback (RLHF) Ouyang et al. (2022) and self-improvement techniques Huang et al. (2024). Our feedback and refinement method falls into the category of post-hoc correction, which is applied to outputs already generated. Previous post-hoc correction methods, such as Reflexion Shinn et al. (2023) and RCI Kim et al. (2023), focus on reasoning errors and often degrade performance without oracle labels Huang et al. (2024). Our work uniquely applies post-processing correction to meeting summarization, focusing on qualitative improvements with independent models, and further explores this to other model families and related summarization domains. Our approach is informed by the two-stage setup of Tyen et al. (2024) which we extend with an extensive mistake identification architecture and a multi-stage refinement.

3 QMSum Mistake Dataset

QMSum Mistake consists of 200 samples, with 169 (85%) automatically created meeting summaries annotated on nine error types (Section 3.1) and 31 error-free summaries serving as controls to analyze if the mistake identification is too sensitive. Table 1 provides dataset statistics. The samples stem from QMSum’s Zhong et al. (2021) training and test sets, including AMI (staged business meetings) Carletta et al. (2005), ICSI (academic meetings) Janin et al. (2003), and parliament meetings. As gold summaries lack typical errors of automatic summaries, we generate summaries using encoder-decoder models (i.e., LED Beltagy et al. (2020), DialogLED Zhong et al. (2022), PEGASUS-X Phang et al. (2022)) for more severe mistakes in automatic summaries such as coreference and structure errors and LLMs (i.e., GPT-3.5, Phi-3 mini 128k Abdin et al. (2024)) for subtle errors such as relevance. Models have a context size of at least 16k to fit the entire meeting in the input, use default settings, and generate up to 200 tokens to match gold summary lengths. Table 9 shows examples of varying summarization styles and quality levels. The human annotation process, which achieved an average Krippendorff’s alpha of 0.780 (see Table 5), is described in Appendix D.

3.1 Observable errors

Error Type	Transcript	Definition
Redundancy RED	not required	The summary contains repeated or redundant information, which does not help the understanding or contextualization.
Incoherence INC	not required	The model generates summaries containing characteristics that disrupt the logical flow, relevance, or clarity of content either within a sentence (intra-sentence) or across sentences (inter-sentence).
Language LAN	not required	The model uses inappropriate, incorrect (ungrammatical), or ambiguous language or fails to capture unique linguistic styles.
Omission (partial, total) P-OM, T-OM	required	Missing information from the meeting, such as significant decisions or actions. Total omission: Relevant topics and key points are not stated. Partial omission: Salient topics are mentioned but not captured in detail.
Coreference COR	required	The model fails to resolve a reference to a participant or entity, misattributes statements, or omits necessary mentions.
Hallucination HAL	required	The model produces inconsistencies not aligned with the meeting content. Intrinsic: Misrepresents information from the transcript. Extrinsic: Introduces content not present in the transcript.
Structure STR	required	The model misrepresents the order or logic of the meeting’s discourse, misplacing topics or events.
Irrelevance IRR	required	The summary includes information that is unrelated or not central to the main topics or objectives of the meeting.

Table 2: Definition of the nine error types annotated in QMSum Mistake based on existing error types Kirstein et al. (2024b); Chang et al. (2024)

We refine existing error types Kirstein et al. (2024b); Chang et al. (2024) into nine error types with minimal overlap. Table 2 holds the short definitions. Preliminary testing and annotator feedback inform the refinement of the error types and point out overlap in error definitions, making a clear distinction difficult. This leads to major adaptations to precisely delimit the repetition, incoherence, structure, and linguistic inaccuracy errors, while the omission errors undergo minor tweaks in wording. Hallucination errors are packed into a single category to reduce overlap for edge cases between these two. The initial observations further indicate that errors so far were designed to capture missing or incorrect information, not the inclusion of unrelated content, which our summary-generating models tend to generate. Thus, we add the ’Irrelevance’ category.

4 Mistake Identification

Table 3 shows GPT4’s⁶⁶6gpt-4-turbo-2024-04-09, default settings, temperature = 0 accuracy in identifying summarization-related errors (Section 3.1) on the QMSum Mistake dataset. We chose GPT4 for its context size, understanding capabilities, robustness to handle spoken language, and superior results compared to Gemini Team et al. (2024) and Phi Abdin et al. (2024) in early experiments. We provide complementary analysis for the discarded models in Appendix B.

4.1 Mistake identification protocol (MIP)

We consider two prompting strategies to identify possible mistakes in a summary: direct and CoT prompting. In Direct prompting Tyen et al. (2024), given the predicted summary and the meeting transcript, when required (see Table 2), the model outputs ’Yes’ or ’No’ for each error to indicate its existence. For CoT prompting Wei et al. (2023), we extend direct prompting by having the model explain why a passage is erroneous following the ’let’s think step by step’ approach, allowing for detailed analysis of the model’s understanding.

As GPT4 is not specifically trained to identify errors, we enrich the mistake identification prompt with few-shot examples of erroneous summaries (non-overlapping with our test set). The mistake identification prompt consists of four parts: the model role and error definition for context, two few-shot examples of the error type, an optional request for the CoT prompting, and the primary task of reporting the error’s existence. We include more details on the exact prompt in Appendix D.

We consider two setups to explore the MIP: a single-instance of a single GPT4 asked to detect all error types at once Zhang et al. (2023a) and a multi-instance architecture Mousavi et al. (2023) using one GPT4 instance for each error type.

4.2 Mistake identification discussion

While both setups achieve high accuracy scores, the single-instance setup struggles to consistently beat an always true baseline on the whole QMSum Mistake dataset. Overall, this aligns with the hypothesis behind current LLM-based automatic metrics that leverage similar models to assess text characteristics such as fluency, readability, or clarity Li et al. (2024).

Impact of mistake identification protocol on accuracy of error detection.

	single-instance		multi-instance		always true
Error	direct	CoT	direct	CoT
P-OM	75.0	82.2	82.5	84.5	79.0
T-OM	78.5	81.5	87.0	90.0	81.0
REP	73.0	72.0	92.0	95.5	48.5
INC	73.0	66.5	83.0	89.5	39.0
COR	76.0	63.0	85.0	91.5	19.0
HAL	42.5	59.0	73.5	72.0	62.0
LAN	61.5	68.5	77.5	88.5	43.0
STR	71.0	62.5	69.5	87.0	47.0
IRR	60.5	59.0	76.5	81.0	51.0

(a) Results on the whole QMSum Mistake dataset.

	single-instance		multi-instance		always true
Error	Direct	CoT	Direct	CoT
P-OM	86.4	89.9	93.5	94.1	93.5
T-OM	87.0	87.6	94.7	94.1	95.9
REP	68.0	66.9	90.5	94.7	57.4
INC	68.0	60.4	79.9	88.2	46.2
COR	71.6	61.5	82.8	89.9	22.5
HAL	50.3	60.4	75.7	75.1	73.3
LAN	61.5	63.9	75.7	82.2	50.9
STR	66.9	60.9	67.5	89.9	55.6
IRR	55.6	60.4	76.9	82.8	60.4

(b) Results on the erroneous samples of QMSum Mistake.

Table 3: Mistake identification accuracy of GPT4 for all MIP variants. Always True baseline provided for reference. Best values are bold.

Comparing results across the four MIP variants (Table 3), we find that accuracy in detecting mistakes increases significantly on all error types when using a multi-instance setup compared to the single-instance approach. While the difference between single and multi-instance is comparably small ( $\sim$ 7%) for both omission error types (T-OM, P-OM), the accuracy can deviate by up to $\sim$ 29.5% in the case of HAL. Figure 2 shows that the average accuracy across all error types reveals a gain of at least 13.5% when using multiple LLM instances for detection, which aligns with recent works Huang et al. (2024); Tyen et al. (2024). We observe the average false negative rates decrease by $\sim$ 27% from single (CoT) (30.0%, worst overall) to multi (CoT) (3.4%, best overall). We hypothesize that the weaker single-model performance may stem from the extended content and its additional tasks, which must be handled by a single model compared to the multi-instance setting. As a result, the single-instance approach is unable to process the long dependencies, which limits contextualization and comprehension Lee et al. (2021a). In addition, while the multi-instance setup benefits from the CoT prompting, the single-model one is negatively affected with gains in false negative error rate. The CoT explanations showing inconsistency in assessing requested error types due to misunderstanding of the definition supports these observations.

Considering the multi-instance approach as better suited for mistake identification, we conclude that the CoT prompting is beneficial to improve accuracy even further close to 90%. Note that the CoT explanation might contain wrong statements. At the same time, the resulting error identification is correction, which has also been observed in tasks such as sorting and logical Tyen et al. (2024).

The nearly constant average false positive rate (between 12.4% and 15.4%) across all MIPs (Figure 2) suggests a model tendency to point out non-existing errors, which we interpret as oversensitivity to error types. Analyzing the accuracy change between the whole dataset (Table 3) and the erroneous subset (Table 3) we find that GPT tends to falsely flag T-OM, P-OM, STR, HAL, and IRR errors. We derive from this observation that the model expects a content-richer summary, seeing additional content as relevant. Our results suggest that the multi-instance setup with CoT prompting provides the most reliable mistake identification, which we use for the following experiments.

Difficulties in identifying errors.

Based on the best MIP’s accuracy, we categorize errors into three groups: reliable ( $\geq$ 90.0%: COR, REP, T-OM), good ( $\geq$ 85.0%: INC, LAN, STR), and hard to detect (<85.0%: P-OM, IRR, HAL). Following, we discuss the difficulties related to each category by analyzing the models’ CoT explanations to identify patterns and the possible reason they struggle⁷⁷7Due to the amount of data, the model responses consi- dered for this section will be shared upon acceptance..

Errors from the reliable group have descriptions close to what an LLM without access to our definitions would generate when prompted to define the error. The rare accuracy decreases are related to oversensitivity cases, e.g., assigning a T-OM error when expecting more details, indicating that the model may apply error detection rules too strictly. False identification of COR errors typically occurs when conversations become less structured, and multiple participants mention similar information, as in the samples derived from the AMI dataset.

For errors from the good group, the main issue is the model’s tendency to fail to properly contextualize error definitions and apply them too strictly compared to human annotators. For example, a summary’s linearity may be counted as a STR error, as the summary does not preserve the identical structure. False detection of LAN errors includes marking domain-typical terms (e.g., grad student for graduate student in ICSI) as mistakes and orients on the transcript’s language level, rendering fractured and brainstorming-like content (e.g., conversation from the ICSI) difficult.

Errors from the hard group are challenging mainly due to the model’s difficulty in understanding the error type. In the context of HAL the model occasionally looks for closely related errors (e.g., T-OM, COR), leading to wrong detection. GPT4 struggles with P-OM and IRR due to the inherent subjectivity, which we also observe in the slightly lower inter-annotator agreement scores during the QMSum Mistake annotation (Table 5). We conclude that GPT4 applies error detection definitions slightly too strictly and mistakes related to subjectivity are influenced by the model’s heuristic.

5 Summary Refinement

Building on the finding that an LLM can identify typical meeting summarization errors (Section 4.2), we analyze how the quality of original predicted summaries changes when an LLM refines them based on identified mistakes. Our multi-model refinement approach mimics a four-stage human review process to form a refinement protocol (Figure 1): (1) locating errors using the best-performing MIP, (2) generating feedback on identified errors (feedback protocol), (3) structuring feedback (transfer protocol), and (4) refinement. Following, we explore the setup of the feedback and transfer protocols to derive a refinement protocol for meeting summarization.

5.1 Feedback protocol (FP)

Feedback on an error can range from pointing out its existence, similar to someone highlighting a text passage and leaving a short comment, to in-depth explanations of what is wrong with the marked passage and rewrite suggestions. Following this analogy, our feedback protocol consists of an essential and an additional detail part. The essential part includes minimal feedback on the existence of an error type and a short explanation about why and where it was detected, but may not mention all error instances. The additional detail part considers three optional information sources: CoT explanation Wei et al. (2023), correction suggestion Zhang et al. (2023a), and the original transcript. CoT explanation, the output of MIP’s CoT prompting (Section 4.1), contains all observed error instances and details on why they are considered errors. It helps the refinement model derive a rewriting plan through detailed, structured information but may lead to confusion if the reasoning is wrong Tyen et al. (2024). Correction suggestions provide examples of how to correct the error, either as tips or precise rewrites that can be directly applied. The transcript provides all available information in its original form, allowing it to decide whether to accept or reject the feedback and how to integrate it. The three optional information sources can be combined, determining how much information is required and if feedback without a transcript is as informative as adding the transcript for lookup.

TP	FP	Overall	REL	INF	CON	COH
		(Ranking $\downarrow$ )	(Likert $\uparrow$ )	(Likert $\uparrow$ )	(Likert $\uparrow$ )	(Likert $\uparrow$ )
direct	essential only	5.44	3.08	2.99	3.29	3.14
	CoT	3.75	3.10	3.14	3.46	3.20
	Cor	3.79	3.04	2.83	3.57	3.23
	CoT+Cor	4.11	3.11	2.88	3.40	3.09
	Tra	4.68	3.12	2.93	3.65	3.37
	Tra + CoT	4.74	3.14	3.36	3.67	3.56
	Tra + Cor	4.93	3.10	3.14	3.68	3.44
	CoT+Cor+Tra	5.10	3.05	3.05	3.43	3.18
consolidated	essential only	6.10	2.53	2.27	2.58	2.36
	CoT	5.61	2.69	2.62	2.99	2.70
	Cor	6.07	2.96	2.85	3.22	2.98
	CoT+Cor	6.40	2.93	2.92	3.34	3.03
	Tra	4.86	3.08	3.12	3.50	3.33
	Tra + CoT	4.89	3.04	3.05	3.49	3.22
	Tra + Cor	4.88	3.11	3.29	3.60	3.59
	CoT+Cor+Tra	4.92	3.21	3.18	3.70	3.46
	GOLD	4.04	3.08	3.05	3.53	3.21
	ORIG	6.75	2.28	2.15	2.41	2.22
	GPT-S	4.84	3.00	3.00	3.40	3.10
	GPT-R	4.82	3.09	3.09	3.72	3.44

Table 4: Quality reporting of refined summaries for all Transcript Protocols (TP) and Feedback Protocols (FP) combinations (CoT = CoT explanation, Cor = correction, Tra = Transcript). Ranking is the average ranking across all samples. Lower ranking scores indicate higher preference (1 (always preferred) to 20 (always disliked)). REL, INF, CON, COH are the AUTOCALIBRATE Likert scores on relevance, informativeness, conciseness, and coherence using a 5-step Likert scale (1 (worst) to 5 (best)). Best scores per TP are bold, best scores overall are underlined.

5.2 Transfer protocol (TP)

We consider two approaches for structuring feedback for the refinement model: direct feedback Mousavi et al. (2023) and consolidation Zhang et al. (2023a). Direct feedback transfers derived feedback without additional processing, stating whether an error type is observed or not. In the case of CoT explanation, it informs the model step-by-step which sentences are erroneous or error-free, why they are correct or incorrect, and what should be changed (or kept) to have a correct sentence. Consolidation considers only identified errors and generates an editing plan using an intermediate LLM, extracting what information to add, remove, or alter from the feedback protocol. The consolidation protocol does not affect an appended transcript.

5.3 Experimental setup

We refine the erroneous summaries from QMSum Mistake using each refinement protocol variant with the multi-instance CoT-prompted MIP. GPT4 is used as the backbone model for the refiner and optional intermediate LLM to consolidate feedback, with other model families explored in Appendix B. We focus the following experiment on evaluating how summary quality changes based on feedback and show a setup for a meeting summarization refinement protocol. We consider a one-shot improvement here and provide insights on multi-round improvement in Figure 3. To help understand and categorize the quality changes, we report metric results for the original erroneous summaries (ORIG), error-free QMSum gold summaries (GOLD), summaries generated by one GPT4 (GPT-S), and summaries refined by one GPT4 (GPT-R)⁸⁸8’Refine this summary by considering the transcript’. as references in Table 4.

5.4 Evaluation approach

ROUGE Lin (2004) and BERTScore Zhang et al. (2020b), established metrics for meeting summary evaluation Kirstein et al. (2024a), yield scores too similar for interpretation across protocol variants (see Table 8). As human evaluation on all generated refined summaries (total $\sim$ 3.4k) is infeasible, we use the LLM-based metric AUTOCALIBRATE Liu et al. (2023) to report Likert scores on relevance (REL), informativeness (INF), conciseness (CON), and coherence (COH). Since this metric is not developed for meeting summarization, we assess alignment with human judgment by having six annotators rate a subset of 200 summaries according to AUTOCALIBRATE prompts (inter-annotator agreement (Krippendorff’s alpha): REL: 0.775, INF: 0.798, CON: 0.833, COH: 0.803). As the LLM-based evaluation aligns sufficiently with annotator labels (accuracy: 89.1%), we use AUTOCALIBRATE as our main quality proxy. Nevertheless, we manually check every fourth score tuple and model reasoning to confirm alignment with the evaluation task and human judgment. In case of misalignment, three annotators would instead rate the summary. As AUTOCALIBRATE only assesses specific characteristics and does not consider omission, hallucination, or repetition, we also set up a GPT4-powered ranking system, motivated by typical human annotation rankings, based on observable errors from Section 3.1 (see Appendix D for prompt details). We follow the approach used before to ensure reliability and alignment with human annotations (inter-annotator agreement: 0.784 Krippendorff’s alpha, GPT4 acc.: 92.1%).

5.5 Summary refinement discussion

Influence of feedback and transfer protocols on quality.

Table 4 shows the overall ranking and Likert scores of each refinement protocol variant. ORIG summaries are consistently ranked lowest, indicating that refinement positively influences quality, as observed in the assigned Likert scores.

Having only the essential part in the FP leads to minor improvements in ranking and Likert scores for both TPs compared to the ORIG summary, but falling behind the scores of most protocol variants using additional information. This indicates that pointing out errors on a high level already leads to quality improvement. The result is expected, as the minimalistic explanation may not contain every error instance, precise reasoning, or all information to resolve specific errors such as omission. Comparing the essential parts scores of both TPs reveals that the Likert scores and rankings differ notably between the two with the scores using consolidated TP being $\sim$ 0.7 points less. We derive from this observation that the provided feedback influences scores and leads to quality changes.

For the direct TP, CoT explanation and correction are ranked higher (avg. ranks $\sim$ 3.75) than the GPT-S summaries (avg. rank 4.84) and are close to GOLD summaries (avg. rank 4.04). CoT explanation and correction-based refinement outperform transcript-based refinements in overall ranking (avg. rank 4.68 to 5.10) but fall behind in Likert scores, which appears counter-intuitive. The ranking LLM’s reasoning reveals that transcript-based refinement contains repetitions, fails to separate topics, and lacks details, leading to an overall worse rating compared to CoT and correction. As the longer prompt when providing transcripts (avg. 20k tokens with transcript, 4k tokens without transcript) is still only a sixth of GPT4’s context size, we hypothesize that the additional task of cross-checking errors with the transcript may confuse the model due to content repetition and noise in the form of unnecessary details. CoT explanation and correction appear as a lean alternative containing relevant information for quality improvement.

CoT explanation and correction (avg. ranks $\sim$ 3.75) both outperform the combined use of the two (avg. ranking of 5.1 with and 4.11 without transcript). The analysis of the ranking model’s explanation shows that the repetition of content in CoT and correction can lead to multiple occurrences of the same information, while contradicting content may lead to the inclusion of wrong information (see an example in Figure 9).

For the consolidation TP, FPs without transcripts barely improve summary quality (avg. ranks range from 5.61 to 6.40). Transcript-using variants perform similarly to their direct TP counterparts but with rankings and scores closer together. Further, their scores are close to the GPT-R results. This indicates that the consolidated feedback has less influence on refinement than the direct feedback and that the refinement model relies more on the transcript to rewrite the summary. The refiner model’s reasoning reveals that the refinement approaches with consolidation TP and without transcript access often omit details and lack conciseness, which is also observable in the Likert scores (e.g., CON up to 0.47 points down). We conclude that the consolidated approach, effective in short news summarization Zhang et al. (2023a), does not perform well for meeting summarization, likely because the format compresses information about individual errors too much, making it hard for the refinement model to interpret when the total number of errors is large. This can happen especially with long meetings (16k tokens input text) compared to news summarization with input texts of 200 tokens. We leave to future work the exploration of a similar planning setup that applies the described consolidation structure to each error-related feedback block rather than to the whole feedback

We conclude that the feedback from the MIP containing CoT explanations already provides a strong foundation for improving ORIG summaries and bringing their quality close to that of a human summary. Correction suggestion is a promising alternative for CoT explanation as FP with comparable ratings on quality, allowing for further research to identify when to use which FP.

6 Final Considerations

In this paper, we investigated GPT4’s ability to find mistakes in a given meeting summary and refine them accordingly. We found that GPT4 achieves a high accuracy of $\sim$ 89% on average, measured against human labels, in identifying typical mistakes (e.g., repetition of content) when using a dedicated model instance paired with CoT prompting to identify individual errors. However, it struggles to identify similar and subjective errors, such as hallucination (72% acc.) with omission and irrelevance (81% acc.). We showed strong evidence that a dedicated LLM can refine a summary based on identified errors. By providing a CoT explanation for each error type containing reasoning why and where an error was observed, we improve the quality of relevance, informativeness, conciseness, and coherence significantly. These refined summaries are comparable in quality, with error-free gold summaries. Our post hoc refinement approach can be applied to refine meeting summaries generated by traditional models and LLMs and marks an early entry into methods that allow the full potential of LLMs for meeting summarization. We leave the development of more sophisticated refinement protocols, e.g., using multi-agent discussion, and the application of our multi-LLM approach to similar complex text generation tasks (e.g., story writing to reflect on given setting) and real-world applications (e.g., assisting LLM agents to check the outcome to a task) to future work. We release QMSum Mistake to encourage research on refinement.

Acknowledgements

This work was supported by the Lower Saxony Ministry of Science and Culture and the VW Foundation. Frederic Kirstein was supported by the Mercedes-Benz AG Research and Development.

Potential Impact

The multi-LLM approach proposed here, influenced by psychological observations on productivity and collaboration, exemplifies how other academic fields can inform NLP research Wahle et al. (2023b). This work demonstrates the potential for enhancing complex text generation tasks requiring robust output such as machine translation Feng et al. (2024), reasoning Kalyanpur et al. (2024), question answering Kim et al. (2024), or paraphrasing Becker et al. (2023); Wahle et al. (2023a), that may benefit from an output-challenging system that assesses content alignment. By incorporating multi-LLM strategies and personalization, we open new avenues for improving NLP outputs across various applications, underscoring the value of interdisciplinary approaches in advancing NLP technologies and their real-world applicability.

Limitations

Although our proposed QMSum Mistake might seem small (i.e., 200 samples), its size is comparable to the original QMSum dataset (i.e., 232 samples). We contribute to extending the original dataset with careful human error annotations for almost all examples available. Another possible limitation in our work is the use of only GPT4 in our main experiments. We chose GPT4 because of its large context size (e.g., 128k tokens) and better initial results in identifying errors. Evaluating and error annotation and refinement for multiple models by humans would be time-consuming and financially unfeasible. However, we report the detailed results in Appendix B to provide insights on other language families and different models (e.g., Phi Abdin et al. (2024), Gemini Team et al. (2024)) considered in our study. We evaluate their performance on mistake identification and quality changes when refining a summary.

Ethics Statement and Broader Impact

Our research abides by ethical guidelines for AI research and is committed to privacy, confidentiality, and intellectual property rights. We’ve ensured that the datasets in our study, publicly available, do not house sensitive or personal details. While our study leverages existing resources and generative models, it’s important to note that these models can possess biases and may occasionally generate summaries with distortions, biases, or inappropriate content. To counteract this, we’ve configured our models to omit potentially harmful or unsafe content. While our research aims to enhance meeting summarization to benefit communication and productivity across sectors, we’re acutely aware of the ethical challenges posed by AI in this domain. Meeting summarization models must be wielded with respect to privacy and consent, especially when processing sensitive or confidential material. It’s paramount that these models neither violate privacy nor perpetuate harmful biases. As the field evolves, we stress the importance of maintaining these ethical considerations and encourage fellow researchers to uphold them, ensuring that AI advancements in meeting summarization are both beneficial and ethically grounded. An integral aspect of our ethical commitment is reflected in our approach to annotator recruitment and management. The team of annotators, consisting of interns, student assistants, and doctoral students, was meticulously selected through internal channels. This strategy was chosen to uphold a high standard of annotation quality—a quality we found challenging to guarantee through external platforms such as Amazon Mechanical Turk. Ensuring fair compensation, these annotators were remunerated in accordance with institutional guidelines for their respective positions. Further, flexibility in the annotation process was also a priority. Annotators had the freedom to choose their working times and environments to prevent fatigue from affecting their judgment.

References

Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, and Ammar Ahmad Awan. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Preprint, arxiv:2404.14219.
Asi et al. (2022) Abedelkadir Asi, Song Wang, Roy Eisenstadt, Dean Geckt, Yarin Kuper, Yi Mao, and Royi Ronen. 2022. An End-to-End Dialogue Summarization System for Sales Calls. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 45–53, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
Becker et al. (2023) Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2023. Paraphrase Detection: Human vs. Machine Content. Preprint, arXiv:2303.13989.
Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://arxiv.org/abs/2004.05150v2.
Carletta et al. (2005) Jean Carletta, Wessel Kraaij, Simone Ashby, Sebastien Bourban, Michael Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, Michael Lincoln, A. Lisowska, W. Post, D. Reidsma, P. Wellner, and L. McCowan. 2005. The AMI Meeting Corpus. Proceedings of Symposium on Annotating and Measuring Meeting Behavior.
Chang et al. (2024) Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2024. BooookScore: A systematic exploration of book-length summarization in the era of LLMs. Preprint, arxiv:2310.00785.
Chauhan et al. (2022) Vipul Chauhan, Prasenjeet Roy, Lipika Dey, and Tushar Goel. 2022. TCS_WITM_2022 @ DialogSum : Topic oriented Summarization using Transformer based Encoder Decoder Model. In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, pages 104–109, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
Feng et al. (2024) Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2024. TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement. Preprint, arXiv:2402.16379.
Hua et al. (2023) Yilun Hua, Zhaoyuan Deng, and Kathleen McKeown. 2023. Improving Long Dialogue Summarization with Semantic Graph Representation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13851–13883, Toronto, Canada. Association for Computational Linguistics.
Huang et al. (2024) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. Preprint, arxiv:2310.01798.
Janin et al. (2003) A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 1, pages I–I.
Jiang et al. (2024) Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, and Daniel Khashabi. 2024. SELF-[IN]CORRECT: LLMs Struggle with Refining Self-Generated Responses. Preprint, arxiv:2404.04298.
Kalyanpur et al. (2024) Aditya Kalyanpur, Kailash Saravanakumar, Victor Barres, Jennifer Chu-Carroll, David Melville, and David Ferrucci. 2024. LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic. Preprint, arXiv:2406.17663.
Khalifa et al. (2021) Muhammad Khalifa, Miguel Ballesteros, and Kathleen McKeown. 2021. A Bag of Tricks for Dialogue Summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8014–8022, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Kim et al. (2023) Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language Models can Solve Computer Tasks. Preprint, arxiv:2303.17491.
Kim et al. (2024) Jaehyung Kim, Dongyoung Kim, and Yiming Yang. 2024. Learning to Correct for QA Reasoning with Black-box LLMs. Preprint, arXiv:2406.18695.
Kirstein et al. (2024a) Frederic Kirstein, Jan Philip Wahle, Bela Gipp, and Terry Ruas. 2024a. CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization. Preprint, arxiv:2406.07494.
Kirstein et al. (2024b) Frederic Kirstein, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2024b. What’s under the hood: Investigating Automatic Metrics on Meeting Summarization. https://arxiv.org/abs/2404.11124v1.
Krippendorff (1970) Klaus Krippendorff. 1970. Bivariate Agreement Coefficients for Reliability of Data. Sociological Methodology, 2:139–150.
Laskar et al. (2023) Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, and Shashi Bhushan Tn. 2023. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 343–352, Singapore. Association for Computational Linguistics.
Lee et al. (2021a) Kahyun Lee, Mehmet Kayaalp, Sam Henry, and Özlem Uzuner. 2021a. A Context-Enhanced De-identification System. ACM Transactions on Computing for Healthcare, 3(1):6:1–6:14.
Lee et al. (2021b) Seolhwa Lee, Kisu Yang, Chanjun Park, João Sedoc, and Heuiseok Lim. 2021b. Who Speaks Like a Style of Vitamin: Towards Syntax-Aware Dialogue Summarization Using Multi-Task Learning. IEEE Access, 9:168889–168898.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Li et al. (2024) Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, and Chongyang Tao. 2024. Leveraging Large Language Models for NLG Evaluation: A Survey. Preprint, arxiv:2401.07103.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2023) Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. 2023. Calibrating LLM-Based Evaluator. Preprint, arxiv:2309.13308.
Madaan et al. (2021) Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, and Eduard Hovy. 2021. Think about it! Improving defeasible reasoning by first modeling the question scenario. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6291–6310, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Mousavi et al. (2023) Sajad Mousavi, Ricardo Luna Gutierrez, Desik Rengarajan, Vineet Gundecha, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen, and Soumyendu Sarkar. 2023. N-CRITICS: Self-Refinement of Large Language Models with Ensemble of Critics.
Naraki et al. (2022) Yuji Naraki, Tetsuya Sakai, and Yoshihiko Hayashi. 2022. Evaluating the Effects of Embedding with Speaker Identity Information in Dialogue Summarization. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 298–304, Marseille, France. European Language Resources Association.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arxiv:2203.02155.
Pan et al. (2023) Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2023. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. Preprint, arxiv:2308.03188.
Phang et al. (2022) Jason Phang, Yao Zhao, and Peter J. Liu. 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. Preprint, arxiv:2208.04347.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):140:5485–140:5551.
Saunders et al. (2022) William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. Self-critiquing models for assisting human evaluators. Preprint, arxiv:2206.05802.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. Preprint, arxiv:2303.11366.
Team et al. (2024) Gemini Team, Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry, Lepikhin, and Timothy Lillicrap. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint, arxiv:2403.05530.
Tyen et al. (2024) Gladys Tyen, Hassan Mansoor, Victor Cărbune, Peter Chen, and Tony Mak. 2024. LLMs cannot find reasoning errors, but can correct them given the error location. Preprint, arxiv:2311.08516.
Wahle et al. (2023a) Jan Philip Wahle, Bela Gipp, and Terry Ruas. 2023a. Paraphrase Types for Generation and Detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12148–12164, Singapore. Association for Computational Linguistics.
Wahle et al. (2023b) Jan Philip Wahle, Terry Ruas, Mohamed Abdalla, Bela Gipp, and Saif Mohammad. 2023b. We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12896–12913, Singapore. Association for Computational Linguistics.
Wang et al. (2022) Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, and Haizhou Li. 2022. Analyzing and Evaluating Faithfulness in Dialogue Summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4897–4908, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Preprint, arxiv:2201.11903.
Zhang et al. (2023a) Haopeng Zhang, Xiao Liu, and Jiawei Zhang. 2023a. SummIt: Iterative Text Summarization via ChatGPT. Preprint, arxiv:2305.14835.
Zhang et al. (2020a) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020a. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, pages 11328–11339. JMLR.org.
Zhang et al. (2020b) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. BERTScore: Evaluating Text Generation with BERT. Preprint, arxiv:1904.09675.
Zhang et al. (2023b) Yusen Zhang, Yang Liu, Ziyi Yang, Yuwei Fang, Yulong Chen, Dragomir Radev, Chenguang Zhu, Michael Zeng, and Rui Zhang. 2023b. MACSum : Controllable Summarization with Mixed Attributes. Transactions of the Association for Computational Linguistics, 11:787–803.
Zhong et al. (2022) Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2022. DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization. Preprint, arxiv:2109.02492.
Zhong et al. (2021) Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. 2021. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.

Appendix A Human Annotation

We adapt proven methodologies for a thorough human annotation Zhang et al. (2023b). Six graduate students ⁹⁹9The origin of the funds and annotators will be disclosed later to avoid the risk to give the authors identity. aged 22 to 28, from diverse academic backgrounds (e.g., computer science, psychology, communication science), proficient in English and familiar with meeting summarization, participate as annotators. Each annotates 2-4 subsets of QMSum Mistake (35 samples per subset, each containing transcripts, gold summaries, and model-generated summaries), with at least three annotators per sample. The annotators identify errors by answering yes/no questions (e.g., "Does the summary contain repetition ?") and provide reasoning for observed mistakes for quality assessment. To ensure reliable and consistent annotation, we implement several measures. Inter-annotator agreement is assessed using Krippendorff’s alpha Krippendorff (1970), achieving an average of 0.764 (see Table 5), indicating a moderate to strong agreement on the assessment. A training course is run before the annotation task to train annotators, refine guidelines, and identify new error types. Annotators practice on QMSum summaries produced by the LED model, which are not used for the final QMSum Mistake dataset. During the actual annotation task, we add gold summaries that the annotators should be able to evaluate correctly. Otherwise, their annotation would be rejected, and their understanding of the task would be discussed. Regular review meetings are held to maintain consistency in quality, align understanding, and conduct ongoing quality control. An expert annotator was available to discuss complex issues during the annotation process.

Assessed Characteristic	Krippendorff’s $\alpha$
Omission (partial)	0.787
Omission (total)	0.834
Repetition	0.889
Incoherence	0.764
Coreference	0.719
Hallucination	0.764
Sprache	0.748
Structure	0.795
Irrelevance	0.719

Table 5: Inter-rater reliability for the human annotations, measured by Krippendorff’s alpha. Scores

\geq

0.667 mean moderate agreement, scores

\geq

0.8 mean strong agreement.

Appendix B Exploring Additional Model Families and Setups

In this section, we task models from the Phi and Gemini families on the mistake identification and refinement tasks. Particularly, we consider Gemini Flash (Gemini) and the 3.4B parameter Phi-3 mini 128k (Phi). We chose these models because their context size is large enough to fit a meeting transcript without requiring major architecture adaptation and because they are available. We further opt for smaller model versions compared to GPT4 to analyze the performance differences. We perform the experiments on 25% of the erroneous QMSum Mistake samples to derive initial trends.

B.1 Mistake Identification with smaller models

Error	Gemini	Phi	GPT4
P-OM	87.5	87.5	87.5
T-OM	75.0	75.0	92.5
REP	35.0	32.5	90.0
INC	62.5	32.5	95.0
COR	15.0	7.5	92.5
HAL	57.5	57.5	57.5
LAN	35.0	35.0	72.5
STR	37.5	20.0	92.5
IRR	60.0	60.0	77.5

Table 6: Mistake finding accuracy of Gemini, Phi, GPT4 on a subset of QMSum Mistake.

Table 6 shows the accuracies of these models in terms of identifying errors, all using the best MIP protocol identified in Section 4, containing multiple model instances and CoT prompting. As expected, Gemini and Phi show weaker accuracy, which can mostly be attributed to their smaller model sizes. Notably, Phi struggles to report errors in the prompted output format, similar to how GPT4 struggles in the single-instance setup, while Gemini is closer in its answer pattern to what we observed for GPT4 in the single-instance setup. Phi and Gemini also show an oversensitivity to errors as we hypothesize for GPT4 (Section 4.2). This oversensitivity is more pronounced for the smaller Phi model than for Gemini. This oversensitivity leads to a match in accuracy for P-OM and HAL, as all models reported here an always-true result. Considering the models’ reasoning for the scores, we observe further support for this hypothesis. For example, Gemini reports the mention of participants’ names as an unnecessary repetition. We conclude that even though these models have a similar (Phi) or larger (Gemini) context size compared to GPT4, the significantly fewer parameters hurt the task understanding and contextualization. Further, the oversensitivity appears to be linked to a model’s understanding capabilities, which in the considered case is connected to the model size.

B.2 Refinement Performance with Smaller Models

Table 7 reports the quality of one-round refined summaries using Phi and GPT4 on the subset of QMSum Mistake. Note that GEMINI is not reported here as the model consistently did not provide any refinements. Both models were prompted with the best-performing refinement protocol, i.e., multiple instances of CoT were prompted for mistake identification, CoT explanation was used as feedback, and direct feedback was used as a transfer protocol. We follow the evaluation approach in Section 5.4. We observe that even though Phi does not reliably detect errors, the exhaustive pointing out of possible error cases and the refinement step help to improve the quality, considering the Likert scores by 0.4 to 0.8 points. However, it is noteworthy that Phi sometimes struggles with refining a summary and instead details the given feedback. We therefore conclude, that Phi is capable of refining a summary given a list of observed errors and reasoning for the observation, but the smallest model struggles with the task understanding. Hence, with adaptions such as few-shot examples or by using Phi-3 small, Phi may be a cheap alternative to GPT4 for summary refinement.

	OVR $\downarrow$	REL $\uparrow$	INF $\uparrow$	CON $\uparrow$	COH $\uparrow$
GPT4	1.24	3.05	3.07	3.21	2.98
Phi	1.84	2.78	2.98	2.93	3.04
GOLD	1.43	3.08	3.05	3.53	3.21
ORIG	2.77	2.28	2.15	2.41	2.22

Table 7: Ranking and scoring of Phi and GPT4 according to their quality. OVR is the overall ranking, with lower scores indicating a more preferred summary. REL, INF, CON, and COH are relevant, informativeness, conciseness, and coherence. The scoring uses a 5-step Likert scale, with 1 being the worst and 5 best.

B.3 Multiple rounds

So far we have explored the application of the refinement concept in a single round, with one pass of the mistake identification and summary refinement. Following, we explore how the refinement quality changes when GPT4 can reconsider the generated summary for 10 rounds. We keep the best-performing setup (multi-instance with CoT prompting for MIP, CoT explanation FP, direct feedback TP) and use the small subset of QMSum Mistake. We report the ranking of the different summaries in Figure 3, observing that while the one-round performance is strong enough to improve a given summary to a quality level comparable to a human summary, it can be further improved. From the ranking model’s reasoning, we observe that this improvement mainly involves reducing remaining omission errors and fitting the summary better to the comprehensiveness GPT4 asks for. Notably, we observe instances of strong degradation, e.g., in 6 which follows a previous trend of reduced quality. We derive from this that while there may be more potential to further improve summaries by applying the refinement protocol multiple times, it may quickly saturate, and unwanted errors are induced. From the ranking model’s explanation, we observe that this correlates with an increase in repetition and hallucination. We conclude that multiple rounds of refinement can potentially further improve summaries, but this requires dedicated research.

Appendix C QMSum Mistake varying summarization styles and quality levels of models

TP	FP	BS	R-1	R-2	RLS
dir	essential	16.20	33.73	07.46	20.53
dir	CoT	16.16	33.89	07.57	20.41
dir	Cor	16.19	33.89	07.52	20.39
dir	CoT+Cor	16.35	33.90	07.56	20.58
dir	Tra	15.28	33.89	07.82	20.99
dir	Tra+CoT+Cor	15.12	33.78	07.94	21.31
con	essential	14.27	29.79	05.58	18.26
con	CoT	14.28	29.36	05.43	18.12
con	Cor	15.11	29.64	05.55	18.37
con	CoT+Cor	15.15	29.71	05.71	18.13
con	Tra	14.96	29.90	05.55	18.55
con	Tra+CoT+Cor	14.98	30.07	05.76	18.47

Table 8: Score of the established evaluation metrics BERTScore (BS) and ROUGE (R-1 = ROUGE 1, R2 = ROUGE 2, RLS = ROUGE LSum).

We show one examples of QMSum Mistake for each used language model in Table 9.

Architecture	Language Model	Summary	Errors
enc-dec	LED	The group discussed the components design of the remote control. Firstly, they discussed the shape of the buttons, the size of the screen, the colour of the rubber case, and the material. Then, they talked about the material design. After that, the group discussed how the remote should look like a fruit and vegetable. Finally, the team discussed the design of a touch-based graphical interface system. The team decided to use the wheely material instead of a rubber case for the LCD screen. Then the group talked about how the material should be used for the remote, and how it should be easy to switch to.	REP, LAN INC, P-OM T-OM, HAL
	DialogLED	This was the kick-off meeting for the new remote control project. Project Manager started the meeting by introducing the team members to each other. Industrial Designer introduced the group as the project manager. Marketing and User Interface gave a brief about the project and the design of the remote control. Then Industrial Designer gave the brief about their favourite characteristics of their favourite animals. Then the group got to draw their favourite animal and sum up their favorite characteristics of it. Industrial designer gave the final product and the team got to choose the favourite animal of the group.	REP, LAN INC, P-OM T-OM, COR HAL, STR
	PEGASUS-X	Project Manager introduced some new project requirements that came in, and then Industrial Designer discussed the working design of the remote control. User Interface and Industrial Designer talked about the buttons and what it does and that sort of thing. Project Manager discussed the target age group and how it’s gonna appeal to their target group. Marketing and User Interface discussed the functions of the circle disc, the touch pad, and the volume and channel changers. Industrial Designer said that it would be easy to implement and that it’d be compatible with D-V-D-type remotes.	LAN, P-OM T-OM
autoreg	GPT-3.5	The conversation is about a team discussing the design of a new remote control for a television. The project manager mentions that the remote control should be original, trendy, and user-friendly. The team discusses different ideas for the remote control, including using speech recognition for navigation and having a light adaptation system. They also discuss the importance of optimizing the number of buttons and the size of the remote control. The industrial designer suggests gathering data from programs or content and navigating through them using the remote control. The team agrees to discuss technical points and gather more information in the next meeting. They also mention the need to assign roles and responsibilities for different aspects of the design. The conversation ends with the project manager suggesting a follow-up meeting in five minutes to discuss other aspects of the design.	P-OM, T-OM HAL, IRR
	Phi-3 mini	The group discussed the progress on digits, specifically the almost completion of the TI-digits project and the plan to train a recognizer with the data. They also talked about the possibility of having transcribers manually input the digits from a meeting recording, which would result in a more complete transcript. There was a debate on whether this task should be done by the transcribers or automated. Additionally, they touched on the quality of the current models and the potential for improvement, as well as the differences in pronunciation when reading digits compared to conversational speech. They considered using Switchboard data for further research and discussed the need for a more comprehensive annotation system that includes articulatory features.	P-OM, T-OM HAL

Table 9: Samples of the QMSum Mistake dataset, one for each used language model. In the architecture column, enc-dec means encoder-decoder and autoreg stands for autoregressive. The errors column presents the human-annotated errors for each summary.

Appendix D Prompts

In the following, we present the prompts used to identify mistakes (Figure 4), to consolidate feedback (Figure 5), and to refine a summary (Figure 6). Figure 7 shows a few-shot example of P-OM. Figure 8 provides the prompt template for LLM-based ranking.

Figure 4: MIP prompt template in the format for multi-instance usage. In the single-instance setup, the definition and example blocks are repeated for every error type.

Figure 5: Prompt tehmplate used to consolidate a feedback for the consolidation TP. The model is tasked to extract from the exhaustive feedback what the refinement model should consider for editing.

Figure 6: The summary-refining sub-prompt.

Figure 7: A few-shot example as it is shown to the mode in the MIP prompt Figure 4. This few-shot examples counts a major P-OM example.

Figure 8: The template prompt for ranking summaries according to their performance on the errors described in Section 3.1.

Appendix E Additional Content on Summary Refinement

E.1 Established Metrics’ scores

Table 8 reports the BERTScore Zhang et al. (2020b) (re-weighted) and ROUGE Lin (2004) scores for different combinations of FP and TP. Note that the scores are very close to each other with slight variation, that does not allow for a thorough analysis.

E.2 Correction and CoT are contradictory

Figure 9 demonstrates a case of contradicting information in CoT explanation and correction suggestion.

Figure 9: Confusion between CoT content and Correction suggestion.