OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

Allen Roush
Wand.ai
[email protected]
&Yusuf Shabazz*
Howard Community College
[email protected]
&Arvind Balaji
Texas AM University
[email protected]
&Peter Zhang
UC Berkeley
[email protected]
&Stefano Mezza
University of New South Wales
[email protected]
&Markus Zhang
Stanford University
[email protected]
&Sanjay Basu
Oracle Corporation
[email protected]
&Sriram Vishwanath
University of Texas
[email protected]
&Mehdi Fatemi
Wand.ai
[email protected]
&Ravid Shwartz Ziv
Wand.ai
New York University
[email protected]
Denotes equal contribution
Abstract

We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist.

1 Introduction

Argument mining plays a pivotal role in developing advanced language models (LLMs) capable of sophisticated reasoning and understanding. Engaging with complex argumentative texts enhances LLMs’ abilities to comprehend, generate, and evaluate arguments. This improves their performance in applications such as legal document analysis, educational tools, and more.

Existing argument mining datasets, such as DebateSum introduced by Roush & Balaji (2020), are limited in scope. DebateSum, with 240,566 examples, primarily focuses on pre-season evidence from summer camps, excluding the rich argumentative structures in regular-season debates. This limitation affects dataset size, representativeness, and utility for large-scale argument mining.

To address these gaps, we introduce OpenDebateEvidence, a large-scale dataset for argument mining and summarization sourced from the OpenCaseList project (Hardy, 2024). This dataset comprises 3.5 million documents, making it the most extensive collection of debate evidence available. It captures the full spectrum of arguments presented throughout the debate season. OpenDebateEvidence’s comprehensive nature, with its detailed metadata, makes it highly valuable for training language models.

In this paper, we provide an in-depth overview of OpenDebateEvidence, detailing our data collection and preprocessing methods. We demonstrate that training LLMs on OpenDebateEvidence significantly improves their performance not only on this dataset but also on other related argumentative datasets. We conducted extensive evaluation experiments using state-of-the-art language models: LLaMA3-8B 111https://huggingface.co/meta-llama/Meta-Llama-3-8B and Mistral-7B222https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2. These models were fine-tuned using advanced techniques such as Low-Rank Adaptation (LoRA) (Hu et al., 2021), Representation Fine-Tuning (ReFT) (Wu et al., 2024), and Orthogonalization (Arditi et al., 2023). The results show substantial improvements in model performance compared to those trained on previous argument mining datasets. This underlines OpenDebateEvidence’s effectiveness in enhancing argument-mining capabilities.

Our contributions are:

  1. 1.

    We introduce OpenDebateEvidence, the largest and most comprehensive dataset for argument mining and summarization, encompassing 3.5 million documents with detailed metadata.

  2. 2.

    We provide rich metadata that facilitates various NLP tasks and applications, enhancing the dataset’s utility for researchers and practitioners.

  3. 3.

    We demonstrate significant performance improvements of state-of-the-art language models not only on OpenDebateEvidence but also on other related argumentative datasets through extensive fine-tuning experiments.

  4. 4.

    We evaluate the dataset’s effectiveness in different scenarios and methods, including various fine-tuning techniques such as Low-Rank Adaptation (LoRA), Representation Fine-Tuning (ReFT), and Orthogonalization, showcasing substantial gains in model performance.

  5. 5.

    We discuss the practical applications of OpenDebateEvidence in various fields such as legal document analysis, educational tools, and AI model development, highlighting its real-world relevance and impact.

Our experiments highlight that training on OpenDebateEvidence not only enhances model performance on this dataset but also significantly improves results on other related argumentative datasets. This underscores the dataset’s superiority and its potential to drive advancements in computational argumentation research.

2 Background and Related Work

Competitive debate in the United States encompasses several prominent styles, each with unique formats, rules, and emphasis. The three most notable styles are Policy Debate, Lincoln-Douglas Debate, and Public Forum Debate, popular at both high school and collegiate levels. While sharing structural similarities, these debate formats differ in focus, speech times, and the importance placed on evidence. OpenDebateEvidence includes evidence from all of these formats.

2.1 Policy Debate

The High School and College Policy Debate, also known as the "cross-examination debate" (CX), involves two teams of students arguing for and against a specific policy proposal based on an annually changing broad resolution. Each debate round lasts about 90 minutes, comprising eight speeches (four by each team and two by each speaker) and cross-examination periods. The structure includes constructive speeches followed by refutations, with cross-examination periods allowing debaters to clarify arguments or challenge assumptions. New arguments are restricted to constructive speeches.

During a debate, teams present evidence from various sources to support their arguments. This evidence is usually in the form of written "cards" 111Policy Debaters used to literally cut their evidence out of magazines and glue it onto physical cards; while this has fallen out of fashion, the name stuck., such as research publications, academic articles, news reports, or government documents. Figure 1 shows an example of a "card." The quality and quantity of evidence used in a debate round often determine the winner. Policy Debate is unique among competitive debate styles in that the quality of the speech act is secondary222Leading to a peculiar phenomenon known as ”speed reading” or ”spreading” to be normalized: https://en.wikipedia.org/wiki/Spreading_(debate) compared to the quality, quantity, and factuality of the evidence.

Refer to caption
Figure 1: An example of a piece of debate evidence, colloquially known as a "card," from OpenDebateEvidence before parsing. Lines 1 and 2 are the hat and the pocket, used for organizing the evidence by argument and speech. Lines 3-4 are the "tag," a biased abstractive summary of the document. The beginning of line 5 shows the author and the year. The rest of lines 5-8 provide the evidence’s citation. The remainder of the document is the evidence itself. Underlined, bolded, or boxed parts are crucial for the argument, and highlighted sections are read aloud during the speech. These elements form various hierarchical levels of biased token-level extractive summaries.

2.2 Lincoln Douglas Debate

Lincoln-Douglas Debate (LD), a one-on-one format with a bimonthly topic, originated from the historic debates between Abraham Lincoln and Stephen Douglas. Popular in high school and college competitions, LD debates share structural similarities with Policy Debate but feature shorter speech times and cross-examination periods. LD debates emphasize ethical and moral reasoning, focusing more on philosophical arguments rather than policy implications. However, they still prioritize the quality and quantity of evidence presented.

2.3 Public Forum Debate

Public Forum Debate is a two-on-two format debating a monthly topic designed to be accessible to a broader audience. Compared to Policy and LD debates, Public Forum rounds have shorter speaking times and place less emphasis on evidence. Public Forum Debate constitutes a smaller portion of the evidence in OpenDebateEvidence and was not included in DebateSum.

2.4 Existing Datasets and Research

Significant prior work in argument mining has focused on competitive formal debate. IBM’s Project Debater has been a leading effort, publishing extensively on argument detection (Ein-Dor et al., 2019), argument quality (Gleize et al., 2019), key point analysis/summarization (Bar-Haim et al., 2020; Magnusson & Friedman, 2021), and autonomous debating systems (Slonim et al., 2021). However, their work does not focus on the real-world competitive debate evidence found in our dataset.

Other notable contributions include VivesDebate, a multilingual audio dataset of debate tournaments (Ruiz-Dolz & Iranzo-Sánchez, 2024); ArgAnalysis35K, focusing on single argument analysis pairs in evidence-free parliamentary debate (Joshi et al., 2023); IAM (Integrated Argument Mining), a highly annotated dataset for integrated argument mining tasks with only 1,00010001{,}0001 , 000 articles (Cheng et al., 2022); and DebateSum, a dataset with 240,566240566240{,}566240 , 566 examples focusing on pre-season debate evidence (Roush & Balaji, 2020). Additionally, several legal summarization datasets have been developed, including ArgLegalSumm (Elaraby & Litman, 2022), Multi-LexSum (Shen et al., 2022), and datasets targeting Indian and British case law (Shukla et al., 2022), which together total fewer than 10,0001000010{,}00010 , 000 examples.

Other resources include logos.app Community (2024c), debate.cards (Community, 2024b), and contention.ai (Community, 2024a), which index various debate evidence and generate new evidence from web searches. Datasets targeting biased or query-focused summarization include QBSUM, a Chinese dataset with 49,0004900049{,}00049 , 000 samples (Zhao et al., 2021); QMSum, which studies meeting summarization with 1,80818081{,}8081 , 808 samples (Zhong et al., 2021); and LMGQS, a dataset with over 1 million documents converted to query-focused summarization (Xu et al., 2023). In contrast, our dataset is fully human-created and human-annotated by active debate competitors.

Compared to these datasets, OpenDebateEvidence offers a significantly larger scale and scope, with over 3.5 million documents enriched with detailed metadata.

3 OpenDebateEvidence Dataset

3.1 Data Collection

OpenDebateEvidence is sourced from the OpenCaseList project (Hardy, 2024), an online platform where high school and college debate teams disclose and open-source their evidence. The dataset contains over 3.5 million documents, covering all NSDA debate topics from 2012 to 2023.333A list of these topics can be found here. Each document corresponds to a single piece of evidence used in a debate, categorized by debate format (Policy, LD, Public Forum), and includes comprehensive metadata such as author, date, title, source, citation details, and the debate round in which it was used444An example of some of these downloads can be found here.

The dataset also includes standardized tags to describe the type of argument made by the document, such as topicality, disadvantages, advantages, and counter plans, along with details of the structure and location in the debate file from which the document was extracted. To protect privacy, identifying information has been anonymized.

3.2 Data Preprocessing

Debate evidence is stored in the .docx file format, requiring a specialized parsing process to extract relevant information. The parsing pipeline begins by unzipping the .docx file to access the internal XML files. Ensuring accurate preprocessing is paramount for maintaining dataset quality. This process involves detailed steps to preserve the integrity and consistency of the data, including tokenization, simplification, and structuring of text blocks, followed by extracting and organizing individual debate cards into a structured format that captures both metadata and content.

The XML files are parsed to extract formatting details such as underlining, bold, and highlighting. Next, the document undergoes tokenization, creating a structured representation with text blocks representing paragraphs or coherent units of text along with their formatting information. A simplification step removes unnecessary formatting and merges adjacent tokens with similar styling.

To extract individual debate cards, the parsing procedure identifies card boundaries based on formatting and structure, extracting components such as the tag, citation, and body text. This information is organized into a structured format that captures the metadata and content of each debate card. Finally, the parsed dataset is converted back into a cleaned Hugging Face dataset, providing a human-readable version of the dataset. This structured dataset serves as the foundation for further natural language processing tasks.

3.3 Data Deduplication

After parsing the dataset and extracting individual cards, identifying and removing duplicates is essential to ensure data quality. Deduplication involves comparing the textual content of each card to identify those sharing significant portions of their text. This process enhances dataset usability by eliminating redundancy, ensuring each unique argument is represented only once.

The deduplication algorithm splits each card’s text into sentences. These sentences are then preprocessed by removing non-letter characters and converting them to lowercase. Short sentences below a certain length are filtered out to focus on meaningful content.

The algorithm retrieves and compares card IDs with a significant number of shared sentences. If the number of matching sentences exceeds a predefined threshold and their positions within the cards are within a certain range, the cards are considered duplicates. Duplicate clusters are formed by identifying all cards connected through shared sentences. A representative card is then selected from each cluster based on factors such as sentence count and content quality, and duplicates are removed iteratively.

3.4 Data Statistics

The OpenDebateEvidence dataset offers a comprehensive collection of over 3.5 million documents categorized by debate format (Policy, Lincoln-Douglas, and Public Forum). Each document is enriched with extensive metadata, including author, date, title, source, citation details, and debate round information. Standardized tags describe the type of argument, such as topicality, disadvantages, advantages, and counterplans.

Policy Debate evidence constitutes approximately two-thirds of the dataset, Lincoln-Douglas Debate evidence comprises about one-third, and Public Forum Debate evidence makes up a smaller percentage. Spanning topics from 2012 to 2023, the dataset represents over 1,400 schools and includes contributions from more than 3,200 authors.

Key statistics of the dataset are provided in Table 1, and more detailed statistics and information can be found in Appendix F.

Table 1: Key Statistics of OpenDebateEvidence Dataset
Feature Count
Total Documents 3,571,098
Policy Debate Evidence 2,380,600
Lincoln-Douglas Debate Evidence 1,164,132
Public Forum Debate Evidence 26,366
Years Covered 2012-2023
Average Document Length (characters) 3,556
Total Schools Represented 1,423
Unique Authors 3,217
Unique Topics 68

3.5 Rich Metadata for Argument Structure

Each evidence document is organized with a “hat,” “pocket,” and “tag” to represent its role within a debate case.

The “pocket” indicates the top-level speech section the evidence supports, such as “1NC” for the first negative constructive speech. The “hat” denotes the broad argument category, like “Oil Disadvantage,” which aligns with a structured argument against an affirmative case. The “tag” provides a concise, biased summary of the specific argument made by the evidence. Debaters often create these tags first and then find the evidence that fits the tag.

This metadata encodes the rhetorical structure and purpose of the evidence in a practical and real-world context. The “hat” and “pocket” provide the argument’s context, while the “tag” offers a concise summary of the core claim.

For argument mining, this metadata offers valuable semantic annotations for training models on argument components and relations. “Hats” and “pockets” help models learn the overarching structure, while “tags” summarize key points.

For summarization, the hierarchical metadata enables multi-level summaries: “pockets” for high-level overviews, “hats” for key categories, and “tags” for concise core claims. The biased nature of “tags” illustrates how debaters rhetorically summarize their claims and arguments. OpenDebateEvidence is particularly rich as it includes both hierarchical biased abstractive and token-level extractive summaries.

4 Experiments

To evaluate the efficacy of the OpenDebateEvidence dataset for argument mining and summarization, we conducted a series of fine-tuning experiments using state-of-the-art language models. We also evaluated the performance of these models on two related datasets.

4.1 Experimental Setup

We employed three recent fine-tuning techniques for adapting our models to OpenDebateEvidence: Low-Rank Adaptation (LoRA) (Hu et al., 2021), Representation Fine-Tuning (ReFT) (Wu et al., 2024), and Orthogonalization (Arditi et al., 2023). These methods are chosen for their parameter efficiency and ability to prevent catastrophic forgetting. The details of these techniques are provided in Appendix D.

We perform our experiments on three datasets: OpenDebateEvidence, DebateSum, which is also a dataset of Policy Debate Evidence, and the billsum dataset from Kornilova & Eidelman (2019), a dataset of US legislation and summaries, to illustrate our fine-tuned models capabilities at performing argumentative summarization in many contexts.

We conducted two types of experiments: traditional NLP evaluation metrics and using GPT-4o as a judge model. All experiments were conducted on a 4xA100 machine from Microsoft Azure with parallelism, attention optimization, and 16-bit quantization enabled. All decoding/sampling settings were kept default. The seed value of "42" was used wherever possible.

4.1.1 Traditional NLP Metrics

For the traditional NLP metrics, we evaluated the models on validation datasets of the whole BillSum dataset and 10,0001000010{,}00010 , 000 examples from OpenDebateEvidence. Each model was tasked with generating a short "abstract" summarizing the key arguments made in each document. We computed ROUGE F1 scores between the generated text and the ground-truth "tag" provided in the OpenDebateEvidence metadata and the reference summaries in BillSum. For more details see Appendix E. Additionally, we evaluated each language model’s perplexity on the sampled subsets to assess how well the models captured the overall distribution of debate and legislative language.

4.1.2 LLM as Judge

In the "LLM as Judge" experiments, we evaluated the quality of the generated abstracts using GPT-4o as the judge. Each model’s output was assessed on two criteria: the quality of the output and the quality of supporting the argument, both rated on a scale from 1 to 10. The evaluation was conducted on 1,00010001{,}0001 , 000 results from both datasets. This approach allows us to measure not only the linguistic quality of the summaries but also their effectiveness in supporting the arguments. For more details see Section E.2.2 .

4.2 Results

The results of our experiments are shown in Tables 2, 3 and 4. Both the Mistral-7B and LLaMA3-8B models achieved promising results, with the LLaMA3-8B models generally outperforming the Mistral-7B models across all ROUGE metrics. Notably, fine-tuning on a larger subset of the dataset (1,000,00010000001{,}000{,}0001 , 000 , 000 examples) significantly improved performance, as evidenced by the LLaMA3-8B LoRA (1M Ex) model. This model achieved the highest scores across traditional NLP metrics and the LLM as Judge evaluations, demonstrating that extensive fine-tuning on domain-specific data is crucial for optimizing model performance in argument mining tasks.

Our results demonstrate the significant impact of using the OpenDebateEvidence dataset on model performance. On all three datasets, our fine-tuned models showed substantial improvements over the baseline models. The LLaMA3-8B LoRA model, fine-tuned on 1,000,00010000001{,}000{,}0001 , 000 , 000 examples from OpenDebateEvidence, achieved the highest scores across all traditional NLP metrics and the LLM as Judge evaluations. This highlights the importance of extensive fine-tuning on domain-specific data.

4.2.1 OpenDebateEvidence Performance

For the OpenDebateEvidence dataset (Table 2), the LLaMA3-8B models generally outperformed the Mistral-7B models across all ROUGE metrics. Fine-tuning techniques, particularly LoRA with 1,000,00010000001{,}000{,}0001 , 000 , 000 examples, significantly enhanced the models’ ability to generate high-quality summaries that effectively support arguments. Notably, the LoRA technique improved performance by effectively adapting model parameters with minimal additional computational resources. ReFT showed strong performance, indicating its ability to modify hidden representations in targeted subspaces, improving summarization quality. Orthogonalization, while effective, showed relatively less improvement than LoRA and ReFT, likely due to its focus on controlling specific features in the residual stream.

Table 2: Performance on OpenDebateEvidence. ROUGE F1 scores and perplexity on 10,0001000010{,}00010 , 000 sampled documents, and LLM as Judge scores on 1,00010001{,}0001 , 000 results. Scores are averaged over three runs. R-1, R-2, and R-L denote ROUGE-1, ROUGE-2, and ROUGE-L respectively. Error bars represent one standard error over 3 trials.
Model R-1 R-2 R-L Perplexity Output Quality Support Quality
Mistral-7B
Base 27.8±0.3plus-or-minus27.80.327.8\pm 0.327.8 ± 0.3 8.2±0.5plus-or-minus8.20.58.2\pm 0.58.2 ± 0.5 24.5±0.8plus-or-minus24.50.824.5\pm 0.824.5 ± 0.8 150.2±5.1plus-or-minus150.25.1150.2\pm 5.1150.2 ± 5.1 7.5±0.2plus-or-minus7.50.27.5\pm 0.27.5 ± 0.2 7.3±0.2plus-or-minus7.30.27.3\pm 0.27.3 ± 0.2
LoRA 30.1±0.5plus-or-minus30.10.530.1\pm 0.530.1 ± 0.5 9.4±0.2plus-or-minus9.40.29.4\pm 0.29.4 ± 0.2 25.8±0.6plus-or-minus25.80.625.8\pm 0.625.8 ± 0.6 33.9±2.3plus-or-minus33.92.333.9\pm 2.333.9 ± 2.3 7.7±0.2plus-or-minus7.70.27.7\pm 0.27.7 ± 0.2 7.5±0.2plus-or-minus7.50.27.5\pm 0.27.5 ± 0.2
ReFT 29.9±0.2plus-or-minus29.90.229.9\pm 0.229.9 ± 0.2 9.3±0.1plus-or-minus9.30.19.3\pm 0.19.3 ± 0.1 25.6±0.6plus-or-minus25.60.625.6\pm 0.625.6 ± 0.6 50.3±3.4plus-or-minus50.33.450.3\pm 3.450.3 ± 3.4 7.6±0.3plus-or-minus7.60.37.6\pm 0.37.6 ± 0.3 7.4±0.3plus-or-minus7.40.37.4\pm 0.37.4 ± 0.3
Orthogonal 27.9±0.5plus-or-minus27.90.527.9\pm 0.527.9 ± 0.5 8.3±0.2plus-or-minus8.30.28.3\pm 0.28.3 ± 0.2 24.7±1.2plus-or-minus24.71.224.7\pm 1.224.7 ± 1.2 76.4±4.4plus-or-minus76.44.476.4\pm 4.476.4 ± 4.4 7.6±0.2plus-or-minus7.60.27.6\pm 0.27.6 ± 0.2 7.4±0.2plus-or-minus7.40.27.4\pm 0.27.4 ± 0.2
LLaMA3-8B
Base 25.4±0.6plus-or-minus25.40.625.4\pm 0.625.4 ± 0.6 7.6±0.2plus-or-minus7.60.27.6\pm 0.27.6 ± 0.2 22.8±1.3plus-or-minus22.81.322.8\pm 1.322.8 ± 1.3 100.3±5.3plus-or-minus100.35.3100.3\pm 5.3100.3 ± 5.3 7.2±0.3plus-or-minus7.20.37.2\pm 0.37.2 ± 0.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3
LoRA 25.7±0.7plus-or-minus25.70.725.7\pm 0.725.7 ± 0.7 7.8±0.1plus-or-minus7.80.17.8\pm 0.17.8 ± 0.1 23.0±0.7plus-or-minus23.00.723.0\pm 0.723.0 ± 0.7 77.5±4.6plus-or-minus77.54.677.5\pm 4.677.5 ± 4.6 7.3±0.2plus-or-minus7.30.27.3\pm 0.27.3 ± 0.2 7.1±0.2plus-or-minus7.10.27.1\pm 0.27.1 ± 0.2
ReFT 27.6±1.0plus-or-minus27.61.027.6\pm 1.027.6 ± 1.0 8.7±0.4plus-or-minus8.70.48.7\pm 0.48.7 ± 0.4 24.9±0.7plus-or-minus24.90.724.9\pm 0.724.9 ± 0.7 47.8±1.9plus-or-minus47.81.947.8\pm 1.947.8 ± 1.9 7.3±0.3plus-or-minus7.30.37.3\pm 0.37.3 ± 0.3 7.1±0.3plus-or-minus7.10.37.1\pm 0.37.1 ± 0.3
Orthogonal 25.5±1.2plus-or-minus25.51.225.5\pm 1.225.5 ± 1.2 7.7±0.4plus-or-minus7.70.47.7\pm 0.47.7 ± 0.4 22.9±2.1plus-or-minus22.92.122.9\pm 2.122.9 ± 2.1 88.0±5.7plus-or-minus88.05.788.0\pm 5.788.0 ± 5.7 7.2±0.3plus-or-minus7.20.37.2\pm 0.37.2 ± 0.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3
LoRA (1M Ex) 32.2±1.5plus-or-minus32.21.5\mathbf{32.2\pm 1.5}bold_32.2 ± bold_1.5 9.9±1.0plus-or-minus9.91.0\mathbf{9.9\pm 1.0}bold_9.9 ± bold_1.0 27.4±1.2plus-or-minus27.41.2\mathbf{27.4\pm 1.2}bold_27.4 ± bold_1.2 21.8±2.5plus-or-minus21.82.5\mathbf{21.8\pm 2.5}bold_21.8 ± bold_2.5 8.0±0.2plus-or-minus8.00.2\mathbf{8.0\pm 0.2}bold_8.0 ± bold_0.2 7.8±0.2plus-or-minus7.80.2\mathbf{7.8\pm 0.2}bold_7.8 ± bold_0.2

4.2.2 BillSum Performance

Similarly, for the BillSum dataset (Table 3), the models fine-tuned on OpenDebateEvidence demonstrated superior performance compared to their baselines. This suggests that training on OpenDebateEvidence can enhance the models’ capabilities in reasoning and summarization, even when applied to a different domain such as legislative texts. The results on BillSum confirm the transferability of the fine-tuning techniques, with LoRA once again showing the most significant improvements. This technique’s efficiency in parameter adaptation appears to be particularly beneficial for handling diverse datasets. ReFT also performed well, indicating its robustness in capturing complex argument structures across different domains. Orthogonalization, while showing improvement, was less impactful compared to the other techniques.

Table 3: Performance on BillSum. ROUGE F1 scores and perplexity on 10,0001000010{,}00010 , 000 sampled documents, and LLM as Judge scores on 1,00010001{,}0001 , 000 results. Scores are averaged over three runs. R-1, R-2, and R-L denote ROUGE-1, ROUGE-2, and ROUGE-L respectively. Error bars represent one standard error over 3 trials.
Model R-1 R-2 R-L Perplexity Output Quality Support Quality
Mistral-7B
Base 44.8±0.3plus-or-minus44.80.344.8\pm 0.344.8 ± 0.3 21.2±0.5plus-or-minus21.20.521.2\pm 0.521.2 ± 0.5 40.5±0.8plus-or-minus40.50.840.5\pm 0.840.5 ± 0.8 25.2±1.1plus-or-minus25.21.125.2\pm 1.125.2 ± 1.1 7.2±0.3plus-or-minus7.20.37.2\pm 0.37.2 ± 0.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3
LoRA 47.1±0.5plus-or-minus47.10.547.1\pm 0.547.1 ± 0.5 23.4±0.2plus-or-minus23.40.223.4\pm 0.223.4 ± 0.2 42.8±0.6plus-or-minus42.80.642.8\pm 0.642.8 ± 0.6 23.9±0.3plus-or-minus23.90.323.9\pm 0.323.9 ± 0.3 7.4±0.2plus-or-minus7.40.27.4\pm 0.27.4 ± 0.2 7.2±0.2plus-or-minus7.20.27.2\pm 0.27.2 ± 0.2
ReFT 46.9±0.2plus-or-minus46.90.246.9\pm 0.246.9 ± 0.2 23.3±0.1plus-or-minus23.30.123.3\pm 0.123.3 ± 0.1 42.6±0.6plus-or-minus42.60.642.6\pm 0.642.6 ± 0.6 24.3±0.4plus-or-minus24.30.424.3\pm 0.424.3 ± 0.4 7.5±0.3plus-or-minus7.50.37.5\pm 0.37.5 ± 0.3 7.3±0.3plus-or-minus7.30.37.3\pm 0.37.3 ± 0.3
Orthogonal 44.9±0.5plus-or-minus44.90.544.9\pm 0.544.9 ± 0.5 21.3±0.2plus-or-minus21.30.221.3\pm 0.221.3 ± 0.2 40.7±1.2plus-or-minus40.71.240.7\pm 1.240.7 ± 1.2 25.4±1.4plus-or-minus25.41.425.4\pm 1.425.4 ± 1.4 7.3±0.2plus-or-minus7.30.27.3\pm 0.27.3 ± 0.2 7.1±0.2plus-or-minus7.10.27.1\pm 0.27.1 ± 0.2
LLaMA3-8B
Base 42.4±0.6plus-or-minus42.40.642.4\pm 0.642.4 ± 0.6 19.6±0.2plus-or-minus19.60.219.6\pm 0.219.6 ± 0.2 38.8±1.3plus-or-minus38.81.338.8\pm 1.338.8 ± 1.3 27.3±1.3plus-or-minus27.31.327.3\pm 1.327.3 ± 1.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3 6.8±0.3plus-or-minus6.80.36.8\pm 0.36.8 ± 0.3
LoRA 42.7±0.7plus-or-minus42.70.742.7\pm 0.742.7 ± 0.7 19.8±0.1plus-or-minus19.80.119.8\pm 0.119.8 ± 0.1 39.0±0.7plus-or-minus39.00.739.0\pm 0.739.0 ± 0.7 27.0±1.1plus-or-minus27.01.127.0\pm 1.127.0 ± 1.1 7.1±0.2plus-or-minus7.10.27.1\pm 0.27.1 ± 0.2 6.9±0.2plus-or-minus6.90.26.9\pm 0.26.9 ± 0.2
ReFT 44.6±1.0plus-or-minus44.61.044.6\pm 1.044.6 ± 1.0 20.7±0.4plus-or-minus20.70.420.7\pm 0.420.7 ± 0.4 40.9±0.7plus-or-minus40.90.740.9\pm 0.740.9 ± 0.7 26.8±1.0plus-or-minus26.81.026.8\pm 1.026.8 ± 1.0 7.2±0.3plus-or-minus7.20.37.2\pm 0.37.2 ± 0.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3
Orthogonal 42.5±1.2plus-or-minus42.51.242.5\pm 1.242.5 ± 1.2 19.7±0.4plus-or-minus19.70.419.7\pm 0.419.7 ± 0.4 38.9±2.1plus-or-minus38.92.138.9\pm 2.138.9 ± 2.1 27.5±1.5plus-or-minus27.51.527.5\pm 1.527.5 ± 1.5 7.1±0.3plus-or-minus7.10.37.1\pm 0.37.1 ± 0.3 6.9±0.3plus-or-minus6.90.36.9\pm 0.36.9 ± 0.3
LoRA (1M Ex) 48.2±1.5plus-or-minus48.21.5\mathbf{48.2\pm 1.5}bold_48.2 ± bold_1.5 24.9±1.0plus-or-minus24.91.0\mathbf{24.9\pm 1.0}bold_24.9 ± bold_1.0 43.4±1.2plus-or-minus43.41.2\mathbf{43.4\pm 1.2}bold_43.4 ± bold_1.2 21.8±0.5plus-or-minus21.80.5\mathbf{21.8\pm 0.5}bold_21.8 ± bold_0.5 7.8±0.2plus-or-minus7.80.2\mathbf{7.8\pm 0.2}bold_7.8 ± bold_0.2 7.6±0.2plus-or-minus7.60.2\mathbf{7.6\pm 0.2}bold_7.6 ± bold_0.2

4.2.3 DebateSum Performance

In evaluating the DebateSum dataset (Table 4), the fine-tuned models demonstrated notable improvements over their base counterparts. The use of advanced fine-tuning techniques, particularly LoRA with a larger dataset of 1,000,00010000001{,}000{,}0001 , 000 , 000 examples, significantly boosted the models’ performance across all ROUGE metrics. This was evident in both the Mistral-7B and LLaMA3-8B models, highlighting the effectiveness of LoRA in enhancing summarization capabilities. The ReFT technique also showed robust results, suggesting its strong ability to refine hidden representations for better summarization quality. While Orthogonalization improved the performance, it was comparatively less effective than LoRA and ReFT, likely due to its narrower focus on specific feature control within the residual streams.

Table 4: Performance on DebateSum. ROUGE F1 scores and perplexity on 10,0001000010{,}00010 , 000 sampled documents, and LLM as Judge scores on 1,00010001{,}0001 , 000 results. Scores are averaged over three runs. R-1, R-2, and R-L denote ROUGE-1, ROUGE-2, and ROUGE-L respectively. Error bars represent one standard error over 3 trials.
Model R-1 R-2 R-L Perplexity Output Quality Support Quality
Mistral-7B
Base 26.3±0.4plus-or-minus26.30.426.3\pm 0.426.3 ± 0.4 7.5±0.3plus-or-minus7.50.37.5\pm 0.37.5 ± 0.3 23.1±0.6plus-or-minus23.10.623.1\pm 0.623.1 ± 0.6 130.5±4.2plus-or-minus130.54.2130.5\pm 4.2130.5 ± 4.2 7.3±0.3plus-or-minus7.30.37.3\pm 0.37.3 ± 0.3 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3
LoRA 28.5±0.6plus-or-minus28.50.628.5\pm 0.628.5 ± 0.6 8.9±0.3plus-or-minus8.90.38.9\pm 0.38.9 ± 0.3 24.7±0.5plus-or-minus24.70.524.7\pm 0.524.7 ± 0.5 30.2±2.0plus-or-minus30.22.030.2\pm 2.030.2 ± 2.0 7.5±0.3plus-or-minus7.50.37.5\pm 0.37.5 ± 0.3 7.3±0.3plus-or-minus7.30.37.3\pm 0.37.3 ± 0.3
ReFT 28.3±0.3plus-or-minus28.30.328.3\pm 0.328.3 ± 0.3 8.8±0.2plus-or-minus8.80.28.8\pm 0.28.8 ± 0.2 24.5±0.4plus-or-minus24.50.424.5\pm 0.424.5 ± 0.4 45.1±2.7plus-or-minus45.12.745.1\pm 2.745.1 ± 2.7 7.4±0.2plus-or-minus7.40.27.4\pm 0.27.4 ± 0.2 7.2±0.2plus-or-minus7.20.27.2\pm 0.27.2 ± 0.2
Orthogonal 26.4±0.5plus-or-minus26.40.526.4\pm 0.526.4 ± 0.5 7.6±0.2plus-or-minus7.60.27.6\pm 0.27.6 ± 0.2 23.3±0.7plus-or-minus23.30.723.3\pm 0.723.3 ± 0.7 70.3±3.5plus-or-minus70.33.570.3\pm 3.570.3 ± 3.5 7.4±0.3plus-or-minus7.40.37.4\pm 0.37.4 ± 0.3 7.2±0.3plus-or-minus7.20.37.2\pm 0.37.2 ± 0.3
LLaMA3-8B
Base 24.2±0.5plus-or-minus24.20.524.2\pm 0.524.2 ± 0.5 6.9±0.3plus-or-minus6.90.36.9\pm 0.36.9 ± 0.3 21.9±0.8plus-or-minus21.90.821.9\pm 0.821.9 ± 0.8 95.7±4.5plus-or-minus95.74.595.7\pm 4.595.7 ± 4.5 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3 6.7±0.3plus-or-minus6.70.36.7\pm 0.36.7 ± 0.3
LoRA 24.5±0.6plus-or-minus24.50.624.5\pm 0.624.5 ± 0.6 7.0±0.2plus-or-minus7.00.27.0\pm 0.27.0 ± 0.2 22.1±0.6plus-or-minus22.10.622.1\pm 0.622.1 ± 0.6 73.9±3.2plus-or-minus73.93.273.9\pm 3.273.9 ± 3.2 7.1±0.2plus-or-minus7.10.27.1\pm 0.27.1 ± 0.2 6.8±0.2plus-or-minus6.80.26.8\pm 0.26.8 ± 0.2
ReFT 26.6±0.9plus-or-minus26.60.926.6\pm 0.926.6 ± 0.9 8.0±0.3plus-or-minus8.00.38.0\pm 0.38.0 ± 0.3 23.8±0.7plus-or-minus23.80.723.8\pm 0.723.8 ± 0.7 44.7±1.7plus-or-minus44.71.744.7\pm 1.744.7 ± 1.7 7.1±0.3plus-or-minus7.10.37.1\pm 0.37.1 ± 0.3 6.9±0.3plus-or-minus6.90.36.9\pm 0.36.9 ± 0.3
Orthogonal 24.3±1.1plus-or-minus24.31.124.3\pm 1.124.3 ± 1.1 7.0±0.4plus-or-minus7.00.47.0\pm 0.47.0 ± 0.4 22.0±1.9plus-or-minus22.01.922.0\pm 1.922.0 ± 1.9 82.5±4.2plus-or-minus82.54.282.5\pm 4.282.5 ± 4.2 7.0±0.3plus-or-minus7.00.37.0\pm 0.37.0 ± 0.3 6.8±0.3plus-or-minus6.80.36.8\pm 0.36.8 ± 0.3
LoRA (1M Ex) 30.4±1.4plus-or-minus30.41.4\mathbf{30.4\pm 1.4}bold_30.4 ± bold_1.4 9.4±0.9plus-or-minus9.40.9\mathbf{9.4\pm 0.9}bold_9.4 ± bold_0.9 26.5±1.1plus-or-minus26.51.1\mathbf{26.5\pm 1.1}bold_26.5 ± bold_1.1 19.8±1.7plus-or-minus19.81.7\mathbf{19.8\pm 1.7}bold_19.8 ± bold_1.7 7.8±0.3plus-or-minus7.80.3\mathbf{7.8\pm 0.3}bold_7.8 ± bold_0.3 7.6±0.3plus-or-minus7.60.3\mathbf{7.6\pm 0.3}bold_7.6 ± bold_0.3

5 Potential Applications and Future Directions

5.1 Argument Quality Assessment

The extensive collection of real-world debate arguments and rich metadata in the dataset provides a unique opportunity to study argument quality. By analyzing factors such as duplicate count, position in the debate round, and debate outcomes, researchers can develop models to automatically assess persuasiveness, relevance, and overall quality. This could lead to tools that offer real-time feedback and suggestions for debaters, improving debate communities’ inclusivity by reducing reliance on expensive human judges.

5.2 Multi-level Argument Summarization

The hierarchical structure of arguments in the dataset enables research into multi-level argument summarization. Models can generate summaries at various granularities, from concise one-sentence summaries to detailed overviews. This aligns with the emerging interest in query-focused and hierarchical summarization in the NLP community.

5.3 Argument Generation and Rebuttal

With its diverse collection of arguments and counterarguments, OpenDebateEvidence is valuable for developing argument generation models. By studying successful debaters’ patterns and strategies, researchers can create systems that generate persuasive and relevant arguments on given topics. Additionally, the dataset’s balanced coverage of affirmative and negative arguments enables the development of rebuttal generation models that counter opposing arguments.

5.4 Cross-domain Argument Mining

While primarily focused on competitive debate formats, the argumentation skills and techniques in the dataset are applicable across domains such as legal reasoning, policy-making, and online discussions. Researchers can develop general argument mining models for diverse argumentative texts, advancing areas like legal document analysis, opinion mining, and fact-checking. Integrating OpenDebateEvidence with fact-checking and misinformation detection datasets could yield robust models for identifying and countering misleading claims in public discourse.

5.5 Understanding Persuasion and Sentiment

The dataset captures both arguments’ logical structure and debaters’ rhetorical strategies and emotional appeals. By studying the interplay between rational argumentation and affective language, researchers can develop sophisticated models for understanding sentiment’s role in persuasion and decision-making. This has applications in political science, marketing, and human-computer interaction.

5.6 Debate Coaching and Education

OpenDebateEvidence holds significant potential for debate coaching and education. Analyzing successful arguments and strategies can help coaches identify best practices and develop more effective training programs. The dataset can also serve as a resource for creating educational materials such as argument templates, case studies, and interactive learning tools. This will support aspiring debaters in skill development.

6 Conclusion

In this paper, we introduce OpenDebateEvidence, a large-scale dataset for argument mining and summarization, comprising over 3.5 million documents from the OpenCaseList project. After extensive preprocessing and deduplication, we created a high-quality dataset enriched with metadata that captures the hierarchical structure and semantics of debate arguments. Our experiments demonstrated the potential of fine-tuning modern large language models for argumentative abstractive summarization in a parameter-efficient manner. The results showed significant improvements in performance on the OpenDebateEvidence, DebateSum, and BillSum datasets, validating the effectiveness of our approach.

By providing this resource to the community, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. The OpenDebateEvidence dataset, with its rich metadata and diverse collection of debate formats, offers an excellent resource for developing and evaluating argument mining and summarization models.

Future work includes exploring additional fine-tuning techniques and expanding the dataset to include more diverse debate formats. We also plan to investigate the integration of multimodal data to enhance argument comprehension and explore cross-linguistic adaptations to broaden the applicability of our models. By continuing to refine and expand this resource, we hope to further enhance language models’ capabilities in understanding and generating complex argumentative discourse.

References

  • Arditi et al. (2023) Andy Arditi, Oscar Obeso, Aaquib111, wesg, and Neel Nanda. Refusal in llms is mediated by a single direction. https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction, 2023. Accessed: 2024-06-18.
  • Bar-Haim et al. (2020) Roy Bar-Haim, Lilach Eden, Roni Friedman, Yoav Kantor, Dan Lahav, and Noam Slonim. From arguments to key points: Towards automatic argument summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4029–4039, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.371. URL https://aclanthology.org/2020.acl-main.371.
  • Cheng et al. (2022) Liying Cheng, Lidong Bing, Ruidan He, Qian Yu, Yan Zhang, and Luo Si. IAM: A comprehensive and large-scale dataset for integrated argument mining tasks. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2277–2287, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.162. URL https://aclanthology.org/2022.acl-long.162.
  • Community (2024a) Contention AI Community. Contention ai. https://contention.ai, 2024a. Accessed: 2024-06-18.
  • Community (2024b) Debate Cards Community. Debate cards. http://debate.cards, 2024b. Accessed: 2024-06-18.
  • Community (2024c) Logos Debate Community. Logos debate. https://logos-debate.netlify.app, 2024c. Accessed: 2024-06-18.
  • Ein-Dor et al. (2019) Liat Ein-Dor, Eyal Shnarch, Lena Dankin, Alon Halfon, Benjamin Sznajder, Ariel Gera, Carlos Alzate, Martin Gleize, Leshem Choshen, Yufang Hou, Yonatan Bilu, Ranit Aharonov, and Noam Slonim. Corpus wide argument mining – a working solution, 2019.
  • Elaraby & Litman (2022) Mohamed Elaraby and Diane Litman. ArgLegalSumm: Improving abstractive summarization of legal documents with argument mining. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp.  6187–6194, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.540.
  • Gleize et al. (2019) Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena Dankin, Guy Moshkowich, Ranit Aharonov, and Noam Slonim. Are you convinced? choosing the more convincing evidence with a Siamese network. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  967–976, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1093. URL https://aclanthology.org/P19-1093.
  • Hardy (2024) Aaron Hardy. Opencaselist project. https://opencaselist.com/, 2024. Accessed: 2024-06-18.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
  • Joshi et al. (2023) Omkar Joshi, Priya Pitre, and Yashodhara Haribhakta. ArgAnalysis35K : A large-scale dataset for argument quality analysis. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13916–13931, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.778. URL https://aclanthology.org/2023.acl-long.778.
  • Kornilova & Eidelman (2019) Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu (eds.), Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp.  48–56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5406. URL https://aclanthology.org/D19-5406.
  • Magnusson & Friedman (2021) Ian Magnusson and Scott Friedman. Extracting fine-grained knowledge graphs of scientific claims: Dataset and transformer-based results. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4651–4658, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.381. URL https://aclanthology.org/2021.emnlp-main.381.
  • Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  • Roush & Balaji (2020) Allen Roush and Arvind Balaji. DebateSum: A large-scale argument mining and summarization dataset. In Elena Cabrio and Serena Villata (eds.), Proceedings of the 7th Workshop on Argument Mining, pp.  1–7, Online, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.argmining-1.1.
  • Ruiz-Dolz & Iranzo-Sánchez (2024) Ramon Ruiz-Dolz and Javier Iranzo-Sánchez. Vivesdebate-speech: A corpus of spoken argumentation to leverage audio features for argument mining, 2024.
  • Shen et al. (2022) Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey. Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities, 2022.
  • Shukla et al. (2022) Neelesh Shukla, Amit Vaid, Raghu Katikeri, Sangeeth Keeriyadath, and Msp Raja. DiMSum: Distributed and multilingual summarization of financial narratives. In Mahmoud El-Haj, Paul Rayson, and Nadhem Zmandar (eds.), Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pp.  65–72, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.fnp-1.9.
  • Slonim et al. (2021) Noam Slonim, Yonatan Bilu, Carlos Alzate, et al. An autonomous debating system. Nature, 591:379–384, 2021. doi: 10.1038/s41586-021-03215-w. URL https://doi.org/10.1038/s41586-021-03215-w.
  • Wu et al. (2024) Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, 2024.
  • Xu et al. (2023) Ruochen Xu, Song Wang, Yang Liu, Shuohang Wang, Yichong Xu, Dan Iter, Pengcheng He, Chenguang Zhu, and Michael Zeng. LMGQS: A large-scale dataset for query-focused summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14764–14776, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.984. URL https://aclanthology.org/2023.findings-emnlp.984.
  • Zhao et al. (2021) Mingjun Zhao, Shengli Yan, Bang Liu, Xinwang Zhong, Qian Hao, Haolan Chen, Di Niu, Bowei Long, and Weidong Guo. Qbsum: A large-scale query-based document summarization dataset from real-world applications. Computer Speech and amp; Language, 66:101166, March 2021. ISSN 0885-2308. doi: 10.1016/j.csl.2020.101166. URL http://dx.doi.org/10.1016/j.csl.2020.101166.
  • Zhong et al. (2021) Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. Qmsum: A new benchmark for query-based multi-domain meeting summarization, 2021.

Appendix A Limitations

A.1 Representation Bias

While the OpenDebateEvidence dataset is extensive, it may not fully represent the diversity of argumentation styles and topics across all debate communities. The dataset primarily includes evidence from American high school and college debates, and therefore might not capture the nuances of debates in other cultural or educational contexts, or in other languages.

Format-Specific Challenges

The unique formatting conventions used in debate evidence may present challenges for standard natural language processing tools. The presence of shorthand, abbreviations, and specialized jargon may require additional preprocessing or specialized models to accurately interpret and analyze the text.

Incomplete or Inconsistent Metadata

While the dataset includes extensive metadata, there may be inconsistencies or gaps in this information. For example, citation details might be missing or incorrect for some documents, and the standardized tags describing the type of argument might not be uniformly applied across all documents.

Potential Noise and Redundancy

The dataset’s size and diversity may also introduce noise and redundancy. Duplicate documents, irrelevant content, or errors in formatting and citation may exist within the dataset, potentially affecting the quality of the analyses in spite of efforts taken to reduce or eliminate this.

Limited Accessibility to Public Forum Debate Evidence

With Public Forum Debate making up such a small percentage of the evidence included within OpenDebateEvidence, research focusing on this specific debate format may face limitations in terms of data quantity and diversity.

Appendix B Ethics Statement

The OpenDebateEvidence dataset, presented in this paper, derives from openly shared debate evidence across various educational forums and debate formats. This dataset strictly adheres to the principles of fair use, focusing on academic and research intent. The files which make up OpenDebateEvidence have been hosted online in some cases for over a decade without any known ethical issues arising as a result of it.

We performed this research and released this dataset with the full blessing and support of the OpenCaseList project.

Appendix C Social Impacts

Our introduction of OpenDebateEvidence, a comprehensive dataset sourced from the American Competitive Debate community, is poised to have significant positive societal impacts. By offering a rich collection of over 3.5 million documents with detailed metadata, this dataset provides an unparalleled resource for training and evaluating language models in the domain of argument mining and summarization.

The comprehensive nature of OpenDebateEvidence, capturing the nuanced complexity of arguments in high school and college debates, will enable more rigorous and representative assessments of language models. This, in turn, will drive advancements in computational argumentation research and applications.

Practitioners and researchers will benefit from this benchmark, which is designed to reflect real-world argumentative scenarios more accurately. The dataset’s ability to enhance model performance across various argumentative tasks suggests its utility in improving the robustness and reliability of language technologies.

Moreover, by making OpenDebateEvidence publicly available, we encourage broader participation and innovation in this field. This democratization of resources can lead to more diverse contributions and perspectives, fostering a more inclusive research environment.

In summary, we believe our work will accelerate research, improve model evaluation and training, and ultimately enhance the capabilities of language models in handling complex argumentative texts, with no foreseeable negative societal impacts.

Appendix D Fine-Tuning Techniques

D.1 Low-Rank Adaptation (LoRA)

LoRA introduces low-rank matrices into the model’s architecture, reducing the number of trainable parameters. Given a weight matrix Wd×k𝑊superscript𝑑𝑘W\in\mathbb{R}^{d\times k}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT, LoRA decomposes it into two low-rank matrices Ad×r𝐴superscript𝑑𝑟A\in\mathbb{R}^{d\times r}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and Br×k𝐵superscript𝑟𝑘B\in\mathbb{R}^{r\times k}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, where rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ). The updated weight matrix is then W=W+ABsuperscript𝑊𝑊𝐴𝐵W^{\prime}=W+ABitalic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_W + italic_A italic_B. We used the PeFT Mangrulkar et al. (2022) package with default settings (rank 8).

D.2 Representation Fine-Tuning (ReFT)

ReFT modifies hidden representations through targeted interventions in specific subspaces. Low-rank Linear Subspace ReFT (LoReFT) is defined as:

ΦLoReFT(h)=h+R(Wh+bRh)subscriptΦLoReFTsuperscript𝑅top𝑊𝑏𝑅\Phi_{\text{LoReFT}}(h)=h+R^{\top}(Wh+b-Rh)roman_Φ start_POSTSUBSCRIPT LoReFT end_POSTSUBSCRIPT ( italic_h ) = italic_h + italic_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_W italic_h + italic_b - italic_R italic_h )

where Rr×d𝑅superscript𝑟𝑑R\in\mathbb{R}^{r\times d}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT has orthonormal rows. We used the PyReft package with default settings (rank 4).

D.3 Orthogonalization

Orthogonalization controls specific features in the model’s residual stream by modifying the weights. Given a direction r^d^𝑟superscript𝑑\hat{r}\in\mathbb{R}^{d}over^ start_ARG italic_r end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, each weight matrix Woutd×dinputsubscript𝑊outsuperscript𝑑subscript𝑑inputW_{\text{out}}\in\mathbb{R}^{d\times d_{\text{input}}}italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT input end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is modified as:

Wout=Woutr^r^Woutsubscriptsuperscript𝑊outsubscript𝑊out^𝑟superscript^𝑟topsubscript𝑊outW^{\prime}_{\text{out}}=W_{\text{out}}-\hat{r}\hat{r}^{\top}W_{\text{out}}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT - over^ start_ARG italic_r end_ARG over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT

We used the notebook for this process.

Appendix E Prompts Used in Experiments

E.1 OpenDebateEvidence/DebateSum

E.1.1 Traditional NLP Metrics Prompt

SYSTEM PROMPT: You are a Policy Debater.
USER PROMPT:
DOCUMENT: <full text of the document>
Provide an abstractive summary/card-tag of the argument made in the document above.
ABSTRACT:

E.1.2 LLM as Judge Prompt

SYSTEM PROMPT: You are a Policy Debate Judge.
USER PROMPT:
DOCUMENT: <full text of the document>
ABSTRACT: <generated abstract>
Score the abstract from 0-10 on it’s how well it supports the documents argument, and on its general quality

E.2 BillSum

E.2.1 Traditional NLP Metrics Prompt

SYSTEM PROMPT: You are a lawmaker.
USER PROMPT:
DOCUMENT: <full text of the document>
Provide an abstractive summary of the law made in the document above.
ABSTRACT:

E.2.2 LLM as Judge Prompt

SYSTEM PROMPT: You are a lawmaker.
USER PROMPT:
DOCUMENT: <full text of the document>
ABSTRACT: <generated abstract>
Score the abstract from 0-10 on it’s how well it supports the documents argument, and on its general quality

Appendix F OpenDebateEvidence Dataset Details

F.1 Dataset Card for OpenDebateEvidence

  • - Homepage: Here

  • - Repository (Access/download dataset and code): Here

  • - Croissant Metadata: Here

F.1.1 Dataset Summary

This dataset is a gargantoum follow-up to DebateSum, which includes a ton of improvements

Among those improvements are the following:

  • Massively increased size (about 25X the size of DebateSum), including nearly all debate evidence ever open sourced over the past 20 years from High School and College Public Forum, Policy, and Lincoln Douglas debate leagues

  • Far more metadata: Lots of new columns indicating everything from the number of times a piece of evidence has been seen (a good heuristic for evidence quality) to the teams and tournaments and rounds where a piece of evidence was deployed

  • Better deduplication and parsing techniques, including better accounting of the hierarchical nature that debaters use for underlining evidence

F.1.2 Supported Tasks and Leaderboards

This dataset is useful for text generation, summarization, information retrieval, question answering, and related tasks. This dataset is further highly useful as a "trustworthy" dataset. All evidence within it has corresponding citations and is in general "factual" or grounded in facts. We do the evaluation in our paper, establishing the first "leaderboard" for measuring the performance of models trained on this dataset.

F.1.3 Languages

English with very minor exceptions (i.e. evidence from performance cases using non-English evidence to make anti-colonialist arguments)

F.1.4 Dataset Creation

Gathered from the OpenCaseList project with their enthusiastic permission.

F.1.5 Source Data

Debate Evidence from NDCA/NDT debate leagues from 2002-2022.

F.1.6 Dataset Format

This dataset is originally contained in csv files, which were auto-converted into the parqueet dataset format by Huggingface. It’s available for download and consumption in both formats.

F.1.7 Hosting, licensing, and maintenance plan

We host and maintain our dataset on Huggingface through its "dataset" feature. We plan to update this dataset every year with new evidence as it is released by debaters, causing this to be a "living" dataset. We pledge to make sure that this dataset remains accessible for the foreseeable future, and the ability to regenerate this dataset is always preserved as its source documents are freely downloadable on OpenCaseList’s website.

F.1.8 Discussion of Biases

Competitive debate at the highest levels has increasingly rewarded teams who cite particular subfields of philosophy. A partial list of these highly represented topics is given below

  • Postmodernism

  • Poststructuralism

  • Frankfurt School

  • Critical Theory

  • Critical Race Theory

  • Queer Theory

  • Feminism

These cannons are dominated by so called "left-wing" thinkers and have mostly marginalized so called "right-wing" thinkers within them with some notable exceptions

Note that despite a strong "left-wing" bias, large swaths of left-wing thought, such as anarchism, are relatively absent.

Beyond this, most of the evidence was gathered with the argument being made first, and the evidence found after-the-fact to support it. This means that while the evidence is almost all "truthful", a lot of important information which might not help support an argument may be omitted.

F.1.9 Other Known Limitations

There are cases of academic dishonesty within this dataset (i.e. evidence that had specific insertions made by a debater which weren’t in the original text). It’s also possible that the source had changed in-between when it was cited and retrieved. We believe that this is extremely rare in practice, affecting no more than  200 examples.

F.1.10 Consent

We got the enthusiastic consent and approval to use this data from the OpenCaseList project. Debaters who submit their evidence there fully consent for this evidence to be freely used, including for curated datasets like this

F.1.11 Personal Information

We removed all Personal Information from the metadata of this evidence (first/last name of debaters).

F.1.12 Licensing Information

All data within this dataset is clearly used within an extracurricular, educational activity. This means that any "copyright" issues from reproduction of copyrighted articles within the dataset are allowed and exempted under US copyright law. The OpenCaseList project fully blesses this work and has published this evidence with a permissive, MIT license.

F.1.13 Author Statement

We, the authors, bear all responsibility in case of violation of rights, etc., and we confirm that the data is licensed with an MIT license.

Table 5: Description of OpenDebateEvidence Columns
Column Name Description
id Unique identifier for the row
tag Biased abstractive summary of the evidence / argument made by debater with evidence.
cite String indicating the short citation of the source used for the evidence
fullcite Full citation of the source used for the evidence
summary Underlined longer word level extractive summary of the evidence, note that summary is biased
towards supporting the tag argument
spoken Highlighted shorter extractive summary of the evidence / The spoken text of the evidence, note
that summary is biased towards supporting the tag argument
fulltext The full text of the evidence
textLength The length of the text in the evidence in characters
markup The full text of the evidence with HTML markup for parsing / visualization purposes
pocket String indicating the virtual “pocket” (top level section, usually the speech name) in which the
evidence is stored within its original document
hat String indicating the virtual “hat” (medium level section, usually the broad type of argument)
in which the evidence is stored within its original document
block String indicating the virtual “block” (low level section, usually the specific type of argument)
in which the evidence is stored within its original document
bucketId Unique identifier for the bucket in which the evidence is stored
duplicateCount The number of duplicates of the evidence. This acts as a rough proxy for evidence quality, as
good evidence will be duplicated across many debate files
fileId Unique identifier for the file in which the evidence is stored
filePath The file path of the file in which the evidence is stored
roundId Unique identifier for the debate round in which the evidence was used
side The debate side on which the evidence was used (Affirmative or Negative)
tournament The name of the tournament in which the evidence was used
round The round number in which the evidence was used
opponent The name of the opposing team in the debate round in which the evidence was used
judge The name of the judge in the debate round in which the evidence was used
report A report associated with the evidence filled out by one of the debaters, usually summarizing
the arguments presented
opensourcePath The path to the open-source repository in which the evidence is stored
caselistUpdatedAt The date on which the caselist was last updated
teamId Unique identifier for the team
teamName The name of the team
teamDisplayName The display name of the team
teamNotes Notes associated with the team
debater1First The first name of the first debater of the team
debater1Last The last name of the first debater of the team
debater2First The first name of the second debater of the team
debater2Last The last name of the second debater of the team
schoolId Unique identifier for the school
schoolName The name of the school
schoolDisplayName The display name of the school
state The state in which the school is located
chapterId Unique identifier for the chapter
caselistId Unique identifier for the caselist
caselistName The name of the caselist
caselistDisplayName The display name of the caselist
year The year in which the debate round took place
event The event in which the debate round took place
level The level of the debate (e.g., college, high school, etc.)
teamSize The number of debaters on the team
Table 6: Sample Data Row from OpenDebateEvidence
Column Name Sample Data
id 282,369
tag “Biodiversity loss causes human extinction.”
cite “McCarthy 18”
fullcite “Joe McCarthy 18. Staff Writer…”
summary “As the sixth mass extinction event accelerates…”
spoken “As the sixth mass extinction accelerates humans ris…”
fulltext “As the sixth mass extinction event accelerates around the world…”
textLength 3,556
markup “<h4>Biodiversity loss causes human extinction.</h4><p>Joe <strong>McCarthy 18…’
pocket “1NC”
hat “OFF”
block “1NC—DA”
bucketId 18,967
duplicateCount 122
fileId 3,564
filePath “./documents/ndtceda22/Emory/KiLo/Emory-KiLo-Aff-JW-Round-3.docx”
roundId 932,619
side “A”
tournament “JW Patterson Debates hosted by UK”
round “3”
opponent “West Georgia CL”
judge “Ka***”
report
“1AC - Manoomin 1NC - T Subsets States CP Human Right CP Rights K Politics DA
Fetal Personhood DA AI Bad DA 2NC - K Case 1NR - Case T 2NR - T”
opensourcePath “ndtceda22/Emory/KiLo/Emory-KiLo-Aff-JW-Patterson-Debates.docx”
caselistUpdatedAt “2022-10-05 19:30:41”
teamId 80,494
teamName “KiLo”
teamDisplayName “Emory KiLo”
debater1First “Aa***”
debater1Last “Ki***”
debater2First “Lu***”
debater2Last “Lo***”
schoolId 27,030
schoolName “Emory”
schoolDisplayName “Emory”
caselistId 2,001
caselistName “ndtceda22”
caselistDisplayName “NDT/CEDA College 2022-23”
year 2,022
event “cx”
level “college”
teamSize 2
Feature Top Categories/Values Counts
Year (Top 5) 2020 850,607
2021 787,685
2019 609,703
2018 423,378
2017 258,742
Event cx 2,380,600
ld 1,164,132
pf 26,366
Caselist DisplayName (Top 3) HS LD 2020-21 383,489
HS LD 2021-22 326,166
HS Policy 2021-22 262,292
TeamSize 2 2,406,966
1 1,164,132
Level hs 2,144,757
college 1,426,341
State (Top 5) None 1,875,021
CA 450,734
TX 335,087
GA 108,011
IL 94,084
Side N 1,992,850
A 1,578,248
DuplicateCount (Top 5) 1 1,176,971
2 114,268
3 92,667
4 78,804
5 67,166
Table 7: Sample statistics from the OpenDebateEvidence dataset.