OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
Abstract
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist.
1 Introduction
Argument mining plays a pivotal role in developing advanced language models (LLMs) capable of sophisticated reasoning and understanding. Engaging with complex argumentative texts enhances LLMs’ abilities to comprehend, generate, and evaluate arguments. This improves their performance in applications such as legal document analysis, educational tools, and more.
Existing argument mining datasets, such as DebateSum introduced by Roush & Balaji (2020), are limited in scope. DebateSum, with 240,566 examples, primarily focuses on pre-season evidence from summer camps, excluding the rich argumentative structures in regular-season debates. This limitation affects dataset size, representativeness, and utility for large-scale argument mining.
To address these gaps, we introduce OpenDebateEvidence, a large-scale dataset for argument mining and summarization sourced from the OpenCaseList project (Hardy, 2024). This dataset comprises 3.5 million documents, making it the most extensive collection of debate evidence available. It captures the full spectrum of arguments presented throughout the debate season. OpenDebateEvidence’s comprehensive nature, with its detailed metadata, makes it highly valuable for training language models.
In this paper, we provide an in-depth overview of OpenDebateEvidence, detailing our data collection and preprocessing methods. We demonstrate that training LLMs on OpenDebateEvidence significantly improves their performance not only on this dataset but also on other related argumentative datasets. We conducted extensive evaluation experiments using state-of-the-art language models: LLaMA3-8B 111https://huggingface.co/meta-llama/Meta-Llama-3-8B and Mistral-7B222https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2. These models were fine-tuned using advanced techniques such as Low-Rank Adaptation (LoRA) (Hu et al., 2021), Representation Fine-Tuning (ReFT) (Wu et al., 2024), and Orthogonalization (Arditi et al., 2023). The results show substantial improvements in model performance compared to those trained on previous argument mining datasets. This underlines OpenDebateEvidence’s effectiveness in enhancing argument-mining capabilities.
Our contributions are:
-
1.
We introduce OpenDebateEvidence, the largest and most comprehensive dataset for argument mining and summarization, encompassing 3.5 million documents with detailed metadata.
-
2.
We provide rich metadata that facilitates various NLP tasks and applications, enhancing the dataset’s utility for researchers and practitioners.
-
3.
We demonstrate significant performance improvements of state-of-the-art language models not only on OpenDebateEvidence but also on other related argumentative datasets through extensive fine-tuning experiments.
-
4.
We evaluate the dataset’s effectiveness in different scenarios and methods, including various fine-tuning techniques such as Low-Rank Adaptation (LoRA), Representation Fine-Tuning (ReFT), and Orthogonalization, showcasing substantial gains in model performance.
-
5.
We discuss the practical applications of OpenDebateEvidence in various fields such as legal document analysis, educational tools, and AI model development, highlighting its real-world relevance and impact.
Our experiments highlight that training on OpenDebateEvidence not only enhances model performance on this dataset but also significantly improves results on other related argumentative datasets. This underscores the dataset’s superiority and its potential to drive advancements in computational argumentation research.
2 Background and Related Work
Competitive debate in the United States encompasses several prominent styles, each with unique formats, rules, and emphasis. The three most notable styles are Policy Debate, Lincoln-Douglas Debate, and Public Forum Debate, popular at both high school and collegiate levels. While sharing structural similarities, these debate formats differ in focus, speech times, and the importance placed on evidence. OpenDebateEvidence includes evidence from all of these formats.
2.1 Policy Debate
The High School and College Policy Debate, also known as the "cross-examination debate" (CX), involves two teams of students arguing for and against a specific policy proposal based on an annually changing broad resolution. Each debate round lasts about 90 minutes, comprising eight speeches (four by each team and two by each speaker) and cross-examination periods. The structure includes constructive speeches followed by refutations, with cross-examination periods allowing debaters to clarify arguments or challenge assumptions. New arguments are restricted to constructive speeches.
During a debate, teams present evidence from various sources to support their arguments. This evidence is usually in the form of written "cards" 111Policy Debaters used to literally cut their evidence out of magazines and glue it onto physical cards; while this has fallen out of fashion, the name stuck., such as research publications, academic articles, news reports, or government documents. Figure 1 shows an example of a "card." The quality and quantity of evidence used in a debate round often determine the winner. Policy Debate is unique among competitive debate styles in that the quality of the speech act is secondary222Leading to a peculiar phenomenon known as ”speed reading” or ”spreading” to be normalized: https://en.wikipedia.org/wiki/Spreading_(debate) compared to the quality, quantity, and factuality of the evidence.
![Refer to caption](extracted/5713445/pic1.png)
2.2 Lincoln Douglas Debate
Lincoln-Douglas Debate (LD), a one-on-one format with a bimonthly topic, originated from the historic debates between Abraham Lincoln and Stephen Douglas. Popular in high school and college competitions, LD debates share structural similarities with Policy Debate but feature shorter speech times and cross-examination periods. LD debates emphasize ethical and moral reasoning, focusing more on philosophical arguments rather than policy implications. However, they still prioritize the quality and quantity of evidence presented.
2.3 Public Forum Debate
Public Forum Debate is a two-on-two format debating a monthly topic designed to be accessible to a broader audience. Compared to Policy and LD debates, Public Forum rounds have shorter speaking times and place less emphasis on evidence. Public Forum Debate constitutes a smaller portion of the evidence in OpenDebateEvidence and was not included in DebateSum.
2.4 Existing Datasets and Research
Significant prior work in argument mining has focused on competitive formal debate. IBM’s Project Debater has been a leading effort, publishing extensively on argument detection (Ein-Dor et al., 2019), argument quality (Gleize et al., 2019), key point analysis/summarization (Bar-Haim et al., 2020; Magnusson & Friedman, 2021), and autonomous debating systems (Slonim et al., 2021). However, their work does not focus on the real-world competitive debate evidence found in our dataset.
Other notable contributions include VivesDebate, a multilingual audio dataset of debate tournaments (Ruiz-Dolz & Iranzo-Sánchez, 2024); ArgAnalysis35K, focusing on single argument analysis pairs in evidence-free parliamentary debate (Joshi et al., 2023); IAM (Integrated Argument Mining), a highly annotated dataset for integrated argument mining tasks with only articles (Cheng et al., 2022); and DebateSum, a dataset with examples focusing on pre-season debate evidence (Roush & Balaji, 2020). Additionally, several legal summarization datasets have been developed, including ArgLegalSumm (Elaraby & Litman, 2022), Multi-LexSum (Shen et al., 2022), and datasets targeting Indian and British case law (Shukla et al., 2022), which together total fewer than examples.
Other resources include logos.app Community (2024c), debate.cards (Community, 2024b), and contention.ai (Community, 2024a), which index various debate evidence and generate new evidence from web searches. Datasets targeting biased or query-focused summarization include QBSUM, a Chinese dataset with samples (Zhao et al., 2021); QMSum, which studies meeting summarization with samples (Zhong et al., 2021); and LMGQS, a dataset with over 1 million documents converted to query-focused summarization (Xu et al., 2023). In contrast, our dataset is fully human-created and human-annotated by active debate competitors.
Compared to these datasets, OpenDebateEvidence offers a significantly larger scale and scope, with over 3.5 million documents enriched with detailed metadata.
3 OpenDebateEvidence Dataset
3.1 Data Collection
OpenDebateEvidence is sourced from the OpenCaseList project (Hardy, 2024), an online platform where high school and college debate teams disclose and open-source their evidence. The dataset contains over 3.5 million documents, covering all NSDA debate topics from 2012 to 2023.333A list of these topics can be found here. Each document corresponds to a single piece of evidence used in a debate, categorized by debate format (Policy, LD, Public Forum), and includes comprehensive metadata such as author, date, title, source, citation details, and the debate round in which it was used444An example of some of these downloads can be found here.
The dataset also includes standardized tags to describe the type of argument made by the document, such as topicality, disadvantages, advantages, and counter plans, along with details of the structure and location in the debate file from which the document was extracted. To protect privacy, identifying information has been anonymized.
3.2 Data Preprocessing
Debate evidence is stored in the .docx file format, requiring a specialized parsing process to extract relevant information. The parsing pipeline begins by unzipping the .docx file to access the internal XML files. Ensuring accurate preprocessing is paramount for maintaining dataset quality. This process involves detailed steps to preserve the integrity and consistency of the data, including tokenization, simplification, and structuring of text blocks, followed by extracting and organizing individual debate cards into a structured format that captures both metadata and content.
The XML files are parsed to extract formatting details such as underlining, bold, and highlighting. Next, the document undergoes tokenization, creating a structured representation with text blocks representing paragraphs or coherent units of text along with their formatting information. A simplification step removes unnecessary formatting and merges adjacent tokens with similar styling.
To extract individual debate cards, the parsing procedure identifies card boundaries based on formatting and structure, extracting components such as the tag, citation, and body text. This information is organized into a structured format that captures the metadata and content of each debate card. Finally, the parsed dataset is converted back into a cleaned Hugging Face dataset, providing a human-readable version of the dataset. This structured dataset serves as the foundation for further natural language processing tasks.
3.3 Data Deduplication
After parsing the dataset and extracting individual cards, identifying and removing duplicates is essential to ensure data quality. Deduplication involves comparing the textual content of each card to identify those sharing significant portions of their text. This process enhances dataset usability by eliminating redundancy, ensuring each unique argument is represented only once.
The deduplication algorithm splits each card’s text into sentences. These sentences are then preprocessed by removing non-letter characters and converting them to lowercase. Short sentences below a certain length are filtered out to focus on meaningful content.
The algorithm retrieves and compares card IDs with a significant number of shared sentences. If the number of matching sentences exceeds a predefined threshold and their positions within the cards are within a certain range, the cards are considered duplicates. Duplicate clusters are formed by identifying all cards connected through shared sentences. A representative card is then selected from each cluster based on factors such as sentence count and content quality, and duplicates are removed iteratively.
3.4 Data Statistics
The OpenDebateEvidence dataset offers a comprehensive collection of over 3.5 million documents categorized by debate format (Policy, Lincoln-Douglas, and Public Forum). Each document is enriched with extensive metadata, including author, date, title, source, citation details, and debate round information. Standardized tags describe the type of argument, such as topicality, disadvantages, advantages, and counterplans.
Policy Debate evidence constitutes approximately two-thirds of the dataset, Lincoln-Douglas Debate evidence comprises about one-third, and Public Forum Debate evidence makes up a smaller percentage. Spanning topics from 2012 to 2023, the dataset represents over 1,400 schools and includes contributions from more than 3,200 authors.
Key statistics of the dataset are provided in Table 1, and more detailed statistics and information can be found in Appendix F.
Feature | Count |
---|---|
Total Documents | 3,571,098 |
Policy Debate Evidence | 2,380,600 |
Lincoln-Douglas Debate Evidence | 1,164,132 |
Public Forum Debate Evidence | 26,366 |
Years Covered | 2012-2023 |
Average Document Length (characters) | 3,556 |
Total Schools Represented | 1,423 |
Unique Authors | 3,217 |
Unique Topics | 68 |
3.5 Rich Metadata for Argument Structure
Each evidence document is organized with a “hat,” “pocket,” and “tag” to represent its role within a debate case.
The “pocket” indicates the top-level speech section the evidence supports, such as “1NC” for the first negative constructive speech. The “hat” denotes the broad argument category, like “Oil Disadvantage,” which aligns with a structured argument against an affirmative case. The “tag” provides a concise, biased summary of the specific argument made by the evidence. Debaters often create these tags first and then find the evidence that fits the tag.
This metadata encodes the rhetorical structure and purpose of the evidence in a practical and real-world context. The “hat” and “pocket” provide the argument’s context, while the “tag” offers a concise summary of the core claim.
For argument mining, this metadata offers valuable semantic annotations for training models on argument components and relations. “Hats” and “pockets” help models learn the overarching structure, while “tags” summarize key points.
For summarization, the hierarchical metadata enables multi-level summaries: “pockets” for high-level overviews, “hats” for key categories, and “tags” for concise core claims. The biased nature of “tags” illustrates how debaters rhetorically summarize their claims and arguments. OpenDebateEvidence is particularly rich as it includes both hierarchical biased abstractive and token-level extractive summaries.
4 Experiments
To evaluate the efficacy of the OpenDebateEvidence dataset for argument mining and summarization, we conducted a series of fine-tuning experiments using state-of-the-art language models. We also evaluated the performance of these models on two related datasets.
4.1 Experimental Setup
We employed three recent fine-tuning techniques for adapting our models to OpenDebateEvidence: Low-Rank Adaptation (LoRA) (Hu et al., 2021), Representation Fine-Tuning (ReFT) (Wu et al., 2024), and Orthogonalization (Arditi et al., 2023). These methods are chosen for their parameter efficiency and ability to prevent catastrophic forgetting. The details of these techniques are provided in Appendix D.
We perform our experiments on three datasets: OpenDebateEvidence, DebateSum, which is also a dataset of Policy Debate Evidence, and the billsum dataset from Kornilova & Eidelman (2019), a dataset of US legislation and summaries, to illustrate our fine-tuned models capabilities at performing argumentative summarization in many contexts.
We conducted two types of experiments: traditional NLP evaluation metrics and using GPT-4o as a judge model. All experiments were conducted on a 4xA100 machine from Microsoft Azure with parallelism, attention optimization, and 16-bit quantization enabled. All decoding/sampling settings were kept default. The seed value of "42" was used wherever possible.
4.1.1 Traditional NLP Metrics
For the traditional NLP metrics, we evaluated the models on validation datasets of the whole BillSum dataset and examples from OpenDebateEvidence. Each model was tasked with generating a short "abstract" summarizing the key arguments made in each document. We computed ROUGE F1 scores between the generated text and the ground-truth "tag" provided in the OpenDebateEvidence metadata and the reference summaries in BillSum. For more details see Appendix E. Additionally, we evaluated each language model’s perplexity on the sampled subsets to assess how well the models captured the overall distribution of debate and legislative language.
4.1.2 LLM as Judge
In the "LLM as Judge" experiments, we evaluated the quality of the generated abstracts using GPT-4o as the judge. Each model’s output was assessed on two criteria: the quality of the output and the quality of supporting the argument, both rated on a scale from 1 to 10. The evaluation was conducted on results from both datasets. This approach allows us to measure not only the linguistic quality of the summaries but also their effectiveness in supporting the arguments. For more details see Section E.2.2 .
4.2 Results
The results of our experiments are shown in Tables 2, 3 and 4. Both the Mistral-7B and LLaMA3-8B models achieved promising results, with the LLaMA3-8B models generally outperforming the Mistral-7B models across all ROUGE metrics. Notably, fine-tuning on a larger subset of the dataset ( examples) significantly improved performance, as evidenced by the LLaMA3-8B LoRA (1M Ex) model. This model achieved the highest scores across traditional NLP metrics and the LLM as Judge evaluations, demonstrating that extensive fine-tuning on domain-specific data is crucial for optimizing model performance in argument mining tasks.
Our results demonstrate the significant impact of using the OpenDebateEvidence dataset on model performance. On all three datasets, our fine-tuned models showed substantial improvements over the baseline models. The LLaMA3-8B LoRA model, fine-tuned on examples from OpenDebateEvidence, achieved the highest scores across all traditional NLP metrics and the LLM as Judge evaluations. This highlights the importance of extensive fine-tuning on domain-specific data.
4.2.1 OpenDebateEvidence Performance
For the OpenDebateEvidence dataset (Table 2), the LLaMA3-8B models generally outperformed the Mistral-7B models across all ROUGE metrics. Fine-tuning techniques, particularly LoRA with examples, significantly enhanced the models’ ability to generate high-quality summaries that effectively support arguments. Notably, the LoRA technique improved performance by effectively adapting model parameters with minimal additional computational resources. ReFT showed strong performance, indicating its ability to modify hidden representations in targeted subspaces, improving summarization quality. Orthogonalization, while effective, showed relatively less improvement than LoRA and ReFT, likely due to its focus on controlling specific features in the residual stream.
Model | R-1 | R-2 | R-L | Perplexity | Output Quality | Support Quality |
---|---|---|---|---|---|---|
Mistral-7B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LLaMA3-8B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LoRA (1M Ex) |
4.2.2 BillSum Performance
Similarly, for the BillSum dataset (Table 3), the models fine-tuned on OpenDebateEvidence demonstrated superior performance compared to their baselines. This suggests that training on OpenDebateEvidence can enhance the models’ capabilities in reasoning and summarization, even when applied to a different domain such as legislative texts. The results on BillSum confirm the transferability of the fine-tuning techniques, with LoRA once again showing the most significant improvements. This technique’s efficiency in parameter adaptation appears to be particularly beneficial for handling diverse datasets. ReFT also performed well, indicating its robustness in capturing complex argument structures across different domains. Orthogonalization, while showing improvement, was less impactful compared to the other techniques.
Model | R-1 | R-2 | R-L | Perplexity | Output Quality | Support Quality |
---|---|---|---|---|---|---|
Mistral-7B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LLaMA3-8B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LoRA (1M Ex) |
4.2.3 DebateSum Performance
In evaluating the DebateSum dataset (Table 4), the fine-tuned models demonstrated notable improvements over their base counterparts. The use of advanced fine-tuning techniques, particularly LoRA with a larger dataset of examples, significantly boosted the models’ performance across all ROUGE metrics. This was evident in both the Mistral-7B and LLaMA3-8B models, highlighting the effectiveness of LoRA in enhancing summarization capabilities. The ReFT technique also showed robust results, suggesting its strong ability to refine hidden representations for better summarization quality. While Orthogonalization improved the performance, it was comparatively less effective than LoRA and ReFT, likely due to its narrower focus on specific feature control within the residual streams.
Model | R-1 | R-2 | R-L | Perplexity | Output Quality | Support Quality |
---|---|---|---|---|---|---|
Mistral-7B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LLaMA3-8B | ||||||
Base | ||||||
LoRA | ||||||
ReFT | ||||||
Orthogonal | ||||||
LoRA (1M Ex) |
5 Potential Applications and Future Directions
5.1 Argument Quality Assessment
The extensive collection of real-world debate arguments and rich metadata in the dataset provides a unique opportunity to study argument quality. By analyzing factors such as duplicate count, position in the debate round, and debate outcomes, researchers can develop models to automatically assess persuasiveness, relevance, and overall quality. This could lead to tools that offer real-time feedback and suggestions for debaters, improving debate communities’ inclusivity by reducing reliance on expensive human judges.
5.2 Multi-level Argument Summarization
The hierarchical structure of arguments in the dataset enables research into multi-level argument summarization. Models can generate summaries at various granularities, from concise one-sentence summaries to detailed overviews. This aligns with the emerging interest in query-focused and hierarchical summarization in the NLP community.
5.3 Argument Generation and Rebuttal
With its diverse collection of arguments and counterarguments, OpenDebateEvidence is valuable for developing argument generation models. By studying successful debaters’ patterns and strategies, researchers can create systems that generate persuasive and relevant arguments on given topics. Additionally, the dataset’s balanced coverage of affirmative and negative arguments enables the development of rebuttal generation models that counter opposing arguments.
5.4 Cross-domain Argument Mining
While primarily focused on competitive debate formats, the argumentation skills and techniques in the dataset are applicable across domains such as legal reasoning, policy-making, and online discussions. Researchers can develop general argument mining models for diverse argumentative texts, advancing areas like legal document analysis, opinion mining, and fact-checking. Integrating OpenDebateEvidence with fact-checking and misinformation detection datasets could yield robust models for identifying and countering misleading claims in public discourse.
5.5 Understanding Persuasion and Sentiment
The dataset captures both arguments’ logical structure and debaters’ rhetorical strategies and emotional appeals. By studying the interplay between rational argumentation and affective language, researchers can develop sophisticated models for understanding sentiment’s role in persuasion and decision-making. This has applications in political science, marketing, and human-computer interaction.
5.6 Debate Coaching and Education
OpenDebateEvidence holds significant potential for debate coaching and education. Analyzing successful arguments and strategies can help coaches identify best practices and develop more effective training programs. The dataset can also serve as a resource for creating educational materials such as argument templates, case studies, and interactive learning tools. This will support aspiring debaters in skill development.
6 Conclusion
In this paper, we introduce OpenDebateEvidence, a large-scale dataset for argument mining and summarization, comprising over 3.5 million documents from the OpenCaseList project. After extensive preprocessing and deduplication, we created a high-quality dataset enriched with metadata that captures the hierarchical structure and semantics of debate arguments. Our experiments demonstrated the potential of fine-tuning modern large language models for argumentative abstractive summarization in a parameter-efficient manner. The results showed significant improvements in performance on the OpenDebateEvidence, DebateSum, and BillSum datasets, validating the effectiveness of our approach.
By providing this resource to the community, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. The OpenDebateEvidence dataset, with its rich metadata and diverse collection of debate formats, offers an excellent resource for developing and evaluating argument mining and summarization models.
Future work includes exploring additional fine-tuning techniques and expanding the dataset to include more diverse debate formats. We also plan to investigate the integration of multimodal data to enhance argument comprehension and explore cross-linguistic adaptations to broaden the applicability of our models. By continuing to refine and expand this resource, we hope to further enhance language models’ capabilities in understanding and generating complex argumentative discourse.
References
- Arditi et al. (2023) Andy Arditi, Oscar Obeso, Aaquib111, wesg, and Neel Nanda. Refusal in llms is mediated by a single direction. https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction, 2023. Accessed: 2024-06-18.
- Bar-Haim et al. (2020) Roy Bar-Haim, Lilach Eden, Roni Friedman, Yoav Kantor, Dan Lahav, and Noam Slonim. From arguments to key points: Towards automatic argument summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4029–4039, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.371. URL https://aclanthology.org/2020.acl-main.371.
- Cheng et al. (2022) Liying Cheng, Lidong Bing, Ruidan He, Qian Yu, Yan Zhang, and Luo Si. IAM: A comprehensive and large-scale dataset for integrated argument mining tasks. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2277–2287, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.162. URL https://aclanthology.org/2022.acl-long.162.
- Community (2024a) Contention AI Community. Contention ai. https://contention.ai, 2024a. Accessed: 2024-06-18.
- Community (2024b) Debate Cards Community. Debate cards. http://debate.cards, 2024b. Accessed: 2024-06-18.
- Community (2024c) Logos Debate Community. Logos debate. https://logos-debate.netlify.app, 2024c. Accessed: 2024-06-18.
- Ein-Dor et al. (2019) Liat Ein-Dor, Eyal Shnarch, Lena Dankin, Alon Halfon, Benjamin Sznajder, Ariel Gera, Carlos Alzate, Martin Gleize, Leshem Choshen, Yufang Hou, Yonatan Bilu, Ranit Aharonov, and Noam Slonim. Corpus wide argument mining – a working solution, 2019.
- Elaraby & Litman (2022) Mohamed Elaraby and Diane Litman. ArgLegalSumm: Improving abstractive summarization of legal documents with argument mining. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp. 6187–6194, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.540.
- Gleize et al. (2019) Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena Dankin, Guy Moshkowich, Ranit Aharonov, and Noam Slonim. Are you convinced? choosing the more convincing evidence with a Siamese network. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 967–976, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1093. URL https://aclanthology.org/P19-1093.
- Hardy (2024) Aaron Hardy. Opencaselist project. https://opencaselist.com/, 2024. Accessed: 2024-06-18.
- Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
- Joshi et al. (2023) Omkar Joshi, Priya Pitre, and Yashodhara Haribhakta. ArgAnalysis35K : A large-scale dataset for argument quality analysis. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13916–13931, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.778. URL https://aclanthology.org/2023.acl-long.778.
- Kornilova & Eidelman (2019) Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu (eds.), Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp. 48–56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5406. URL https://aclanthology.org/D19-5406.
- Magnusson & Friedman (2021) Ian Magnusson and Scott Friedman. Extracting fine-grained knowledge graphs of scientific claims: Dataset and transformer-based results. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4651–4658, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.381. URL https://aclanthology.org/2021.emnlp-main.381.
- Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Roush & Balaji (2020) Allen Roush and Arvind Balaji. DebateSum: A large-scale argument mining and summarization dataset. In Elena Cabrio and Serena Villata (eds.), Proceedings of the 7th Workshop on Argument Mining, pp. 1–7, Online, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.argmining-1.1.
- Ruiz-Dolz & Iranzo-Sánchez (2024) Ramon Ruiz-Dolz and Javier Iranzo-Sánchez. Vivesdebate-speech: A corpus of spoken argumentation to leverage audio features for argument mining, 2024.
- Shen et al. (2022) Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, and Doug Downey. Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities, 2022.
- Shukla et al. (2022) Neelesh Shukla, Amit Vaid, Raghu Katikeri, Sangeeth Keeriyadath, and Msp Raja. DiMSum: Distributed and multilingual summarization of financial narratives. In Mahmoud El-Haj, Paul Rayson, and Nadhem Zmandar (eds.), Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022, pp. 65–72, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.fnp-1.9.
- Slonim et al. (2021) Noam Slonim, Yonatan Bilu, Carlos Alzate, et al. An autonomous debating system. Nature, 591:379–384, 2021. doi: 10.1038/s41586-021-03215-w. URL https://doi.org/10.1038/s41586-021-03215-w.
- Wu et al. (2024) Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, 2024.
- Xu et al. (2023) Ruochen Xu, Song Wang, Yang Liu, Shuohang Wang, Yichong Xu, Dan Iter, Pengcheng He, Chenguang Zhu, and Michael Zeng. LMGQS: A large-scale dataset for query-focused summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14764–14776, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.984. URL https://aclanthology.org/2023.findings-emnlp.984.
- Zhao et al. (2021) Mingjun Zhao, Shengli Yan, Bang Liu, Xinwang Zhong, Qian Hao, Haolan Chen, Di Niu, Bowei Long, and Weidong Guo. Qbsum: A large-scale query-based document summarization dataset from real-world applications. Computer Speech and amp; Language, 66:101166, March 2021. ISSN 0885-2308. doi: 10.1016/j.csl.2020.101166. URL http://dx.doi.org/10.1016/j.csl.2020.101166.
- Zhong et al. (2021) Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. Qmsum: A new benchmark for query-based multi-domain meeting summarization, 2021.
Appendix A Limitations
A.1 Representation Bias
While the OpenDebateEvidence dataset is extensive, it may not fully represent the diversity of argumentation styles and topics across all debate communities. The dataset primarily includes evidence from American high school and college debates, and therefore might not capture the nuances of debates in other cultural or educational contexts, or in other languages.
Format-Specific Challenges
The unique formatting conventions used in debate evidence may present challenges for standard natural language processing tools. The presence of shorthand, abbreviations, and specialized jargon may require additional preprocessing or specialized models to accurately interpret and analyze the text.
Incomplete or Inconsistent Metadata
While the dataset includes extensive metadata, there may be inconsistencies or gaps in this information. For example, citation details might be missing or incorrect for some documents, and the standardized tags describing the type of argument might not be uniformly applied across all documents.
Potential Noise and Redundancy
The dataset’s size and diversity may also introduce noise and redundancy. Duplicate documents, irrelevant content, or errors in formatting and citation may exist within the dataset, potentially affecting the quality of the analyses in spite of efforts taken to reduce or eliminate this.
Limited Accessibility to Public Forum Debate Evidence
With Public Forum Debate making up such a small percentage of the evidence included within OpenDebateEvidence, research focusing on this specific debate format may face limitations in terms of data quantity and diversity.
Appendix B Ethics Statement
The OpenDebateEvidence dataset, presented in this paper, derives from openly shared debate evidence across various educational forums and debate formats. This dataset strictly adheres to the principles of fair use, focusing on academic and research intent. The files which make up OpenDebateEvidence have been hosted online in some cases for over a decade without any known ethical issues arising as a result of it.
We performed this research and released this dataset with the full blessing and support of the OpenCaseList project.
Appendix C Social Impacts
Our introduction of OpenDebateEvidence, a comprehensive dataset sourced from the American Competitive Debate community, is poised to have significant positive societal impacts. By offering a rich collection of over 3.5 million documents with detailed metadata, this dataset provides an unparalleled resource for training and evaluating language models in the domain of argument mining and summarization.
The comprehensive nature of OpenDebateEvidence, capturing the nuanced complexity of arguments in high school and college debates, will enable more rigorous and representative assessments of language models. This, in turn, will drive advancements in computational argumentation research and applications.
Practitioners and researchers will benefit from this benchmark, which is designed to reflect real-world argumentative scenarios more accurately. The dataset’s ability to enhance model performance across various argumentative tasks suggests its utility in improving the robustness and reliability of language technologies.
Moreover, by making OpenDebateEvidence publicly available, we encourage broader participation and innovation in this field. This democratization of resources can lead to more diverse contributions and perspectives, fostering a more inclusive research environment.
In summary, we believe our work will accelerate research, improve model evaluation and training, and ultimately enhance the capabilities of language models in handling complex argumentative texts, with no foreseeable negative societal impacts.
Appendix D Fine-Tuning Techniques
D.1 Low-Rank Adaptation (LoRA)
LoRA introduces low-rank matrices into the model’s architecture, reducing the number of trainable parameters. Given a weight matrix , LoRA decomposes it into two low-rank matrices and , where . The updated weight matrix is then . We used the PeFT Mangrulkar et al. (2022) package with default settings (rank 8).
D.2 Representation Fine-Tuning (ReFT)
ReFT modifies hidden representations through targeted interventions in specific subspaces. Low-rank Linear Subspace ReFT (LoReFT) is defined as:
where has orthonormal rows. We used the PyReft package with default settings (rank 4).
D.3 Orthogonalization
Orthogonalization controls specific features in the model’s residual stream by modifying the weights. Given a direction , each weight matrix is modified as:
We used the notebook for this process.
Appendix E Prompts Used in Experiments
E.1 OpenDebateEvidence/DebateSum
E.1.1 Traditional NLP Metrics Prompt
SYSTEM PROMPT: You are a Policy Debater.
USER PROMPT:
DOCUMENT: <full text of the document>
Provide an abstractive summary/card-tag of the argument made in the document above.
ABSTRACT:
E.1.2 LLM as Judge Prompt
SYSTEM PROMPT: You are a Policy Debate Judge.
USER PROMPT:
DOCUMENT: <full text of the document>
ABSTRACT: <generated abstract>
Score the abstract from 0-10 on it’s how well it supports the documents argument, and on its general quality
E.2 BillSum
E.2.1 Traditional NLP Metrics Prompt
SYSTEM PROMPT: You are a lawmaker.
USER PROMPT:
DOCUMENT: <full text of the document>
Provide an abstractive summary of the law made in the document above.
ABSTRACT:
E.2.2 LLM as Judge Prompt
SYSTEM PROMPT: You are a lawmaker.
USER PROMPT:
DOCUMENT: <full text of the document>
ABSTRACT: <generated abstract>
Score the abstract from 0-10 on it’s how well it supports the documents argument, and on its general quality
Appendix F OpenDebateEvidence Dataset Details
F.1 Dataset Card for OpenDebateEvidence
F.1.1 Dataset Summary
This dataset is a gargantoum follow-up to DebateSum, which includes a ton of improvements
Among those improvements are the following:
-
•
Massively increased size (about 25X the size of DebateSum), including nearly all debate evidence ever open sourced over the past 20 years from High School and College Public Forum, Policy, and Lincoln Douglas debate leagues
-
•
Far more metadata: Lots of new columns indicating everything from the number of times a piece of evidence has been seen (a good heuristic for evidence quality) to the teams and tournaments and rounds where a piece of evidence was deployed
-
•
Better deduplication and parsing techniques, including better accounting of the hierarchical nature that debaters use for underlining evidence
F.1.2 Supported Tasks and Leaderboards
This dataset is useful for text generation, summarization, information retrieval, question answering, and related tasks. This dataset is further highly useful as a "trustworthy" dataset. All evidence within it has corresponding citations and is in general "factual" or grounded in facts. We do the evaluation in our paper, establishing the first "leaderboard" for measuring the performance of models trained on this dataset.
F.1.3 Languages
English with very minor exceptions (i.e. evidence from performance cases using non-English evidence to make anti-colonialist arguments)
F.1.4 Dataset Creation
Gathered from the OpenCaseList project with their enthusiastic permission.
F.1.5 Source Data
Debate Evidence from NDCA/NDT debate leagues from 2002-2022.
F.1.6 Dataset Format
This dataset is originally contained in csv files, which were auto-converted into the parqueet dataset format by Huggingface. It’s available for download and consumption in both formats.
F.1.7 Hosting, licensing, and maintenance plan
We host and maintain our dataset on Huggingface through its "dataset" feature. We plan to update this dataset every year with new evidence as it is released by debaters, causing this to be a "living" dataset. We pledge to make sure that this dataset remains accessible for the foreseeable future, and the ability to regenerate this dataset is always preserved as its source documents are freely downloadable on OpenCaseList’s website.
F.1.8 Discussion of Biases
Competitive debate at the highest levels has increasingly rewarded teams who cite particular subfields of philosophy. A partial list of these highly represented topics is given below
-
•
Postmodernism
-
•
Poststructuralism
-
•
Frankfurt School
-
•
Critical Theory
-
•
Critical Race Theory
-
•
Queer Theory
-
•
Feminism
These cannons are dominated by so called "left-wing" thinkers and have mostly marginalized so called "right-wing" thinkers within them with some notable exceptions
Note that despite a strong "left-wing" bias, large swaths of left-wing thought, such as anarchism, are relatively absent.
Beyond this, most of the evidence was gathered with the argument being made first, and the evidence found after-the-fact to support it. This means that while the evidence is almost all "truthful", a lot of important information which might not help support an argument may be omitted.
F.1.9 Other Known Limitations
There are cases of academic dishonesty within this dataset (i.e. evidence that had specific insertions made by a debater which weren’t in the original text). It’s also possible that the source had changed in-between when it was cited and retrieved. We believe that this is extremely rare in practice, affecting no more than 200 examples.
F.1.10 Consent
We got the enthusiastic consent and approval to use this data from the OpenCaseList project. Debaters who submit their evidence there fully consent for this evidence to be freely used, including for curated datasets like this
F.1.11 Personal Information
We removed all Personal Information from the metadata of this evidence (first/last name of debaters).
F.1.12 Licensing Information
All data within this dataset is clearly used within an extracurricular, educational activity. This means that any "copyright" issues from reproduction of copyrighted articles within the dataset are allowed and exempted under US copyright law. The OpenCaseList project fully blesses this work and has published this evidence with a permissive, MIT license.
F.1.13 Author Statement
We, the authors, bear all responsibility in case of violation of rights, etc., and we confirm that the data is licensed with an MIT license.
Column Name | Description |
---|---|
id | Unique identifier for the row |
tag | Biased abstractive summary of the evidence / argument made by debater with evidence. |
cite | String indicating the short citation of the source used for the evidence |
fullcite | Full citation of the source used for the evidence |
summary | Underlined longer word level extractive summary of the evidence, note that summary is biased |
towards supporting the tag argument | |
spoken | Highlighted shorter extractive summary of the evidence / The spoken text of the evidence, note |
that summary is biased towards supporting the tag argument | |
fulltext | The full text of the evidence |
textLength | The length of the text in the evidence in characters |
markup | The full text of the evidence with HTML markup for parsing / visualization purposes |
String indicating the virtual “pocket” (top level section, usually the speech name) in which the | |
evidence is stored within its original document | |
hat | String indicating the virtual “hat” (medium level section, usually the broad type of argument) |
in which the evidence is stored within its original document | |
block | String indicating the virtual “block” (low level section, usually the specific type of argument) |
in which the evidence is stored within its original document | |
bucketId | Unique identifier for the bucket in which the evidence is stored |
duplicateCount | The number of duplicates of the evidence. This acts as a rough proxy for evidence quality, as |
good evidence will be duplicated across many debate files | |
fileId | Unique identifier for the file in which the evidence is stored |
filePath | The file path of the file in which the evidence is stored |
roundId | Unique identifier for the debate round in which the evidence was used |
side | The debate side on which the evidence was used (Affirmative or Negative) |
tournament | The name of the tournament in which the evidence was used |
round | The round number in which the evidence was used |
opponent | The name of the opposing team in the debate round in which the evidence was used |
judge | The name of the judge in the debate round in which the evidence was used |
report | A report associated with the evidence filled out by one of the debaters, usually summarizing |
the arguments presented | |
opensourcePath | The path to the open-source repository in which the evidence is stored |
caselistUpdatedAt | The date on which the caselist was last updated |
teamId | Unique identifier for the team |
teamName | The name of the team |
teamDisplayName | The display name of the team |
teamNotes | Notes associated with the team |
debater1First | The first name of the first debater of the team |
debater1Last | The last name of the first debater of the team |
debater2First | The first name of the second debater of the team |
debater2Last | The last name of the second debater of the team |
schoolId | Unique identifier for the school |
schoolName | The name of the school |
schoolDisplayName | The display name of the school |
state | The state in which the school is located |
chapterId | Unique identifier for the chapter |
caselistId | Unique identifier for the caselist |
caselistName | The name of the caselist |
caselistDisplayName | The display name of the caselist |
year | The year in which the debate round took place |
event | The event in which the debate round took place |
level | The level of the debate (e.g., college, high school, etc.) |
teamSize | The number of debaters on the team |
Column Name | Sample Data | ||
---|---|---|---|
id | 282,369 | ||
tag | “Biodiversity loss causes human extinction.” | ||
cite | “McCarthy 18” | ||
fullcite | “Joe McCarthy 18. Staff Writer…” | ||
summary | “As the sixth mass extinction event accelerates…” | ||
spoken | “As the sixth mass extinction accelerates humans ris…” | ||
fulltext | “As the sixth mass extinction event accelerates around the world…” | ||
textLength | 3,556 | ||
markup | “<h4>Biodiversity loss causes human extinction.</h4><p>Joe <strong>McCarthy 18…’ | ||
“1NC” | |||
hat | “OFF” | ||
block | “1NC—DA” | ||
bucketId | 18,967 | ||
duplicateCount | 122 | ||
fileId | 3,564 | ||
filePath | “./documents/ndtceda22/Emory/KiLo/Emory-KiLo-Aff-JW-Round-3.docx” | ||
roundId | 932,619 | ||
side | “A” | ||
tournament | “JW Patterson Debates hosted by UK” | ||
round | “3” | ||
opponent | “West Georgia CL” | ||
judge | “Ka***” | ||
report |
|
||
opensourcePath | “ndtceda22/Emory/KiLo/Emory-KiLo-Aff-JW-Patterson-Debates.docx” | ||
caselistUpdatedAt | “2022-10-05 19:30:41” | ||
teamId | 80,494 | ||
teamName | “KiLo” | ||
teamDisplayName | “Emory KiLo” | ||
debater1First | “Aa***” | ||
debater1Last | “Ki***” | ||
debater2First | “Lu***” | ||
debater2Last | “Lo***” | ||
schoolId | 27,030 | ||
schoolName | “Emory” | ||
schoolDisplayName | “Emory” | ||
caselistId | 2,001 | ||
caselistName | “ndtceda22” | ||
caselistDisplayName | “NDT/CEDA College 2022-23” | ||
year | 2,022 | ||
event | “cx” | ||
level | “college” | ||
teamSize | 2 |
Feature | Top Categories/Values | Counts |
Year (Top 5) | 2020 | 850,607 |
2021 | 787,685 | |
2019 | 609,703 | |
2018 | 423,378 | |
2017 | 258,742 | |
Event | cx | 2,380,600 |
ld | 1,164,132 | |
pf | 26,366 | |
Caselist DisplayName (Top 3) | HS LD 2020-21 | 383,489 |
HS LD 2021-22 | 326,166 | |
HS Policy 2021-22 | 262,292 | |
TeamSize | 2 | 2,406,966 |
1 | 1,164,132 | |
Level | hs | 2,144,757 |
college | 1,426,341 | |
State (Top 5) | None | 1,875,021 |
CA | 450,734 | |
TX | 335,087 | |
GA | 108,011 | |
IL | 94,084 | |
Side | N | 1,992,850 |
A | 1,578,248 | |
DuplicateCount (Top 5) | 1 | 1,176,971 |
2 | 114,268 | |
3 | 92,667 | |
4 | 78,804 | |
5 | 67,166 |