11institutetext: Dept. of Computer Science & Engineering, University of Moratuwa, Sri Lanka
11email: {kushan.22,nisansa}@cse.mrt.ac.lk
22institutetext: ConscientAI, Sri Lanka
22email: {kushan,cd}@conscient.ai

M2DS: Multilingual Dataset for Multi-document Summarisation

Kushan Hewapathirana\orcidlink0009-0008-1580-0699 1122    Nisansa de Silva\orcidlink0000-0002-5361-4810 11    C.D. Athuraliya\orcidlink0009-0007-4696-5210 22
Abstract

In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today’s globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.

Keywords:
Multi-document Summarisation Multilingual Natural Language Processing

1 Introduction

The art of document summarisation relies on intricate language skills: the ability to navigate through extensive texts, extract important information, and distil it into concise summaries. In recent years, the surge in deep learning within Natural Language Processing (NLP) has sparked significant interest among researchers in this particular task [28, 2, 1]. Summarisation stands as a significant challenge in NLP, gaining paramount importance as the demand for easily digestible content continues to soar [21, 9].

The field of multi-document summarization (MDS) faces a shortage of comprehensive datasets, unlike the advancements in single-document summarization (SDS). While SDS datasets have expanded to include multilingual summarization, MDS is still relatively new but shows promise. Recent MDS research has explored various domains, such as customer reviews, academic papers, medical and legal documents, and news articles, with a predominant focus on the English language [28, 2, 1, 21]. Despite the availability of extensive SDS datasets like CNN/Daily Mail [15], Gigaword Corpus [31], Newsroom corpus [13], and New York Times [35], there is a scarcity of datasets specifically designed for versatile MDS applications, though MDS datasets like DUC111https://duc.nist.gov, TAC222https://tac.nist.gov, and Multi-News [10] exist which predominantly serve the news domain and limited to the English language.

However, in a world boasting over 7,000 languages, the crucial requirement for multilingual approaches in MDS has become evident. Consider English, with its lexicon of over 171,146 words and a staggering 1.5 billion speakers, versus languages like Sinhala, spoken by approximately 16 million people in Sri Lanka  [8]. According to the 26th edition of Ethnologue published in 2023, only 380 million people speak English natively as their first language, which accounts for less than 5% of the global population, and the total English-speaking population (i.e. as the first language and second language) is 20% by 2023, which means that the vast majority of the global population is primarily dependent on other languages [8]. This underscores the importance of an inclusive approach in MDS research, where multilingual models cater for diverse languages.

To address this, the research introduces the Multilingual Dataset for Multi-document Summarisation (M2DS). This dataset aims to facilitate the development of robust MDS models across diverse languages, including low-resource languages, for real-world applications. Covering languages such as English, Japanese, Korean, Tamil, and Sinhala, M2DS is considered a pioneering effort in multilingual MDS, complementing existing single-document summarisation datasets in the multilingual domain.

2 Related Work

This section aims to delve into the MDS landscape, exploring existing datasets, multilingual text summarisation datasets, and current state-of-the-art models. This provides insights into the diverse facets and recent advancements in MDS.

2.1 Major MDS Datasets Across Diverse Domains

Despite being essential for various applications, MDS datasets are relatively scarce compared to SDS datasets. However, the following key datasets have significantly influenced summarisation research. DUC and TAC datasets had set early benchmarks in the news domain [28, 2]. The Multi-News [10] dataset offers substantial size and traceability in the news domain. WikiSum [24] leverages Wikipedia and search engine results for abstractive summarisation challenges. Multi-XScience [27] blends arXiv 333https://arxiv.org papers and Microsoft Academic Graph [37] (MAG) for scientific writing challenges. BigSurvey [25] and MS2̂ [6] contribute to scientific writing, focusing on comprehensive summaries and consolidating conflicting evidence, respectively.

Domain-specific datasets like Rotten Tomatoes [19] and WikiHow [17] diversify summarisation research into movie reviews and knowledge base articles. In customer reviews, Opinosis [11] and OPOSUM [3] are significant, with Opinosis providing professional-written golden summaries for model training and evaluation, and OPOSUM including domain and polarity information across six product categories.

However, a notable gap exists in these datasets—they primarily cater to English, highlighting the pressing need for multilingual datasets. Embracing linguistic diversity can drive global advancements in summarisation research, marking a fertile ground for innovation and exploration. The development of summarisation datasets in multiple languages stands as a promising avenue for future research and inclusivity in the field.

2.2 Existing MDS Models

Transformer architecture-based models, particularly those pre-trained on large datasets, have gained attention for their ability to capture inter-document relationships and generate informative summaries. Examples include BERTSUM [26], using a hierarchical encoder, BART [20], designed as a denoising auto-encoder, PEGASUS [42], leveraging self-supervised learning, and T5 [33], a text-to-text transformer.

In the MDS domain, PRIMERA [41], based on the LongFormer Encoder-Decoder (LED) [4] architecture, stands out, surpassing previous models with a synthetic summary generation strategy during pre-training. DAMEN [30], tailored for the medical domain, combines BERT models with discriminative methods. CGSUM [5] introduces a citation-guided summarization approach for scientific papers.

Despite these advancements, challenges persist in accurately reflecting conflicting information, especially in multi-document scenarios [7]. In multilingual MDS, progress is limited, often relying on linear programming models, and summaries are often in English rather than the original languages, limiting language coverage [29].

2.3 Prior Work on Multilingual MDS

The Workshop on Multilingual Summarisation (MultiLing)444https://aclanthology.org/venues/multiling/ within the ACL anthology has been a crucial focal point in multilingual summarisation research and the 2013 workshop specifically focused on Multilingual MDS [12]. During this event, a Multilingual MDS corpus was constructed, featuring languages like Arabic, English, Greek, Chinese, Romanian, Czech, Hebrew, and Spanish. The corpus creation involved selecting English texts and employing a sentence-by-sentence translation approach for the featured languages [21, 9].

In terms of model concepts, Marina et al.(2013) [29] introduced a novel text representation model extending the classic Vector Space Model [34] to Hyperplane and Half-spaces. They reformulated the extractive summarisation problem as an optimisation task using linear programming, addressing the challenge of representing a large number of extracts without explicit computation. The optimal solution was found by minimising a distance function in polynomial time. While an evaluation was not conducted, the authors suggested potential assessments using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores [23]. This nuanced approach to multilingual MDS, focusing on innovative text representation models and optimisation strategies, has laid a foundation for further exploration and evaluation in subsequent research endeavours [29].

2.4 Existing Multilingual Text Summarisation Datasets

In recent years, there has been a notable increase in research exploring the benefits of summarising across diverse languages, particularly in bilingual settings [39]. Our focus is directed towards datasets relevant to multilingual summarising, covering SDS, MDS, and Cross-Lingual Summarization (CLS).

While multilingual SDS has seen significant progress, research in multilingual MDS is limited. CLS, involving generating summaries in one language for documents in another language, has gained momentum with the development of multilingual SDS. Many multilingual SDS efforts have transitioned to include CLS components in their datasets [21, 9, 39, 14].

Key multilingual SDS datasets include MLSUM [36], featuring 1.5 million news articles across multiple languages; XL-Sum [14], a diverse dataset containing 1.35 million articles in 44 languages; WikiLingua [18], one of the largest parallel multilingual summarization datasets; MLGSum  [40], drawing from various news providers; and M3LS, comprising over 1.11 million multilingual multi-modal instances across 20 languages. Despite their contributions for SDS, the exploration of multilingual MDS is limited due to the absence of a dedicated high-quality dataset.

3 M2DS Dataset

Refer to caption
Figure 1: Process of dataset development. The golden summary for each original article was generated by logically combining its own summary and summaries of its related articles, whereas the original article and the related articles served as the collection of multi-documents. These pairs formed multi-document clusters.

This section provides an overview of the data sources, collection, and pre-processing procedures employed in this study. The dataset, named M2DS, consists of news articles in five languages, each paired with professionally written summaries sourced from the BBC. The summaries, crafted by editors, include links to the original articles for reference. The study emphasises transparency and reproducibility, with a commitment to providing links and scripts for replicating the dataset from the specified sources.

Dataset No. of documents No. of clusters Avg. no. of documents per cluster Domain
Multi-News 56.0k*56.0k*56.0\mathrm{k}\textsuperscript{\textasteriskcentered}56.0 roman_k 16.0k16.0k16.0\mathrm{k}16.0 roman_k 3.5*3.5*3.5\textsuperscript{\textasteriskcentered}3.5 News articles
Multi-Xscience 40.0k*40.0k*40.0\mathrm{k}\textsuperscript{\textasteriskcentered}40.0 roman_k 14.0k14.0k14.0\mathrm{k}14.0 roman_k 2.8*2.8*2.8\textsuperscript{\textasteriskcentered}2.8 Related work section in scientific articles
Wikisum¦ 1.5M*1.5M*1.5\mathrm{M}\textsuperscript{\textasteriskcentered}1.5 roman_M 37.5k37.5k37.5\mathrm{k}37.5 roman_k 40.0*40.0*40.0\textsuperscript{\textasteriskcentered}40.0 Wikipedia articles¦
BigSurvey-MDS¢ 430.0k*430.0k*430.0\mathrm{k}\textsuperscript{\textasteriskcentered}430.0 roman_k 7.0k7.0k7.0\mathrm{k}7.0 roman_k 61.4*61.4*61.4\textsuperscript{\textasteriskcentered}61.4 Human-written survey papers on various domains¢
PEERSUM 11.9k11.9k11.9\mathrm{k}\textsuperscript{\textbardbl}11.9 roman_k 1.5k1.5k1.5\mathrm{k}1.5 roman_k 7.87.87.8\textsuperscript{\textbardbl}7.8 Peer reviews of scientific publications
MS^2 470.0k470.0k470.0\mathrm{k}\textsuperscript{\textdagger}470.0 roman_k 20.0k20.0k20.0\mathrm{k}20.0 roman_k 23.523.523.5\textsuperscript{\textdagger}23.5 Reviews of scientific publications in medical domain
Rotten Tomato Dataset\uparrow 244.0k244.0k244.0\mathrm{k}\textsuperscript{\textdaggerdbl}244.0 roman_k 9.0k9.0k9.0\mathrm{k}9.0 roman_k 26.826.826.8\textsuperscript{\textdaggerdbl}26.8 Movie reviews
M2DS 180.0k180.0k180.0\mathrm{k}180.0 roman_k 51.5k51.5k51.5\mathrm{k}51.5 roman_k 3.53.53.53.5 News articles
- English 67.0k67.0k67.0\mathrm{k}67.0 roman_k 17.0k17.0k17.0\mathrm{k}17.0 roman_k 3.93.93.93.9
- Tamil 32.0k32.0k32.0\mathrm{k}32.0 roman_k 10.0k10.0k10.0\mathrm{k}10.0 roman_k 3.23.23.23.2
- Japanese 29.0k29.0k29.0\mathrm{k}29.0 roman_k 11.0k11.0k11.0\mathrm{k}11.0 roman_k 2.62.62.62.6
- Korean 27.0k27.0k27.0\mathrm{k}27.0 roman_k 8.0k8.0k8.0\mathrm{k}8.0 roman_k 3.43.43.43.4
- Sinhala 23.5k23.5k23.5\mathrm{k}23.5 roman_k 5.5k5.5k5.5\mathrm{k}5.5 roman_k 4.24.24.24.2
Table 1: MDS datsaset statistics. The sources are as follows: *Xiao et al.(2022) [41], DeYoung et al.(2021) [6], DeYoung et al.(2023) [7], Fabbri et al.(2019) [10], Lu et al.(2020) [27], Li et al.(2022) [22], ¦Liu et al.(2018) [24], ¢Liu et al.(2023) [25], \uparrowLeon et al.(2020) [19].

3.1 Dataset Development

In the rapidly changing digital landscape, the significant increase in online news articles has led to a growing demand for concise and informative content. To address this need, the Multilingual Multi-document Summarisation Dataset (M2DS) has been introduced. Emphasising language inclusivity, the dataset focuses on linguistic diversity and uses BBC News as the primary source due to its global coverage and articles available in multiple languages.

We utilised the M3LS dataset to extract links of parsed articles in each language. This dataset served as a valuable foundation for creating our dataset by providing corresponding Twitter page links for each BBC news article. To ensure the reliability of the M3LS dataset [39], the authors conducted a manual assessment of article and summary quality, evaluating factors such as informativeness, length, and the ability to capture essential information. This assessment involved a meticulous review of 100 articles in four languages from their dataset. For each article, the authors carefully read the text and assigned a score between 1-5 to the golden summary, with 5 representing the best possible summary that captures most of the crucial information from the given article and vice versa. Notably, more than 70 articles across the evaluated languages received a score of over 4 out of 5 in their analysis [39].

Assuming uniformity in the quality of articles published by BBC across various domains, the authors extrapolated that this high-quality standard holds true for every language in their dataset [39]. This verification process ensured the overall quality and reliability of the BBC articles and their summaries derived from the M3LS dataset for our study.

Once we completed the first stage, we had a SDS dataset similar to the M3LS dataset, consisting of (1) BBC articles extracted from the links included in the M3LS dataset, (2) the corresponding summaries, and (3) links to related articles which are listed in the original articles. The transformation from an SDS dataset to a MDS dataset involved extracting article-summary pairs from related links, which is illustrated by Figure 1. To ensure the quality of these summaries and the relatedness of articles in each cluster, we manually verified a sample of 10 clusters, containing 2-10 articles per cluster for English, Tamil, and Sinhala languages.

Our dataset spans from 2010 to 2021, incorporating articles sourced from the M3LS dataset dated between 2010 and 2021. To expand temporal coverage, we collected links from the front page of the BBC News site, focusing on articles from December 2021 to December 2023, and ensured non-duplication. To handle duplicates, we meticulously removed repeated links, guaranteeing the uniqueness of each document cluster. Consequently, the dataset contains a diverse range of articles, free from duplication. The dataset is structured in the Hugging Face DatasetDict format, offering ease of access555The dataset can be found at https://huggingface.co/datasets/KushanH/m2ds and https://osf.io/7gjtm/files/osfstorage. The code and pre-trained models are available at https://github.com/KushanMH/m2ds..

3.2 Dataset Composition

The M2DS dataset encompasses articles in Sinhala, English, Japanese, Korean, and Tamil languages. Our aspiration is for the M2DS dataset to serve as a catalyst, sparking research interest in languages that have received less exploration. Each language-specific cluster within the dataset comprises two to ten documents.

The M2DS dataset comprises 180,000 documents organized into 51,500 clusters, with an average of 3.5 documents per cluster. English-language news articles contribute the highest number of documents at 67,000, whereas Sinhala has the lowest count at 23,500. The average documents per cluster vary across languages, with Japanese news having the lowest at 2.6 and Sinhala having the highest at 4.2. (See Table 1).

Dataset PRIMERA PEGASUS LED
Multi-News R-1 42.0*42.0*42.0\textsuperscript{\textasteriskcentered}bold_42.0 32.0*32.0*32.0\textsuperscript{\textasteriskcentered}32.0 17.3*17.3*17.3\textsuperscript{\textasteriskcentered}17.3
R-2 13.6*13.6*13.6\textsuperscript{\textasteriskcentered}bold_13.6 10.1*10.1*10.1\textsuperscript{\textasteriskcentered}10.1 3.7*3.7*3.7\textsuperscript{\textasteriskcentered}3.7
R-L 20.8*20.8*20.8\textsuperscript{\textasteriskcentered}bold_20.8 16.7*16.7*16.7\textsuperscript{\textasteriskcentered}16.7 10.4*10.4*10.4\textsuperscript{\textasteriskcentered}10.4
Multi-Xscience R-1 29.1*29.1*29.1\textsuperscript{\textasteriskcentered}bold_29.1 27.6*27.6*27.6\textsuperscript{\textasteriskcentered}27.6 14.6*14.6*14.6\textsuperscript{\textasteriskcentered}14.6
R-2 4.6*4.6*4.6\textsuperscript{\textasteriskcentered}bold_4.6 4.6*4.6*4.6\textsuperscript{\textasteriskcentered}4.6 1.9*1.9*1.9\textsuperscript{\textasteriskcentered}1.9
R-L 15.7*15.7*15.7\textsuperscript{\textasteriskcentered}bold_15.7 15.3*15.3*15.3\textsuperscript{\textasteriskcentered}15.3 9.9*9.9*9.9\textsuperscript{\textasteriskcentered}9.9
WikiSum R-1 28.0*28.0*28.0\textsuperscript{\textasteriskcentered}bold_28.0 24.6*24.6*24.6\textsuperscript{\textasteriskcentered}24.6 10.5*10.5*10.5\textsuperscript{\textasteriskcentered}10.5
R-2 8.0*8.0*8.0\textsuperscript{\textasteriskcentered}bold_8.0 5.5*5.5*5.5\textsuperscript{\textasteriskcentered}5.5 2.4*2.4*2.4\textsuperscript{\textasteriskcentered}2.4
R-L 18.0*18.0*18.0\textsuperscript{\textasteriskcentered}bold_18.0 15.0*15.0*15.0\textsuperscript{\textasteriskcentered}15.0 8.6*8.6*8.6\textsuperscript{\textasteriskcentered}8.6
Rotten Tomatoes R-1 25.425.425.4\textsuperscript{\textbullet}25.4 27.427.427.4\textsuperscript{\textbullet}bold_27.4 25.625.625.6\textsuperscript{\textbullet}25.6
R-2 8.48.48.4\textsuperscript{\textbullet}8.4 9.59.59.5\textsuperscript{\textbullet}bold_9.5 8.08.08.0\textsuperscript{\textbullet}8.0
R-L 19.819.819.8\textsuperscript{\textbullet}19.8 21.121.121.1\textsuperscript{\textbullet}bold_21.1 19.619.619.6\textsuperscript{\textbullet}19.6
Table 2: ROUGE scores of selected models on different domain datasets. Note: This study utilises multiple datasets from various domains.(Results obtained from Hewapathirana et al.(2023) [16]). The Multi-News dataset [10] consists of news articles, Multi-XScience [27] focuses on scientific papers, WikiSum [24] provides Wikipedia summaries, and Rotten Tomatoes [19] covers movie reviews. These diverse datasets offer valuable resources for training and evaluating summarization models. The sources for the results are as follows: *Xiao et al.(2022) [41] and DeYoung et al.(2023) [7]

3.3 Dataset Comparison

As the M2DS dataset is the first of its kind, we decided to compare it with existing MDS datasets. We conducted a comprehensive comparison with existing MDS datasets across various domains, considering factors such as the total number of documents, number of clusters, and the number of documents per cluster. This approach provides valuable insights into the positioning of our dataset within the landscape of English-centric MDS datasets.

To offer a holistic view, we present both aggregated numbers and language-specific statistics. This multi-faceted analysis allows for a nuanced understanding of the dataset’s characteristics in comparison to established datasets as shown in Table 1. Comparatively, when assessing M2DS against other MDS datasets, our dataset stands out with a significant overall number of documents. However, on a language-wise comparison, it exhibits a relatively lower count per language, emphasising the importance of considering linguistic variations in dataset analysis.

It is important to note that certain statistical metrics, such as average sentence length, average token count, and average word count per article and per cluster, were not included in the comparison. The rationale behind this omission is the inherent linguistic differences across languages. For instance, languages like Japanese and Korean may convey the same meaning with a lesser number of words, or in some cases, they might encapsulate an entire sentence with a single character. Consequently, direct comparisons based on these metrics could be misleading due to the diverse linguistic structures and expressions employed by different languages.

4 Experiments

In this section, we outline the experiments conducted, taking the dataset size relative to existing English-centric MDS datasets into consideration. The dataset was partitioned into training, testing, and validation sets, following a 90-5-5 split for languages other than English [39]. For English, we adopted an 80-10-10 split, aligning with the practices of previous researchers in MDS dataset creation [25, 24, 17]. A meticulous evaluation of various MDS models was carried out to establish robust baselines. Additionally, we explored the efficacy of open-source large language models, aiming to set a strong baseline for future research.

4.1 Pre-trained Model Selection

In the context of Multilingual MDS, there is a noticeable gap in the literature concerning the absence of transformer-based models. Recognising the robustness of such models, the approach in this study involved evaluating pre-trained models to establish baselines. Model selection was guided by an extensive literature review, considering factors like model performance, ROUGE scores, publication year, and venue.

The evaluation focused on three summarising models: PRIMERA, PEGASUS, and LED. PRIMERA demonstrated superior performance in previous studies, while PEGASUS showed superior sentiment understanding, particularly on the Rotten Tomatoes dataset. LED, a widely used pre-trained model, served as a baseline in the existing literature (See Table 2).

Among the models under consideration, PRIMERA emerged as a promising choice due to its distinctive approach to MDS [16]. Seeking to minimize dependency on dataset-specific modeling, it consolidates multiple documents into a single extended sequence, employing the LED architecture known for its computational efficiency. PRIMERA incorporates a sparse “local+global” attention mechanism in the encoder and introduces special document separator tokens (<doc-sep>) to indicate document boundaries. Inspired by models like PEGASUS, PRIMERA adopts a unique masking strategy based on the Entity Pyramid framework, to address the limitations in selecting representative information for summarisation [41, 20, 4].

Sprache Models
LEAD-3 RANDOM CENTROID PRIMERA PEGASUS LED Llama 2
Sinhala R-1 0.060.060.060.06 5.75.75.75.7 4.54.54.54.5 5.75.75.75.7 4.14.14.14.1 3.63.63.63.6 20.220.220.2bold_20.2
R-2 0.00.00.00.0 0.050.050.050.05 0.10.10.10.1 2.22.22.22.2 2.12.12.12.1 1.91.91.91.9 6.56.56.5bold_6.5
R-L 0.060.060.060.06 5.15.15.15.1 3.93.93.93.9 3.23.23.23.2 2.82.82.82.8 2.92.92.92.9 17.317.317.3bold_17.3
Japanese R-1 3.53.53.53.5 2.32.32.32.3 1.91.91.91.9 6.36.36.36.3 5.75.75.75.7 5.95.95.95.9 7.77.77.7bold_7.7
R-2 0.00.00.00.0 0.010.010.010.01 0.050.050.050.05 3.23.23.2bold_3.2 1.31.31.31.3 1.41.41.41.4 0.80.80.80.8
R-L 3.53.53.53.5 1.91.91.91.9 1.71.71.71.7 4.14.14.14.1 3.33.33.33.3 2.72.72.72.7 6.86.86.8bold_6.8
Korean R-1 2.42.42.42.4 1.41.41.41.4 1.31.31.31.3 5.45.45.45.4 5.55.55.55.5 4.64.64.64.6 8.58.58.5bold_8.5
R-2 0.40.40.40.4 0.020.020.020.02 0.030.030.030.03 1.11.11.11.1 1.41.41.4bold_1.4 0.80.80.80.8 1.01.01.01.0
R-L 2.32.32.32.3 1.31.31.31.3 1.31.31.31.3 2.32.32.32.3 2.92.92.92.9 1.91.91.91.9 8.18.18.1bold_8.1
Tamil R-1 6.86.86.86.8 1.61.61.61.6 2.22.22.22.2 4.44.44.44.4 3.83.83.83.8 3.73.73.73.7 10.210.210.2bold_10.2
R-2 0.90.90.90.9 0.00.00.00.0 0.060.060.060.06 1.11.11.11.1 0.70.70.70.7 0.40.40.40.4 3.13.13.1bold_3.1
R-L 6.26.26.26.2 1.61.61.61.6 1.91.91.91.9 2.22.22.22.2 1.71.71.71.7 1.31.31.31.3 9.89.89.8bold_9.8
Englisch R-1 1.21.21.21.2 6.46.46.46.4 7.67.67.67.6 28.728.728.7bold_28.7 22.522.522.522.5 20.520.520.520.5 20.820.820.820.8
R-2 0.00.00.00.0 0.050.050.050.05 3.83.83.83.8 12.312.312.312.3 9.99.99.99.9 10.110.110.110.1 13.513.513.5bold_13.5
R-L 1.11.11.11.1 5.75.75.75.7 7.67.67.67.6 17.117.117.117.1 14.714.714.714.7 15.215.215.215.2 19.219.219.2bold_19.2
Table 3: Comparison of performance across fine-tuned models on the M2DS dataset

4.2 Baselines

For our baseline models, we explore simpler extractive approaches and statistical methods alongside pretrained models. In the extractive category, we employ LEAD-3 and RANDOM [39]. LEAD-3 extracts the first three sentences from the source text as the final summary, while RANDOM recursively selects words randomly from the source text until the threshold summary length is reached. These approaches serve as unbiased reference points for understanding and comparing more complex models. In the statistical approach, we experiment with CENTROID, inspired by [32]. CENTROID ranks sentences based on centrality scores derived from the words within each sentence, utilising TF-IDF scores to measure word similarity. We extract top sentences from each ranking until the threshold summary length is achieved.

Moving to pre-trained models, we select PRIMERA, PEGASUS, and LED, training them on each language’s respective training set. For tokenization, we use a space-based tokenizer for Sinhala and Tamil, the original tokenizer for PRIMERA and LED in other languages, and a space-based tokenizer for PEGASUS in all languages except English. For English, we report results both with and without fine-tuning.

Additionally, we present baseline scores for Llama 2, which is an open Large Language Model (LLM)  [38]. Llama 2, an updated version of Llama 1 and a formidable 7 billion-parameter causal decoder-only model, is introduced by Meta AI 666https://ai.meta.com. We limit ourselves to using open LLMs to ensure reproducibility within the research community.

Sprache Models
PRIMERA PRIMERA (fine-tuned) PEGASUS PEGASUS (fine-tuned) LED LED (fine-tuned)
Englisch R-1 23.623.623.623.6 28.7 18.618.618.618.6 22.522.522.522.5 17.117.117.117.1 20.520.520.520.5
R-2 8.88.88.88.8 12.3 9.19.19.19.1 9.99.99.99.9 7.17.17.17.1 10.110.110.110.1
R-L 13.613.613.613.6 17.1 12.412.412.412.4 14.714.714.714.7 13.213.213.213.2 15.215.215.215.2
Table 4: Comparison of performance across models originally trained on English datasets, on English articles of the M2DS dataset

5 Analysis and Discussion

In our baseline evaluations, Llama 2 7B outperforms all other models, showcasing its robust performance. Notably, PRIMERA excels slightly better in the English language, indicating its effectiveness in capturing linguistic nuances specific to that language. However, when assessing the state-of-the-art MDS models fine-tuned on our dataset, we observed a discernible drop in performance compared to their previous performance under English-centric news domain datasets as depicted in Table 2. This phenomenon could stem from the models struggling to capture language-specific information unique to each language in our multilingual dataset (See Table 3).

A noteworthy observation is the lower scores in LEAD-3, which extracts only the first three sentences as the summary. This suggests that our dataset exhibits better quality, addressing the issues found in TAC/DUC datasets, where the first three sentences often serve as summaries, leading models to learn biased patterns.

Contrary to the trend of using LLMs for MDS, our findings suggest that simpler models, such as PRIMERA specifically designed for MDS tasks, may be more effective. This is evident from PRIMERA’s superior performance in English when compared to Llama 2. Designing task-specific models like PRIMERA which perform well without extensive fine-tuning, could be a more effective approach with respect to resource constraints. Additionally, it is crucial to note that Llama 2, without fine-tuning, achieves competitive results, highlighting its potential for zero-shot learning and its effectiveness across diverse datasets compared to models specifically trained on individual datasets.

Furthermore, it is essential to emphasise the scalability of models like Llama 2, indicating their potential for handling larger datasets and their adaptability across various domains. Additionally, future research should explore Transfer Learning techniques to enhance the performance of MDS models across different languages, minimising the observed drop in performance. Finally, understanding the impact of dataset quality on model evaluation is crucial, and our dataset’s higher quality, as reflected in low LEAD-3 scores, underscores the significance of curating datasets that truly represent the summarisation task.

Additionally, we conducted a comparison of PRIMERA’s and other models’ performance with and without fine-tuning on the English language subset of our dataset (See Table 4). Although PRIMERA excels with a zero-shot approach surpassing other models, its scores are slightly lower when compared to other MDS models trained on news domain datasets. This suggests that our dataset presents challenges for models, underscoring its quality. Furthermore, all the models have improved their performance when they are fine-tuned. For instance, PRIMERA’s score increased from 23.6 to 28.7, exhibiting the highest improvement among other models.

6 Conclusion and Future Directions

The study introduces the M2DS dataset to fill the gap in multilingual MDS datasets. While existing MDS datasets have made strides in various domains, they mostly focus on English, leaving a void in multilingual representation. M2DS, with document-summary pairs in five languages, stands out as the pioneering multilingual MDS dataset.

The evaluation of M2DS against existing datasets demonstrates its potential and unique contribution to the field. Baseline scores from state-of-the-art MDS techniques provide a benchmark for future research in multilingual settings. Llama 2 7B outperforms other models, showcasing robust performance. PRIMERA excels slightly better in English, indicating effectiveness in capturing language-specific nuances.

The introduction of M2DS opens avenues for future research, enabling researchers to enhance the robustness of MDS models across diverse linguistic contexts. Possible directions include language-specific model tuning, exploring multilingual model development, and extending M2DS into diverse domains beyond news articles for broader applications.

References

  • [1] Abid, A.M.: Multi-document text summarization using deep belief network. International Journal of Advances in Scientific Research and Engineering (IJASRE) (2022)
  • [2] Afsharizadeh, M., Ebrahimpour-Komleh, H., et al.: A survey on multi-document summarization and domain-oriented approaches. Journal of Information Systems and Telecommunication (JIST) 1(37),  68 (2022)
  • [3] Angelidis, S., Lapata, M.: Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In: EMNLP. pp. 3675–3686 (2018)
  • [4] Beltagy, I., Peters, M.E., et al.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
  • [5] Chen, J., Cai, C., Jiang, X., Chen, K.: Comparative graph-based summarization of scientific papers guided by comparative citations. In: Proceedings of the 29th International Conference on Computational Linguistics. pp. 5978–5988 (2022)
  • [6] DeYoung, J., Beltagy, I., van Zuylen, M., Kuehl, B., Wang, L.L.: MSˆ2: Multi-Document Summarization of Medical Studies. In: EMNLP. pp. 7494–7513 (2021)
  • [7] DeYoung, J., Martinez, S.C., Marshall, I.J., Wallace, B.C.: Do multi-document summarization models synthesize? arXiv preprint arXiv:2301.13844 (2023)
  • [8] Eberhard, David M., G.F.S., Charles D. Fennig, e.: Ethnologue: Languages of the americas and the pacific. (No Title) (2023)
  • [9] Elhadad, M., Miranda-Jiménez, S., Steinberger, J., Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 2: Czech, hebrew and spanish. In: Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization. pp. 13–19 (2013)
  • [10] Fabbri, A.R., Li, I., She, T., Li, S., Radev, D.: Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In: ACL. pp. 1074–1084 (2019)
  • [11] Ganesan, K., Zhai, C., Han, J.: Opinosis: A graph based approach to abstractive summarization of highly redundant opinions. In: Coling 2010. pp. 340–348 (2010)
  • [12] Giannakopoulos, G.: Multi-document multilingual summarization and evaluation tracks in acl 2013 multiling workshop. In: Proceedings of the multiling 2013 workshop on multilingual multi-document summarization. pp. 20–28 (2013)
  • [13] Grusky, M., Naaman, M., Artzi, Y.: Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In: ACL. pp. 708–719 (2018)
  • [14] Hasan, T., Bhattacharjee, et al.: Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In: ACL-IJCNLP 2021. pp. 4693–4703 (2021)
  • [15] Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., Blunsom, P.: Teaching machines to read and comprehend. Advances in neural information processing systems 28 (2015)
  • [16] Hewapathirana, K., De Silva, N., Athuraliya, C.D.: Multi-document summarization: A comparative evaluation. In: 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS). pp. 19–24. IEEE (2023)
  • [17] Koupaee, M., Wang, W.Y.: Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305 (2018)
  • [18] Ladhak, F., Durmus, E., Cardie, C., Mckeown, K.: Wikilingua: A new benchmark dataset for cross-lingual abstractive summarization. In: EMNLP 2020. pp. 4034–4048 (2020)
  • [19] Leon, S.: Rotten tomatoes movies and critic reviews dataset. https://bit.ly/RTdataset (2020), (Accessed on 06/24/2023)
  • [20] Lewis, M., Liu, Y., et al.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL. pp. 7871–7880 (2020)
  • [21] Li, L., Forăscu, C., El-Haj, M., Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 1: Arabic, english, greek, chinese, romanian. In: Proceedings of the multiling 2013 workshop on multilingual multi-document summarization. pp. 1–12 (2013)
  • [22] Li, M., Qi, J., Lau, J.H.: Peersum: A peer review dataset for abstractive multi-document summarization. arXiv preprint arXiv:2203.01769 (2022)
  • [23] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
  • [24] Liu, P.J., Saleh, M., et al.: Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198 (2018)
  • [25] Liu, S., Cao, J., Yang, R., Wen, Z.: Generating a structured summary of numerous academic papers: Dataset and method. arXiv preprint arXiv:2302.04580 (2023)
  • [26] Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: EMNLP-IJCNLP. pp. 3730–3740 (2019)
  • [27] Lu, Y., Dong, Y., Charlin, L.: Multi-xscience: A large-scale dataset for extreme multi-document summarization of scientific articles. In: EMNLP. pp. 8068–8074 (2020)
  • [28] Ma, C., Zhang, W.E., et al.: Multi-document summarization via deep learning techniques: A survey. ACM Computing Surveys (CSUR) (2020)
  • [29] Marina, L., Natalia, V.: Multilingual multi-document summarization with poly. In: Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization (2013)
  • [30] Moro, G., Ragazzi, L., Valgimigli, L., Freddi, D.: Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: ACL. pp. 180–189 (2022)
  • [31] Napoles, C., Gormley, M.R., Van Durme, B.: Annotated gigaword. In: Proceedings of the joint workshop on automatic knowledge base construction and web-scale knowledge extraction (AKBC-WEKEX). pp. 95–100 (2012)
  • [32] Radev, D., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: NAACL-ANLP 2000 Workshop: Automatic Summarization (2000)
  • [33] Raffel, C., Shazeer, N., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485–5551 (2020)
  • [34] Salton, G.: A vector space model for information retrieval. Journal of the ASIS pp. 613–620 (1975)
  • [35] Sandhaus, E.: The new york times annotated corpus. (2008), https://catalog.ldc.upenn.edu/LDC2008T19
  • [36] Scialom, Thomas, D., et al.: Mlsum: The multilingual summarization corpus. In: EMNLP. pp. 8051–8067 (2020)
  • [37] Sinha, A., Shen, Z., et al.: An overview of microsoft academic service (mas) and applications. In: WWW. pp. 243–246 (2015)
  • [38] Touvron, H., Martin, et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [39] Verma, Y., Jangra, A., Verma, R., Saha, S.: Large scale multi-lingual multi-modal summarization dataset. In: ACL. pp. 3602–3614 (2023)
  • [40] Wang, D., Chen, J., Zhou, H., Qiu, X., Li, L.: Contrastive aligned joint learning for multilingual summarization. In: ACL-IJCNLP 2021. pp. 2739–2750 (2021)
  • [41] Xiao, W., Beltagy, I., Carenini, G., Cohan, A.: Primera: Pyramid-based masked sentence pre-training for multi-document summarization. In: ACL. pp. 5245–5263 (2022)
  • [42] Zhang, J., Zhao, Y., et al.: Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In: ICML. pp. 11328–11339. PMLR (2020)