Rethinking Transformer-based Multi-document Summarization: An Empirical Investigation

Congbo Ma1, Wei Emma Zhang2, Dileepa Pitawela2, Haojie Zhuang2, Yanfeng Shu3
1Macquarie University, Sydney, Australia, 2The University of Adelaide, Adelaide, Australia,
3CSIRO, Australia. [email protected], [email protected]
{wei.e.zhang, dileepa.pitawela, haojie.zhuang}@adelaide.edu.au
,
Abstract

The utilization of Transformer-based models prospers the growth of multi-document summarization (MDS). Given the huge impact and widespread adoption of Transformer-based models in various natural language processing tasks, investigating their performance and behaviors in the context of MDS becomes crucial for advancing the field and enhancing the quality of summary. To thoroughly examine the behaviours of Transformer-based MDS models, this paper presents five empirical studies on (1) measuring the impact of document boundary separators quantitatively; (2) exploring the effectiveness of different mainstream Transformer structures; (3) examining the sensitivity of the encoder and decoder; (4) discussing different training strategies; and (5) discovering the repetition in a summary generation. The experimental results on prevalent MDS datasets and eleven evaluation metrics show the influence of document boundary separators, the granularity of different level features and different model training strategies. The results also reveal that the decoder exhibits greater sensitivity to noises compared to the encoder. This underscores the important role played by the decoder, suggesting a potential direction for future research in MDS. Furthermore, the experimental results indicate that the repetition problem in the generated summaries has correlations with the high uncertainty scores.

Rethinking Transformer-based Multi-document Summarization: An Empirical Investigation


Congbo Ma1, Wei Emma Zhang2, Dileepa Pitawela2, Haojie Zhuang2, Yanfeng Shu3 1Macquarie University, Sydney, Australia, 2The University of Adelaide, Adelaide, Australia, 3CSIRO, Australia. [email protected], [email protected] {wei.e.zhang, dileepa.pitawela, haojie.zhuang}@adelaide.edu.au,


1 Introduction

The innovation and contemporary developments of Transformer architecture Vaswani et al. (2017) thrives multi-document summarization (MDS) Ma et al. (2022a). This motivates us to study the behaviors of the Transformer structure MDS models. Through these analyses, we aim to provide a thorough understanding of MDS and its intricacies within the MDS model framework. We undertake a comprehensive investigation from five distinct perspectives covering the Transformer-based MDS model design pipeline: (1) Document input perspective: we conduct experiments to quantitatively assess the impact of document boundary separators from a standpoint of document input; (2) Transformer structure perspective: we explore the effectiveness of different mainstream Transformer structures; (3) The significance of encoder and decoder perspective: we design empirical studies by adding noises on top of the encoder and decoder; (4) Training strategy perspective: we restructure the source documents and include self-supervised learning; (5) Summary generation perspective, we explore the uncertainties when repetition problems occur in the summary generation process.

The primary distinction between SDS and MDS lies in the variance of source document numbers. One straightforward way that convert MDS to SDS is concatenating text spans and processing them as a flat sequence Liu et al. (2018); Chu and Liu (2019); Brazinskas et al. (2020); Mao et al. (2020); Zhao et al. (2022). One way to aid the models in detecting and modeling document-to-document relationships in one flat sequence is to utilize document boundary separators Fabbri et al. (2019); Xiao et al. (2022). However, there is a notable gap in the current literature regarding a qualitative and quantitative examination of the influence of document boundary separators. This absence of exploration serves as the driving force behind our initiative to investigate whether these separators contribute to enhanced model performance and foster awareness of document boundaries within the feature space of MDS models. Through experiments conducted on three distinct Transformer structures, we discerned that the impact of document boundary separators varies among models with differing hierarchies. Uncertainty analysis is a pivotal approach employed in the examination and assessment of generation systems Xu et al. (2020) which can serve as an important indicator to show how the model performs during the summary generation. We then investigate the variation of summary prediction uncertainty by exploring the relations between separators and the predictive uncertainty of the structures. Certainly, measuring uncertainty in the context of summarization can provide insights into how the presence of document boundary separators affects the behavior of Transformer-based models and their summarization outcomes. By quantifying uncertainty through the entropy calculations, we gain a deeper understanding of the level of confidence or ambiguity the model has in its generated summaries.

Instead of simply concatenating all the input documents into a flat sequence and applying SDS models, the hierarchical Transformer structure Liu and Lapata (2019); Pasunuru et al. (2021); Li et al. (2020) has been proposed to specifically solve MDS tasks. This structure has been used for encoding multiple documents in a hierarchical manner, enabling the capture of cross-document relations through the utilization of an attention mechanism. The hierarchical Transformer structure contains a low-level Transformer that encodes tokens and a high-level Transformer that is used to encode coarser-grained textual units. This motivates us to further explore the influence of different hierarchies on MDS performances. We explore the effect of different granularity of high-level Transformer on the performance of MDS models. In this paper, we consider sentence-level and document-level features as different granularities. Based on the empirical studies, our findings indicate that for MDS tasks involving relatively short documents, flat Transformer models are a suitable choice. Also, the hierarchical structure prefers higher granularity in high-level Transformer structures.

In addition to exploring the hierarchical structure of Transformer-based MDS models, we explore the Transformer’s internal structure. Based on the existing Transformer-based MDS methods, we find that many of the MDS models focus on modifying the components of encoder Liu and Lapata (2019); Pasunuru et al. (2021); Liu et al. (2021); Ma et al. (2022b) and fewer works pay attention to ameliorating the decoder Jin and Wan (2020); Liu et al. (2022) to cater the requirements for MDS tasks. This motivates us to explore the robustness of the encoder and decoder towards interference under the same noise conditions. Therefore, we add Gaussian noises at the parameter space of the encoder or decoder to fulfill this purpose. The experimental results indicate that the decoder exhibits greater sensitivity compared to the encoder in MDS scenarios. This finding underscores the need for increased attention to decoder enhancements in future research within the MDS community.

Based on the analysis of Transformer-based MDS models, we also pay attention to exploring different training strategies for further enhancing the performance of MDS models. Different training strategies offer unique approaches to utilize available data and optimize model performance. By investigating diverse training strategies, we aim to identify the most effective methods for training MDS models, leveraging the characteristics of the dataset and the summarization task at hand. These strategies involve using pseudo datasets, fine-tuning on original datasets, or a combining of both. To generate pseudo data, we treat individual documents in a document set as pseudo-summaries and create multiple sets of pseudo-document-summary pairs. We evaluate three training approaches: training exclusively on the pseudo dataset, mixing the pseudo dataset with the original dataset, and a two-step process of training on the pseudo dataset followed by fine-tuning on the original dataset. The experimental results demonstrate that the pretrain-finetune strategy consistently outperformed the other training strategies, leading to improved summarization quality. The analysis of feature distributions further supported this finding, highlighting the alignment between the finetuned model and the baseline model. These results provide valuable insights into the effectiveness of the pretrain-finetune approach in enhancing summarization performance. The findings of this study can guide future research and development in the field of abstractive summarization, emphasizing the importance of training strategies for achieving higher-quality summaries.

Moreover, while the different Transformer structures and training strategies demonstrated variations in performances, an observation is the presence of repetitive patterns in the generated summaries, indicating a potential issue that needs to be addressed in abstractive summarization systems. Liu et al. Liu et al. (2023) gave two possible reasons behind the repetition problem in abstractive summarization: (1) attending to the same location in the source and (2) attending to similar but different sentences in the source. In this paper, we explore the cause of repetitive problems in abstractive summarization by examining predictive uncertainty. We quantify uncertainty scores at each time slot during the summary generation process. The analysis aims to observe how the uncertainty score changes when repetition phenomena occur, allowing us to identify positions where uncertainty is localized in repetitive behavior. The analysis reveals that as the model generates repetitive sentences or words, the uncertainty score rises, pointing out decreased confidence and increased uncertainty regarding the appropriateness and relevance of repeated elements in the summary. Understanding this relationship allows us to develop strategies to mitigate repetition and improve the quality of generated summaries.

2 Methodology

We introduce how to design the MDS experiments from the following angles: input data, Transformer structures, training strategies and summary generation. Therefore, we design five experiments to evaluate the behaviors of Transformer-based MDS models: (1) the measurable impact of document boundary separators; (2) the effectiveness of different Transformer structures; (3) the sensitivity of the encoder and decoder; (4) different training strategies; (5) repetition in document generation.

2.1 The Measurable Impact of Document Separators

We modify the source documents instead of the summarization models to the format of: 𝒟={𝐝1,𝐬,𝐝2,𝐬,,𝐬,𝐝N}𝒟superscript𝐝1𝐬superscript𝐝2𝐬𝐬superscript𝐝𝑁\mathcal{D}=\{\mathbf{d}^{1},\mathbf{s},\mathbf{d}^{2},\mathbf{s},...,\mathbf{% s},\mathbf{d}^{N}\}caligraphic_D = { bold_d start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_s , bold_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_s , … , bold_s , bold_d start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT }, where N𝑁Nitalic_N is the number of documents in a document set 𝒟𝒟\mathcal{D}caligraphic_D, the superscript 𝐝nsuperscript𝐝𝑛\mathbf{d}^{n}bold_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the n-th document in the set, and 𝐬𝐬\mathbf{s}bold_s denotes the special tokens. We investigate different Transformer models on two MDS datasets and eleven evaluation metrics to explore the impact of the document boundary separators qualitatively and quantitatively. We analyze and compare the prediction uncertainty from different datasets and different formats of source documents by inspecting entropy values during summary generation. We aim to understand how decisions by adding document boundary separators are reflected in the model’s uncertainty. In the generation process, each predictive position 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has an outcome probabilistic distribution 𝐱i1,,𝐱imsubscript𝐱𝑖1subscript𝐱𝑖𝑚\mathbf{x}_{i1},...,\mathbf{x}_{im}bold_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT, m𝑚mitalic_m is the number of a corpus pool. We use entropy as an uncertainty measurement which can be calculated as follows:

H(𝐗i)=j=1mP(𝐱ij)logP(𝐱ij)𝐻subscript𝐗𝑖superscriptsubscript𝑗1𝑚𝑃subscript𝐱𝑖𝑗𝑙𝑜𝑔𝑃subscript𝐱𝑖𝑗\displaystyle H(\mathbf{X}_{i})=-\sum_{j=1}^{m}P(\mathbf{x}_{ij})\ logP(% \mathbf{x}_{ij})italic_H ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_P ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) italic_l italic_o italic_g italic_P ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (1)

Because the size of the corpus pool is large and the prediction distribution is usually long-tailed Xu et al. (2020), we sort the prediction distribution 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in descending order and get a minimal set of tokens where the sum prediction values are larger than 0.95, and then normalize the distribution. We calculate the entropy value based on the new distribution P(𝐱ij)superscript𝑃subscript𝐱𝑖𝑗P^{\prime}(\mathbf{x}_{ij})italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ). The utilization of entropy as a measure allows us to gauge the distribution of probabilities across different tokens within the predictive positions of the summaries. Higher entropy values indicate a wider spread of probabilities, suggesting that the model is less certain about the most appropriate token to choose. Conversely, lower entropy values suggest that the model is more confident in its token predictions. The quantification of uncertainty through entropy measurements and its qualitative analysis enables us to assess how the introduction of document boundary separators influences the performance of the summaries generated by Transformer-based models. This holistic approach helps us unravel the nuanced impact of document boundary separators on the MDS process and gain valuable insights into the behavior of these models in handling multiple document inputs.

2.2 The Effectiveness of Different Transformer Structures

Transformer structures have become an essential component of many state-of-the-art natural language processing models. However, the design of the Transformer architecture can vary dramatically, and different structures may impact the performance of the model on different tasks. In this study, we aim to evaluate the effectiveness of different Transformer structures for MDS tasks. Specifically, we focus on two types of structures: flat Transformer and hierarchical Transformer.

The flat Transformer consists of a single layer of self-attention and feed-forward neural network layers that process the input tokens sequentially. In contrast, the hierarchical Transformer has a more complex structure, where the input tokens are first grouped into sentences or documents, and then processed by local and global Transformer layers. To explore the hierarchical Transformer structure, we investigate two different granularities of high-level Transformer: sentence-level and document-level. Building on the work of Liu Liu and Lapata (2019), we make modifications to the local Transformer layers to encode individual documents. The global Transformer layers are then able to exchange information at the sentence or document level.

Our analysis is motivated by the need to better understand how different Transformer structures can impact the performance of MDS models. By comparing the performance of the flat Transformer and hierarchical Transformer structures, we aim to identify which structure is more effective for multiple document summarization data.

2.3 The Sensitivity of Encoder and Decoder

In summarization tasks, the encoder plays a crucial role in extracting representations from the input text, while the decoder is responsible for generating the output summary, which requires producing coherent and meaningful language. Given the intricate nature of summary generation, the decoder’s role demands fine-grained control and precision, making it potentially more sensitive than the encoder. To explore the sensitivity of the encoder-decoder in Transformer-based summarization models, we add Gaussian noise at the parameter space of the encoder or decoder. We devise this experiment based on the intuition that a module (whether it’s the encoder or decoder) exhibits varying sensitivity to noise, thereby signifying the differing degrees of importance each module holds for overall performance. Formally, we have:

z=f(x;Θ+αn),nN(μ,δ)formulae-sequence𝑧𝑓𝑥Θ𝛼𝑛similar-to𝑛𝑁𝜇𝛿\displaystyle z=f(x;\Theta+\alpha n),n\sim N(\mu,\delta)italic_z = italic_f ( italic_x ; roman_Θ + italic_α italic_n ) , italic_n ∼ italic_N ( italic_μ , italic_δ ) (2)

where f()𝑓f(\cdot)italic_f ( ⋅ ) is the component in Transformer; ΘΘ\Thetaroman_Θ is the parameters in f()𝑓f(\cdot)italic_f ( ⋅ ); n𝑛nitalic_n represents Gaussian noise; μ𝜇\muitalic_μ, δ𝛿\deltaitalic_δ are mean and variance in the Gaussian noise, α𝛼\alphaitalic_α is the weighted factor.

2.4 Different Training Strategies

In this study, we aim to investigate the impact of different training strategies on Transformer models for abstractive summarization. While we have previously examined the components of Transformer models, the specific influence of training strategies remains unexplored. Our objective is to identify the most effective training strategies by leveraging the inherent characteristics of MDS datasets, without the need for external data sources. To create pseudo data utilizing the characteristic MDS, we adopt a straightforward approach. We treat one document from a given document set as a pseudo-summary while considering the remaining documents as input documents. This process is iterated, systematically selecting each document in the set as a pseudo-summary, until all input documents have served as pseudo-summaries. Consequently, we generate multiple sets of pseudo-document-summary pairs, which we refer to as pseudo-MDS dataset. The original MDS dataset is denoted as the original dataset in the subsequent analysis.

To evaluate the effectiveness of different training strategies, we design three distinct approaches. Firstly, we train the MDS model exclusively on the pseudo dataset. Secondly, we mix the pseudo dataset with the original dataset, creating a comprehensive mega dataset, on which the MDS model is trained. Lastly, we employ a two-step process, initially training the model on the pseudo dataset and subsequently fine-tuning it on the original dataset.

2.5 Repetition in Document Generation

For abstractive MDS, a persistent challenge arises from the inclination of models to produce repetitive sentences or words during the summarization process. This tendency creates a loop that is difficult to break, hampering the generation of accurate summaries. To analyze what may cause repetitive problems, we delve into an analysis of prediction uncertainty, examining uncertainty scores throughout the generation process and localizing uncertainty to certain positions in a repetition behavior.

To quantify uncertainty, we employ Equation 1, which calculates the uncertainty score for each time slot during the summarization generation. By applying this equation, we obtain a measure of uncertainty that corresponds to the level of doubt or ambiguity associated with the generated output. The analysis focuses on observing how the uncertainty score evolves in response to the occurrence of repetition phenomena.

3 Empirical Studies and Analyses

3.1 Settings for Empirical Studies

We evaluate the performance of three Transformer models: Vanilla Transformer (VT) Vaswani et al. (2017), Vanilla Transformer with copy mechanism (VTC), and modified Hierarchical Transformer (HT) Liu and Lapata (2019). These models are assessed on two widely used MDS datasets: Multi-XScience Lu et al. (2020) and Multi-News Fabbri et al. (2019). To comprehensively analyze their performance, we employ eleven evaluation metrics: ROUGE Lin (2004) including ROUGE-1 (R-1), ROUGE-2(R-2), ROUGE-L (R-L), ROUGE-SU (R-SU), ROUGE-WE (R-WE) Ng and Abrecht (2015), BLEU Papineni et al. (2002), S3 Peyrard et al. (2017) including pyramid (pyr) and responsiveness (resp) scores, BertScore (BS) Zhang et al. (2020), Relevance (Rel) Peyrard (2019), Redundancy(Red) Peyrard (2019).

Datasets Models R-1\uparrow R-2\uparrow R-L\uparrow R-SU\uparrow R-WE\uparrow BLEU\uparrow S3 (pyr/resp)\uparrow BS\uparrow Red\downarrow Rel\uparrow
Multi -XScience VT 0.2714 0.0490 0.1030 0.0784 0.1523 2.9773 0.2103/0.3609 0.5330 -4.0712 -5.8352
VT w/o S 0.2670 0.0480 0.1553 0.0767 0.1580 3.3623 0.2202/0.3663 0.5405 -6.1908 -4.8609
VTC 0.2635 0.0483 0.1499 0.0734 0.1659 4.6037 0.2561/0.3885 0.5590 -7.0585 -4.5802
VTC w/o S 0.2713 0.0468 0.1502 0.0780 0.1702 4.7615 0.2554/0.3861 0.5621 -7.8402 -4.2908
HT 0.2571 0.0483 0.1615 0.0692 0.1407 7.1501 0.1769/0.3473 0.5303 -4.6987 -8.0379
HT w/o S 0.2216 0.0376 0.1446 0.0521 0.1100 5.2862 0.1428/0.3295 0.5108 -4.0142 -11.6068
Multi -News VT 0.2445 0.0523 0.1301 0.0603 0.1480 2.0054 0.1380/0.3212 0.4622 -5.7674 -7.4220
VT w/o S 0.2555 0.0550 0.1347 0.0651 0.1491 2.0193 0.1384/0.3214 0.4605 -5.2098 -8.0488
VTC 0.4233 0.1471 0.2059 0.1625 0.2860 11.3861 0.3778/0.4871 0.5955 -6.0966 3.9027
VTC w/o S 0.4363 0.1555 0.2053 0.1698 0.2885 13.015 0.3967/0.5017 0.5916 -6.2869 3.8355
HT 0.2349 0.0371 0.1352 0.0598 0.1154 3.5434 0.1097/0.3074 0.4987 -5.0249 -17.1520
HT w/o S 0.2304 0.0384 0.1430 0.0580 0.1193 3.0499 0.1023/0.3031 0.4966 -4.9433 -16.8205
Table 1: Evaluation results on Multi-XScience and Multi-News datasets, both with and without the document boundary separators. “S" indicates document separators.

3.2 Impact of Document Separators

We investigate the VT, VTC, and HT models on both datasets and report the eleven evaluation metrics to explore the impact of the document boundary separators. From Table 1, interestingly, we find that adding separators reduces models’ performance in half of the cases (3 out of 6). For example, model VT with separators performs relatively worse on Multi-News (the results of 8 evaluation metrics are worse among 11 evaluation metrics); model VTC performs relatively worse on both Multi-XScience (the results of 9 evaluation metrics are worse among 11 evaluation metrics) and Multi-News (the results of 8 evaluation metrics are worse among 11 evaluation metrics) when with separators. These results indicate input documents with separators are not very helpful for flat Transformer models. However, we can perceive that the HT model achieves better performance on both datasets with document boundary separators.

Another interesting finding is the most commonly used ROUGE, in a few cases, shows the opposite result from other evaluation metrics. For instance, on the Multi-XScience dataset, the VT (with document boundary separators) shows better ROUGE results than VT (without document boundary separators) but contradicts the results on “R-WE", “BLEU”, “S3”, ”BertScore", “Redundancy” and “Relevance". It indicates that the ROUGE-centric evaluation system needs to be updated and the measurement of summarization can not rely solely on ROUGE.

Refer to caption
Figure 1: The uncertainty scores of VTC on Multi-News and Multi-XScience. The x-axis and y-axis are the value of uncertainty scores and the number of tokens.

We also discover the relations between document boundary separators and token uncertainty scores. Figure 1 shows the uncertainty scores of generated tokens of VTC models on both datasets. Surprisingly, the figure reflects that separators are associated with high uncertainty score actions which means the separators increase the predictive uncertainty of models. Possible because the separators have no semantic relations with the source documents and separators may be regarded as noise to increase the predictive uncertainty. The median uncertainty score of the Multi-News is larger than the Multi-XScience aligning with the size of datasets.

Refer to caption
Refer to caption
Figure 2: Performance variation with document-level (green line) and sentence-level (orange line) HT models on Multi-XScience (left) and Multi-News (right) datasets. BLEU, Redundancy and Relevance are scaled (0 to 0.6) to make all point in the plot boundary.

3.3 Quantitative Performance on Different Transformer Structures

We investigate (1) the effectiveness of different Transformer architectures: flat Transformer (VT,VTC) and hierarchical Transformer (HT); (2) the influences of different granularities within hierarchical Transformer structure. The results are also found in Table 1. In most evaluation metrics, the HT model can not achieve as good results as two flat Transformer models on both datasets. The two potential reasons are: (1) the pipeline of the HT model is longer than the flat Transformer models which makes the HT model hard to train. (2) the Multi-XScience and Multi-New datasets are not long document summarization datasets. The average document length of Multi-XScience and Multi-New are 778.08 and 2103.49. From the experimental results, we can conclude that the HT model is more suitable for lengthy documents, implying that flat Transformer models are a good choice for tasks with shorter documents.

As mentioned in Section 2.2, to evaluate the influences of different granularities within the hierarchical Transformer structure, we modify the local Transformer layers to encode individual documents. Figure 2 shows the performances of document-level and sentence-level HT models. All the metrics are showing better performances with the document-level HT compared to the sentence-level HT as the green line exceeds the boundary of the orange line in every dimension (redundancy is the lower the better). The apparent trend implies that a higher level of granularity is more favorable for the hierarchical Transformer structure.

3.4 Quantitative Performance on the Sensitivity of Encoder and Decoder

Datasets Models R-1\uparrow R-2\uparrow R-L\uparrow R-SU\uparrow R-WE\uparrow BLEU\uparrow S3 (pyr/resp)\uparrow BS\uparrow Red\downarrow Rel\uparrow
Multi -XScience En (α𝛼\alphaitalic_α=1e-31e-31\text{e-}31 e- 3) 0.2656 0.0477 0.1507 0.0739 0.1660 4.6288 0.2560/0.3881 0.5593 -5.2615 2.5252
De (α𝛼\alphaitalic_α=1e-31e-31\text{e-}31 e- 3) 0.2637 0.0483 0.1499 0.0735 0.1676 4.8116 0.2573/0.3890 0.5608 -5.2806 2.5377
En (α𝛼\alphaitalic_α=1e-21e-21\text{e-}21 e- 2) 0.2433 0.0412 0.1386 0.0650 0.1523 4.0228 0.2276/0.3713 0.5506 -5.2222 2.4878
De (α𝛼\alphaitalic_α=1e-21e-21\text{e-}21 e- 2) 0.2130 0.0362 0.1277 0.0512 0.1333 2.6732 0.1933/0.3535 0.5189 -4.5961 2.4406
En (α𝛼\alphaitalic_α=1e-11e-11\text{e-}11 e- 1) 0.0305 0.0019 0.0232 0.0035 0.0057 0.0979 -0.0786/0.2085 0.3631 -1.8267 0.1420
De (α𝛼\alphaitalic_α=1e-11e-11\text{e-}11 e- 1) 0.0282 0.0039 0.0259 0.0019 0.0012 0.0422 -0.0350/0.2347 0.3533 -0.9935 1.2109
Multi -News En (α𝛼\alphaitalic_α=1e-31e-31\text{e-}31 e- 3) 0.4178 0.1439 0.2063 0.1598 0.2817 10.5326 0.3345/0.4623 0.5943 -5.8867 3.8567
De (α𝛼\alphaitalic_α=1e-31e-31\text{e-}31 e- 3) 0.4172 0.1427 0.2053 0.1589 0.2802 10.6737 0.3348/0.4625 0.5941 -5.8923 3.8533
En (α𝛼\alphaitalic_α=1e-21e-21\text{e-}21 e- 2) 0.2899 0.0689 0.1405 0.0888 0.2095 5.5596 0.2260/0.3778 0.5335 -5.2695 3.7854
De (α𝛼\alphaitalic_α=1e-21e-21\text{e-}21 e- 2) 0.2248 0.0602 0.1134 0.0706 0.1842 4.0850 0.2288/0.3793 0.4972 -4.7247 3.3888
En (α𝛼\alphaitalic_α=1e-11e-11\text{e-}11 e- 1) 0.0938 0.0049 0.0724 0.0101 0.0266 0.0549 -0.0499/ 0.2223 0.3330 -1.2151 1.7586
De (α𝛼\alphaitalic_α=1e-11e-11\text{e-}11 e- 1) 0.0458 0.0018 0.0330 0.0041 0.0186 0.0476 -0.0537/0.2207 0.3410 -2.3011 1.3539
Table 2: Evaluation results on Multi-XScience and Multi-News datasets about the encoder-decoder structure.

To investigate the hypothesis in section 2.3, we select the VTC model as the foundation for evaluating the effectiveness of the encoder-decoder structure on the Multi-XScience and Multi-News datasets. By examining Table 2, we observe large differences in performance when introducing noise to the encoder and decoder in highly noisy scenarios (with α=1e-1𝛼1e-1\alpha=1\text{e-}1italic_α = 1 e- 1 and α=1e-2𝛼1e-2\alpha=1\text{e-}2italic_α = 1 e- 2). Specifically, in noisy conditions, we find that adding noise to the decoder has a more substantial impact on performance compared to adding noise to the encoder. However, as the noise levels decreased, the performance gaps between the two approaches narrowed. This observation supports our initial hypothesis that the decoder is more sensitive than the encoder. The potential reasons are: (1) errors or inaccuracies in the decoder can have a cascading effect on subsequent tokens generated during decoding. This error propagation phenomenon can make the decoder more sensitive to small perturbations, as any mistakes or noise introduced during decoding can amplify and affect the overall quality of the generated summary; (2) Transformer-based models often employ an attention mechanism that allows the decoder to focus on different parts of the encoded input during the decoding process. The decoder’s sensitivity is crucial in effectively attending to relevant information, and even slight perturbations in the encoded input can impact the attention weights and subsequently influence the decoding process. Consequently, it underscores the crucial role played by the decoder in summarization tasks. These findings shed light on the high importance of the decoder’s contribution to the overall summarization process.

3.5 Quantitative Performance of Different Training Strategies

Refer to caption

Multi-News

Refer to caption

Multi-XScience

Figure 3: The feature visualization of VTC, VTC with self-supervised training and VTC with finetuning after self-supervised training with PCA.

The experimental results presented in Table 3 provide an overview of the performance of the VTC model trained using different pretraining strategies on the Multi-XScience and Multi-News datasets. In the table, the VTC is trained on the original document set and golden summary pairs. The “finetune" strategy refers to the training of the model on the pseudo dataset (introduced in Section 2.4) first and then fine-tuning on the original dataset. The “self-supervised" strategy denotes training the VTC model exclusively on the pseudo dataset. The “mix" strategy illustrates training the model using a combination of the pseudo dataset and the original dataset. By comparing the results obtained from these different training strategies, we aim to identify the most effective approach for each dataset.

For the Multi-XScience, the results show that the VTC (pretrain-finetune) strategy outperforms the VTC trained on the original dataset across most metrics, indicating the effectiveness of the pretrain-finetune strategy in improving summarization quality. In contrast, the VTC (self-supervised) exhibits lower performance compared to the VTC (pretrain-finetune), suggesting that just self-supervised training is less effective for this dataset.

Similarly, for the Multi-News dataset, the results imply the VTC model achieves good performance across all metrics, with higher scores on the VTC (pretrain-finetune) strategy, showcasing improved summarization quality. Conversely, the VTC (self-supervised) and VTC (mix) strategy yields lower performance compared to the other strategies.

Datasets Models R-1\uparrow R-2\uparrow R-L\uparrow R-SU\uparrow R-WE\uparrow BLEU\uparrow S3 (pyr/resp)\uparrow BS\uparrow Red\downarrow Rel\uparrow
Multi- XScience VTC 0.2635 0.0483 0.1499 0.0734 0.1659 4.6037 0.2561/0.3885 0.5590 -7.0585 -4.5802
VTC (finetune) 0.2955 0.0558 0.1671 0.0879 0.1770 3.9727 0.2569/0.3886 0.5511 -5.0020 2.5824
VTC(self-supervised) 0.2585 0.0368 0.1471 0.0678 0.1325 1.2885 0.1694/0.3343 0.5173 -5.3546 2.2064
VTC(mix) 0.2547 0.0350 0.1468 0.0653 0.1324 1.2922 0.1526/0.3246 0.5176 -5.3285 2.1945
Multi- News VTC 0.4233 0.1471 0.2059 0.1625 0.2860 11.3861 0.3778/0.4871 0.5955 -6.0966 3.9027
VTC (finetune) 0.4271 0.1509 0.2084 0.1643 0.2886 11.5514 0.3893/0.4960 0.6004 -6.2075 3.9135
VTC(self-supervised) 0.2724 0.0484 0.1349 0.0738 0.1399 2.8583 0.1281/0.3159 0.4737 -5.5046 2.4027
VTC(mix) 0.3046 0.0673 0.1485 0.0938 0.1728 5.2611 0.1909/0.3595 0.4979 -5.8684 2.7281
Table 3: Different training strategies on Multi-News and Multi-XScience datasets.

The comparison of these different training strategies reveals that the pretrain-finetune approach consistently leads to better summarization performance compared to the baseline VTC model and other training strategy, highlighting its effectiveness in improving summarization quality.

To find the potential reason why the finetune strategy works well, we visualize the feature distributions of three training strategies: VTC, VTC (self-supervised) VTC (finetune) using Principal Component Analysis (PCA) as illustrated in Figure 3. For the Multi-News, the features come from the encoder of the VTC (self-supervised) and the VTC (finetuning) exhibits overlapping, while maintaining distance from the plain VTC. In contrast, for the Multi-XScience, the VTC (finetune) is more similar to the plain VTC but still noticeably distinct from the VTC (self-supervised). This observation is consistent with the performance results presented in Table 3. In the case of the Multi-XScience, finetuning the model after self-supervised training significantly improves the model’s performance compared to the VTC. However, when the model is only pretrained using self-supervised learning, it performs worse than the VTC. This discrepancy can be attributed to the fact that the features of the finetuned model closely align with the VTC model’s distribution since both models possess better representations for the final prediction. Conversely, for the Multi-News, the finetuned model exhibits only marginal improvements over the VTC. This observation also explains the overlap between features from the finetuned model and the self-supervised model, as finetuning adjusts the feature distribution towards the ‘genuine’ distribution, albeit to a limited extent.

Refer to caption
Figure 4: The relationship between uncertainty scores and token repetitions on different summaries.

3.6 The Relation Between Repetition and Uncertainty

We examine the correlation between repetition and uncertainty in the process of generating summaries. To assess uncertainty, we compute a score for each token generated. Two summaries are presented: one featuring repetition and the other as a standard summary without repetition. The outcomes are depicted in Figure 4. The X-axis represents token indexes, while the Y-axis illustrates uncertainty scores for each token. In summary #1, where no repetitions occur, the uncertainties of tokens remain within a “normal" range. This suggests that the model successfully avoids repetitive patterns, resulting in lower uncertainty scores throughout the summary generation process. Conversely, in summaries #2, we observe a distinct pattern. As the repetition of tokens or phrases begins, the uncertainty scores escalate rapidly.

By comparing uncertainty scores across different time slots, we gain insights into the relationship between repetition and uncertainty in abstractive summarization. When a repetition phenomenon occurs, we observe notable changes in the uncertainty score, indicating a correlation between the two factors. Specifically, as the model generates repetitive sentences or words, the uncertainty score tends to increase. This increase in uncertainty suggests that the model becomes less confident and more uncertain about the appropriateness or relevance of the repeated elements within the summary. By understanding this relationship, we can devise strategies to mitigate repetition and subsequently enhance the quality of generated summaries. By reducing uncertainty through the minimization of repetition, we pave the way for more accurate and reliable abstractive summarization.

4 Conclusion and Discussion

This study attempts to empirically examine the influences on Transformer behaviors from five important perspectives: document boundary separators, Transformer structures, the sensitivity of encoder-decoder architecture, training strategies, and the relationship between repetition and uncertainty in generated summaries. We first explore the impact of separators on two flat Transformer and one hierarchical Transformer structure.

Experiments indicate that adding separators makes hierarchical Transformers aware of document boundaries, unlike flat Transformers. This suggests that for models handling complex structures, separators can enhance performance. The necessity of adopting separators should be considered depending on the Transformer structure applied.

The Transformer structure exploring experiments demonstrate that a higher level of granularity is favorable for the hierarchical Transformer structure. The experiments also demonstrate the simple structure, flat Transformer, has been able to show better performance on the Multi-XScience and Multi-News datasets than the complicated hierarchical Transformer structure. The flat Transformer models are sufficient for MDS tasks with relatively short length of documents.

Furthermore, adding noise to the decoder affects performance more than adding noise to the encoder. This sensitivity is likely due to error propagation during decoding and the attention mechanism’s dependence on accurate encoding. These results emphasize the decoder’s crucial role in producing high-quality summaries and its significant impact on the summarization process.

The pretrain-finetune strategy that trains the model on the pseudo labels first and then fine-tuning it on the original dataset consistently leads to improved summarization performance when compared to other training strategies. This finding highlights the effectiveness of the pretrain-finetune strategy in enhancing MDS model performance.

Moreover, the analysis of the relations between repetition and uncertainty provides valuable insights into improving the quality of generated summaries. The findings suggest that as repetition occurs in the summaries, there is a noticeable increase in uncertainty scores. By recognizing this relationship, strategies can be developed to mitigate repetition and reduce uncertainty, ultimately enhancing the overall quality of abstractive summaries. These insights contribute to the advancement of abstractive summarization techniques and open avenues for further research in improving the reliability and effectiveness of summary generation.

We also point out the possible exploring direction for future MDS work: (1) evaluate the generated summaries from multiple evaluations; (2) add the higher level of granularity information into the models; (3) investigate the MDS method for particularly long input documents; (4) pay more attention to the decoder when designing the Transformer-based summarization models; (5) try to reduce the Sudden sharp increase and high uncertainty score during the summary generation process.

Limitations

The original Hierarchical Transformer (HT) model is trained on four GPUs (NVIDIA TITAN Xp) for 500, 000 steps, but with an unspecified batch-size. In order to keep a fair comparison and consider the limitation of our computation resource, all the models reported in the paper are trained on the same one GPU, which in turn influences the setting of batch-size. It may effect the performance of HT model.

References

Appendix A Appendix

A.1 Implementation Details

The training of all models begins with an initial learning rate of 2. An initial warm-up phase spans the first 8,000 steps, followed by a subsequent multi-step learning rate reduction. During the training process, a batch size of 4,096 is utilized, and the optimization is performed for 20,000 steps using the Adam optimizer. A dropout rate of 0.2 is employed to enhance model robustness. All experiments are conducted on a single NVIDIA 3090 GPU with one Intel i9-10900X CPU. The operating environment is provided by Ubuntu 22.04.3 LTS.

A.2 Summarization Models

Vanilla Transformer (VT) Vaswani et al. (2017) is a sequence-to-sequence model that is proposed for machine translation task. It is subsequently generalized in various tasks of NLP due to its strong performance Lin et al. (2021).

Vanilla Transformer with Copy Mechanism (VTC)111We implement the VT and VTC based on https://github.com/Alex-Fabbri/Multi-News/tree/master/code/OpenNMT-py-baselines.. This variant has a mechanism to copy the attention distribution that one of the randomly chosen attention heads from the encoder side into the decoder so that the generated text becomes less repetitive and less factually inaccurate.

Hierarchical Transformer (HT) Liu and Lapata (2019) proposed a hierarchical attention structure to attend long sequences effectively and capture cross-paragraph contextual relationships. The local Transformer layers encode individual paragraphs and global Transformer layers exchange paragraph-level information from local layers across paragraphs.

A.3 Datasets

The empirical studies are based on two widely used MDS datasets: Multi-XScience Lu et al. (2020) and Multi-News Fabbri et al. (2019). Multi-XScience contains data from scientific articles. The task of this dataset is to generate the related work section of a target paper based on its abstract and the abstracts of the articles it refers to. Multi-News collects news articles from the site "newser.com." Each set of source documents has a professionally written summary and the task is to generate that summary based on the sources. Table 4 describes the statistics of these two datasets, including the size of the train, test, and validation set, the average document length, and the average summary length.

Datasets Train/ Test/ Validation
Average Document
Length
Average Summary
Length
Multi-XScience 30,369 / 5,093/ 5,066 778.08 116.44
Multi-News 44,972 / 5,622 / 5,622 2,103.49 263.66
Table 4: Description of two used multi-document summarization datasets: Multi-News and Multi-XScience.

A.4 Data Processing

For Multi-XScience and Multi-News datasets, the source documents are separated by a special token named “story_separator_special_tag”. The length of the input documents is restricted to 1024 tokens. In each document set, the number of tokens for one document is 1024N1024𝑁\frac{1024}{N}divide start_ARG 1024 end_ARG start_ARG italic_N end_ARG, where N𝑁Nitalic_N is the number of documents in a document set. For some shorter documents, the documents repeat themselves to fill the 1024 token quota. In the Multi-XScience dataset, the citations in the sources and targets are replaced by a common token ‘@cite’.

A.5 Evaluation Metrics

ROUGE222The parameters of ROUGE are -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m. Recall-Oriented Understudy for Gisting Evaluation Lin (2004) is a set of evaluation metrics for comparing the overlapping textual units between generated summaries and golden summaries, including ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L), ROUGE-SU (R-SU). R-1 and R-2 measure the overlapping unigrams and bigrams respectively while R-L identifies the longest co-occurring sequence of n-grams. R-SU is calculated as a statistic to measure the co-occurrence of unigram and skip-bigram.

ROUGE-WE (R-WE) Ng and Abrecht (2015) is a variant of the ROUGE metric which replaces the hard lexical matching in ROUGE-N with a soft matching based on the cosine similarity of word embeddings. The soft matching in ROUGE-WE provides a more forgiving evaluation by not strictly requiring exact lexical matches, thus allowing for variations in word order and phrasing.

BLEU BiLingual Evaluation Understudy Papineni et al. (2002) introduces a brevity penalty term and computes the geometric average of the modified n-gram precision.

S3 Peyrard et al. (2017) is a model-based metric that considers the features from other evaluation metrics, including R-N, R-L, R-WE and JS-divergence, to produce pyramid (pyr) and responsiveness (resp) scores.

BertScore (BS)333The model type of BertScore is bert-base-uncased. Zhang et al. (2020) measures the soft overlap of the token BERT embeddings from the machine-generated summaries and golden summaries.

Relevance (Rel) Peyrard (2019) calculates cross-entropy over individually constructed probability distributions for a summary S𝑆Sitalic_S and a source D𝐷Ditalic_D using their own semantic units ω𝜔\omegaitalic_ω: Relevance(S,D)=ωiPS(ωi).log(PD(ωi))formulae-sequence𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒𝑆𝐷subscriptsubscript𝜔𝑖subscript𝑃𝑆subscript𝜔𝑖𝑙𝑜𝑔subscript𝑃𝐷subscript𝜔𝑖Relevance(S,D)=\sum\limits_{\omega_{i}}\ P_{S}(\omega_{i})\ .\ log(P_{D}(% \omega_{i}))italic_R italic_e italic_l italic_e italic_v italic_a italic_n italic_c italic_e ( italic_S , italic_D ) = ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . italic_l italic_o italic_g ( italic_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where probability distributions of summary and source document are given by PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and PDsubscript𝑃𝐷P_{D}italic_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT respectively.

Redundancy(Red) Peyrard (2019) evaluates the quality of the accumulation of information in the candidate summaries: Redundancy(S)=ωiPS(ωi).log(PS(ωi))formulae-sequence𝑅𝑒𝑑𝑢𝑛𝑑𝑎𝑛𝑐𝑦𝑆subscriptsubscript𝜔𝑖subscript𝑃𝑆subscript𝜔𝑖𝑙𝑜𝑔subscript𝑃𝑆subscript𝜔𝑖Redundancy(S)=\sum\limits_{\omega_{i}}\ P_{S}(\omega_{i})\ .\ log(P_{S}(\omega% _{i}))italic_R italic_e italic_d italic_u italic_n italic_d italic_a italic_n italic_c italic_y ( italic_S ) = ∑ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . italic_l italic_o italic_g ( italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

A.6 Visualization on the Impact of Document Separators

We compare and analyze the embedding space of the tokens after they feed into the encoder with and without document separators by t-SNE visualization (Figure 5). After token representations feed into the hierarchical Transformer encoder, the cluster boundaries of documents with separators are easier to be identified in the embedding space. Different from the hierarchical Transformer model, these two flat Transformer models have difficulties to distinguish the document cluster boundaries in the embedding space when the token representations after feed into Transformer encoder. Potentially, the hierarchical Transformer prefers more structural information of documents to compose the final summaries, while the flat Transformer does not.

Refer to caption

(VT)

Refer to caption

(VTC)

Refer to caption

(HT)

Figure 5: t-SNE visualization of two embedding spaces on Multi-News dataset with VT, VTC and HT models: (1) token representations before feeding into the Transformer encoder; (2) token representations after feeding into the Transformer encoder. The figures in the 1st row are the visualization with document separators and in the 2st row are the visualization without document separators.