HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: blkarray
  • failed: bigstrut

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2404.05243v1 [cs.CL] 08 Apr 2024

Product Description and QA Assisted Self-Supervised Opinion Summarization

Tejpalsingh Siledar*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Rupasai Rangaraju*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Sri Raghava Muddu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT,
Swaprava Nathnormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Pushpak Bhattacharyyanormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT,
Suman Banerjeenormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Amey Patilnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Sudhanshu Shekhar Singhnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT,
Muthusamy Chelliahnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Nikesh Gareranormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT
normal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT
Computer Science and Engineering, IIT Bombay, India,
{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPTFlipkart, India
{tejpalsingh, rupasai, sriraghava, swaprava, pb}@cse.iitb.ac.in
Abstract

In e-commerce, opinion summarization is footnotetext: * Equal contribution. the process of summarizing the consensus opinions found in product reviews. However, the potential of additional sources such as product description and question-answers (QA) has been considered less often. Moreover, the absence of any supervised training data makes this task challenging. To address this, we propose a novel synthetic dataset creation (SDC) strategy that leverages information from reviews as well as additional sources for selecting one of the reviews as a pseudo-summary to enable supervised training. Our Multi-Encoder Decoder framework for Opinion Summarization (MEDOS) employs a separate encoder for each source, enabling effective selection of information while generating the summary. For evaluation, due to the unavailability of test sets with additional sources, we extend the Amazon, Oposum+, and Flipkart test sets and leverage ChatGPT111https://chat.openai.com/ (gpt-3.5 August 3 version) to annotate summaries. Experiments across nine test sets demonstrate that the combination of our SDC approach and MEDOS model achieves on average a 14.5%percent14.5\mathbf{14.5\%}bold_14.5 % improvement in ROUGE-1 F1 over the SOTA. Moreover, comparative analysis underlines the significance of incorporating additional sources for generating more informative summaries. Human evaluations further indicate that MEDOS scores relatively higher in coherence and fluency with 0.410.410.410.41 and 0.50.50.50.5 (11-1- 1 to 1111) respectively, compared to existing models. To the best of our knowledge, we are the first to generate opinion summaries leveraging additional sources in a self-supervised setting.

Product Description and QA Assisted Self-Supervised Opinion Summarization


Tejpalsingh Siledar*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTnormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Rupasai Rangaraju*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTnormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Sri Raghava Muddu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTnormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Swaprava Nathnormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Pushpak Bhattacharyyanormal-♣{}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT, Suman Banerjeenormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Amey Patilnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Sudhanshu Shekhar Singhnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Muthusamy Chelliahnormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT, Nikesh Gareranormal-♠{}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPT {}^{\clubsuit}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPTComputer Science and Engineering, IIT Bombay, India, {}^{\spadesuit}start_FLOATSUPERSCRIPT ♠ end_FLOATSUPERSCRIPTFlipkart, India {tejpalsingh, rupasai, sriraghava, swaprava, pb}@cse.iitb.ac.in

1 Introduction

In the e-commerce domain, reviews play a vital role in making informed decisions. However, due to the recent proliferation of online reviews, going through all the product reviews before making a decision is challenging. Opinion summarization provides a solution by summarizing the opinions presented in the reviews (Hu and Liu, 2006; Wang and Ling, 2016; Angelidis and Lapata, 2018).

MultimodalSum

I bought this product to scan my negatives. It does not work with Windows XP. I have tried to contact the company several times and have not received a response. I am very disappointed in the product. I would not recommend it to anyone.

Our Model (MEDOS)

I purchased the VuPoint FS-C1-VP Film and Slide Digital Converter to scan my 35mm film and slide negatives. It is not compatible with Windows XP. The software does not work with Windows 7 or 8. I have tried to contact the company and they do not respond to my emails. I would not recommend this product to anyone.

Table 1: MultimodalSum vs. MEDOS generated summary for a product from the Amazon test set. Information assisted from product description and question-answers are in bold and underline respectively. Our model is able to capture essential information from the product description and question-answers, not found in reviews. This makes our model-generated summaries more informative while still retaining the consensus opinions from reviews as evident in the above example.

However, text summarization (Nallapati et al., 2016; See et al., 2017; Liu and Lapata, 2019) usually contains reference summaries which are very difficult to obtain at a large scale for opinion summarization. As a result, recent studies (Bražinskas et al., 2020; Elsahar et al., 2021) enable self-supervision by curating synthetic pairs out of review corpus by sampling one of the reviews as a pseudo summary and considering the remaining reviews as the input.

Motivation  In e-commerce, users’ opinions are expressed through various sources such as product ratings, reviews, review upvotes and downvotes, and question-answers. Additionally, for each product, description, product specification, product images, price, etc. are present as well. Considering such additional sources apart from reviews is vital in generating opinion summaries that are well-rounded and informative. Specifically, descriptions offer nuanced details about various aspects, while question-answers provide additional perspectives on specific queries, both of which can be valuable. Table 1 shows an example of the influence of product description and question-answers. However, acquiring annotated training datasets proves expensive and impractical as the number of sources increases. This makes it essential to devise effective synthetic dataset creation strategies that enable supervised training of models using multiple sources.

Problem Statement  We propose a novel synthetic dataset creation approach that uses additional sources such as product description and question-answers (QA) along with reviews for generating synthetic quadruplets of the form {input reviews, description, question-answers, pseudo-summary} to enable end-to-end supervised training. A multi-encoder decoder model for opinion summarization (MEDOS) to effectively select information from either product description or question-answers while summarizing reviews. For evaluation, due to the unavailability of test sets that have annotated summaries written considering such additional sources (except for Flipkart (Siledar et al., 2023b)), we extend the available e-commerce test sets by including these additional sources and leveraging ChatGPT (OpenAI, 2023) to annotate (Gilardi et al., 2023; Huang et al., 2023) summaries.
Input: Reviews, Description, Question-Answers
Output: Opinion Summary

Our contributions are:

  1. 1.

    A novel synthetic dataset creation (SDC) approach that enables supervised training in the presence of additional sources without the need for any annotated training datasets. We propose a Multi-Encoder Decoder framework for Opinion Summarization (MEDOS)222Code and data: https://github.com/tjsiledar/MEDOS to effectively fuse information from reviews, product description, and question-answers (QA) (Section 3, 4 & 5). To the best of our knowledge, we are the first to do multi-source self-supervised opinion summarization.

  2. 2.

    Extensions to e-commerce test sets namely Amazon (Bražinskas et al., 2020) and Oposum+ (Amplayo et al., 2021) to include additional sources. For comparison, we extend: Amazon, Oposum+, and Flipkart by curating six new test sets: Amazon R, Amazon RDQ, Oposum+ R, Oposum+ RDQ, Flipkart R, and Flipkart RDQ leveraging ChatGPT to annotate summaries. We extend the test sets to contain 𝟔𝟔𝟐662\mathbf{662}bold_662 opinion summaries across six curated test sets (Section 6.2, Table 2).

  3. 3.

    Experimental demonstrations of our SDC approach and MEDOS model in outperforming the SOTA model on nine test sets on average by 14.5%percent14.5\mathbf{14.5\%}bold_14.5 % in ROUGE-1 F1 (Section 7).

  4. 4.

    Comparative and qualitative analysis indicating the importance of sources such as product description and question-answers in generating more informative summaries compared to existing models (Section 7, Table 4 & 5).

Original Extended (Ours)
Amazon Oposum+ Flipkart Amazon Oposum+ Flipkart Amazon Oposum+ Flipkart
GPT-R GPT-R GPT-R GPT-RDQ GPT-RDQ GPT-RDQ
#domains 4444 6666 3333 4444 6666 3333 4444 6666 3333
#test set 32323232 30303030 145145145145 32323232 30303030 145145145145 32323232 30303030 145145145145
#reviews/product 8888 10101010 10101010 8888 10101010 10101010 8888 10101010 10101010
#summaries/product 3333 3333 1111 𝟑3\mathbf{3}bold_3 𝟑3\mathbf{3}bold_3 𝟏1\mathbf{1}bold_1 𝟑3\mathbf{3}bold_3 𝟑3\mathbf{3}bold_3 𝟏1\mathbf{1}bold_1
#summaries 96969696 90909090 145145145145 𝟗𝟔96\mathbf{96}bold_96 𝟗𝟎90\mathbf{90}bold_90 𝟏𝟒𝟓145\mathbf{145}bold_145 𝟗𝟔96\mathbf{96}bold_96 𝟗𝟎90\mathbf{90}bold_90 𝟏𝟒𝟓145\mathbf{145}bold_145
#descriptions - - - - - - 21212121 17171717 145145145145
#question-answers - - - - - - 11111111 10101010 145145145145
Table 2: Statistics for original and extended test sets. GPT-R indicates the use of reviews whereas GPT-RDQ indicates the use of reviews, description, and question-answers to generate summaries using ChatGPT. Bold represents our contributions. In the respective extended versions, reviews are the same as the original.

2 Related Work

Self-supervised Opinion Summarization.  Recent approaches use self-supervision by considering one of the reviews as a pseudo-summary. Bražinskas et al. (2020) randomly selected N𝑁Nitalic_N reviews per entity to construct N𝑁Nitalic_N pseudo-summary, reviews pairs. Amplayo and Lapata (2020) sampled a review randomly and generated noisy versions of it as input reviews. Amplayo et al. (2020) used aspect and sentiment distributions to sample pseudo-summaries. Elsahar et al. (2021) selected reviews similar to a randomly sampled pseudo-summary as input reviews, based on TF-IDF cosine similarity. Wang and Wan (2021) aimed at reducing opinion redundancy and constructed highly relevant reviews pseudo-summary pairs by learning aspect and sentiment embeddings to generate relevant pairs. Im et al. (2021) used synthetic dataset creation strategy similar to Bražinskas et al. (2020) and extended it to multimodal version. Ke et al. (2022) captured the consistency of aspects and sentiment between reviews and pseudo-summary using constrained sampling. Siledar et al. (2023a) use lexical and semantic similarities for creating synthetic datasets. Our work is most similar to Elsahar et al. (2021) in using cosine similarity to select input reviews and pseudo-summary pairs. However, we use review embeddings to compute similarity instead of TF-IDF scores. Additionally, our pseudo-summary selection considers additional sources such as product description and question-answers as well. Our synthetic dataset creation strategy ensures that the pseudo-summary selection is highly relevant to all our input sources. Recent opinion summarization systems (Bhaskar et al., 2023; Hosking et al., 2023) include a large number of reviews. However, we limit our work to a fixed number of reviews to enable a fair comparison with previous approaches.

Additional sources for Opinion Summarization.Zhao and Chaturvedi (2020) used aspects identified from product description to perform extractive aspect-based opinion summarization. Li et al. (2020) proposed a supervised multimodal summarization model to effectively generate summaries using reviews, product image, product title, and product details. Im et al. (2021) proposed a self-supervised multimodal training pipeline to generate summaries using reviews, images, and meta-data. Siledar et al. (2023b) did supervised opinion summarization using simple rules to generate summaries separately in the form of verdict, pros, cons, and additional information using reviews, description, specifications, and question-answers. Our work takes inspiration from Im et al. (2021) to utilize a multi-encoder framework to effectively fuse information from various sources. However, where additional sources are all text, our approach of forming highly relevant synthetic pairs using additional sources helps in capturing relevant information. Also, our approach differs from Siledar et al. (2023b) in training models in an end-to-end fashion without the aid of supervised summaries.

3 Problem Formulation

Preliminaries. For a specific product or an entity, R={r1,,rN}𝑅subscript𝑟1subscript𝑟𝑁R=\{r_{1},...,r_{N}\}italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } is the set of N𝑁Nitalic_N reviews, D𝐷Ditalic_D represents the product description, and Q={q1,,qM}𝑄subscript𝑞1subscript𝑞𝑀Q=\{q_{1},...,q_{M}\}italic_Q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } represents a set of M𝑀Mitalic_M question-answer pairs such that qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT concatenated question and its corresponding answer.

Opinion Summarization. The task of opinion summarization is to generate an opinion summary s𝑠sitalic_s given a set of reviews R𝑅Ritalic_R for an entity (eg. product or business). Rush et al. (2015) defined the task of abstractive summarization as:

s*superscript𝑠\displaystyle s^{*}italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =argmax𝑠g(s,R),absent𝑠argmax𝑔𝑠𝑅\displaystyle=\underset{s}{\text{argmax}}\;g(s,R),= underitalic_s start_ARG argmax end_ARG italic_g ( italic_s , italic_R ) , (1)
g(s,R)𝑔𝑠𝑅\displaystyle g(s,R)italic_g ( italic_s , italic_R ) =logp(s|R;θ),absent𝑝conditional𝑠𝑅𝜃\displaystyle=\log\;p(s|R;\theta),= roman_log italic_p ( italic_s | italic_R ; italic_θ ) , (2)
i=0J1logp(si+1|sw,R;θ),absentsuperscriptsubscript𝑖0𝐽1𝑝conditionalsubscript𝑠𝑖1subscript𝑠𝑤𝑅𝜃\displaystyle\approx\sum_{i=0}^{J-1}\log\;p(s_{i+1}|s_{w},R;\theta),≈ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT roman_log italic_p ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_R ; italic_θ ) , (3)

where g𝑔gitalic_g is a scoring function defined as a conditional log probability of the summary given the input, sw=s[iw+1,,i]subscript𝑠𝑤subscript𝑠𝑖𝑤1𝑖s_{w}=s_{[i-w+1,...,i]}italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT [ italic_i - italic_w + 1 , … , italic_i ] end_POSTSUBSCRIPT for a window size w𝑤witalic_w, θ𝜃\thetaitalic_θ is the neural network parameters, and |s|=J𝑠𝐽|s|=J| italic_s | = italic_J. For opinion summarization, the input is a review set R𝑅Ritalic_R and the output is the opinion summary s𝑠sitalic_s. The conditional probability can be modeled using Transformers (Vaswani et al., 2017) as:

p(si+1|sw,R;θ)𝑝conditionalsubscript𝑠𝑖1subscript𝑠𝑤𝑅𝜃\displaystyle p(s_{i+1}|s_{w},R;\theta)italic_p ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_R ; italic_θ ) proportional-to\displaystyle\propto
ρ𝜌\displaystyle\rhoitalic_ρ (FFN(C-Attn(𝐚𝐑,𝐞𝐬𝐰))),FFNC-Attnsubscript𝐚𝐑subscript𝐞subscript𝐬𝐰\displaystyle(\text{FFN}(\text{C-Attn}(\mathbf{a_{R}},\mathbf{e_{s_{w}}}))),( FFN ( C-Attn ( bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) , (4)
𝐚𝐑=S-Attn(Enc\displaystyle\mathbf{a_{R}}=\text{S-Attn}(\text{Enc}bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT = S-Attn ( Enc (R)),𝐞𝐬𝐰=Emb(sw),\displaystyle(R)),\;\mathbf{e_{s_{w}}}=\text{Emb}(s_{w}),( italic_R ) ) , bold_e start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Emb ( italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , (5)

where ρ𝜌\rhoitalic_ρ is the softmax function, FFN is the feed-forward network, C-Attn is the cross-attention network, S-Attn is the self-attention network, Enc is the encoder, and Emb is the embedding layer.

Additional Sources. Under the presence of additional sources such as product description and question-answers, the equations for modeling abstractive summarization can be written as:

s*superscript𝑠\displaystyle s^{*}italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =argmax𝑠g(s,R,D,Q),absent𝑠argmax𝑔𝑠𝑅𝐷𝑄\displaystyle=\underset{s}{\text{argmax}}\;g(s,R,D,Q),= underitalic_s start_ARG argmax end_ARG italic_g ( italic_s , italic_R , italic_D , italic_Q ) , (6)
g(s,R,D,Q)𝑔𝑠𝑅𝐷𝑄\displaystyle g(s,R,D,Q)italic_g ( italic_s , italic_R , italic_D , italic_Q ) =logp(s|R,D,Q;θ),absent𝑝conditional𝑠𝑅𝐷𝑄𝜃\displaystyle=\log\;p(s|R,D,Q;\theta),= roman_log italic_p ( italic_s | italic_R , italic_D , italic_Q ; italic_θ ) , (7)
i=0J1absentsuperscriptsubscript𝑖0𝐽1\displaystyle\approx\sum_{i=0}^{J-1}≈ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT logp(si+1|sw,R,D,Q;θ),𝑝conditionalsubscript𝑠𝑖1subscript𝑠𝑤𝑅𝐷𝑄𝜃\displaystyle\log\;p(s_{i+1}|s_{w},R,D,Q;\theta),roman_log italic_p ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_R , italic_D , italic_Q ; italic_θ ) , (8)

Using transformers, this can be modeled as:

p(si+1|sw,R,D,Q;θ)𝑝conditionalsubscript𝑠𝑖1subscript𝑠𝑤𝑅𝐷𝑄𝜃\displaystyle p(s_{i+1}|s_{w},R,D,Q;\theta)italic_p ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_R , italic_D , italic_Q ; italic_θ ) proportional-to\displaystyle\propto
ρ(FFN\displaystyle\rho(\text{FFN}italic_ρ ( FFN (C-Attn(𝐚𝐟,𝐞𝐬𝐰))),\displaystyle(\text{C-Attn}(\mathbf{a_{f}},\mathbf{e_{s_{w}}}))),( C-Attn ( bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) , (9)
𝐞𝐬𝐰=Emb(sw),subscript𝐞subscript𝐬𝐰Embsubscript𝑠𝑤\displaystyle\mathbf{e_{s_{w}}}=\text{Emb}(s_{w}),bold_e start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Emb ( italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , (10)

where 𝐚𝐟subscript𝐚𝐟\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT is the fused attention. We propose a Multi-Encoder Decoder Framework- MEDOS (Section 5, Figure 1) to create fused attention 𝐚𝐟subscript𝐚𝐟\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT (Eq. 11).

Amazon Amazon GPT-R Amazon GPT-RDQ
abs? Model R D Q R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow
Random 27.8627.8627.8627.86 3.873.873.873.87 16.6816.6816.6816.68 20.6920.6920.6920.69 1.561.561.561.56 12.5512.5512.5512.55 18.8318.8318.8318.83 1.451.451.451.45 12.0312.0312.0312.03
Oracle 44.4744.4744.4744.47 13.8313.8313.8313.83 30.8530.8530.8530.85 33.6933.6933.6933.69 6.046.046.046.04 22.8822.8822.8822.88 31.8331.8331.8331.83 5.775.775.775.77 22.0422.0422.0422.04
Clustroid 29.2729.2729.2729.27 4.414.414.414.41 17.7817.7817.7817.78 22.7422.7422.7422.74 2.162.162.162.16 14.0314.0314.0314.03 21.3121.3121.3121.31 2.572.572.572.57 13.3813.3813.3813.38
LexRank 29.4629.4629.4629.46 5.535.535.535.53 17.7417.7417.7417.74 22.8222.8222.8222.82 3.083.083.083.08 13.7713.7713.7713.77 19.3019.3019.3019.30 4.314.314.314.31 12.9012.9012.9012.90
QT 34.0434.0434.0434.04 7.037.037.037.03 18.0818.0818.0818.08 23.0123.0123.0123.01 2.482.482.482.48 12.0512.0512.0512.05 21.7821.7821.7821.78 3.253.253.253.25 12.3612.3612.3612.36
CopyCat 31.9731.9731.9731.97 5.815.815.815.81 20.1620.1620.1620.16 20.0920.0920.0920.09 1.791.791.791.79 12.9412.9412.9412.94 20.5420.5420.5420.54 1.941.941.941.94 13.8513.8513.8513.85
PlanSum 32.8732.8732.8732.87 6.126.126.126.12 19.0519.0519.0519.05 20.4920.4920.4920.49 1.761.761.761.76 12.4412.4412.4412.44 19.0919.0919.0919.09 1.581.581.581.58 12.0212.0212.0212.02
ConsistSum 33.3233.3233.3233.32 5.945.945.945.94 21.4121.41\mathbf{21.41}bold_21.41 - - - - - -
MultimodalSum 34.1934.1934.1934.19 7.057.057.057.05 20.8120.8120.8120.81 21.4321.4321.4321.43 1.581.581.581.58 13.2013.2013.2013.20 20.3920.3920.3920.39 2.082.082.082.08 12.8312.8312.8312.83
TransSum 34.2334.2334.2334.23 7.247.247.247.24 20.4920.4920.4920.49 - - - - - -
COOP 36.5736.57\mathbf{36.57}bold_36.57 7.237.237.237.23 21.2421.2421.2421.24 - - - - - -
T5555-concat 28.0428.0428.0428.04 4.464.464.464.46 16.3916.3916.3916.39 21.2821.2821.2821.28 2.572.57\mathbf{2.57}bold_2.57 13.0013.0013.0013.00 20.6120.6120.6120.61 2.722.722.722.72 13.3313.3313.3313.33
BART-concat 32.3532.3532.3532.35 6.496.496.496.49 19.7819.7819.7819.78 22.3222.3222.3222.32 2.272.272.272.27 13.7413.7413.7413.74 21.7521.7521.7521.75 2.392.392.392.39 13.5713.5713.5713.57
MEDOS 34.6334.6334.6334.63 7.487.48\mathbf{7.48}bold_7.48 20.9720.9720.9720.97 23.9223.92\mathbf{23.92}bold_23.92* 2.272.272.272.27* 14.6914.69\mathbf{14.69}bold_14.69* 25.4425.44\mathbf{25.44}bold_25.44* 4.164.16\mathbf{4.16}bold_4.16* 16.4516.45\mathbf{16.45}bold_16.45*
Table 3: Results on Amazon test set and its extensions. R, D, Q indicate the presence of reviews, description, and question-answers respectively in the input. abs? indicate abstractive systems. Kühn and underline indicate best and second-best scores using abstractive systems. * indicates pvalue <0.05absent0.05<0.05< 0.05 on paired t-test against MultimodalSum. Overall our combination of SDC approach and MEDOS outperforms baselines across all three test sets.
Algorithm 1 SDC using Additional Sources
1:Bewertungen R𝑅Ritalic_R, 𝐞𝐑N×dsubscript𝐞𝐑superscript𝑁𝑑\mathbf{e_{R}}\in\mathbb{R}^{N\times d}bold_e start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, product description D𝐷Ditalic_D, 𝐞𝐃1×dsubscript𝐞𝐃superscript1𝑑\mathbf{e_{D}}\in\mathbb{R}^{1\times d}bold_e start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT, and question-answer pairs Q𝑄Qitalic_Q, qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q, 𝐞𝐪1×dsubscript𝐞𝐪superscript1𝑑\mathbf{e_{q}}\in\mathbb{R}^{1\times d}bold_e start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT for a product. Functions sim𝑠𝑖𝑚simitalic_s italic_i italic_m, diag𝑑𝑖𝑎𝑔diagitalic_d italic_i italic_a italic_g, and mean𝑚𝑒𝑎𝑛meanitalic_m italic_e italic_a italic_n.
2:Initialize Z=[]𝑍Z=[]italic_Z = [ ]
3:for each product do
4:     Mdiag(sim(𝐞𝐑,𝐞𝐑),0)𝑀𝑑𝑖𝑎𝑔𝑠𝑖𝑚subscript𝐞𝐑subscript𝐞𝐑0M\leftarrow diag(sim(\mathbf{e_{R}},\mathbf{e_{R}}),0)italic_M ← italic_d italic_i italic_a italic_g ( italic_s italic_i italic_m ( bold_e start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT ) , 0 ) {N×N}absentsuperscript𝑁𝑁{\{\in\mathbb{R}^{N\times N}\}}{ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT }
5:     dssim(𝐞𝐑,𝐞𝐃)𝑑𝑠𝑠𝑖𝑚subscript𝐞𝐑subscript𝐞𝐃ds\leftarrow sim(\mathbf{e_{R}},\mathbf{e_{D}})italic_d italic_s ← italic_s italic_i italic_m ( bold_e start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT ) {N×1}absentsuperscript𝑁1\{\in\mathbb{R}^{N\times 1}\}{ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT }
6:     for qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q do
7:         qs+=sim(𝐞𝐑,𝐞𝐪)qs\mathrel{+}=sim(\mathbf{e_{R}},\mathbf{e_{q}})italic_q italic_s + = italic_s italic_i italic_m ( bold_e start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ) {N×1}absentsuperscript𝑁1\{\in\mathbb{R}^{N\times 1}\}{ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT }
8:     end for
9:     qsmean(qs)𝑞𝑠𝑚𝑒𝑎𝑛𝑞𝑠qs\leftarrow mean(qs)italic_q italic_s ← italic_m italic_e italic_a italic_n ( italic_q italic_s ) {N×1}absentsuperscript𝑁1\{\in\mathbb{R}^{N\times 1}\}{ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT }
10:     ssλ1ds+λ2qs𝑠𝑠subscript𝜆1𝑑𝑠subscript𝜆2𝑞𝑠ss\leftarrow\lambda_{1}\cdot ds+\lambda_{2}\cdot qsitalic_s italic_s ← italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_d italic_s + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_q italic_s
11:     Rptop-p reviews usingsssubscript𝑅𝑝top-p reviews using𝑠𝑠R_{p}\leftarrow\text{top-p reviews using}\;ssitalic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← top-p reviews using italic_s italic_s
12:     for rRp𝑟subscript𝑅𝑝r\in R_{p}italic_r ∈ italic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
13:         Ttop-k reviews for r usingM𝑇top-k reviews for r using𝑀T\leftarrow\text{top-k reviews for $r$ using}\;Mitalic_T ← top-k reviews for italic_r using italic_M
14:         Z.insert({T,D,Q,r})formulae-sequence𝑍𝑖𝑛𝑠𝑒𝑟𝑡𝑇𝐷𝑄𝑟Z.insert(\{T,D,Q,r\})italic_Z . italic_i italic_n italic_s italic_e italic_r italic_t ( { italic_T , italic_D , italic_Q , italic_r } )
15:     end for
16:end for
17:Return Z𝑍Zitalic_Z

4 Synthetic Dataset Creation (SDC)

Before discussing the details of our framework, we formalize the synthetic dataset creation process used to train these models. In the absence of supervised datasets, most recent approaches (Bražinskas et al., 2020; Im et al., 2021) resort to self-supervision wherein {input reviews, pseudo-summary} pairs are constructed.

Nach Bražinskas et al. (2020), we can assume that a review rR𝑟𝑅r\in Ritalic_r ∈ italic_R can serve as a summary for a set of reviews TR{r}𝑇𝑅𝑟T\subseteq R-\{r\}italic_T ⊆ italic_R - { italic_r }. This lets us create training points (T,r)𝑇𝑟(T,r)( italic_T , italic_r ) i.e. {input reviews, pseudo-summary}, similar to what the model will experience during inference. T𝑇Titalic_T is fixed to size k𝑘kitalic_k, enabling comparison with existing works.

However, in the presence of additional sources such as product description D𝐷Ditalic_D and question-answer pairs Q𝑄Qitalic_Q, we slightly modify this definition. Instead of synthetic pairs, we construct synthetic quadruplets of the form: {input reviews, product description, question-answers, pseudo-summary}.

Algorithm 1 details the process of generating synthetic quadruplets. We generate multiple such quadruplets out of reviews R𝑅Ritalic_R, product description D𝐷Ditalic_D, and question-answer pairs Q𝑄Qitalic_Q for a specific product. The overall idea for synthetic dataset creation is to choose relevant quadruplets for training. Here we define relevance as the quadruplet that best aids our model in learning the task of opinion summarization using multiple sources.

The intuition is to first select a pseudo-summary r𝑟ritalic_r that is the closest to both D𝐷Ditalic_D and Q𝑄Qitalic_Q. We measure closeness in terms of cosine similarity sim𝑠𝑖𝑚simitalic_s italic_i italic_m between their embeddings (SBERT (Reimers and Gurevych, 2019)). This selection ensures that the pseudo-summary r𝑟ritalic_r contains information relevant to both D𝐷Ditalic_D and Q𝑄Qitalic_Q so that the model learns to pick information from these two sources as well during training. Next, using the pseudo-summary r𝑟ritalic_r selected, we look for its closest k𝑘kitalic_k set of reviews that can act as its input reviews set T𝑇Titalic_T, which ensures that the model learns the task of summarization.

More formally, we first compute a matrix MN×N𝑀superscript𝑁𝑁M\in\mathbb{R}^{N\times N}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT by computing cosine similarity between embeddings of each review pair (ra,rb)subscript𝑟𝑎subscript𝑟𝑏(r_{a},r_{b})( italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) where ra,rbRsubscript𝑟𝑎subscript𝑟𝑏𝑅r_{a},r_{b}\in Ritalic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_R. We make all the diagonals of M𝑀Mitalic_M as zero to remove self-comparisons using diag𝑑𝑖𝑎𝑔diagitalic_d italic_i italic_a italic_g function. Next, we compute dsN×1𝑑𝑠superscript𝑁1ds\in\mathbb{R}^{N\times 1}italic_d italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT by computing cosine similarity between the embeddings of each review rasubscript𝑟𝑎r_{a}italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and D𝐷Ditalic_D. We also compute qsN×1𝑞𝑠superscript𝑁1qs\in\mathbb{R}^{N\times 1}italic_q italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT by computing cosine similarity between the embeddings of each review rasubscript𝑟𝑎r_{a}italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and all qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q and taking a mean𝑚𝑒𝑎𝑛meanitalic_m italic_e italic_a italic_n of it respectively. Finally, we compute ssN×1𝑠𝑠superscript𝑁1ss\in\mathbb{R}^{N\times 1}italic_s italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT as λ1ds+λ2qssubscript𝜆1𝑑𝑠subscript𝜆2𝑞𝑠\lambda_{1}\cdot ds+\lambda_{2}\cdot qsitalic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_d italic_s + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_q italic_s where λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are parameters set to 0.50.50.50.5 for our experiments. We select RpRsubscript𝑅𝑝𝑅R_{p}\subseteq Ritalic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⊆ italic_R reviews for forming p𝑝pitalic_p synthetic quadruplets by taking the top-p scores from ss𝑠𝑠ssitalic_s italic_s. For each review rRp𝑟subscript𝑅𝑝r\in R_{p}italic_r ∈ italic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we get the top-k reviews T𝑇Titalic_T from R{r}𝑅𝑟R-\{r\}italic_R - { italic_r } using scores corresponding to the review r𝑟ritalic_r from M𝑀Mitalic_M. This lets us form synthetic quadruplet instances such as {T,D,Q,r}𝑇𝐷𝑄𝑟\{T,D,Q,r\}{ italic_T , italic_D , italic_Q , italic_r } for model training.

Refer to caption
Figure 1: Framework of our MEDOS model that takes reviews, description, and question-answers (QA) as the input. During inference, the model generates a summary whereas during training the model uses pseudo-summary obtained through SDC process for learning.

5 Model Framework (MEDOS)

Figure 1 represents our multi-encoder framework, where each source passes through its separate encoder to generate separate attentions: 𝐚𝐑=S-Attn(Enc(T)),𝐚𝐃=S-Attn(Enc(D)),formulae-sequencesubscript𝐚𝐑S-AttnEnc𝑇subscript𝐚𝐃S-AttnEnc𝐷\mathbf{a_{R}}=\text{S-Attn}(\text{Enc}(T)),\;\mathbf{a_{D}}=\text{S-Attn}(% \text{Enc}(D)),bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT = S-Attn ( Enc ( italic_T ) ) , bold_a start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT = S-Attn ( Enc ( italic_D ) ) , and 𝐚𝐐=S-Attn(Enc(Q))subscript𝐚𝐐S-AttnEnc𝑄\;\mathbf{a_{Q}}=\text{S-Attn}(\text{Enc}(Q))bold_a start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT = S-Attn ( Enc ( italic_Q ) ). The fused attention 𝐚𝐟subscript𝐚𝐟\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT is then computed as:

𝐚𝐟subscript𝐚𝐟\displaystyle\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT =𝐚𝐑+α𝐚𝐃+β𝐚𝐐absentsubscript𝐚𝐑direct-product𝛼subscript𝐚𝐃direct-product𝛽subscript𝐚𝐐\displaystyle=\mathbf{a_{R}}+\alpha\odot\mathbf{a_{D}}+\beta\odot\mathbf{a_{Q}}= bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT + italic_α ⊙ bold_a start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT + italic_β ⊙ bold_a start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT (11)

where direct-product\odot represents element-wise multiplication, α𝛼\alphaitalic_α and β𝛽\betaitalic_β act as gates regulating the flow of information from product description and question-answers, computed as: α=ϕ([𝐚𝐑;𝐚𝐃]𝐖α)𝛼italic-ϕsubscript𝐚𝐑subscript𝐚𝐃subscript𝐖𝛼\alpha=\phi([\mathbf{a_{R}};\mathbf{a_{D}}]\mathbf{W_{\alpha}})italic_α = italic_ϕ ( [ bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT ; bold_a start_POSTSUBSCRIPT bold_D end_POSTSUBSCRIPT ] bold_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) and β=ϕ([𝐚𝐑;𝐚𝐐]𝐖β)𝛽italic-ϕsubscript𝐚𝐑subscript𝐚𝐐subscript𝐖𝛽\beta=\phi([\mathbf{a_{R}};\mathbf{a_{Q}}]\mathbf{W_{\beta}})italic_β = italic_ϕ ( [ bold_a start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT ; bold_a start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT ] bold_W start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) where 𝐖α,𝐖βsubscript𝐖𝛼subscript𝐖𝛽\mathbf{W_{\alpha},W_{\beta}}bold_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are learned parameters and ϕ(𝐱)=RELU(tanh(𝐱))italic-ϕ𝐱RELUtanh𝐱\phi(\mathbf{x})=\text{RELU}(\text{tanh}(\mathbf{x}))italic_ϕ ( bold_x ) = RELU ( tanh ( bold_x ) ) is the activation, following Im et al. (2021).

Produkt 𝟏1\mathbf{1}bold_1 Produkt 𝟐2\mathbf{2}bold_2 Produkt 𝟑3\mathbf{3}bold_3 Produkt 𝟒4\mathbf{4}bold_4

I purchased the VuPoint FS-C1-VP Film and Slide Digital Converter to scan my 35mm film and slide negatives. It is not compatible with Windows XP. The software does not work with Windows 7 or 8. I have tried to contact the company and they do not respond to my emails. I would not recommend this product to anyone.

The Marpac TSC 330 Travel Sound Conditioner is a great little machine. It is small enough to travel with, but big enough to be used at home. The sound quality is great and it is easy to use. The only thing I don’t like about it is that it doesn’t have a volume control.

The Sony Speaker Dock is a great product. The sound is great and the remote control works great. The only thing I don’t like about it is that it doesn’t charge my iphone 4s. I have to buy an adaptor for that.

The Opteka HG-1 Heavy-Duty Aluminum Ultra HandGrip Handheld Stabilization System for DSLR and Video Cameras is a great product. I use it with my Nikon Coolpix L820 and it works great. It is a little heavy, but that is to be expected for a small camera.

Table 4: Qualitative Analysis. MEDOS generated summaries for four different products from the Amazon test set utilizing reviews, description, and question-answers. Information assisted by the product description is indicated in bold, whereas those assisted from the question-answers are underlined.

6 Experiments

6.1 Datasets

We conducted experiments on: Amazon (He and McAuley, 2016; Bražinskas et al., 2020), Oposum+ (Amplayo et al., 2021), and Flipkart (Siledar et al., 2023b). Statistics are in Table 2. Using our SDC strategy, we created 387387387387k and 313313313313k instances from the Amazon and Oposum+ respectively to enable supervised training. Due to the unavailability of review data in the case of Flipkart, we used the Amazon data to train models. Refer Appendix E.

6.2 Test Dataset Extension

In the absence of any test sets that contain additional sources, we extended Amazon, Oposum+, and Flipkart to contain such sources and leveraged ChatGPT to annotate summaries using reviews and additional sources as input, amounting to 662662662662 opinion summaries in total. Statistics for the extended versions of the test sets are in Table 2. For extensions, we obtain the additional sources (except for Flipkart) from the Amazon data (He and McAuley, 2016). We leverage ChatGPT as our annotator following recent works (Gilardi et al., 2023; Huang et al., 2023). For each test set, we curated: GPT-R, in which summaries are generated using only reviews, and GPT-RDQ, in which summaries are generated using reviews, description, and question-answers. We investigated multiple prompts before finalizing the best one (Appendix B). We employed three professionals to evaluate the annotation quality on informativeness, faithfulness, coherence, conciseness, and fluency using a 5555-point scale. Statistics are in Table 15. The Inter-Rater Reliability computed using Fleiss’ Kappa was 0.23,0.410.230.410.23,0.410.23 , 0.41 and 0.420.420.420.42 for human-annotated, GPT-R, and GPT-RDQ summaries which are considered fair, moderate, and moderate agreement respectively (Landis and Koch, 1977). Refer to Appendix C & H.

6.3 Baseline Models

Extractive Approaches. Random selects a random review from the input as a lower bound. Oracle is the extractive upper bound computed by selecting input sentences with the highest R1 to gold summary. Clustroid (Bražinskas et al., 2020) selects the review with the highest RL score with respect to other reviews. LexRank (Erkan and Radev, 2004) selects the most salient sentences from the input using BERT (Devlin et al., 2019) encodings to represent sentences. QT (Angelidis et al., 2021) represents opinions in quantized space.

Abstractive Approaches. CopyCat (Bražinskas et al., 2020) is a hierarchical variational autoencoder that learns a latent code of the summary. PlanSum (Amplayo and Lapata, 2020) uses content plans to generate synthetic datasets. ConsistSum (Ke et al., 2022) uses aspect and sentiment distribution to generate review-summary pairs. MultimodalSum (Im et al., 2021) generates summaries using multimodal data such as text, images, and meta-data. TransSum (Wang and Wan, 2021) uses aspect and sentiment embeddings to construct synthetic datasets. COOP (Iso et al., 2021) searches for convex combinations of latent vectors to generate summaries. AceSum (Amplayo et al., 2021) uses silver-labeled data obtained through seed words to train the model. SW-LOO (Shen et al., 2023) uses the aspect seed words to construct synthetic datasets, whereas NLI-LOO uses only aspects. Acesumext, SW-LOOext, and NLI-LOOext are the extractive versions respectively. ASBOS (Siledar et al., 2023b) uses aspect-sentiment to filter sentences and generate supervised summaries.

Multi-source Approaches. Due to the absence of any unsupervised approaches that use additional sources as input we fine-tune two models using our synthetic dataset for a fair comparison. BART-concat and T5555-concat use BART (Lewis et al., 2019) and T5555 (Raffel et al., 2020) respectively with the input as a concatenated text. Appendix F.

6.4 Implementation Details

We used the bart-large (Lewis et al., 2019) and t5-large (Raffel et al., 2020) models from HuggingFace (Wolf et al., 2019). A learning rate of 2e62𝑒62e-62 italic_e - 6, batch size of 8888, and 5555 epochs performs the best on dev sets (Appendix G). During inference, we set beam size to 5555 and no repeat ngram to 3333. For encoding, we use the all-MiniLM-L12-v2 from SBERT (Reimers and Gurevych, 2019). For SDC, k=8𝑘8k=8italic_k = 8 for Amazon and 10101010 for Oposum+ and Flipkart, whereas top-p selection is done using 85858585 percentile. Number of QA’s are fixed to M=10𝑀10M=10italic_M = 10.

Gold (Amazon GPT-RDQ)

The VuPoint Film and Slide Scanner presents a mixed picture. It claims ease & efficiency for converting film to digital format, but user experiences differ. While some commend its user-friendliness and recommend it, others report frustrating issues like bleached images, compatibility problems, and subpar results. The need for XP compatibility limits its usefulness for modern systems. It offers potential for simple scanning but may require persistence to achieve desired outcomes.

MultimodalSum

I bought this product to scan my negatives. It does not work with Windows XP. I have tried to contact the company several times and have not received a response. I am very disappointed in the product. I would not recommend it to anyone.

BART-concat

I bought this Scanner to scan my slides and film. I have Windows XP and it does not work with it. I tried to download the drivers from the web site but they are not available. I called the company and they said they would send me a new one but I have not received it yet. I am very disappointed.

Our Model (MEDOS)

I purchased the VuPoint FS-C1-VP Film and Slide Digital Converter to scan my 35mm film and slide negatives. It is not compatible with Windows XP. The software does not work with Windows 7 or 8. I have tried to contact the company and they do not respond to my emails. I would not recommend this product to anyone.

Table 5: Comparative Analysis. ChatGPT-generated summary using reviews, description, and question-answers (GPT-RDQ) followed by different model-generated summaries for an Amazon test set product. Information assisted from the description and question-answers are in bold and underline respectively. MEDOS is able to capture vital information from additional sources which won’t be possible using only reviews.

7 Results and Analysis

Automatic Evaluation.  We use the ROUGE-{1,2,L} F1 score Lin (2004) (R1, R2 & RL) to assess the generated summary quality. Tables 3, 11 & 12 present the results on Amazon and its variants, Oposum+ and its variants, and Flipkart and its variants respectively. In general, we observe that our MEDOS model performs better than baselines and outperforms MultimodalSum on all nine test sets. Better results on GPT-RDQ versions are expected as our model and these test sets use all sources for generating summaries. However, we observe that even on the original and GPT-R test sets our models perform much better. The reason for this we believe is that under the presence of multiple sources, our models are better at figuring out what information is essential and needs to be presented in the summary. Our approach to creating synthetic datasets plays a vital role in this. By showing the model the most relevant summary that takes into consideration all the sources, our models are able to learn better the task of opinion summarization as evidenced by the results. Next, almost for all cases, we observe that MEDOS performs better than the combination of simple concatenation approach and single encoder models (BART-concat & T5555-concat). The MEDOS model due to its multi-encoder framework is able to selectively choose relevant information from the product description and question-answers. Additionally, we observe that single encoder models encounter context limitations in most cases thereby being unable to leverage the additional sources fully.

Qualitative Analysis.  Table 4 presents the summary generated by our MEDOS model for four different products from the Amazon test set. Product description typically contains brand names as well as aspect-specifics. We observe that MEDOS excels at picking these specific names and including them in the generated summaries at appropriate places ensuring that the summaries are coherent. For example, 𝟑𝟓35\mathbf{35}bold_35mm film in product 1111 is an essential information that gets included in the summary. MEDOS also demonstrated the ability to pick relevant information from question-answers keeping the opinions being summarized in context. In product 4444, the MEDOS model additionally gathers the compatibility of Nixon Coolpix L820820820820 and the weight of the product from question-answers. Overall, MEDOS, due to its multi-encoder architecture and assistance from synthetic datasets during training learns to fuse relevant information well.

Comparative Analysis.  Sample summaries generated by our model and some baselines on an Amazon test set product are shown in Table 5. MultimodalSum uses reviews, images, and meta-data, whereas Gold (Amazon GPT-RDQ), BART-concat, and our models use reviews, product description, and question-answers. In comparison to MultimodalSum, which also uses product description as part of the meta-data, MEDOS is able to capture details better such as VuPoint FS-C1-VP Film and Slide Digital Converter (brand name) and 𝟑𝟓35\mathbf{35}bold_35mm film (information present only in description). In the presence of QA, MEDOS is able to provide relevant additional context to the information present in reviews. It picks details about Windows 7777 and 8888 from question-answers to present it along with the Windows XP. Finally, MEDOS does a better job compared to BART-concat in capturing details which we intuit is due to its multi-encoder framework. Additionally, the overall retention of the consensus opinions from the reviews is unaffected.

Error Analysis.  Unfortunately, our models are also prone to occasional hallucinations. For example, product 3333 in Table 4 mentions that an adaptor is needed to charge iPhone 4444s. Though, needing an adaptor for some models is mentioned in question-answers and iPhone 4s in reviews, there is no evidence of iPhone 4s needing an adaptor. We attribute such hallucinations to treating brand names such as iPhone 4s, iPhone 5, etc. as same.

Ablation Study.  Table 6 presents the ablation study of our MEDOS model in using different sources on the Amazon GPT-RDQ test set. Results indicate that the combination of all sources performs the best. Intuitively, a higher score on Amazon GPT-RDQ summaries indicates that our model is leveraging the additional sources to generate more informative summaries. Without question-answers, we observe a 2222 R1 point drop whereas, without the description a 5555 R1 point drop. As expected, the utility of the description is higher than the question-answers. Descriptions contain aspect-specifics which help in enriching the summaries. In contrast, question-answers provide information related to specific queries about the product, which may or may not contribute to the overall summary. The distinction is evident, as using only reviews and question-answers results in poorer performance compared to using only reviews and description.

Amazon GPT-RDQ
R1 \uparrow R2 \uparrow RL \uparrow
MEDOS
       w. Reviews + Description + QA 25.4425.44\mathbf{25.44}bold_25.44 4.164.16\mathbf{4.16}bold_4.16 16.4516.45\mathbf{16.45}bold_16.45
       w. Reviews + Description 23.5423.5423.5423.54 2.432.432.432.43 14.8114.8114.8114.81
       w. Reviews + QA 20.0520.0520.0520.05 1.361.361.361.36 12.9012.9012.9012.90
       w. Reviews 21.2621.2621.2621.26 2.222.222.222.22 13.6813.6813.6813.68
Table 6: Ablation study on Amazon GPT-RDQ. The highest utility comes from adding the description. QA in the presence of reviews and description aids the best.

Human Evaluation. Table 7 shows the Best-Worst Scaling (Louviere et al., 2015) results, assessing the quality of opinion summaries. Six Masters’ students aged 21212121-30303030 evaluated the model-generated summaries on: faithfulness, coherence, conciseness, and fluency. Each evaluator assigned a score of +1111 for best, -1111 for worst, and 00 for the remaining models. Final scores were computed by averaging the scores from all the evaluators. Notably, MEDOS achieved the best scores on all criteria.

Amazon Faithfulness \uparrow Coherence \uparrow Conciseness \uparrow Fluency \uparrow
PlanSum -0.500.500.500.50 -0.660.660.660.66 -0.630.630.630.63 -0.680.680.680.68
MultimodalSum 0.170.170.170.17 0.160.160.160.16 0.220.220.220.22 0.140.140.140.14
BART-concat 0.050.050.050.05 0.080.080.080.08 0.070.070.070.07 0.100.100.100.10
MEDOS 0.210.21\mathbf{0.21}bold_0.21 0.410.41\mathbf{0.41}bold_0.41 0.230.23\mathbf{0.23}bold_0.23 0.500.50\mathbf{0.50}bold_0.50
Table 7: Best-Worst Scaling. MEDOS generated summaries received better scores on all four criteria in human evaluation using the best-worst scaling method.

SDC approach effectiveness. Our SDC approach selects the pseudo-summary based on description and QA first, followed by reviews. This ensures that the model sees relevant information during training thereby learning two things: picking of relevant information from additional sources and generating opinion summaries. Table 8 reports the results obtained using different SDC approaches.

Amazon GPT-RDQ
R1 \uparrow R2 \uparrow RL \uparrow
Our approach 25.4425.44\mathbf{25.44}bold_25.44 4.164.16\mathbf{4.16}bold_4.16 16.4516.45\mathbf{16.45}bold_16.45
Using only reviews for selection 21.3621.3621.3621.36 2.042.042.042.04 13.8613.8613.8613.86
Random selection 14.3114.3114.3114.31 0.480.480.480.48 10.2010.2010.2010.20
Table 8: SDC approach analysis. Our approach that uses description and question-answers along with reviews for selecting pseudo-summary performs the best.

Quantification of information captured. We measure the R1 scores of generated summaries with the sources on the Amazon test set to quantify the amount of information captured. Figure 2 shows our MEDOS generated summaries achieve an R1 of 18.6418.6418.6418.64, 11.8211.8211.8211.82, and 5.815.815.815.81 for reviews, description, and question-answers compared to 18.6318.6318.6318.63, 8.288.288.288.28, and 5.465.465.465.46 for MultimodalSum. The nearly identical R1 for MEDOS and MultimodalSum suggest that even when additional information is present, MEDOS effectively captures all the crucial details from reviews. Next, MEDOS is better than both MultimodalSum and BART-concat in leveraging the information from description. Finally, for QA, R1 for MultimodalSum acts as a baseline as it does not use any QA during summarization. We observe that the BART-concat performs worse whereas MEDOS is able to capture relevant information.

Refer to caption
Figure 2: Quantification of information captured. MEDOS captures a similar amount of information from reviews as that of MultimodalSum, performs better for description, and picks relevant details from QA.

MEDOS performance. We test the performance of MEDOS model by varying the number of parameters. Specifically, we use two variants of BART i.e. bart-base and bart-large, and report the results in Table 9. We observe that the bart-base variant of the MEDOS with just 0.30.30.30.3B parameters outperforms the single encoder models T5555-concat and BART-concat (uses bart-large). In comparison between the two variants of MEDOS, we find that the bart-large version, as expected, performs better than bart-base due to a larger number of parameters. Overall, our findings indicate that the multi-encoder performs better and is able to capture details from different sources effectively.

Amazon GPT-RDQ
mul? #parameters R1 \uparrow R2 \uparrow RL \uparrow
T5555-concat 0.70.70.70.7B 20.6120.6120.6120.61 2.722.722.722.72 13.3313.3313.3313.33
BART-concat 0.40.40.40.4B 21.7521.7521.7521.75 2.392.392.392.39 13.5713.5713.5713.57
MEDOS
       bart-base 0.30.30.30.3B 22.2122.2122.2122.21 3.383.383.383.38 15.3115.3115.3115.31
       bart-large 0.80.80.80.8B 25.4425.44\mathbf{25.44}bold_25.44 4.164.16\mathbf{4.16}bold_4.16 16.4516.45\mathbf{16.45}bold_16.45
Table 9: MEDOS Results. Comparison of MEDOS summaries for different parameter sizes. mul? represents models that use multiple encoders. #parameters indicate the number of parameters in billions (B).

LLMs on Multi-source Opinion Summarization. Recently, large language models (LLMs) have shown remarkable performance on a lot of tasks. For a fair comparison to baselines, we kept the focus of our work on smaller models in a self-supervised setting. For completion, we test the instruct models: Claude-2222333https://www.anthropic.com/index/claude-2, Chatglm2222-6666b (Du et al., 2022), Llama-2222-70707070b-chat444https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, and Llama-2222-7777b-chat555https://huggingface.co/meta-llama/Llama-2-7b-chat-hf (Touvron et al., 2023) on the task of multi-source opinion summarization. The training details of these models are not public and could possibly had access to test sets as a part of their training. We use the same GPT-RDQ prompts as in Appendix B to generate summaries using LLMs. We observe that our MEDOS model with just 0.80.80.80.8B parameters performs comparably to Claude-2222 with 130130130130B parameters and Chatglm-6666b666https://huggingface.co/THUDM/chatglm2-6b with 6666B parameters. Although Llama models with 70707070B and 7777B parameters perform way better, for task-specific models MEDOS provides a cheaper alternative.

Amazon GPT-RDQ
Model #parameters R1 \uparrow R2 \uparrow RL \uparrow
Claude-2222 130130130130B 31.1131.1131.1131.11 4.734.734.734.73 16.6716.6716.6716.67
Llama-2222-70707070b-chat 70707070B 32.7732.77\mathbf{32.77}bold_32.77 7.847.84\mathbf{7.84}bold_7.84 20.2820.28\mathbf{20.28}bold_20.28
Chatglm2222-6666b 6666B 27.3127.3127.3127.31 4.724.724.724.72 16.8016.8016.8016.80
Llama-2222-7777b-chat 7777B 32.4332.4332.4332.43 7.337.337.337.33 20.2720.2720.2720.27
MEDOS 0.80.80.80.8B 25.4425.4425.4425.44 4.164.164.164.16 16.4516.4516.4516.45
Table 10: LLM results on Amazon GPT-RDQ test set compared to MEDOS.

8 Conclusion and Future Work

We proposed a novel approach to create synthetic datasets by harnessing information from reviews and additional sources such as product description and question-answers. This method enables supervised training of models without the necessity of expensive annotated training datasets. Our proposed framework MEDOS uses separate encoders for selectively fusing information from these sources to generate an opinion summary. For evaluation, due to the absence of any test sets that contained such additional sources and annotated summaries, we extended the already available e-commerce test sets with additional sources and leveraged ChatGPT to annotate summaries. This resulted in six additional test sets with 𝟔𝟔𝟐662\mathbf{662}bold_662 opinion summaries in total. Results show that our synthetic dataset approach and MEDOS framework outperforms the SOTA model on average by 14.5%percent14.5\mathbf{14.5\%}bold_14.5 % and the simple input concatenation baseline by 6.5%percent6.5\mathbf{6.5\%}bold_6.5 % across all nine test sets. Through qualitative and comparative analysis we demonstrated that our model-generated summaries are more informative and emphasize the importance of including additional sources for comprehensive summaries.

One future work is to expand these frameworks to encompass more reviews and all available sources, creating thorough product summaries.

Limitations

Our work, although uses a multi-encoder framework, is still currently limited by the size of the input. In e-commerce, reviews generally tend to be in the tens of thousands which could not be supported directly by the current model architectures. There has been research on increasing the context limits of the latest large language models, however, the performance of such models needs to be tested in the context of handling larger inputs for the task of opinion summarization. It becomes even more challenging to integrate additional sources found on product pages on e-commerce websites to provide an overall well-rounded product summary. Finally, we did not consider large language models (LLMs) in our work as our goal was to push for improvements in smaller models for multi-source opinion summarization utilizing only the available product corpus without the need for expensive large-scale annotated datasets and compute-intensive large-scale models. Our models do not use any LLM signals or LLM-generated data for training and rely only on the product corpus for learning the task of multi-source opinion summarization.

Ethical Considerations

We perform our experiments on existing opinion summarization datasets as well as extend the test sets by generating summaries using ChatGPT. Some of the examples in these datasets might not be appropriate for everyone. Our models may also propagate these unintended biases due to the nature of the datasets. We urge the research community to use our models and these test sets with caution and we are fully committed to removing any discrepancies in the existing datasets in the future.

References

  • Amplayo et al. (2020) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2020. Unsupervised opinion summarization with content planning. In AAAI Conference on Artificial Intelligence.
  • Amplayo et al. (2021) Reinald Kim Amplayo, Stefanos Angelidis, and Mirella Lapata. 2021. Aspect-controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6578–6593, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1934–1945, Online. Association for Computational Linguistics.
  • Angelidis et al. (2021) Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, and Mirella Lapata. 2021. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293.
  • Angelidis and Lapata (2018) Stefanos Angelidis and Mirella Lapata. 2018. Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3675–3686, Brussels, Belgium. Association for Computational Linguistics.
  • Bhaskar et al. (2023) Adithya Bhaskar, Alex Fabbri, and Greg Durrett. 2023. Prompted opinion summarization with GPT-3.5. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9282–9300, Toronto, Canada. Association for Computational Linguistics.
  • Bražinskas et al. (2020) Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised opinion summarization as copycat-review generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5151–5169, Online. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  • Elsahar et al. (2021) Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. 2021. Self-supervised and controlled multi-document opinion summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1646–1662, Online. Association for Computational Linguistics.
  • Erkan and Radev (2004) Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res., 22:457–479.
  • Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30).
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  • Hosking et al. (2023) Tom Hosking, Hao Tang, and Mirella Lapata. 2023. Attributable and scalable opinion summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8488–8505, Toronto, Canada. Association for Computational Linguistics.
  • Hu and Liu (2006) Minqing Hu and Bing Liu. 2006. Opinion extraction and summarization on the web. In Aaai, volume 7, pages 1621–1624.
  • Huang et al. (2023) Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is ChatGPT better topenai human annotators? potential and limitations of ChatGPT in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023. ACM.
  • Im et al. (2021) Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, and Sehee Chung. 2021. Self-supervised multimodal opinion summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 388–403, Online. Association for Computational Linguistics.
  • Iso et al. (2021) Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, and Wang-Chiew Tan. 2021. Convex Aggregation for Opinion Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3885–3903, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Ke et al. (2022) Wenjun Ke, Jinhua Gao, Huawei Shen, and Xueqi Cheng. 2022. Consistsum: Unsupervised opinion summarization with the consistency of aspect, sentiment and semantic. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Landis and Koch (1977) J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics, pages 159–174.
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel rahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics.
  • Li et al. (2020) Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Aspect-aware multimodal summarization for chinese e-commerce products. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8188–8195.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. ArXiv, abs/1908.08345.
  • Louviere et al. (2015) Jordan J. Louviere, Terry N. Flynn, and Anthony A. J. Marley. 2015. Best-worst scaling: Theory, methods and applications.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Conference on Computational Natural Language Learning.
  • OpenAI (2023) OpenAI. 2023. ChatGPT (August 3 Version). https://chat.openai.com.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
  • See et al. (2017) A. See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. ArXiv, abs/1704.04368.
  • Shen et al. (2023) Ming Shen, Jie Ma, Shuai Wang, Yogarshi Vyas, Kalpit Dixit, Miguel Ballesteros, and Yassine Benajiba. 2023. Simple yet effective synthetic dataset construction for unsupervised opinion summarization. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1898–1911, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Siledar et al. (2023a) Tejpalsingh Siledar, Suman Banerjee, Amey Patil, Sudhanshu Singh, Muthusamy Chelliah, Nikesh Garera, and Pushpak Bhattacharyya. 2023a. Synthesize, if you do not have: Effective synthetic dataset creation strategies for self-supervised opinion summarization in E-commerce. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13480–13491, Singapore. Association for Computational Linguistics.
  • Siledar et al. (2023b) Tejpalsingh Siledar, Jigar Makwana, and Pushpak Bhattacharyya. 2023b. Aspect-sentiment-based opinion summarization using multiple information sources. Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD).
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Wang and Wan (2021) Ke Wang and Xiaojun Wan. 2021. TransSum: Translating aspect and sentiment embeddings for self-supervised opinion summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 729–742, Online. Association for Computational Linguistics.
  • Wang and Ling (2016) Lu Wang and Wang Ling. 2016. Neural network-based abstract generation for opinions and arguments. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 47–57, San Diego, California. Association for Computational Linguistics.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Transformers: State-of-the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing.
  • Zhao and Chaturvedi (2020) Chao Zhao and Snigdha Chaturvedi. 2020. Weakly-supervised opinion summarization by leveraging external information. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9644–9651.

Appendix A Results on Oposum+ and Flipkart datasets

Results on Oposum+ and Flipkart and their corresponding extended test sets are reported in Tables 11 and 12 respectively.

Oposum+ Oposum+ GPT-R Oposum+ GPT-RDQ
abs? Model R D Q R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow
Random 33.6333.6333.6333.63 10.7910.7910.7910.79 19.8219.8219.8219.82 24.0824.0824.0824.08 2.382.382.382.38 13.2513.2513.2513.25 23.6823.6823.6823.68 2.122.122.122.12 12.9812.9812.9812.98
Oracle 77.3177.3177.3177.31 70.3070.3070.3070.30 74.3574.3574.3574.35 36.8736.8736.8736.87 7.417.417.417.41 23.8823.8823.8823.88 36.2836.2836.2836.28 7.447.447.447.44 23.8723.8723.8723.87
QT 37.7237.7237.7237.72 14.6514.6514.6514.65 21.6921.6921.6921.69 25.8225.8225.8225.82 3.473.473.473.47 14.0114.0114.0114.01 25.8125.8125.8125.81 3.213.213.213.21 14.1314.1314.1314.13
AceSumext 38.4838.4838.4838.48 15.1715.1715.1715.17 22.8222.8222.8222.82 - - - - - -
SW-LOOext 40.4540.4540.4540.45 19.1319.1319.1319.13 23.2023.2023.2023.20 - - - - - -
NLI-LOOext 39.7939.7939.7939.79 18.3318.3318.3318.33 23.4923.4923.4923.49 - - - - - -
CopyCat 29.8029.8029.8029.80 5.615.615.615.61 17.9717.9717.9717.97 22.4122.4122.4122.41 2.302.302.302.30 13.9413.9413.9413.94 22.3822.3822.3822.38 2.032.032.032.03 14.0614.0614.0614.06
AceSum 32.9832.9832.9832.98 10.7210.7210.7210.72 20.2720.2720.2720.27 22.7822.7822.7822.78 3.593.593.593.59 13.2013.2013.2013.20 23.5423.5423.5423.54 3.513.513.513.51 13.8813.8813.8813.88
PlanSum 30.2630.2630.2630.26 5.295.295.295.29 17.4817.4817.4817.48 22.3722.3722.3722.37 2.052.052.052.05 13.3213.3213.3213.32 22.6422.6422.6422.64 2.252.252.252.25 13.7113.7113.7113.71
MultimodalSum 33.0833.0833.0833.08 7.467.467.467.46 19.7519.7519.7519.75 23.3523.3523.3523.35 2.982.982.982.98 14.5314.5314.5314.53 23.7323.7323.7323.73 2.802.802.802.80 14.7014.7014.7014.70
SW-LOO 36.1936.1936.1936.19 12.1712.17\mathbf{12.17}bold_12.17 21.1121.1121.1121.11 - - - - - -
NLI-LOO 31.2231.2231.2231.22 9.939.939.939.93 19.0819.0819.0819.08 - - - - - -
T5555-concat 30.8430.8430.8430.84 11.0811.0811.0811.08 21.0121.0121.0121.01 21.9821.9821.9821.98 2.842.842.842.84 12.9112.9112.9112.91 20.4120.4120.4120.41 2.312.312.312.31 12.7312.7312.7312.73
BART-concat 34.7634.7634.7634.76 9.129.129.129.12 20.6420.6420.6420.64 25.6425.6425.6425.64 3.473.473.473.47 15.2915.2915.2915.29 25.6225.6225.6225.62 3.363.36\mathbf{3.36}bold_3.36 15.9115.9115.9115.91
MEDOS 36.5736.57\mathbf{36.57}bold_36.57* 8.798.798.798.79* 21.3521.35\mathbf{21.35}bold_21.35* 26.8226.82\mathbf{26.82}bold_26.82* 3.673.67\mathbf{3.67}bold_3.67* 15.9215.92\mathbf{15.92}bold_15.92* 26.3226.32\mathbf{26.32}bold_26.32* 3.343.343.343.34* 16.1016.10\mathbf{16.10}bold_16.10*
Table 11: Results on Oposum+ test set and its extensions. R, D, Q indicate the presence of reviews, description, and question-answers respectively in the input. abs? indicate abstractive systems. Kühn and underline indicate best and second-best scores using abstractive systems. * indicates pvalue <0.05absent0.05<0.05< 0.05 on paired t-test against MultimodalSum. Overall our combination of SDC approach and MEDOS model outperforms baselines across all three test sets.
Flipkart Flipkart GPT-R Flipkart GPT-RDQ
abs? Model R D Q R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow R1 \uparrow R2 \uparrow RL \uparrow
Random 19.5019.5019.5019.50 2.502.502.502.50 10.8910.8910.8910.89 24.2224.2224.2224.22 4.404.404.404.40 14.1014.1014.1014.10 18.0418.0418.0418.04 2.262.262.262.26 10.5110.5110.5110.51
Oracle 34.0734.0734.0734.07 6.346.346.346.34 21.3021.3021.3021.30 38.3538.3538.3538.35 9.989.989.989.98 24.8124.8124.8124.81 29.4729.4729.4729.47 5.125.125.125.12 19.2019.2019.2019.20
Clustroid 21.4221.4221.4221.42 3.013.013.013.01 12.0812.0812.0812.08 27.7627.7627.7627.76 5.565.565.565.56 16.7716.7716.7716.77 10.1710.1710.1710.17 1.451.451.451.45 7.747.747.747.74
LexRank 21.5721.5721.5721.57 2.662.662.662.66 11.8811.8811.8811.88 28.1928.1928.1928.19 5.915.915.915.91 16.9216.9216.9216.92 19.6519.6519.6519.65 3.033.033.033.03 12.1512.1512.1512.15
QT 25.1825.1825.1825.18 3.623.623.623.62 13.0513.0513.0513.05 30.9430.9430.9430.94 5.965.965.965.96 15.3415.3415.3415.34 22.9222.9222.9222.92 2.952.952.952.95 11.9711.9711.9711.97
ASBOS{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 32.5532.5532.5532.55 6.446.446.446.44 17.0317.0317.0317.03 28.2728.2728.2728.27 4.054.054.054.05 14.3014.3014.3014.30 27.3227.3227.3227.32 4.954.954.954.95 14.8314.8314.8314.83
CopyCat 18.3818.3818.3818.38 1.811.811.811.81 11.9911.9911.9911.99 21.6821.6821.6821.68 2.132.132.132.13 13.9213.9213.9213.92 17.8417.8417.8417.84 1.251.251.251.25 11.7011.7011.7011.70
PlanSum 19.9619.9619.9619.96 2.702.702.702.70 12.8612.8612.8612.86 21.1721.1721.1721.17 2.232.232.232.23 13.4813.4813.4813.48 17.3417.3417.3417.34 1.491.491.491.49 11.6811.6811.6811.68
MultimodalSum 21.7621.7621.7621.76 3.233.233.233.23 13.5713.5713.5713.57 23.6023.6023.6023.60 2.782.782.782.78 15.0115.0115.0115.01 19.0419.0419.0419.04 1.791.791.791.79 12.2412.2412.2412.24
T5555-concat 20.4120.4120.4120.41 2.832.832.832.83 11.8011.8011.8011.80 26.7026.7026.7026.70 5.755.75\mathbf{5.75}bold_5.75 16.6516.6516.6516.65 20.1420.1420.1420.14 3.003.003.003.00 12.3112.3112.3112.31
BART-concat 22.3522.3522.3522.35 4.464.464.464.46 15.5315.5315.5315.53 27.2727.27\mathbf{27.27}bold_27.27 4.514.514.514.51 17.2217.22\mathbf{17.22}bold_17.22 23.2923.2923.2923.29 3.133.133.133.13 14.9814.9814.9814.98
MEDOS 25.9725.97\mathbf{25.97}bold_25.97* 5.295.29\mathbf{5.29}bold_5.29* 16.0516.05\mathbf{16.05}bold_16.05* 26.2926.2926.2926.29* 4.034.034.034.03* 16.5916.5916.5916.59* 23.9223.92\mathbf{23.92}bold_23.92* 4.304.30\mathbf{4.30}bold_4.30* 16.3516.35\mathbf{16.35}bold_16.35*
Table 12: Results on Flipkart test set and its extensions. R, D, Q indicate the presence of reviews, description, and question-answers respectively in the input. abs? indicate abstractive systems. Kühn and underline indicate best and second-best using abstractive systems. * indicates pvalue <0.05absent0.05<0.05< 0.05 on paired t-test against MultimodalSum. \dagger represents supervised systems. Overall our combination of SDC approach and MEDOS outperforms baselines.
Rating 1 2 3 4 5
Informativeness very poor poor acceptable good very good
Faithfulness all hallucinated somewhat verifiable moderate hallucination slight hallucination no hallucination
Coherence very poor poor acceptable good very good
Conciseness verbose moderately verbose slightly verbose almost concise concise
Fluency ungrammatical slightly fluent somewhat fluent mostly fluent fluent
Table 13: Human evaluation metrics. We use a scale of 1-5 to rate summaries on five evaluation metrics.

Appendix B GPT Prompts

GPT-R prompt: Following are the reviews for a product. Generate a summary of the opinions as a review itself with a word limit of under 100 words. Use information from the given reviews only to generate the summary.
reviews: [r1,…,rk]

GPT-RDQ prompt: Following are the reviews, description, and question-answers for a product. Generate a summary of the opinions as a review itself with a word limit of under 100 words. Use information from the given reviews, description, and question-answers only to generate the summary.
reviews: [r1,…,rk]
description : "…"
question-answers: [q1,..,qM]

Appendix C Evaluation Metric

We use various metrics to qualitatively evaluate our model-generated summaries as well as ChatGPT-annotated summaries. We use the following:

  1. 1.

    Informativeness- how much of the information is captured?

  2. 2.

    Faithfulness- how consistent are the opinions compared to reference summaries?

  3. 3.

    Coherence- is the summary well organized and easy to read?

  4. 4.

    Conciseness- is the summary concise yet informative?

  5. 5.

    Fluency- is the summary fluent and grammatical?

Appendix D ChatGPT Annotation Quality

We assessed the GPT-generated summaries against human-written summaries on 5555 metrics namely Informativeness, Faithfulness, Coherence, Conciseness, and Fluency. Results are presented in Table 15. We compare the ChatGPT-generated summaries against the human-annotated summaries for different test sets and report the results in Table 14. For ChatGPT-generated summaries refer to Table 19. GPT-R represents ChatGPT summaries using only reviews as input whereas GPT-RDQ represents ChatGPT summaries using reviews, description and question-answers.

ChatGPT generated
No. of summaries R1 \uparrow R2 \uparrow RL \uparrow
Amazon 96969696 25.0925.0925.0925.09 2.582.582.582.58 14.0214.0214.0214.02
Oposum+ 90909090 30.0130.0130.0130.01 4.424.424.424.42 15.3015.3015.3015.30
Flipkart 145145145145 30.2030.2030.2030.20 4.184.184.184.18 15.7415.7415.7415.74
Table 14: ChatGPT Results. Comparison of ChatGPT summaries with human-annotated summaries for different test sets.
Info. \uparrow Faith. \uparrow Coh. \uparrow Con. \uparrow Flu. \uparrow
Human 3.883.883.883.88 3.913.913.913.91 3.683.683.683.68 3.833.833.833.83 3.623.623.623.62
GPT-R 4.024.024.024.02 4.134.134.134.13 4.024.024.024.02 4.094.094.094.09 3.983.983.983.98
GPT-RDQ 4.104.104.104.10 4.164.164.164.16 4.164.164.164.16 4.234.234.234.23 4.164.164.164.16
Table 15: Annotation quality. Both GPT-R and GPT-RDQ summaries score higher on all the metrics on average compared to human-annotated summaries. Scores range from 1-5. Info-informativeness, Faith-faithfulness, Coh-coherence, Con-conciseness, Flu-fluency.

Appendix E Dataset Details

Amazon Amazon contains reviews from 4444 domains: electronics, home & kitchen, personal care, and clothing, shoes & jewelry. The evaluation set contains 3333 summaries and 8888 reviews per product. The training set contains 1similar-toabsent1\sim 1∼ 1M reviews over 90909090K products.

Oposum+ Oposum+ contains reviews from 6666 domains: bags, bluetooth headsets, boots, keyboards, televisions. The evaluation set contains 3333 extractive summaries and 10101010 reviews per product. The training set contains 4.13similar-toabsent4.13\sim 4.13∼ 4.13M reviews over 95959595K products.

Flipkart  Flipkart contains reviews from 3333 domains: laptops, mobiles, and tablets. The test set has 1111 summary per product. The original test set contains 1111K reviews per product on average. We downsample this to 10101010 reviews per product (randomly) for comparison.

Appendix F Single-Encoder Baseline

Refer to caption
Figure 3: Framework of the baseline model that takes reviews, description, and QA as the input. A simple concatenation (+) of the input sources is used to generate a summary. During inference, the model generates a summary whereas during training the model uses pseudo-summary obtained through SDC process for learning.

In the single-encoder framework, we concatenate reviews, product description, and question-answers using a separator symbol (</s>). This concatenated text crdqsubscript𝑐𝑟𝑑𝑞c_{rdq}italic_c start_POSTSUBSCRIPT italic_r italic_d italic_q end_POSTSUBSCRIPT goes through an encoder to get the fused attention 𝐚𝐟subscript𝐚𝐟\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT as:

𝐚𝐟subscript𝐚𝐟\displaystyle\mathbf{a_{f}}bold_a start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT =S-Attn(Enc(crdq))absentS-AttnEncsubscript𝑐𝑟𝑑𝑞\displaystyle=\text{S-Attn}(\text{Enc}(c_{rdq}))= S-Attn ( Enc ( italic_c start_POSTSUBSCRIPT italic_r italic_d italic_q end_POSTSUBSCRIPT ) ) (12)

During training, the summary will be the pseudo-summary r𝑟ritalic_r and the input crdqsubscript𝑐𝑟𝑑𝑞c_{rdq}italic_c start_POSTSUBSCRIPT italic_r italic_d italic_q end_POSTSUBSCRIPT will be formed using T,D,Q𝑇𝐷𝑄T,D,Qitalic_T , italic_D , italic_Q from the synthetic quadruplet. Figure 3 describes the single-encoder architecture. We use BART and T5555 as our baseline models.

Appendix G Implementation Details

We used the Adam (Kingma and Ba, 2015) optimizer with eps of 1e41𝑒41e-41 italic_e - 4 and linear weight decay to optimize our models. We use learning rate in [1e6,2e6,1e5,2e51𝑒62𝑒61𝑒52𝑒51e-6,2e-6,1e-5,2e-51 italic_e - 6 , 2 italic_e - 6 , 1 italic_e - 5 , 2 italic_e - 5] and batch size in [8,168168,168 , 16] as our hyperparameters. All experiments use NVIDIA A100-SXM4-80GB GPUs.

Appendix H Inter-Rater Reliability

We employed three professionals proficient in English in the age group of 23-34. Two evaluators were male and one was female. They were provided with detailed evaluation instructions along with examples to rate summaries on different criteria as shown in Table 13. Each instance of the dataset was rated once and the work was equally divided among the three evaluators. 100 summaries were randomly chosen for evaluation and each evaluator annotated 50 summaries (25 unique and 25 common among all evaluators to compute Inter-Rater Reliability). Results of the evaluation can be found in Table 15. We first conducted a pilot study for evaluation with randomly sampled 10 summaries before proceeding to the final annotation. Table 16 shows the results of Fleiss’ Kappa computed on different criteria.

Human-annotated GPT-R GPT-RDQ
Informativeness 0.220.220.220.22 0.430.430.430.43 0.450.450.450.45
Factuality 0.240.240.240.24 0.360.360.360.36 0.440.440.440.44
Coherence 0.250.250.250.25 0.420.420.420.42 0.410.410.410.41
Conciseness 0.210.210.210.21 0.380.380.380.38 0.400.400.400.40
Fluency 0.240.240.240.24 0.450.450.450.45 0.410.410.410.41
Overall 0.230.230.230.23 0.410.410.410.41 0.420.420.420.42
Table 16: Fleiss’ Kappa. We compute the Inter-Rater Reliability for human-annotated, GPT-R and GPT-RDQ on five metrics. GPT-R and GPT-RDQ scored higher on all the metrics compared to human summaries.

Appendix I SDC Approach Effectiveness

The novelty of our SDC approach lies in utilizing descriptions and question-answer pairs in the selection of pseudo-summaries in the most effective manner. The initial selection based on descriptions and question-answers ensures that the chosen pseudo-summary exhibits information overlap between these sources. This, in turn, aids the model in learning to extract information from these diverse inputs during the summarization process. Moreover, our strategy involves using the selected pseudo-summary to then identify the input reviews that are the most semantically close to it. This dual-step process enhances the model’s learning of the opinion summarization task. Table 8 contains the results obtained using different SDC approaches. We find that our approach of creating synthetic datasets performs the best.

Appendix J MEDOS vs. LLMs?

Table 20 displays a comparison between the summaries generated by the LLM models and our MEDOS model. Our findings reveal that the MEDOS model adeptly captures most user opinions within the summary. However, LLMs go a step further, encompassing additional details to provide a comprehensive perspective on various product aspects. Despite this, our MEDOS model, significantly smaller and reliant solely on unsupervised corpus for synthetic datasets, competently extracts crucial user opinions without the extensive resources and fine-tuning required by LLMs, which often consist of billions of tokens and parameters.

Our primary goal was to leverage existing product data and refine smaller models like BART for multi-source opinion summarization, evaluating their effectiveness compared to ChatGPT. Prioritizing these smaller models aims to enhance accessibility and deployability, particularly on devices with limited resources. While LLMs outshine in performance, our focus on achieving high-quality outputs using smaller models within constraints represents a notable achievement. Insights gained from this endeavor can potentially enhance the data efficiency of larger models in the future. Beyond cost-effectiveness, MEDOS introduces a pathway to substantial results with reduced computational and data needs.

Appendix K Summary Lengths

Table 17 reports the mean summary length and mean standard deviations for summaries across three test sets: Amazon, Oposum+, and Flipkart.

Amazon Oposum+ Flipkart
μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ
Human annotated 55.2055.2055.2055.20 12.9812.9812.9812.98 82.1682.1682.1682.16 20.5420.5420.5420.54 118.86118.86118.86118.86 37.1137.1137.1137.11
GPT-R 58.3158.3158.3158.31 13.0113.0113.0113.01 89.6189.6189.6189.61 8.908.908.908.90 82.7182.7182.7182.71 13.5413.5413.5413.54
GPT-RDQ 53.6453.6453.6453.64 12.2812.2812.2812.28 81.5781.5781.5781.57 12.8812.8812.8812.88 84.4484.4484.4484.44 12.1512.1512.1512.15
MultimodalSum 49.0349.0349.0349.03 4.634.634.634.63 46.0046.0046.0046.00 5.335.335.335.33 42.3042.3042.3042.30 4.764.764.764.76
MEDOS 47.7547.7547.7547.75 5.735.735.735.73 57.3657.3657.3657.36 7.097.097.097.09 51.7951.7951.7951.79 8.288.288.288.28
Table 17: Mean summary length (μ𝜇\muitalic_μ) and mean standard deviation (σ𝜎\sigmaitalic_σ) for summaries corresponding to the three test sets: Amazon, Oposum+, and Flipkart.

Appendix L Example

Table 18 shows the reviews, product description, and question-answers for a sample product from the Amazon test set. Table 19 contains the human-annotated summaries from the original test set and our ChatGPT-generated (GPT-R and GPT-RDQ) summaries followed by different model-generated summaries for the same product.

Bewertungen

Exactly as described, at 8 + oz. of solid metal this grip offers a stable way to hold your lightweight digital camera, without putting you fingers in front of the lens or flash. I find it works well with the Kodak PlaySmart Video camera and the Nikon S9100 point and shoot. Opteka HG-5 Pistol Handgrip Stabilizer for Point-n-Shoot, DSLR and Video Cameras

Probably the best and the least expensive stick I’ve ever owned and I love it. I use this with my GoPro HD Hero2. It’s a bit heavy but the construct is very good. You can use this as a weapon too. lol

Bought this as part of the stabilizer rig then realized that this was easier to use alone than the rig itself. I am going to use it with a camera with an active stabilizer. Videos looked good. Will update after I use it this weekend. It looks good and is built solid.

I use this with a dual-camera mount and I like this because of the heft / weight and it stays pretty secure whether I use it with the mount or on my flip video camera or snapshot. I’d recommend this handle highly.

I was planning to use this with my D7000 + Battery Grip + 80-200 f / 2.8 lens, but when I received it, I changed my mind. It just does not look like it can handle that load. I put it on my Panasonic GF2, and it performs very nicely. Would highly recommend it for lighter cameras.

The unit is quite sturdy. I bought it to replace the pistol grip unit also featured because the pistol grip locking mechanism did seem to want to lock tight. This unit locks in very tightly and also feels professional. A great purchase for the money.

This is the 2nd stabilizer that I’ve purchased one for my Sony a99 and one for my Sony a33. I can’t speak highly enough about this handy little item! It’s perfectly sized and the ergonomics is ideal! Two thumbs up!!

A low cost device that I bought and paired with a cell phone reduce jittery videos. Works pretty well for handheld use even when walking. The thread seemed a little recessed at first until I moved the washer flat. I recommend this product for anyone who records videos often for friends, and family especially with your cell phone.

Product Description

Opteka HG-1 Heavy-Duty Aluminum Ultra HandGrip Handheld Stabilization System for DSLR and Video Cameras. The Opteka HG-1 HandGrip Stabilization System is a video stabilization device designed specifically for point-and-shoots, Digital SLR cameras and compact camcorders. The Handgrip keeps your hands off the camera and allows you to capture videos from difficult angles. SpecificationsColor:Black; RedMaterials:Aluminum; Foam PaddedThread Size:1/4"Dimensions (HxLxW):6.25" x 1.5" x 1.5" (15.8cm x 3.8cm x 3.8cm)Weight:8.4oz (240g)

Question-Answers

What is on the bottom end? Is there a 1/4 - 20 female connection on the bottom?

Yes it does have a 1/4 - 20 female connection very handy, i hope this helps you.

Does this work with nikon d800

It’ll work with any camera that has a standard thread tripod socket. Note there is a male post at the top AND a female socket on the bottom. One of the handiest gadgets I’ve ever bought! If it only came in blue

Can this be used on a Nikon Coolpix L820, or is that camera too big / insufficient size?

Yes you can. Use Can be used By any camera or camcorder Threaded for a tripod

Is the thread 1/4-20

Yes, 1/4 -20 (1/2 inch long) for standard tripod mount. threads right into the bottom of any small and midsized camera

If my arm shakes a lot, will this help?

Probably not. I recommend you check out a mono pod or tripod. There also is a gimble style stabilizer that may help you but I’ve never used one so if you try it let me know how it works. Hope I answered your question.

I assume this can be screwed directly into a Canon VIXIA HF20?

If your camera is threaded for a tripod it will work. It’s a little heavy but it’s handy to have

Does this handle water resistant and/or underwater?

It’s just a chunk of machined aluminum with a foam handle. Get it wet, it’ll dry off. If it’s screwed into an underwater housing you’re golden, but if you just dunk your camera under water while attached to this attractive red and black handle, you’ll have a nice expensive paperweight Comprende?

Table 18: Reviews, Product Description, and Question-Answers for a sample product from the Amazon test set.
Human-annotated summaries

Exactly as described. Looks good with a solid build and the unit is quite sturdy. This unit locks in very tightly and also feels professional. It is a great purchase for the money. I’d recommend this handle highly.

This stabilizer is perfect for light weight cameras and can also be used with a cell phone to help diminish jitter. You can feel the heft in this stabilizer and know that it will stay secure. It feels weighty and professional. I would say to use caution when using it with larger cameras but it is perfect for smaller cameras. Overall, it is recommended.

A great all-around, sturdy stick for cameras. It is very versatile and quite ideal for lightweight cameras especially, and this stick is highly recommended for its great price-point and overall versatility.

GPT-R summaries

The Opteka HG-5 Pistol Handgrip Stabilizer delivers on its promise. With a solid metal build, it secures lightweight cameras effectively. It impressed users across various devices, from Nikon to GoPro, proving its versatility. Sturdy and easy to handle, it’s a budget-friendly choice for smoother videos. Whether used independently or as part of a rig, it offers stability and control. An ideal solution for steady shots, even when on the move.

The Opteka HG-5 Pistol Handgrip Stabilizer is a solid metal grip offering stability for lightweight cameras. Versatile with various camera types, it’s praised for its heft, secure hold, and professional feel. Some found it surprisingly effective for reducing jittery videos, while others appreciated its durability and ease of use. Though not suitable for heavy setups, it’s ideal for lighter cameras, making it a recommended, cost-effective choice for steady and smooth video capture.

Opteka HG-5 Pistol Handgrip Stabilizer is a solid metal grip offering stability for lightweight digital cameras like GoPro and point-and-shoots. It’s versatile, working well both with rigs and standalone. Sturdy and secure, it handles various cameras effectively. A worthwhile purchase for smoother videos, whether shooting casually or on-the-go.

GPT-RDQ summaries

The Opteka HG-1 HandGrip Stabilization System is a versatile tool for capturing steady videos. Its sturdy build provides stability for lightweight digital cameras and camcorders. Users appreciate its ergonomic design and efficient angle adjustments. The grip’s thread and weight offer ease of use, enhancing video quality even while on the move. An affordable solution for jitter-free recordings, especially useful for handheld shots and tricky angles.

Opteka HG-1 HandGrip offers solid stabilization for cameras. With its durable build and foam-padded handle, it adds stability to your shots. It accommodates various cameras, making it versatile for point-and-shoots, DSLRs, and compact camcorders. Itś especially handy for reducing jittery videos and capturing challenging angles. Its 1/4"-20 thread size ensures compatibility with most devices. The grip is praised for its heft and comfortable ergonomics. A cost-effective tool to improve video stability and angles for diverse cameras.

The Opteka HG-1 HandGrip Stabilization System proves its worth. Sturdy, versatile, and well-crafted, it stabilizes cameras for smooth video capture. Its ergonomic design offers a secure hold while preventing shaky shots. A valuable addition, especially for lightweight point-and-shoots and smaller cameras. Worth every penny for steady, shake-free videos.

PlanSum

i bought this for my nikon d7000 and it works great. i have had no problems with it. the only thing i don’t like is that it is a bit bulky, but it’s not a big deal for me. if you are looking for a good camera, this is the best way to go.

MultimodalSum

This is a great product. I use it with my Canon T3i and it works great. It is a little heavy, but that is to be expected. I like the fact that it has a solid feel to it. I would recommend this product to anyone.

BART-concat

I bought this to use with the Opteka HG-5 Pistol Handgrip Stabilizer for my Canon Powershot T3i and it works great. It is a bit heavy, but that is to be expected with a tripod.

Our Model (MEDOS)

The Opteka HG-1 Heavy-Duty Aluminum Ultra HandGrip Handheld Stabilization System for DSLR and Video Cameras is a great product. I use it with my Nikon Coolpix L820 and it works great. It is a little heavy, but that is to be expected for a small camera.

Table 19: Three human-annotated summaries, three GPT-R summaries, there GPT-RDQ summaries followed by the summaries from different models for the same product as in Table 18.
Gold (Amazon GPT-RDQ)

The VuPoint Film and Slide Scanner presents a mixed picture. It claims ease & efficiency for converting film to digital format, but user experiences differ. While some commend its user-friendliness and recommend it, others report frustrating issues like bleached images, compatibility problems, and subpar results. The need for XP compatibility limits its usefulness for modern systems. It offers potential for simple scanning but may require persistence to achieve desired outcomes.

Chatglm2-6b

scanner from VuPoint. The scanner has worked great for me and I have been able to scan over 1, 200 35mm negatives and a dozen slides. I have also been able to upload a few black and white negatives from 1963. The scanner is user-friendly and efficient tool. The software that comes with the scanner is good for Windows XP and Vista, but the company prefers to sell other products. I am looking for a scanner that can work with more modern versions of Windows, but I am not satisfied with the VuPoint company. I am also waiting for a replacement unit to be delivered.

Claude-2

Reviews of this scanner are quite mixed. Some find it easy to use and good for digitizing old negatives and slides. But many cite issues getting it to work properly, with several noting it is not compatible with Windows versions after XP. Complaints include poor image quality, too much light bleaching images out, and tech support being expensive. The included software also gets some negative feedback for lacking updates. A few reviewers say the holders are fiddly for fitting negatives into. Overall while some find it efficient for converting film, many note disappointing quality and outdated software/hardware compatibility as negatives.

Llama-2-7b-chat

The VuPoint FS-C1-VP Film and Slide Digital Converter has received mixed reviews. Some users have had issues with image quality, light bleaching, and poor customer support, while others have found the scanner easy to use and effective at converting slides and negatives to digital format. The product’s compatibility with newer operating systems and software updates has also been a concern for some reviewers. Overall, the product’s performance and customer service have been a disappointment for some buyers.

Llama-2-70b-chat

The VuPoint FS-C1-VP Film and Slide Digital Converter has mixed reviews. Some reviewers find it easy to use and efficient for scanning 35mm film negatives and slides, while others have issues with image quality, software compatibility, and customer support. One reviewer had a positive experience with the company’s customer service, while another found it unhelpful. The product’s ability to scan images at 5.0 Mega pixel quality and automatic color balance and exposure control are mentioned in the description, but some reviewers still express dissatisfaction with the product’s overall performance.

Our Model (MEDOS)

I purchased the VuPoint FS-C1-VP Film and Slide Digital Converter to scan my 35mm film and slide negatives. It is not compatible with Windows XP. The software does not work with Windows 7 or 8. I have tried to contact the company and they do not respond to my emails. I would not recommend this product to anyone.

Table 20: Comparative analysis with LLM generated summaries. ChatGPT-generated summary using reviews, description, and question-answers (GPT-RDQ) followed by different LLM-generated summaries and our MEDOS model generated-summary for an Amazon test set product.