UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Chao Wang¹, Neo Wu¹, Lin Ning¹, Luyang Liu¹, Jun Xie¹, Shawn O’Banion¹, Bradley Green¹

Abstract

Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.

1 Introduction

User activity timelines, including data such as place visit histories, product reviews, movie ratings, and other digital interactions, offer valuable insights into individual preferences, behaviors, and evolving interests. These timelines are crucial for applications like personalized recommendations and user behavior analysis (Wang et al. 2019). Summarizing these timelines into concise, actionable insights is essential for enhancing recommendation systems and understanding user engagement trends. For example, as shown in Figure 2, the next product prediction accuracy of an LLM-based model on the Amazon Review dataset (Ni, Li, and McAuley 2019) significantly improves when using summaries instead of raw activity timelines.

However, generating high-quality user summaries is challenging due to the complexity and diversity of user timelines. The subjective nature of summary evaluation and the lack of standardized ground-truth datasets further complicate the process. Current methods often rely on simplistic heuristics or models that struggle with these issues (Giarelis, Mastrokostas, and Karacapilidis 2023). Moreover, the absence of standardized benchmarks or reliable metrics hampers the evaluation of summarization effectiveness (Fabbri et al. 2021; Lloret, Plaza, and Aker 2018).

To tackle these challenges, we introduce UserSumBench, a comprehensive benchmark framework specifically designed to evaluate user summarization approaches by assessing the quality of user summaries generated from activity timelines. UserSumBench consists of two key components: a robust, reference-free summary quality metric and a strong baseline summarization approach.

In UserSumBench, the proposed quality metric evaluates the effectiveness of user summarization approaches by measuring how accurately the generated summaries predict future user activities. This metric offers a quantitative assessment of how well the summaries capture key aspects of user behavior (see Figure 1) and has demonstrated strong alignment with human ratings.

The proposed strong baseline summarization approach employs a time-hierarchical and self-critique method. This approach uses an LLM for initial summarization, followed by iterative refinement to reduce hallucinations and improve summary quality. This baseline not only validates the benchmark metrics but also serves as a foundation for future innovations in summarization techniques.

Key Contributions:

•

Introduction of a quality metric for evaluating user summarization approaches through user summaries, demonstrating strong alignment with human ratings, thereby validating its effectiveness and simplifying the evaluation process.
•

Introduction of a strong baseline summarization approach, a time-hierarchical and self-critique method, setting the foundation for future advancements in summarization techniques.

Refer to caption — Figure 1: Evaluating summary quality through future activity prediction tasks, where LLMs predict the most likely user queries based on generated summaries of past activities.

2 Related Works

The lack of standardized benchmarks has long been a challenge in evaluating summarization approaches (Fabbri et al. 2021). While datasets like MovieLens (Harper and Konstan 2015) and Amazon Reviews (Ni, Li, and McAuley 2019) offer comprehensive logs of user activities, they lack corresponding ground-truth summaries, complicating the assessment of summarization techniques. Efforts have been made to create datasets that pair user activities with manually crafted summaries. For example, Liu et al. (Liu et al. 2023) emphasized the importance of performance prediction in summarization evaluation, advocating for benchmarks that measure the predictive power of generated summaries. Similarly, Akkasi et al. (Akkasi, Fraser, and Komeili 2023) explored reference-free evaluation methods, proposing metrics designed to assess summaries’ ability to convey essential content relevant to future activities. These studies highlight the limitations of current evaluation practices and the need for more predictive and reliable benchmarks.

Recent research has focused on establishing standardized evaluation methodologies for summarization (Fabbri et al. 2021; Liu et al. 2023; Chen et al. 2024). Fabbri et al. (Fabbri et al. 2021) addressed the shortcomings in existing summarization evaluation methods by reassessing 14 automatic evaluation metrics using outputs from recent neural summarization models. They also released a toolkit of evaluation metrics to promote consistency in reporting results. Chen et al. (Chen et al. 2024) proposed a facet-aware evaluation paradigm for scientific abstracts, introducing benchmarks that enable more nuanced comparisons of evaluation metrics in this context.

3 UserSumBench Framework

In this section, we present the two key components of the UserSumBench framework: Benchmark Metrics and Hierarchy-Critique Summary Generation. The Benchmark Metrics assess summarization approaches using various criteria, such as future user activity predictions, to ensure a comprehensive evaluation of the generated summaries. The Hierarchy-Critique Summary Generation method provides a strong baseline for evaluating and improving these techniques.

3.1 Benchmark Metrics

UserSumBench includes three evaluation metrics to assess different aspects of user summarization approaches.

Quality Metric

Quality Metric is designed to evaluate the predictive accuracy of user summaries in forecasting future activities. The evaluation involves splitting each user activity timeline into past and future activities (see Figure 1). Summaries are generated from the past activities, and the quality of these summaries is measured by how well they predict the future activities. The quality of a user summary, $Q_{s}$ , is computed by aggregating performance across multiple future activity prediction tasks.

\displaystyle Q_{s}=I\left[\sum_{t}^{T_{s}}q_{s,t}\geq m\right]

(1)

Here, $q_{s,t}$ denotes the binary prediction outcome for summary $s$ on task $t$ from the set of tasks $T_{s}$ ; $m$ is the threshold for the number of correct predictions required; $I[.]$ is the indicator function, where a result of 1 indicates a ”Good” summary, and 0 indicates a ”Bad” summary.

To evaluate a summarization approach on a generated summary set $S$ , the Quality Metric (QM) is calculated as the percentage of summaries classified as ”Good” based on the qualities of these summaries.

\displaystyle QM=\frac{\sum_{s\in S}Q_{s}}{|S|}

(2)

Instruction Following Metric

The Instruction Following Metric evaluates how well the user summaries adhere to specific constraints, such as a word limit, as introduced in other works (Skopek et al. 2023). For a summarization approach, the Instruction Following Metric (IFM) is defined as the proportion of summaries within the set $S$ that meet the word limit constraint $X$ .

\displaystyle IFM=\frac{|\{s\in S\mid\text{length}(s)\leq X\}|}{|S|}

(3)

While this metric was originally proposed in earlier studies, it remains a useful measure for assessing how effectively a summarization approach adheres to the given prompt instructions.

Information Density Metric

The Information Density Metric assesses the conciseness and informativeness of user summaries, combining elements of both the Quality Metric and the Instruction Following Metric. This metric evaluates the balance between the length of a summary and its effectiveness in predicting future activities. For each summary $s$ in the set $S$ , and each associated task set $T_{s}$ , the Information Density Metric (IDM) is calculated by dividing the average task prediction accuracy by the length of the summary.

\displaystyle IDM=\frac{1}{|S|}\sum_{s\in S}\left(\frac{\frac{1}{|T_{s}|}\sum_% {t\in T_{s}}\text{acc}(t)}{\text{length}(s)}\right)

(4)

This metric provides a quantitative approach to evaluate the trade-off between informativeness and brevity in user summaries, ensuring that summaries are both concise and meaningful.

3.2 Hierarchy-Critique Summary Generation

UserSumBench introduces a Time-Hierarchical and Self-Critique (Hierarchy-Critique) summarization approach (see Figure 3(2)), which is demonstrated to outperform a simpler single-step method (see Figure 3(1), refer to Appendix A.1 for the prompt details). For more details on this comparison, please refer to Section 4.2. The Hierarchy-Critique approach is designed to address the challenges of generating factually consistent summaries by mitigating hallucinations while maintaining computational efficiency through a time-hierarchical structure.

The Hierarchy-Critique approach works by first segmenting a user’s activity history into manageable time intervals, ensuring each segment meets a minimum activity threshold to provide a comprehensive representation of the user’s behavior. This segmentation allows LLMs to process the data efficiently within their context window limits. As depicted in Figure 4, the summarizer model generates an initial summary for each segment (refer to Appendix A.2). These segment summaries are then refined by a verifier model (Wang et al. 2023), whose role is to identify and correct potential hallucinations (e.g., query inconsistencies, factual inaccuracies), ensuring that each segment accurately reflects the user’s activities. Finally, the refined segment summaries are synthesized into a cohesive summary that encapsulates the user’s overall behavior and preferences (refer to Appendix A.7), with all time segments combined in chronological order.

Algorithm 1 Hierarchy-Critique Summary Generation

Input: User activity history $\Phi$
Parameter: Time segments $\{T_{1},T_{2},\ldots,T_{n}\}$ , Summarizer model $LLM_{sum}$ , Verifier model $LLM_{ver}$
Output: Factually consistent user summary $S$

1: Divide user activity history

\Phi

into time segments

\{T_{1},T_{2},\ldots,T_{n}\}

2: for each time segment

T_{i}

3: Generate initial summary

S_{i}

for segment

T_{i}

using

LLM_{sum}

4: Refine

S_{i}

using

LLM_{ver}

5: Identify and correct Query Inconsistency

6: Identify and correct Fact Inconsistency:

7: Extract key KG entities

\{e_{i,1},e_{i,2},\ldots,e_{i,k}\}

from

S_{i}

8: Generate question-answer pairs

\{(q_{i,1},a_{i,1}),(q_{i,2},a_{i,2}),\ldots,(q_{i,k},a_{i,k})\}

using

LLM_{ver}

S_{i}

9: Verify consistency of each pair

(q_{i,j},a_{i,j})

with user activities

\Phi

10: For inconsistent pairs, regenerate

S_{i}

incorporating feedback

11: end for

12: Synthesize segment summaries

\{S_{1},S_{2},\ldots,S_{n}\}

into a single cohesive summary

S

13: return

S

As illustrated in Figure 4, the Hierarchy-Critique approach employs two LLMs (a summarizer and a verifier) to iteratively refine segment summaries. These LLMs can be the same model or different models, with the verifier focusing on identifying specific types of hallucinations:

•

Query Consistency: Ensuring that the summary is relevant to the initial query. The verifier checks for consistency between the query and the summary based on a provided prompt (refer to Appendix A.4).

•

Fact Consistency: Ensuring that the information in the generated summary is accurate and consistent with the user’s activities. To identify factual inconsistencies, we propose using the Question Generation - Question Answering (QG-QA) method (Xu et al. 2024), which operates on key Knowledge Graph (KG) entities (Singhal 2012) extracted from the summary. This method involves two components: Question Generation and Question Answering.

-

Question Generation (QG): Given a summary and an KG entity (e.g., Hiking /m/012v4j) extracted from the summary, the verifier generates a question-answer pair based on the summary context using a user-specified prompt (refer to Appendix A.5). The answer is the KG entity.
-

Question Answering (QA): The consistency of these question-answer pairs is then verified against the user’s activities using the verifier, guided by a user-specified prompt (refer to Appendix A.6).

Let $S$ be the summary candidate based on the retrieved activities $\Phi$ , $e_{k}$ the $k$ -th KG entity in $S$ , $(q_{k},a_{k})$ its corresponding question-answer pair, and $h_{k}$ the binary result (consistent, inconsistent) of $(q_{k},a_{k})$ compared to $\Phi$ . The QG-QA process can be represented by Equations 5 and 6.

QG:

\displaystyle\quad\left\{\left(q_{k},a_{k}\right)\right\}=LLM_{QG}\left(S,% \left\{e_{k}\right\}\right)

(5)

QA:

\displaystyle\quad\left\{h_{k}\right\}=LLM_{QA}\left(\Phi,\left\{\left(q_{k},a% _{k}\right)\right\}\right)

(6)

For any inconsistent question-answer pairs identified by $h_{k}$ , the summary is regenerated by incorporating feedback from these hallucinations into the original prompt (refer to Appendix A.3), as described in Figure 4 and detailed in Algorithm 1.

4 Evaluation

In this section, we validate the benchmark metrics and evaluate hierarchy-critique summarization approach within the UserSumBench framework.

4.1 Validating Benchmark Metrics

To validate the UserSumBench benchmark metrics, we study their alignment with human ratings on three public user activity datasets: MovieLens 1M (Harper and Konstan 2015), Yelp (Yelp ), and Amazon Review (Ni, Li, and McAuley 2019).

Datasets and Evaluation Tasks

This section outlines the dataset preparation and the prediction tasks used for evaluation.

User timelines from the three datasets were filtered based on activity count $L_{\Phi}$ to ensure a balance between data sufficiency and computational efficiency:

•

$L_{\Phi}\geq N_{low}$ : Timelines with fewer than 50 activities were excluded to ensure sufficient context for generating robust summaries.
•

$L_{\Phi}\leq N_{up}$ : Timelines with more than 200 activities were truncated to the most recent 200 to focus on recent behavior patterns and manage computational load.

These thresholds ( $N_{low}=50$ and $N_{up}=200$ ) were chosen to balance informative summaries with processing efficiency. Table 1 shows the number of examples used for evaluation after filtering.

Dataset	Number of Examples
MovieLens	4297
Yelp	6798
Amazon Review	5046

Table 1: Number of examples used for evaluating summarization metrics across different datasets.

Four prediction tasks were defined to evaluate the quality of the summaries generated from these datasets. Each task was structured as a multiple-choice question with one correct answer and four randomly selected incorrect options. The order of all choices was randomized during execution to ensure robustness.

•

$t_{1}$ : Predict the user’s next activity (e.g., next watched movie name, next purchased product name) based solely on the summary, assessing the summary’s ability to encapsulate immediate user behavior.
•

$t_{2}$ : Predict the user’s next activity considering both the summary and the $N_{r}$ most recent activities, evaluating the summary’s effectiveness in conjunction with recent user data.
•

$t_{3}$ : Predict the category of the user’s next activity (e.g., next watched movie genre, next purchased product category) based on the summary alone, testing the summary’s ability to generalize user preferences.
•

$t_{4}$ : Predict the category of the user’s next activity using both the summary and the $N_{r}$ most recent activities, examining how well the summary integrates with recent behavior to provide accurate insights.

These tasks were designed to comprehensively assess different aspects of summary quality. The default value for $N_{r}=20$ and the threshold $m=3$ as defined in Equation 1.

Quality Metric vs. Human Ratings

To validate the effectiveness of the proposed Quality Metric (refer to Section 3.1), we conducted a human rating exercise.

We selected 200 examples from each of the three summary-extended datasets (MovieLens, Yelp, and Amazon Reviews), resulting in a total of 600 examples. These examples were based on user activity histories from English-speaking residents in the United States, and the summaries were generated using the single-step summarization approach (see Figure 3(1)) with the Gemini Advanced model (Team et al. 2023).

Six human raters (three male and three female), all English-speaking USA residents familiar with common user activities participated in the evaluation. Each dataset was evaluated by two raters (one male and one female) to enhance robustness and minimize bias, with each rater assessing 100 examples.

Given the subjective nature of summary evaluation, particularly for criteria like ”Fact Hallucinations Verification,” a unified rating guideline was implemented to standardize the process across all raters. This guideline was designed to ensure consistency and reliability in the ratings, classifying summaries as either ”Good” or ”Bad” based on the following criteria:

•

Query Hallucinations Verification: Assesses whether the summary accurately reflects the specified summary query. For instance, if the query is to ”summarize a user’s movie-watching preferences,” but the summary only lists movie-watching activities without indicating any preferences, the summary would fail this check.
•

Fact Hallucinations Verification: Evaluates whether the details in the summary align with the context of the input activities. For example, if a user’s recent activity history shows a focus on purchasing toys, but the summary inaccurately states that the user mainly purchased office supplies, it would be marked as inaccurate.
•

Top Category Recall: Checks whether at least one of the top three categories, such as movie genres, is mentioned in the summary. Very popular categories like ”Restaurant” in the Yelp dataset are excluded from the top three to ensure meaningful assessment.

This structured approach, with clearly defined evaluation criteria and qualified raters, ensures that the human annotation process is consistent and capable of producing reliable comparisons between prediction results and human ratings.

After obtaining prediction results and human ratings for the same user summaries, we categorized each summary into one of four possible outcomes based on the alignment between the prediction and the human rating. We then calculated the Metric-Annotator Agreement (MAA) using Equation 7, which represents the percentage of cases where the evaluation metrics and human annotators agreed on the quality of the summary. Higher scores indicate better alignment.

\displaystyle MAA=\frac{TP+TN}{TP+FP+FN+TN}

(7)

In this equation, the four possible outcomes are as follows:

•

$TP$ (True Positive): Both the prediction and the human rating classify the summary as ”Good.”
•

$FP$ (False Positive): The prediction classifies the summary as ”Good,” but the human rating classifies it as ”Bad.”
•

$FN$ (False Negative): The prediction classifies the summary as ”Bad,” but the human rating classifies it as ”Good.”
•

$TN$ (True Negative): Both the prediction and the human rating classify the summary as ”Bad.”

As shown in Table 2, the proposed Quality Metric demonstrated strong alignment with human ratings, with above 70% agreement across all datasets.

Dataset	MAA
MovieLens	71.0%
Yelp	74.0%
Amazon Review	73.0%

Table 2: Metric-Annotator Agreement (MAA) between Quality Metric and human ratings across different datasets.

To further evaluate the performance of the proposed benchmark metrics, we applied them to three popular models: Gemini 1.5 Pro (Reid et al. 2024), GPT-4o (Achiam et al. 2023), and Claude 3 Haiku (Anthropic 2023). Detailed comparisons can be found in Appendix B.

Metric	MovieLens	Yelp	Amazon
ROUGE-2	0.163	0.037	0.289
ROUGE-L	0.151	0.147	0.338
BLEU	0.049	0.043	-0.077
BERTScore-precision	0.077	-0.021	0.254
BERTScore-recall	-0.006	0.090	0.146
BERTScore-F1	-0.119	0.071	-0.127
BLEURT	-0.082	0.088	-0.023
AutoEval	0.204	0.120	0.490
Quality $Q_{s}$	0.363	0.458	0.366

Table 3: Comparison of different evaluation metrics and summary quality measurement

Q_{s}

across the MovieLens, Yelp, and Amazon Review datasets.

Metric	Dataset	Single-Step	Hierarchy-Critique	Increment Percentage
Quality Metric	MovieLens	0.557 $\pm$ 0.002	0.586 $\pm$ 0.003	5.21%
	Yelp	0.459 $\pm$ 0.002	0.512 $\pm$ 0.003	11.55%
	Amazon Review	0.616 $\pm$ 0.002	0.631 $\pm$ 0.002	2.44%
Instruction Following Metric	MovieLens	0.836	0.988	18.18%
	Yelp	0.791	0.992	25.41%
	Amazon Review	0.842	0.999	18.65%
Information Density Metric (x0.1%)	MovieLens	3.788 $\pm$ 0.004	5.184 $\pm$ 0.005	36.85%
	Yelp	3.561 $\pm$ 0.004	4.721 $\pm$ 0.005	32.58%
	Amazon Review	2.145 $\pm$ 0.002	3.202 $\pm$ 0.003	49.28%

Table 4: Comparison of quality metric, instruction following metric, and information density metric between Single-Step and Hierarchy-Critique approaches across different datasets.

In addition to UserSumBench benchmark metrics, we also considered other reference-free evaluation metrics that do not require ground-truth summaries for user summary evaluation. These metrics compare the generated summary against user activities without relying on a reference summary.

•

ROUGE-2 (Lin 2004): Measures the overlap of bigrams between the generated summary and user activities.
•

ROUGE-L (Lin 2004): Measures the Longest Common Subsequence (LCS) between the generated summary and user activities.
•

BLEU (Papineni et al. 2002): Evaluates the precision of n-grams in the generated summary compared to user activities.
•
BertScore (Zhang et al. 2019): Uses BERT embeddings to compute precision, recall, and F1 score based on the similarity of words in the generated summary and user activities.
- -
  
  BertScore-precision: Measures the precision of embedding overlap.
- -
  
  BertScore-recall: Measures the recall of embedding overlap.
- -
  
  BertScore-F1: Combines precision and recall to provide an F1 score.
•

BLEURT (Sellam, Das, and Parikh 2020): Uses a pre-trained model to evaluate the quality of the generated text based on human judgments compared to user activities.
•

AutoEval (Chiang and Lee 2023): Adopts an LLM to automatically evaluates summaries without requiring reference summaries, focusing on coherence and relevance.

To further validate our approach, we calculated the Pearson Correlation Coefficient between human ratings and the reference-free metrics, including the proposed summary quality measurement $Q_{s}$ (see Equation 1). In this comparison, BLEURT utilized a pre-trained BLEURT-base-128 model (Sellam, Das, and Parikh 2020) for evaluation, while AutoEval was based on the Gemini 1.5 Pro model. The $Q_{s}$ also employed the Gemini 1.5 Pro model for future activity predictions. As shown in Table 3, our metric demonstrated a more reliable and higher correlation with human ratings than other metrics, as they are specifically tailored to the characteristics of user timeline activities. Despite not having a perfect correlation, the $Q_{s}$ remains highly useful for tasks like weakly supervised learning (Zhou 2018), where ground-truth summaries are limited.

4.2 Evaluating Summarization Approaches

In this section, we evaluate our proposed Hierarchy-Critique summarization approach using the UserSumBench benchmark metrics. For consistency, all LLMs in this experiment used the same model: Gemini 1.5 Flash (Reid et al. 2024), across summarization, self-critique verification, and prediction tasks. To validate the robustness of the benchmarks, we conducted three prediction runs on all three datasets to calculate the $Mean\pm Sd$ values for the Quality Metric and the Information Density Metric. The results, presented in Table 4, demonstrate that the Hierarchy-Critique approach consistently outperforms the Single-Step approach across various evaluation metrics. Additionally, the stability of the benchmark metrics across multiple predictions confirms their robustness.

•

Quality Metric: The Hierarchy-Critique approach achieved superior scores on the MovieLens, Yelp, and Amazon Review datasets, with increases of 5.21%, 11.55%, and 2.44% respectively. These gains indicate that the iterative refinement process effectively reduces hallucinations and enhances the overall consistency and quality of the summaries.
•

Instruction Following Metric: Significant improvements were observed with the Hierarchy-Critique approach, particularly in the Yelp dataset, which showed a 25.41% increase, followed by MovieLens and Amazon Review with gains of 18.18% and 18.65% respectively. This suggests that the Hierarchy-Critique method is more adept at adhering to prompt constraints, likely due to its effective segmentation and summarization of user activities.
•

Information Density Metric: The Hierarchy-Critique approach demonstrated substantial gains in this metric, with increases of 36.85%, 32.58%, and 49.28% for the MovieLens, Yelp, and Amazon Review datasets respectively. This shows that the approach not only produces accurate summaries but also ensures that these summaries are concise and rich in information, effectively balancing brevity with informativeness.

These findings underscore the superiority of the Hierarchy-Critique approach over the Single-Step method, particularly in generating higher-quality, instruction-compliant, and information-dense summaries. The effectiveness of time segmentation, combined with iterative refinement and verification processes, plays a pivotal role in these improvements, establishing the Hierarchy-Critique approach as a more robust and reliable option for summarizing user activity timelines.

5 Conclusion and Future Work

In this paper, we introduced UserSumBench, a comprehensive benchmark framework specifically designed to evaluate user summarization approaches through the assessment of user summaries generated from activity timelines. Our key contributions include (1) a reference-free quality metric for assessing user summarization approaches through summaries based on future user activity predictions, which has demonstrated strong effectiveness and close alignment with human preferences across three diverse datasets (MovieLens, Yelp, and Amazon Review), and (2) a robust summarization baseline method that combines a time-hierarchical summarizer with a self-critique verifier, yielding high-quality summaries while effectively minimizing hallucinations.

The strong alignment between our proposed quality metric and human ratings establishes UserSumBench as a reliable and efficient tool for automated evaluation of user summarization approaches. By offering a cost-effective solution, UserSumBench addresses the pressing need for standardized evaluation methods in the field of user summarization.

Looking ahead, we plan to expand UserSumBench by integrating real-time summarization techniques and exploring its applicability across additional domains beyond the current datasets. These enhancements will further broaden the utility and impact of the framework. Additionally, we aim to encourage the broader research community to adopt UserSumBench, fostering its potential to standardize user summary evaluation practices and drive innovation in the development of more accurate and robust summarization techniques. Ultimately, we believe UserSumBench will play a significant role in advancing personalization, recommendation systems, and user understanding in the digital landscape.

References

Achiam et al. (2023) Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Akkasi, Fraser, and Komeili (2023) Akkasi, A.; Fraser, K. C.; and Komeili, M. 2023. Reference-free summarization evaluation with large language models. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, 193–201.
Anthropic (2023) Anthropic. 2023. The Claude 3 Model Family: Opus, Sonnet, Haiku.
Chen et al. (2024) Chen, X.; Wang, T.; Zhu, Q.; Guo, T.; Gao, S.; Lu, Z.; Gao, X.; and Zhang, X. 2024. Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark. arXiv preprint arXiv:2402.14359.
Chiang and Lee (2023) Chiang, C.-H.; and Lee, H.-y. 2023. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657.
Fabbri et al. (2021) Fabbri, A. R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; and Radev, D. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409.
Giarelis, Mastrokostas, and Karacapilidis (2023) Giarelis, N.; Mastrokostas, C.; and Karacapilidis, N. 2023. Abstractive vs. Extractive Summarization: An Experimental Review. Applied Sciences, 13(13): 7620.
Harper and Konstan (2015) Harper, F. M.; and Konstan, J. A. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4): 1–19.
Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
Liu et al. (2023) Liu, Y.; Fabbri, A. R.; Chen, J.; Zhao, Y.; Han, S.; Joty, S.; Liu, P.; Radev, D.; Wu, C.-S.; and Cohan, A. 2023. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. arXiv preprint arXiv:2311.09184.
Lloret, Plaza, and Aker (2018) Lloret, E.; Plaza, L.; and Aker, A. 2018. The challenging task of summary evaluation: an overview. Language Resources and Evaluation, 52: 101–148.
Ni, Li, and McAuley (2019) Ni, J.; Li, J.; and McAuley, J. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 188–197.
Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
Reid et al. (2024) Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.-b.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
Sellam, Das, and Parikh (2020) Sellam, T.; Das, D.; and Parikh, A. P. 2020. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
Singhal (2012) Singhal, A. 2012. Introducing the Knowledge Graph, things, not strings. https://blog.google/products/search/introducing-knowledge-graph-things-not. Blog post.
Skopek et al. (2023) Skopek, O.; Aralikatte, R.; Gooding, S.; and Carbune, V. 2023. Towards better evaluation of instruction-following: A case-study in summarization. arXiv preprint arXiv:2310.08394.
Team et al. (2023) Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Wang et al. (2023) Wang, R.; Wang, H.; Mi, F.; Chen, Y.; Xu, R.; and Wong, K.-F. 2023. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733.
Wang et al. (2019) Wang, S.; Hu, L.; Wang, Y.; Cao, L.; Sheng, Q. Z.; and Orgun, M. 2019. Sequential recommender systems: challenges, progress and prospects. arXiv preprint arXiv:2001.04830.
Xu et al. (2024) Xu, L.; Su, Z.; Yu, M.; Xu, J.; Choi, J. D.; Zhou, J.; and Liu, F. 2024. Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model. arXiv preprint arXiv:2402.12821.
(22) Yelp. 2023. Yelp dataset. https://www.yelp.com/dataset. Dataset document.
Zhang et al. (2019) Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
Zhou (2018) Zhou, Z.-H. 2018. A brief introduction to weakly supervised learning. National science review, 5(1): 44–53.

Appendix A Prompt Examples

A.1 LLM Prompt Template for Single-Step Summarization

{mdframed}

—– Instructions —–

Summarize my “User Activities” and provide insights that address the query: “{query}”. Adhere to the instructions below.

1. The summary should have the format with **Summary** and **Insights**.

2. The summary should take into account the changes of my long-term interest over time.

3. My activities are cataloged in the following “User Activities” section, with each separated by a newline.

4. Limit the summary to no more than {max_words} words.

—– User Activities —–
{user_activities}

Here we define:

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{user_activities}: List of user activities.

A.2 LLM Prompt Template for Segment Summarization

{mdframed}

—– Instructions —–

Summarize my “User Activities” related to the specified time range “{time_range}” and provide insights that address the query: “{query}”. Adhere to the instructions below.

1. The summary should have the format with **Summary** and **Insights**.

2. The summary should take into account the changes of my long-term interest over time.

3. My activities related to this time range are cataloged in the following “User Activities” section, with each separated by a newline.

4. Limit the summary to no more than {max_words} words.

—– User Activities —–
{user_activities}

Here we define:

{time_range}: Time range of the segment user activities.

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{user_activities}: List of user activities of the segment.

A.3 LLM Prompt Template for Segment Summarization with Feedback

{mdframed}

—– Instructions —–

Summarize my “User Activities” related to the specified time range “{time_range}”, revise the below “Previous Summary” to be consistent with every “Question” and its corresponding “ReferenceAnswer” in the below “Previous Question Answers”. Adhere to the instructions below.

1. For each “Question” in “Previous Question Answers”, the “Answer” is derived from the “Previous Summary”, while the “ReferenceAnswer” is based on my “User Activities”.

2. Modify the “Previous Summary” to incorporate the “ReferenceAnswer” rather than the “Answer” for each “Question” in the “Previous Question Answers”.

3. Ensure the new summary provides insights that address the query: “{query}”.

4. The new summary should have the format with **Summary** and **Insights**.

5. The new summary should take into account the changes of my long-term interest over time.

6. My activities related to this time range are cataloged in the following “User Activities” section, with each separated by a newline.

7. Limit the summary to no more than {max_words} words.

—– Previous Summary —–
{previous_summary}

—– Previous Question Answers —–
{previous_question_answer_pairs}

—– User Activities —–
{user_activities}

Here we define:

{time_range}: Time range of the segment user activities.

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{previous_summary}: Previous generated summary.

{previous_question_answer_pairs}: Previous list of generated QA pairs.

{user_activities}: List of user activities of the segment.

A.4 LLMs Prompt Template for Query Consistency

{mdframed}

—– Instructions —–

Evaluate the relevance of a summary under the following “Summary” section to a query under the following “Query” section.

Return “consistent” if the summary aligns with the query, and “inconsistent” if the summary is unrelated to the query.

The response is only single word “consistent” or “inconsistent” without any explanation.

—– Summary —–
{summary}

—– Query —–
{query}

Here we define:

{summary}: Provided summary.

{query}: Query intention of the summarization.

A.5 LLMs Prompt Template for Question Generation

{mdframed}

—– Instructions —–

Given the below “KG Entities” and “Summary”, adhere to the instructions below to create “Question-Answer Pairs”:

1. Each pair must be related to a specific KG entity.

2. The answer must be the KG entity itself.

3. Formulate questions that are directly relevant to the KG entity within the context of the summary.

4. Avoid creating questions that are open-ended.

5. Use the following format for your response, as shown under the “Question-Answer Pairs” of the “Example” section below: [Question#1: ”Question”, Answer#1: ”Answer”].

Following the above steps and an example under the following “Example” section, create “Question-Answer Pairs” based on the task under the “Task” section.

—– Example —–
“KG Entities”:
hiking
pop music

“Summary”:

**Summary:**

The user demonstrates a robust long-term interest in outdoor and musical activities. Specifically, they are drawn to hiking and pop music.

**Insights:**

* Sports Recreation and Fitness: The user has a sustained interest in hiking, engaging regularly in this activity, which indicates a preference for exploring nature and challenging terrains.

* Entertainment Media and Arts: The user enjoys pop music, known for its wide appeal and catchy melodies, reflecting a consistent interest in this genre.

“Question-Answer Pairs”:

[Question#1: “What outdoor activity is the user mainly interested in according to their searches and discussions?”, Answer#1: “hiking”]

[Question#2: “What genre of music does the user prefer, known for its wide appeal and catchy melodies?”, Answer#2: “pop music”]

—Task—
“KG Entities”:
{kg_entities}

“Summary”:
{summary}

Here we define:

{kg_entities}: List of KG entities extracted from the summary.

{summary}: Provided summary.

A.6 LLMs Prompt Template for Question Answering

{mdframed}

—Instructions—

Given the below “Question-Answer Pairs” and my “User Activities”, judge if each question-answer pair is consistent with my activities. Adhere to the instructions below.

1. Each “Judgement” is composed of “Status” and “ReferenceAnswer” like the following format: [Status#1: ”Status”, ReferenceAnswer#1: ”ReferenceAnswer”].

2. The “Status” should be labeled as “consistent” or ’inconsistent”.

3. A “consistent” status means the question-answer pair aligns with or does not contradict the information provided in my activities. The “ReferenceAnswer” should be “none”.

4. An “inconsistent” status means the question-answer pair conflicts directly with or is contradicted by the information provided in my activities. The “ReferenceAnswer” should be a new answer of the question based on my activities.

5. Use the following format for your response, as shown under the “Judgements” of the following “Example” section below like [Status#2: ”Status”, ReferenceAnswer#2: ”ReferenceAnswer”].

6. Match each judgement to its corresponding question-answer pair by their sequence, such as [Status#2: ”Status”, ReferenceAnswer#2: ”ReferenceAnswer”] pertains to [Question#2: ”Question”, Answer#2: ”Answer”].

Following the above steps and an example under the following “Example” section, create judgements based on the task under the “Task” section.

—– Example —–
“Question-Answer Pairs”:

[Question#1: “What outdoor activity is the user mainly interested in according to their searches and discussions?”, Answer#1: “hiking”]

[Question#2: “What genre of music does the user prefer, known for its wide appeal and catchy melodies?”, Answer#2: “rock music”]

“User Activities”:
searched “Pop music trends in the 2020s” around Sat 05/15/2004 4PM
searched “Best coffee brewing methods for home” around Fri 06/11/2004 6PM
searched “How to prepare for a multi-day hiking trip” around Wed 07/21/2004 7PM
searched “The evolution of electronic elements in pop music” around Sun 04/28/2004 5PM
searched “Hiking trails with the best views in the U.S.” around Mon 04/29/2004 1PM

“Judgements”:
[Status#1: “consistent”, ReferenceAnswer#1: “none”]
[Status#2: “inconsistent”, ReferenceAnswer#2: “pop music”]

—– Task —–
“Question-Answer Pairs”:
{question_answer_pairs}

“User Activities”:
{user_activities}

Here we define:

{question_answer_pairs}: List of generated QA pairs related to the KG entities.

{user_activities}: User activities which are used to generate the summary.

A.7 LLMs Prompt Template for Combining Time Segments

{mdframed}

—Instructions—

Combine all the time segment summaries under the “Time Segment Summaries” section. Adhere to the instructions below.

1. The combined summary should offer insights relevant to the query: “{query}”.

2. Format the combined summary with sections labeled **Summary** and **Insights**.

3. Focus the combined summary on my recent interest preferences (recent time segments).

4. Limit the combined summary to no more than {max_words} words.

—– Time Segment Summaries —–
Summary of Time Segment “{time_range}”: {segment_summary}

…

Here we define:

{query}: Query intention of the summarization.

{max_words}: Max number of words for the summary.

{time_range}: Time range of a segment.

{segment_summary}: Summary of a time segment.

Appendix B Analysis of Benchmark Metrics on Popular Models

We assessed three popular models for summarizing user activities (refer to Appendix A.1 for the prompt details) using the UserSumBench benchmark metrics. The models evaluated were Gemini 1.5 Pro (Reid et al. 2024), GPT-4o (Achiam et al. 2023), and Claude 3 Haiku (Anthropic 2023). The summaries generated by these models were subsequently used for predicting future activities with the Gemini 1.5 Flash model (Reid et al. 2024). The results are presented in Figures 5.

Quality Metric

Figure 5 (1) shows the quality metric for the three models across the MovieLens, Yelp, and Amazon Review datasets. Gemini 1.5 Pro and GPT-4o exhibit similar performance in the three datasets. In the MovieLens and Yelp datasets, GPT-4o achieves the highest quality metric, surpassing both Gemini 1.5 Pro and Claude 3 Haiku. Claude 3 Haiku consistently shows slightly lower performance across all datasets but remains competitive, particularly in the Amazon Review dataset where it closely follows the other two models.

Instruction Following Metric

Figure 5 (2) compares the models based on the instruction-following metric. Gemini 1.5 Pro performs very well in all datasets, particularly in the Yelp and Amazon Review dataset, where it significantly outperforms the other models. Claude 3 Haiku also performs well, especially in the MovieLens dataset. GPT-4o, however, shows significantly lower performance in following instructions, particularly in the Yelp and Amazon Review datasets.

The observed poorer performance of GPT-4o on the instruction-following metric, particularly regarding summary word limits, could be attributed to its inherent design and optimization focus. Unlike some models that may be explicitly tuned for concise responses or instruction adherence, GPT-4o might prioritize generating detailed, comprehensive content, even at the expense of brevity. This tendency to produce more elaborate summaries could explain its weaker performance in adhering to strict word limits.

While GPT-4o’s tendency to generate more detailed responses may hinder its ability to strictly follow word limits, it may simultaneously contribute to its higher quality in predicting future activities. The additional detail and context provided in longer summaries could lead to richer representations of user behaviors and preferences, enhancing predictive accuracy on tasks such as those in the MovieLens and Yelp datasets as shown in Figure 5 (1).

In summary, GPT-4o’s architecture or training objectives may emphasize content richness over strict instruction adherence, which, while beneficial for some tasks, leads to challenges in settings where brevity and instruction following are critical.

Information Density Metric

Figure 5 (3) presents the information density metric. Gemini 1.5 Pro demonstrates the highest performance across all datasets. Claude 3 Haiku also performs well but is consistently outperformed by Gemini 1.5 Pro. GPT-4o shows the lowest performance, particularly in the Yelp and Amazon Review datasets, where the difference is most pronounced.

Overall Analysis

Overall, Gemini 1.5 Pro exhibits the best performance across all benchmark metrics and datasets, particularly excelling in instruction following and information density metrics. GPT-4o, while competitive in the quality metric, falls behind significantly in instruction following and information density metrics. Claude 3 Haiku shows consistent performance across the board but does not surpass Gemini 1.5 Pro in most of the metrics.

These results indicate that Gemini 1.5 Pro is the most effective model among the three for generating high-quality, instruction-following, and information-dense summaries. This superior performance makes it a preferable choice for applications requiring detailed and accurate summarization capabilities.