UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Chao Wang1, Neo Wu1, Lin Ning1, Luyang Liu1, Jun Xie1, Shawn O’Banion1, Bradley Green1
Abstract

Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.

1 Introduction

User activity timelines, including data such as place visit histories, product reviews, movie ratings, and other digital interactions, offer valuable insights into individual preferences, behaviors, and evolving interests. These timelines are crucial for applications like personalized recommendations and user behavior analysis (Wang et al. 2019). Summarizing these timelines into concise, actionable insights is essential for enhancing recommendation systems and understanding user engagement trends. For example, as shown in Figure 2, the next product prediction accuracy of an LLM-based model on the Amazon Review dataset (Ni, Li, and McAuley 2019) significantly improves when using summaries instead of raw activity timelines.

However, generating high-quality user summaries is challenging due to the complexity and diversity of user timelines. The subjective nature of summary evaluation and the lack of standardized ground-truth datasets further complicate the process. Current methods often rely on simplistic heuristics or models that struggle with these issues (Giarelis, Mastrokostas, and Karacapilidis 2023). Moreover, the absence of standardized benchmarks or reliable metrics hampers the evaluation of summarization effectiveness (Fabbri et al. 2021; Lloret, Plaza, and Aker 2018).

To tackle these challenges, we introduce UserSumBench, a comprehensive benchmark framework specifically designed to evaluate user summarization approaches by assessing the quality of user summaries generated from activity timelines. UserSumBench consists of two key components: a robust, reference-free summary quality metric and a strong baseline summarization approach.

In UserSumBench, the proposed quality metric evaluates the effectiveness of user summarization approaches by measuring how accurately the generated summaries predict future user activities. This metric offers a quantitative assessment of how well the summaries capture key aspects of user behavior (see Figure 1) and has demonstrated strong alignment with human ratings.

The proposed strong baseline summarization approach employs a time-hierarchical and self-critique method. This approach uses an LLM for initial summarization, followed by iterative refinement to reduce hallucinations and improve summary quality. This baseline not only validates the benchmark metrics but also serves as a foundation for future innovations in summarization techniques.

Key Contributions:

  • Introduction of a quality metric for evaluating user summarization approaches through user summaries, demonstrating strong alignment with human ratings, thereby validating its effectiveness and simplifying the evaluation process.

  • Introduction of a strong baseline summarization approach, a time-hierarchical and self-critique method, setting the foundation for future advancements in summarization techniques.

Refer to caption
Figure 1: Evaluating summary quality through future activity prediction tasks, where LLMs predict the most likely user queries based on generated summaries of past activities.
Refer to caption
Figure 2: Comparison of next product prediction accuracy across different contexts in the Amazon Review dataset, contrasting performance using raw timelines versus summarized data.

2 Related Works

The lack of standardized benchmarks has long been a challenge in evaluating summarization approaches (Fabbri et al. 2021). While datasets like MovieLens (Harper and Konstan 2015) and Amazon Reviews (Ni, Li, and McAuley 2019) offer comprehensive logs of user activities, they lack corresponding ground-truth summaries, complicating the assessment of summarization techniques. Efforts have been made to create datasets that pair user activities with manually crafted summaries. For example, Liu et al. (Liu et al. 2023) emphasized the importance of performance prediction in summarization evaluation, advocating for benchmarks that measure the predictive power of generated summaries. Similarly, Akkasi et al. (Akkasi, Fraser, and Komeili 2023) explored reference-free evaluation methods, proposing metrics designed to assess summaries’ ability to convey essential content relevant to future activities. These studies highlight the limitations of current evaluation practices and the need for more predictive and reliable benchmarks.

Recent research has focused on establishing standardized evaluation methodologies for summarization (Fabbri et al. 2021; Liu et al. 2023; Chen et al. 2024). Fabbri et al. (Fabbri et al. 2021) addressed the shortcomings in existing summarization evaluation methods by reassessing 14 automatic evaluation metrics using outputs from recent neural summarization models. They also released a toolkit of evaluation metrics to promote consistency in reporting results. Chen et al. (Chen et al. 2024) proposed a facet-aware evaluation paradigm for scientific abstracts, introducing benchmarks that enable more nuanced comparisons of evaluation metrics in this context.

3 UserSumBench Framework

In this section, we present the two key components of the UserSumBench framework: Benchmark Metrics and Hierarchy-Critique Summary Generation. The Benchmark Metrics assess summarization approaches using various criteria, such as future user activity predictions, to ensure a comprehensive evaluation of the generated summaries. The Hierarchy-Critique Summary Generation method provides a strong baseline for evaluating and improving these techniques.

3.1 Benchmark Metrics

UserSumBench includes three evaluation metrics to assess different aspects of user summarization approaches.

Quality Metric

Quality Metric is designed to evaluate the predictive accuracy of user summaries in forecasting future activities. The evaluation involves splitting each user activity timeline into past and future activities (see Figure 1). Summaries are generated from the past activities, and the quality of these summaries is measured by how well they predict the future activities. The quality of a user summary, Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, is computed by aggregating performance across multiple future activity prediction tasks.

Qs=I[tTsqs,tm]subscript𝑄𝑠𝐼delimited-[]superscriptsubscript𝑡subscript𝑇𝑠subscript𝑞𝑠𝑡𝑚\displaystyle Q_{s}=I\left[\sum_{t}^{T_{s}}q_{s,t}\geq m\right]italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_I [ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ≥ italic_m ] (1)

Here, qs,tsubscript𝑞𝑠𝑡q_{s,t}italic_q start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT denotes the binary prediction outcome for summary s𝑠sitalic_s on task t𝑡titalic_t from the set of tasks Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT; m𝑚mitalic_m is the threshold for the number of correct predictions required; I[.]I[.]italic_I [ . ] is the indicator function, where a result of 1 indicates a ”Good” summary, and 0 indicates a ”Bad” summary.

To evaluate a summarization approach on a generated summary set S𝑆Sitalic_S, the Quality Metric (QM) is calculated as the percentage of summaries classified as ”Good” based on the qualities of these summaries.

QM=sSQs|S|𝑄𝑀subscript𝑠𝑆subscript𝑄𝑠𝑆\displaystyle QM=\frac{\sum_{s\in S}Q_{s}}{|S|}italic_Q italic_M = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG | italic_S | end_ARG (2)

Instruction Following Metric

The Instruction Following Metric evaluates how well the user summaries adhere to specific constraints, such as a word limit, as introduced in other works (Skopek et al. 2023). For a summarization approach, the Instruction Following Metric (IFM) is defined as the proportion of summaries within the set S𝑆Sitalic_S that meet the word limit constraint X𝑋Xitalic_X.

IFM=|{sSlength(s)X}||S|𝐼𝐹𝑀conditional-set𝑠𝑆length𝑠𝑋𝑆\displaystyle IFM=\frac{|\{s\in S\mid\text{length}(s)\leq X\}|}{|S|}italic_I italic_F italic_M = divide start_ARG | { italic_s ∈ italic_S ∣ length ( italic_s ) ≤ italic_X } | end_ARG start_ARG | italic_S | end_ARG (3)

While this metric was originally proposed in earlier studies, it remains a useful measure for assessing how effectively a summarization approach adheres to the given prompt instructions.

Information Density Metric

The Information Density Metric assesses the conciseness and informativeness of user summaries, combining elements of both the Quality Metric and the Instruction Following Metric. This metric evaluates the balance between the length of a summary and its effectiveness in predicting future activities. For each summary s𝑠sitalic_s in the set S𝑆Sitalic_S, and each associated task set Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the Information Density Metric (IDM) is calculated by dividing the average task prediction accuracy by the length of the summary.

IDM=1|S|sS(1|Ts|tTsacc(t)length(s))𝐼𝐷𝑀1𝑆subscript𝑠𝑆1subscript𝑇𝑠subscript𝑡subscript𝑇𝑠acc𝑡length𝑠\displaystyle IDM=\frac{1}{|S|}\sum_{s\in S}\left(\frac{\frac{1}{|T_{s}|}\sum_% {t\in T_{s}}\text{acc}(t)}{\text{length}(s)}\right)italic_I italic_D italic_M = divide start_ARG 1 end_ARG start_ARG | italic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT ( divide start_ARG divide start_ARG 1 end_ARG start_ARG | italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT acc ( italic_t ) end_ARG start_ARG length ( italic_s ) end_ARG ) (4)

This metric provides a quantitative approach to evaluate the trade-off between informativeness and brevity in user summaries, ensuring that summaries are both concise and meaningful.

3.2 Hierarchy-Critique Summary Generation

UserSumBench introduces a Time-Hierarchical and Self-Critique (Hierarchy-Critique) summarization approach (see Figure 3(2)), which is demonstrated to outperform a simpler single-step method (see Figure 3(1), refer to Appendix A.1 for the prompt details). For more details on this comparison, please refer to Section 4.2. The Hierarchy-Critique approach is designed to address the challenges of generating factually consistent summaries by mitigating hallucinations while maintaining computational efficiency through a time-hierarchical structure.

Refer to caption
Figure 3: Comparison of summarization approaches: (1) Single-step summarization approach, where a summary is generated directly from the user’s activity history; (2) Time-hierarchical and self-critique summarization approach, which involves segmenting the user’s activity history over time, summarizing each segment, and iteratively refining the summaries before combining them into a final summary.

The Hierarchy-Critique approach works by first segmenting a user’s activity history into manageable time intervals, ensuring each segment meets a minimum activity threshold to provide a comprehensive representation of the user’s behavior. This segmentation allows LLMs to process the data efficiently within their context window limits. As depicted in Figure 4, the summarizer model generates an initial summary for each segment (refer to Appendix A.2). These segment summaries are then refined by a verifier model (Wang et al. 2023), whose role is to identify and correct potential hallucinations (e.g., query inconsistencies, factual inaccuracies), ensuring that each segment accurately reflects the user’s activities. Finally, the refined segment summaries are synthesized into a cohesive summary that encapsulates the user’s overall behavior and preferences (refer to Appendix A.7), with all time segments combined in chronological order.

Algorithm 1 Hierarchy-Critique Summary Generation

Input: User activity history ΦΦ\Phiroman_Φ
Parameter: Time segments {T1,T2,,Tn}subscript𝑇1subscript𝑇2subscript𝑇𝑛\{T_{1},T_{2},\ldots,T_{n}\}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, Summarizer model LLMsum𝐿𝐿subscript𝑀𝑠𝑢𝑚LLM_{sum}italic_L italic_L italic_M start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT, Verifier model LLMver𝐿𝐿subscript𝑀𝑣𝑒𝑟LLM_{ver}italic_L italic_L italic_M start_POSTSUBSCRIPT italic_v italic_e italic_r end_POSTSUBSCRIPT
Output: Factually consistent user summary S𝑆Sitalic_S

1:  Divide user activity history ΦΦ\Phiroman_Φ into time segments {T1,T2,,Tn}subscript𝑇1subscript𝑇2subscript𝑇𝑛\{T_{1},T_{2},\ldots,T_{n}\}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }
2:  for each time segment Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do
3:     Generate initial summary Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for segment Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using LLMsum𝐿𝐿subscript𝑀𝑠𝑢𝑚LLM_{sum}italic_L italic_L italic_M start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT
4:     Refine Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using LLMver𝐿𝐿subscript𝑀𝑣𝑒𝑟LLM_{ver}italic_L italic_L italic_M start_POSTSUBSCRIPT italic_v italic_e italic_r end_POSTSUBSCRIPT:
5:      Identify and correct Query Inconsistency
6:      Identify and correct Fact Inconsistency:
7:       Extract key KG entities {ei,1,ei,2,,ei,k}subscript𝑒𝑖1subscript𝑒𝑖2subscript𝑒𝑖𝑘\{e_{i,1},e_{i,2},\ldots,e_{i,k}\}{ italic_e start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } from Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
8:       Generate question-answer pairs {(qi,1,ai,1),(qi,2,ai,2),,(qi,k,ai,k)}subscript𝑞𝑖1subscript𝑎𝑖1subscript𝑞𝑖2subscript𝑎𝑖2subscript𝑞𝑖𝑘subscript𝑎𝑖𝑘\{(q_{i,1},a_{i,1}),(q_{i,2},a_{i,2}),\ldots,(q_{i,k},a_{i,k})\}{ ( italic_q start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ) , ( italic_q start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT ) , … , ( italic_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) } using LLMver𝐿𝐿subscript𝑀𝑣𝑒𝑟LLM_{ver}italic_L italic_L italic_M start_POSTSUBSCRIPT italic_v italic_e italic_r end_POSTSUBSCRIPT on Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9:       Verify consistency of each pair (qi,j,ai,j)subscript𝑞𝑖𝑗subscript𝑎𝑖𝑗(q_{i,j},a_{i,j})( italic_q start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) with user activities ΦΦ\Phiroman_Φ
10:       For inconsistent pairs, regenerate Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT incorporating feedback
11:  end for
12:  Synthesize segment summaries {S1,S2,,Sn}subscript𝑆1subscript𝑆2subscript𝑆𝑛\{S_{1},S_{2},\ldots,S_{n}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } into a single cohesive summary S𝑆Sitalic_S
13:  return S𝑆Sitalic_S

As illustrated in Figure 4, the Hierarchy-Critique approach employs two LLMs (a summarizer and a verifier) to iteratively refine segment summaries. These LLMs can be the same model or different models, with the verifier focusing on identifying specific types of hallucinations:

Refer to caption
Figure 4: Diagram of the iterative summarization refinement process, where the LLM Summarizer generates an initial segment summary, which is then iteratively critiqued and refined by the LLM Verifier until an optimized summary is achieved or a specified iteration threshold is met.
  • Query Consistency: Ensuring that the summary is relevant to the initial query. The verifier checks for consistency between the query and the summary based on a provided prompt (refer to Appendix A.4).

  • Fact Consistency: Ensuring that the information in the generated summary is accurate and consistent with the user’s activities. To identify factual inconsistencies, we propose using the Question Generation - Question Answering (QG-QA) method (Xu et al. 2024), which operates on key Knowledge Graph (KG) entities (Singhal 2012) extracted from the summary. This method involves two components: Question Generation and Question Answering.

    • -

      Question Generation (QG): Given a summary and an KG entity (e.g., Hiking /m/012v4j) extracted from the summary, the verifier generates a question-answer pair based on the summary context using a user-specified prompt (refer to Appendix A.5). The answer is the KG entity.

    • -

      Question Answering (QA): The consistency of these question-answer pairs is then verified against the user’s activities using the verifier, guided by a user-specified prompt (refer to Appendix A.6).

    Let S𝑆Sitalic_S be the summary candidate based on the retrieved activities ΦΦ\Phiroman_Φ, eksubscript𝑒𝑘e_{k}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the k𝑘kitalic_k-th KG entity in S𝑆Sitalic_S, (qk,ak)subscript𝑞𝑘subscript𝑎𝑘(q_{k},a_{k})( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) its corresponding question-answer pair, and hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the binary result (consistent, inconsistent) of (qk,ak)subscript𝑞𝑘subscript𝑎𝑘(q_{k},a_{k})( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) compared to ΦΦ\Phiroman_Φ. The QG-QA process can be represented by Equations 5 and  6.

    QG: {(qk,ak)}=LLMQG(S,{ek})subscript𝑞𝑘subscript𝑎𝑘𝐿𝐿subscript𝑀𝑄𝐺𝑆subscript𝑒𝑘\displaystyle\quad\left\{\left(q_{k},a_{k}\right)\right\}=LLM_{QG}\left(S,% \left\{e_{k}\right\}\right){ ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } = italic_L italic_L italic_M start_POSTSUBSCRIPT italic_Q italic_G end_POSTSUBSCRIPT ( italic_S , { italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) (5)
    QA: {hk}=LLMQA(Φ,{(qk,ak)})subscript𝑘𝐿𝐿subscript𝑀𝑄𝐴Φsubscript𝑞𝑘subscript𝑎𝑘\displaystyle\quad\left\{h_{k}\right\}=LLM_{QA}\left(\Phi,\left\{\left(q_{k},a% _{k}\right)\right\}\right){ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } = italic_L italic_L italic_M start_POSTSUBSCRIPT italic_Q italic_A end_POSTSUBSCRIPT ( roman_Φ , { ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } ) (6)

    For any inconsistent question-answer pairs identified by hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the summary is regenerated by incorporating feedback from these hallucinations into the original prompt (refer to Appendix A.3), as described in Figure 4 and detailed in Algorithm 1.

4 Evaluation

In this section, we validate the benchmark metrics and evaluate hierarchy-critique summarization approach within the UserSumBench framework.

4.1 Validating Benchmark Metrics

To validate the UserSumBench benchmark metrics, we study their alignment with human ratings on three public user activity datasets: MovieLens 1M (Harper and Konstan 2015), Yelp (Yelp ), and Amazon Review (Ni, Li, and McAuley 2019).

Datasets and Evaluation Tasks

This section outlines the dataset preparation and the prediction tasks used for evaluation.

User timelines from the three datasets were filtered based on activity count LΦsubscript𝐿ΦL_{\Phi}italic_L start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT to ensure a balance between data sufficiency and computational efficiency:

  • LΦNlowsubscript𝐿Φsubscript𝑁𝑙𝑜𝑤L_{\Phi}\geq N_{low}italic_L start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ≥ italic_N start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT: Timelines with fewer than 50 activities were excluded to ensure sufficient context for generating robust summaries.

  • LΦNupsubscript𝐿Φsubscript𝑁𝑢𝑝L_{\Phi}\leq N_{up}italic_L start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ≤ italic_N start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT: Timelines with more than 200 activities were truncated to the most recent 200 to focus on recent behavior patterns and manage computational load.

These thresholds (Nlow=50subscript𝑁𝑙𝑜𝑤50N_{low}=50italic_N start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT = 50 and Nup=200subscript𝑁𝑢𝑝200N_{up}=200italic_N start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT = 200) were chosen to balance informative summaries with processing efficiency. Table 1 shows the number of examples used for evaluation after filtering.

Dataset Number of Examples
MovieLens 4297
Yelp 6798
Amazon Review 5046
Table 1: Number of examples used for evaluating summarization metrics across different datasets.

Four prediction tasks were defined to evaluate the quality of the summaries generated from these datasets. Each task was structured as a multiple-choice question with one correct answer and four randomly selected incorrect options. The order of all choices was randomized during execution to ensure robustness.

  • t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Predict the user’s next activity (e.g., next watched movie name, next purchased product name) based solely on the summary, assessing the summary’s ability to encapsulate immediate user behavior.

  • t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Predict the user’s next activity considering both the summary and the Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT most recent activities, evaluating the summary’s effectiveness in conjunction with recent user data.

  • t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: Predict the category of the user’s next activity (e.g., next watched movie genre, next purchased product category) based on the summary alone, testing the summary’s ability to generalize user preferences.

  • t4subscript𝑡4t_{4}italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT: Predict the category of the user’s next activity using both the summary and the Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT most recent activities, examining how well the summary integrates with recent behavior to provide accurate insights.

These tasks were designed to comprehensively assess different aspects of summary quality. The default value for Nr=20subscript𝑁𝑟20N_{r}=20italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 20 and the threshold m=3𝑚3m=3italic_m = 3 as defined in Equation 1.

Quality Metric vs. Human Ratings

To validate the effectiveness of the proposed Quality Metric (refer to Section 3.1), we conducted a human rating exercise.

We selected 200 examples from each of the three summary-extended datasets (MovieLens, Yelp, and Amazon Reviews), resulting in a total of 600 examples. These examples were based on user activity histories from English-speaking residents in the United States, and the summaries were generated using the single-step summarization approach (see Figure 3(1)) with the Gemini Advanced model (Team et al. 2023).

Six human raters (three male and three female), all English-speaking USA residents familiar with common user activities participated in the evaluation. Each dataset was evaluated by two raters (one male and one female) to enhance robustness and minimize bias, with each rater assessing 100 examples.

Given the subjective nature of summary evaluation, particularly for criteria like ”Fact Hallucinations Verification,” a unified rating guideline was implemented to standardize the process across all raters. This guideline was designed to ensure consistency and reliability in the ratings, classifying summaries as either ”Good” or ”Bad” based on the following criteria:

  • Query Hallucinations Verification: Assesses whether the summary accurately reflects the specified summary query. For instance, if the query is to ”summarize a user’s movie-watching preferences,” but the summary only lists movie-watching activities without indicating any preferences, the summary would fail this check.

  • Fact Hallucinations Verification: Evaluates whether the details in the summary align with the context of the input activities. For example, if a user’s recent activity history shows a focus on purchasing toys, but the summary inaccurately states that the user mainly purchased office supplies, it would be marked as inaccurate.

  • Top Category Recall: Checks whether at least one of the top three categories, such as movie genres, is mentioned in the summary. Very popular categories like ”Restaurant” in the Yelp dataset are excluded from the top three to ensure meaningful assessment.

This structured approach, with clearly defined evaluation criteria and qualified raters, ensures that the human annotation process is consistent and capable of producing reliable comparisons between prediction results and human ratings.

After obtaining prediction results and human ratings for the same user summaries, we categorized each summary into one of four possible outcomes based on the alignment between the prediction and the human rating. We then calculated the Metric-Annotator Agreement (MAA) using Equation 7, which represents the percentage of cases where the evaluation metrics and human annotators agreed on the quality of the summary. Higher scores indicate better alignment.

MAA=TP+TNTP+FP+FN+TN𝑀𝐴𝐴𝑇𝑃𝑇𝑁𝑇𝑃𝐹𝑃𝐹𝑁𝑇𝑁\displaystyle MAA=\frac{TP+TN}{TP+FP+FN+TN}italic_M italic_A italic_A = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_F italic_P + italic_F italic_N + italic_T italic_N end_ARG (7)

In this equation, the four possible outcomes are as follows:

  • TP𝑇𝑃TPitalic_T italic_P (True Positive): Both the prediction and the human rating classify the summary as ”Good.”

  • FP𝐹𝑃FPitalic_F italic_P (False Positive): The prediction classifies the summary as ”Good,” but the human rating classifies it as ”Bad.”

  • FN𝐹𝑁FNitalic_F italic_N (False Negative): The prediction classifies the summary as ”Bad,” but the human rating classifies it as ”Good.”

  • TN𝑇𝑁TNitalic_T italic_N (True Negative): Both the prediction and the human rating classify the summary as ”Bad.”

As shown in Table 2, the proposed Quality Metric demonstrated strong alignment with human ratings, with above 70% agreement across all datasets.

Dataset MAA
MovieLens 71.0%
Yelp 74.0%
Amazon Review 73.0%
Table 2: Metric-Annotator Agreement (MAA) between Quality Metric and human ratings across different datasets.

To further evaluate the performance of the proposed benchmark metrics, we applied them to three popular models: Gemini 1.5 Pro (Reid et al. 2024), GPT-4o (Achiam et al. 2023), and Claude 3 Haiku (Anthropic 2023). Detailed comparisons can be found in Appendix B.

Metric MovieLens Yelp Amazon
ROUGE-2 0.163 0.037 0.289
ROUGE-L 0.151 0.147 0.338
BLEU 0.049 0.043 -0.077
BERTScore-precision 0.077 -0.021 0.254
BERTScore-recall -0.006 0.090 0.146
BERTScore-F1 -0.119 0.071 -0.127
BLEURT -0.082 0.088 -0.023
AutoEval 0.204 0.120 0.490
Quality Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 0.363 0.458 0.366
Table 3: Comparison of different evaluation metrics and summary quality measurement Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT across the MovieLens, Yelp, and Amazon Review datasets.
Metric Dataset Single-Step Hierarchy-Critique Increment Percentage
Quality Metric MovieLens 0.557 ±plus-or-minus\pm± 0.002 0.586 ±plus-or-minus\pm± 0.003 5.21%
Yelp 0.459 ±plus-or-minus\pm± 0.002 0.512 ±plus-or-minus\pm± 0.003 11.55%
Amazon Review 0.616 ±plus-or-minus\pm± 0.002 0.631 ±plus-or-minus\pm± 0.002 2.44%
Instruction Following Metric MovieLens 0.836 0.988 18.18%
Yelp 0.791 0.992 25.41%
Amazon Review 0.842 0.999 18.65%
Information Density Metric (x0.1%) MovieLens 3.788 ±plus-or-minus\pm± 0.004 5.184 ±plus-or-minus\pm± 0.005 36.85%
Yelp 3.561 ±plus-or-minus\pm± 0.004 4.721 ±plus-or-minus\pm± 0.005 32.58%
Amazon Review 2.145 ±plus-or-minus\pm± 0.002 3.202 ±plus-or-minus\pm± 0.003 49.28%
Table 4: Comparison of quality metric, instruction following metric, and information density metric between Single-Step and Hierarchy-Critique approaches across different datasets.

In addition to UserSumBench benchmark metrics, we also considered other reference-free evaluation metrics that do not require ground-truth summaries for user summary evaluation. These metrics compare the generated summary against user activities without relying on a reference summary.

  • ROUGE-2 (Lin 2004): Measures the overlap of bigrams between the generated summary and user activities.

  • ROUGE-L (Lin 2004): Measures the Longest Common Subsequence (LCS) between the generated summary and user activities.

  • BLEU (Papineni et al. 2002): Evaluates the precision of n-grams in the generated summary compared to user activities.

  • BertScore (Zhang et al. 2019): Uses BERT embeddings to compute precision, recall, and F1 score based on the similarity of words in the generated summary and user activities.

    • -

      BertScore-precision: Measures the precision of embedding overlap.

    • -

      BertScore-recall: Measures the recall of embedding overlap.

    • -

      BertScore-F1: Combines precision and recall to provide an F1 score.

  • BLEURT (Sellam, Das, and Parikh 2020): Uses a pre-trained model to evaluate the quality of the generated text based on human judgments compared to user activities.

  • AutoEval (Chiang and Lee 2023): Adopts an LLM to automatically evaluates summaries without requiring reference summaries, focusing on coherence and relevance.

To further validate our approach, we calculated the Pearson Correlation Coefficient between human ratings and the reference-free metrics, including the proposed summary quality measurement Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (see Equation 1). In this comparison, BLEURT utilized a pre-trained BLEURT-base-128 model (Sellam, Das, and Parikh 2020) for evaluation, while AutoEval was based on the Gemini 1.5 Pro model. The Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT also employed the Gemini 1.5 Pro model for future activity predictions. As shown in Table 3, our metric demonstrated a more reliable and higher correlation with human ratings than other metrics, as they are specifically tailored to the characteristics of user timeline activities. Despite not having a perfect correlation, the Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT remains highly useful for tasks like weakly supervised learning (Zhou 2018), where ground-truth summaries are limited.

4.2 Evaluating Summarization Approaches

In this section, we evaluate our proposed Hierarchy-Critique summarization approach using the UserSumBench benchmark metrics. For consistency, all LLMs in this experiment used the same model: Gemini 1.5 Flash (Reid et al. 2024), across summarization, self-critique verification, and prediction tasks. To validate the robustness of the benchmarks, we conducted three prediction runs on all three datasets to calculate the Mean±Sdplus-or-minus𝑀𝑒𝑎𝑛𝑆𝑑Mean\pm Sditalic_M italic_e italic_a italic_n ± italic_S italic_d values for the Quality Metric and the Information Density Metric. The results, presented in Table 4, demonstrate that the Hierarchy-Critique approach consistently outperforms the Single-Step approach across various evaluation metrics. Additionally, the stability of the benchmark metrics across multiple predictions confirms their robustness.

  • Quality Metric: The Hierarchy-Critique approach achieved superior scores on the MovieLens, Yelp, and Amazon Review datasets, with increases of 5.21%, 11.55%, and 2.44% respectively. These gains indicate that the iterative refinement process effectively reduces hallucinations and enhances the overall consistency and quality of the summaries.

  • Instruction Following Metric: Significant improvements were observed with the Hierarchy-Critique approach, particularly in the Yelp dataset, which showed a 25.41% increase, followed by MovieLens and Amazon Review with gains of 18.18% and 18.65% respectively. This suggests that the Hierarchy-Critique method is more adept at adhering to prompt constraints, likely due to its effective segmentation and summarization of user activities.

  • Information Density Metric: The Hierarchy-Critique approach demonstrated substantial gains in this metric, with increases of 36.85%, 32.58%, and 49.28% for the MovieLens, Yelp, and Amazon Review datasets respectively. This shows that the approach not only produces accurate summaries but also ensures that these summaries are concise and rich in information, effectively balancing brevity with informativeness.

These findings underscore the superiority of the Hierarchy-Critique approach over the Single-Step method, particularly in generating higher-quality, instruction-compliant, and information-dense summaries. The effectiveness of time segmentation, combined with iterative refinement and verification processes, plays a pivotal role in these improvements, establishing the Hierarchy-Critique approach as a more robust and reliable option for summarizing user activity timelines.

5 Conclusion and Future Work

In this paper, we introduced UserSumBench, a comprehensive benchmark framework specifically designed to evaluate user summarization approaches through the assessment of user summaries generated from activity timelines. Our key contributions include (1) a reference-free quality metric for assessing user summarization approaches through summaries based on future user activity predictions, which has demonstrated strong effectiveness and close alignment with human preferences across three diverse datasets (MovieLens, Yelp, and Amazon Review), and (2) a robust summarization baseline method that combines a time-hierarchical summarizer with a self-critique verifier, yielding high-quality summaries while effectively minimizing hallucinations.

The strong alignment between our proposed quality metric and human ratings establishes UserSumBench as a reliable and efficient tool for automated evaluation of user summarization approaches. By offering a cost-effective solution, UserSumBench addresses the pressing need for standardized evaluation methods in the field of user summarization.

Looking ahead, we plan to expand UserSumBench by integrating real-time summarization techniques and exploring its applicability across additional domains beyond the current datasets. These enhancements will further broaden the utility and impact of the framework. Additionally, we aim to encourage the broader research community to adopt UserSumBench, fostering its potential to standardize user summary evaluation practices and drive innovation in the development of more accurate and robust summarization techniques. Ultimately, we believe UserSumBench will play a significant role in advancing personalization, recommendation systems, and user understanding in the digital landscape.

References

  • Achiam et al. (2023) Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Akkasi, Fraser, and Komeili (2023) Akkasi, A.; Fraser, K. C.; and Komeili, M. 2023. Reference-free summarization evaluation with large language models. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, 193–201.
  • Anthropic (2023) Anthropic. 2023. The Claude 3 Model Family: Opus, Sonnet, Haiku.
  • Chen et al. (2024) Chen, X.; Wang, T.; Zhu, Q.; Guo, T.; Gao, S.; Lu, Z.; Gao, X.; and Zhang, X. 2024. Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark. arXiv preprint arXiv:2402.14359.
  • Chiang and Lee (2023) Chiang, C.-H.; and Lee, H.-y. 2023. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657.
  • Fabbri et al. (2021) Fabbri, A. R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; and Radev, D. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 391–409.
  • Giarelis, Mastrokostas, and Karacapilidis (2023) Giarelis, N.; Mastrokostas, C.; and Karacapilidis, N. 2023. Abstractive vs. Extractive Summarization: An Experimental Review. Applied Sciences, 13(13): 7620.
  • Harper and Konstan (2015) Harper, F. M.; and Konstan, J. A. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4): 1–19.
  • Lin (2004) Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
  • Liu et al. (2023) Liu, Y.; Fabbri, A. R.; Chen, J.; Zhao, Y.; Han, S.; Joty, S.; Liu, P.; Radev, D.; Wu, C.-S.; and Cohan, A. 2023. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. arXiv preprint arXiv:2311.09184.
  • Lloret, Plaza, and Aker (2018) Lloret, E.; Plaza, L.; and Aker, A. 2018. The challenging task of summary evaluation: an overview. Language Resources and Evaluation, 52: 101–148.
  • Ni, Li, and McAuley (2019) Ni, J.; Li, J.; and McAuley, J. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 188–197.
  • Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
  • Reid et al. (2024) Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.-b.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  • Sellam, Das, and Parikh (2020) Sellam, T.; Das, D.; and Parikh, A. P. 2020. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
  • Singhal (2012) Singhal, A. 2012. Introducing the Knowledge Graph, things, not strings. https://blog.google/products/search/introducing-knowledge-graph-things-not. Blog post.
  • Skopek et al. (2023) Skopek, O.; Aralikatte, R.; Gooding, S.; and Carbune, V. 2023. Towards better evaluation of instruction-following: A case-study in summarization. arXiv preprint arXiv:2310.08394.
  • Team et al. (2023) Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Wang et al. (2023) Wang, R.; Wang, H.; Mi, F.; Chen, Y.; Xu, R.; and Wong, K.-F. 2023. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733.
  • Wang et al. (2019) Wang, S.; Hu, L.; Wang, Y.; Cao, L.; Sheng, Q. Z.; and Orgun, M. 2019. Sequential recommender systems: challenges, progress and prospects. arXiv preprint arXiv:2001.04830.
  • Xu et al. (2024) Xu, L.; Su, Z.; Yu, M.; Xu, J.; Choi, J. D.; Zhou, J.; and Liu, F. 2024. Identifying Factual Inconsistency in Summaries: Towards Effective Utilization of Large Language Model. arXiv preprint arXiv:2402.12821.
  • (22) Yelp. 2023. Yelp dataset. https://www.yelp.com/dataset. Dataset document.
  • Zhang et al. (2019) Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  • Zhou (2018) Zhou, Z.-H. 2018. A brief introduction to weakly supervised learning. National science review, 5(1): 44–53.

Appendix A Prompt Examples

A.1 LLM Prompt Template for Single-Step Summarization

{mdframed}

—– Instructions —–

Summarize my “User Activities” and provide insights that address the query: “{query}”. Adhere to the instructions below.

1. The summary should have the format with **Summary** and **Insights**.

2. The summary should take into account the changes of my long-term interest over time.

3. My activities are cataloged in the following “User Activities” section, with each separated by a newline.

4. Limit the summary to no more than {max_words} words.

—– User Activities —–
{user_activities}

Here we define:

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{user_activities}: List of user activities.

A.2 LLM Prompt Template for Segment Summarization

{mdframed}

—– Instructions —–

Summarize my “User Activities” related to the specified time range “{time_range}” and provide insights that address the query: “{query}”. Adhere to the instructions below.

1. The summary should have the format with **Summary** and **Insights**.

2. The summary should take into account the changes of my long-term interest over time.

3. My activities related to this time range are cataloged in the following “User Activities” section, with each separated by a newline.

4. Limit the summary to no more than {max_words} words.

—– User Activities —–
{user_activities}

Here we define:

{time_range}: Time range of the segment user activities.

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{user_activities}: List of user activities of the segment.

A.3 LLM Prompt Template for Segment Summarization with Feedback

{mdframed}

—– Instructions —–

Summarize my “User Activities” related to the specified time range “{time_range}”, revise the below “Previous Summary” to be consistent with every “Question” and its corresponding “ReferenceAnswer” in the below “Previous Question Answers”. Adhere to the instructions below.

1. For each “Question” in “Previous Question Answers”, the “Answer” is derived from the “Previous Summary”, while the “ReferenceAnswer” is based on my “User Activities”.

2. Modify the “Previous Summary” to incorporate the “ReferenceAnswer” rather than the “Answer” for each “Question” in the “Previous Question Answers”.

3. Ensure the new summary provides insights that address the query: “{query}”.

4. The new summary should have the format with **Summary** and **Insights**.

5. The new summary should take into account the changes of my long-term interest over time.

6. My activities related to this time range are cataloged in the following “User Activities” section, with each separated by a newline.

7. Limit the summary to no more than {max_words} words.

—– Previous Summary —–
{previous_summary}

—– Previous Question Answers —–
{previous_question_answer_pairs}

—– User Activities —–
{user_activities}

Here we define:

{time_range}: Time range of the segment user activities.

{query}: Query intention of the summarization. For example, ”Summarize my long-term movie watching preference”.

{max_words}: Max number of words for the summary.

{previous_summary}: Previous generated summary.

{previous_question_answer_pairs}: Previous list of generated QA pairs.

{user_activities}: List of user activities of the segment.

A.4 LLMs Prompt Template for Query Consistency

{mdframed}

—– Instructions —–

Evaluate the relevance of a summary under the following “Summary” section to a query under the following “Query” section.

Return “consistent” if the summary aligns with the query, and “inconsistent” if the summary is unrelated to the query.

The response is only single word “consistent” or “inconsistent” without any explanation.

—– Summary —–
{summary}

—– Query —–
{query}

Here we define:

{summary}: Provided summary.

{query}: Query intention of the summarization.

A.5 LLMs Prompt Template for Question Generation

{mdframed}

—– Instructions —–

Given the below “KG Entities” and “Summary”, adhere to the instructions below to create “Question-Answer Pairs”:

1. Each pair must be related to a specific KG entity.

2. The answer must be the KG entity itself.

3. Formulate questions that are directly relevant to the KG entity within the context of the summary.

4. Avoid creating questions that are open-ended.

5. Use the following format for your response, as shown under the “Question-Answer Pairs” of the “Example” section below: [Question#1: ”Question”, Answer#1: ”Answer”].

Following the above steps and an example under the following “Example” section, create “Question-Answer Pairs” based on the task under the “Task” section.

—– Example —–
“KG Entities”:
hiking
pop music

“Summary”:

**Summary:**

The user demonstrates a robust long-term interest in outdoor and musical activities. Specifically, they are drawn to hiking and pop music.

**Insights:**

* Sports Recreation and Fitness: The user has a sustained interest in hiking, engaging regularly in this activity, which indicates a preference for exploring nature and challenging terrains.

* Entertainment Media and Arts: The user enjoys pop music, known for its wide appeal and catchy melodies, reflecting a consistent interest in this genre.

“Question-Answer Pairs”:

[Question#1: “What outdoor activity is the user mainly interested in according to their searches and discussions?”, Answer#1: “hiking”]

[Question#2: “What genre of music does the user prefer, known for its wide appeal and catchy melodies?”, Answer#2: “pop music”]

—Task—
“KG Entities”:
{kg_entities}

“Summary”:
{summary}

Here we define:

{kg_entities}: List of KG entities extracted from the summary.

{summary}: Provided summary.

A.6 LLMs Prompt Template for Question Answering

{mdframed}

—Instructions—

Given the below “Question-Answer Pairs” and my “User Activities”, judge if each question-answer pair is consistent with my activities. Adhere to the instructions below.

1. Each “Judgement” is composed of “Status” and “ReferenceAnswer” like the following format: [Status#1: ”Status”, ReferenceAnswer#1: ”ReferenceAnswer”].

2. The “Status” should be labeled as “consistent” or ’inconsistent”.

3. A “consistent” status means the question-answer pair aligns with or does not contradict the information provided in my activities. The “ReferenceAnswer” should be “none”.

4. An “inconsistent” status means the question-answer pair conflicts directly with or is contradicted by the information provided in my activities. The “ReferenceAnswer” should be a new answer of the question based on my activities.

5. Use the following format for your response, as shown under the “Judgements” of the following “Example” section below like [Status#2: ”Status”, ReferenceAnswer#2: ”ReferenceAnswer”].

6. Match each judgement to its corresponding question-answer pair by their sequence, such as [Status#2: ”Status”, ReferenceAnswer#2: ”ReferenceAnswer”] pertains to [Question#2: ”Question”, Answer#2: ”Answer”].

Following the above steps and an example under the following “Example” section, create judgements based on the task under the “Task” section.

—– Example —–
“Question-Answer Pairs”:

[Question#1: “What outdoor activity is the user mainly interested in according to their searches and discussions?”, Answer#1: “hiking”]

[Question#2: “What genre of music does the user prefer, known for its wide appeal and catchy melodies?”, Answer#2: “rock music”]

“User Activities”:
searched “Pop music trends in the 2020s” around Sat 05/15/2004 4PM
searched “Best coffee brewing methods for home” around Fri 06/11/2004 6PM
searched “How to prepare for a multi-day hiking trip” around Wed 07/21/2004 7PM
searched “The evolution of electronic elements in pop music” around Sun 04/28/2004 5PM
searched “Hiking trails with the best views in the U.S.” around Mon 04/29/2004 1PM

“Judgements”:
[Status#1: “consistent”, ReferenceAnswer#1: “none”]
[Status#2: “inconsistent”, ReferenceAnswer#2: “pop music”]

—– Task —–
“Question-Answer Pairs”:
{question_answer_pairs}

“User Activities”:
{user_activities}

Here we define:

{question_answer_pairs}: List of generated QA pairs related to the KG entities.

{user_activities}: User activities which are used to generate the summary.

A.7 LLMs Prompt Template for Combining Time Segments

{mdframed}

—Instructions—

Combine all the time segment summaries under the “Time Segment Summaries” section. Adhere to the instructions below.

1. The combined summary should offer insights relevant to the query: “{query}”.

2. Format the combined summary with sections labeled **Summary** and **Insights**.

3. Focus the combined summary on my recent interest preferences (recent time segments).

4. Limit the combined summary to no more than {max_words} words.

—– Time Segment Summaries —–
Summary of Time Segment “{time_range}”: {segment_summary}

Here we define:

{query}: Query intention of the summarization.

{max_words}: Max number of words for the summary.

{time_range}: Time range of a segment.

{segment_summary}: Summary of a time segment.

Appendix B Analysis of Benchmark Metrics on Popular Models

Refer to caption
Refer to caption
Refer to caption
Figure 5: Comparison of different models (Gemini 1.5 Pro, GPT-4o, and Claude 3 Haiku) on various metrics across the MovieLens, Yelp, and Amazon Review datasets: (1) Quality Metric, (2) Instruction Following Metric, and (3) Information Density Metric.

We assessed three popular models for summarizing user activities (refer to Appendix A.1 for the prompt details) using the UserSumBench benchmark metrics. The models evaluated were Gemini 1.5 Pro (Reid et al. 2024), GPT-4o (Achiam et al. 2023), and Claude 3 Haiku (Anthropic 2023). The summaries generated by these models were subsequently used for predicting future activities with the Gemini 1.5 Flash model (Reid et al. 2024). The results are presented in Figures 5.

Quality Metric

Figure 5 (1) shows the quality metric for the three models across the MovieLens, Yelp, and Amazon Review datasets. Gemini 1.5 Pro and GPT-4o exhibit similar performance in the three datasets. In the MovieLens and Yelp datasets, GPT-4o achieves the highest quality metric, surpassing both Gemini 1.5 Pro and Claude 3 Haiku. Claude 3 Haiku consistently shows slightly lower performance across all datasets but remains competitive, particularly in the Amazon Review dataset where it closely follows the other two models.

Instruction Following Metric

Figure 5 (2) compares the models based on the instruction-following metric. Gemini 1.5 Pro performs very well in all datasets, particularly in the Yelp and Amazon Review dataset, where it significantly outperforms the other models. Claude 3 Haiku also performs well, especially in the MovieLens dataset. GPT-4o, however, shows significantly lower performance in following instructions, particularly in the Yelp and Amazon Review datasets.

The observed poorer performance of GPT-4o on the instruction-following metric, particularly regarding summary word limits, could be attributed to its inherent design and optimization focus. Unlike some models that may be explicitly tuned for concise responses or instruction adherence, GPT-4o might prioritize generating detailed, comprehensive content, even at the expense of brevity. This tendency to produce more elaborate summaries could explain its weaker performance in adhering to strict word limits.

While GPT-4o’s tendency to generate more detailed responses may hinder its ability to strictly follow word limits, it may simultaneously contribute to its higher quality in predicting future activities. The additional detail and context provided in longer summaries could lead to richer representations of user behaviors and preferences, enhancing predictive accuracy on tasks such as those in the MovieLens and Yelp datasets as shown in Figure 5 (1).

In summary, GPT-4o’s architecture or training objectives may emphasize content richness over strict instruction adherence, which, while beneficial for some tasks, leads to challenges in settings where brevity and instruction following are critical.

Information Density Metric

Figure 5 (3) presents the information density metric. Gemini 1.5 Pro demonstrates the highest performance across all datasets. Claude 3 Haiku also performs well but is consistently outperformed by Gemini 1.5 Pro. GPT-4o shows the lowest performance, particularly in the Yelp and Amazon Review datasets, where the difference is most pronounced.

Overall Analysis

Overall, Gemini 1.5 Pro exhibits the best performance across all benchmark metrics and datasets, particularly excelling in instruction following and information density metrics. GPT-4o, while competitive in the quality metric, falls behind significantly in instruction following and information density metrics. Claude 3 Haiku shows consistent performance across the board but does not surpass Gemini 1.5 Pro in most of the metrics.

These results indicate that Gemini 1.5 Pro is the most effective model among the three for generating high-quality, instruction-following, and information-dense summaries. This superior performance makes it a preferable choice for applications requiring detailed and accurate summarization capabilities.