[Uncaptioned image]
A Novel Problem and Dataset for Summarization of Planning-Like (PL) Tasks

Vishal Pallagani
University of South Carolina
[email protected]
&Biplav Srivastava
University of South Carolina
[email protected]
&Nitin Gupta
University of South Carolina
[email protected]
Abstract

Text summarization is a well-studied problem that deals with deriving insights from unstructured text consumed by humans, and it has found extensive business applications. However, many real-life tasks involve generating a series of actions to achieve specific goals, such as workflows, recipes, dialogs, and travel plans. We refer to them as planning-like (PL) tasks noting that the main commonality they share is control flow information. which may be partially specified. Their structure presents an opportunity to create more practical summaries to help users make quick decisions. We investigate this observation by introducing a novel plan summarization problem, presenting a dataset, and providing a baseline method for generating PL summaries. Using quantitative metrics and qualitative user studies to establish baselines, we evaluate the plan summaries from our method and large language models. We believe the novel problem and dataset can reinvigorate research in summarization, which some consider as a solved problem.

1 Introduction

Text summarization is a crucial task in natural language processing (NLP) that focuses on condensing large volumes of unstructured text into concise and informative summaries (Luhn, 1958). This task has significant applications in various domains such as news aggregation, document summarization, and content recommendation systems (El-Kassas et al., 2021). Traditional summarization techniques can be broadly categorized into extractive (Gupta and Lehal, 2010) and abstractive methods (Gupta and Gupta, 2019). Extractive summarization selects key sentences or phrases from the original text, whereas abstractive summarization generates new sentences that capture the essence of the text. Recently, large language models (LLMs) have demonstrated remarkable capabilities, outperforming human summaries (Pu et al., 2023) on several datasets such as Multi-News (Fabbri et al., 2019) and MediaSum (Zhu et al., 2021).

Despite its extensive applications, text summarization has primarily concentrated on static documents, overlooking dynamic tasks that involve sequences of actions aimed at achieving specific goals. We refer to these tasks as planning-like (PL) tasks(Srivastava and Pallagani, 2024). Examples of PL tasks include workflows, recipes, dialogs, and travel plans, which often contain control flow information critical for execution. For instance, consider the task of cooking a cheese sandwich. Numerous recipes exist for making a cheese sandwich, each with varying ingredients and steps. A summary for this PL task aims to condense these multiple recipes into a single, coherent summary. This summary would allow a knowledgeable user to quickly make a cheese sandwich based on the brief summary or help a user decide which recipe best suits their needs based on the ingredients they have available. This approach can be considered similar to multi-document summarization on a high level, where information from multiple sources is synthesized into a concise summary (Goldstein et al., 2000). By summarizing multiple action sequences into coherent and actionable insights, we provide users with valuable information and facilitate quicker decision-making.

Consider another example of routes from Google Maps, a commercial service offering travel routes between selected locations. In Figure 1, we provide an instance where the user wants to find driving routes between Manhattan, New York, and Pleasantville, New York. Google Maps offers multiple route options visually on the map and provides a summary of three possible routes to reach the destination. This summary focuses on the critical roads, estimated travel time, and distance. This allows the user to choose their preferred route without going through the complete step-by-step instructions for all three options. Each summary in Box 1 can be expanded to reveal more detailed summaries, including additional key roads or waypoints. This capability enables quick decision-making and efficient route planning, illustrating the utility of summarization in PL tasks.

Refer to caption
Figure 1: Google Maps summarizes three possible driving routes from Manhattan to Pleasantville, New York. The initial view (Box 1) includes key information like critical roads, estimated travel time, and distance, aiding quick decision-making. Detailed step-by-step directions can be accessed by expanding each summary present in Box 2.

To address the gap in summarization literature for PL tasks, we introduce the novel problem of summarizing planning like (PL) tasks111We also refer to it as plan summarization or PL summaries.. Plan summarization aims to create concise and coherent summaries of action sequences that achieve specific goals, thereby facilitating quick understanding and decision-making. Unlike traditional text summarization, plan summarization must account for the executability and logical flow of actions.

We present a new dataset, called as PLANTS222https://github.com/VishalPallagani/PLANTS-benchmark, specifically designed for plan summarization tasks, encompassing diverse domains such as automated plans, recipes, and travel plans. Additionally, we propose a baseline method for generating PL summaries. Our evaluation includes comparisons with summaries generated by both extractive and abstractive methods through a user study. We believe that introducing the plan summarization problem and providing a relevant dataset will spark renewed interest in the summarization research community. Our contributions are threefold: (1) Definition of the planning task summarization; (2) Creation of a dataset tailored for PL tasks; (3) Development of a baseline method for generating summaries; (4) Initial evaluation of how users perceive PL summaries from the baseline method and LLMs.

2 Planning-like Tasks

Planning-like tasks involve a series of actions required to achieve specific goals. These tasks are defined and explored in Srivastava and Pallagani (2024). In this paper, we focus on three primary domains of PL tasks: automated plans, recipes, and travel routes. Each of these domains involves unique challenges and characteristics that necessitate effective summarization for better user comprehension and decision-making.

Automated Plans
Automated planning (Ghallab et al., 2004) involves creating action sequences for intelligent agents to achieve specified goals. In automated planning, a problem is typically represented as a tuple consisting of states, actions, and goals. The objective is to generate an automated plan that transitions the system from the initial state to the goal state while satisfying certain constraints. The semantics of automated plans require them to be sound and feasible, meaning each action must be executable in the given context, and the sequence must logically lead to the achievement of the goal. Summarizing automated plans helps in quickly understanding the essential steps and ensuring all actions are executable.

Recipes
In the domain of culinary arts, recipes are structured sequences of actions aimed at preparing specific dishes. Each recipe includes a list of ingredients and step-by-step instructions for combining them. Given the multitude of recipes available for a single dish, there can be significant variation in ingredients and preparation methods. This diversity makes it challenging for users to quickly identify the essential components and steps needed to prepare a dish. Summarizing recipes allows users to identify must-have ingredients and critical steps, making it easier to choose or adapt a recipe based on available ingredients.

Travel Routes
Travel planning involves creating efficient paths from a starting location to a destination. This process includes determining the optimal route, considering factors such as distance, travel time, and road conditions. Travel routes are complex, often involving multiple possible paths and decisions about which roads or highways to take. Summarizing travel routes provides a clear overview of the main paths, travel times, and distances, aiding in quick decision-making and efficient route planning.

These PL tasks, as summarized in Table 1, highlight the different characteristics and requirements across domains. Summarizing these tasks enhances usability and accessibility, providing users with concise, actionable insights for efficient decision-making and task execution.

Table 1: Characterizing Planning-like Tasks.
Domain State Representation Control Flow Data Flow Auto Generation Auto Execution Kommentare
Automated Plans Full initial state, partial goal state Minimal Precise action sequences ensuring sound execution.
Recipes List of ingredients and steps Moderate Structured instructions for food preparation with variations across different recipes.
Travel Routes Start and destination points Extensive Step-by-step travel paths with critical roads, travel times, and distances to reach the destination.

3 Planning Task Summarization

Planning task summarization involves generating a concise summary of multiple plans that achieve the same goal. In various domains, such as travel planning, recipe generation, and automated planning, it is common to have multiple possible plans to reach a desired outcome. Each plan may differ in the sequence and number of actions required. Inspired by early work on process summarization (Srivastava, 2010), our approach aims to enhance user comprehension and facilitate better decision-making by providing a summary that consolidates these multiple plans into a single, coherent overview, highlighting the key actions and considerations for achieving the goal.

We formally define the planning task summarization problem as follows. Given a set of plans P={p1,p2,,pn}𝑃subscript𝑝1subscript𝑝2subscript𝑝𝑛P=\{p_{1},p_{2},\ldots,p_{n}\}italic_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, each plan pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of a sequence of actions {ai1,ai2,,aim}subscript𝑎𝑖1subscript𝑎𝑖2subscript𝑎𝑖𝑚\{a_{i1},a_{i2},\ldots,a_{im}\}{ italic_a start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT } designed to achieve a common goal G𝐺Gitalic_G. The task is to produce a summary plan Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that is a function of the size and number of actions constrained by metadata. Mathematically, this can be expressed as:

P=Summarize(P,constraints)superscript𝑃Summarize𝑃constraintsP^{*}=\text{Summarize}(P,\text{constraints})italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = Summarize ( italic_P , constraints )

where the constraints may be in terms of textual features (e.g., maximum allowable characters, words or lines) or plan features (e.g., maximum number of actions) in the summary plan. Hence, it is expected that |Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT| << |P𝑃Pitalic_P|. These constraints ensure that the summary plan remains concise and focused on the most critical actions necessary to achieve the goal.

Several challenges arise in the planning task summarization process. Different plans might take varied approaches to achieve the same goal, making it challenging to create a summary that captures the essential steps without losing critical diversity. Additionally, the summary must strictly adhere to the provided constraints, ensuring it remains concise and relevant. Another significant challenge is the selection of actions from the original plans to include in the summary. The goal is to ensure that the summary is representative of the original plans and efficient in the number of actions.

4 PLANTS Dataset

In this section, we introduce the PLANTS dataset, specifically designed for planning task summarization. The dataset encompasses three distinct planning-like tasks: automated plans, recipes, and travel routes. For each task, we have curated 10 different problems/goals. Each goal has 5 different plans for automated plans and recipes, and 3 different plans for travel routes, resulting in a total of 130 diverse plans in the dataset (see Figure 2).

Refer to caption
Figure 2: Distribution of problems and plans across domains. Left: shows the number of problems per domain, with each domain having 10 problems. Right: displays the average number of plans per problem for each domain.

Automated Plans: For generating automated plans, we utilized five classical planning domains from the downward-benchmarks(Basel, 2024): blocks, driverlog, mprime, openstacks-strips, and queen-split. These domains are released as part of the International Planning Competition (IPC) (ICAPS, 2022). The downward-benchmarks repository includes both the domains and their corresponding problems, where the goals are defined. We selected two distinct problems (i.e., goals) from each planning domain, resulting in a total of ten unique goals. Each problem was solved using SymK (Speck et al., 2020), a state-of-the-art classical optimal and top-k planner based on symbolic search that extends Fast Downward (Helmert, 2006). We set k to 5, generating five different plans for each problem. This approach ensures that our dataset contains a variety of viable solutions for each planning problem, providing a robust basis for summarization.

Recipes: For the recipes, we manually selected ten distinct and commonly made dishes such as cheese sandwich, guacamole and omelette from the Recipe1M+ dataset (Marın et al., 2021). Recipe1M+ is a large-scale dataset containing over one million recipes with associated images and instructions. Assumption: To ensure diversity in preparation methods, we assume that distinct ingredient lists will result in different preparation steps. Based on this assumption, we extracted five different recipes for each dish by calculating the Jaccard similarity between the ingredient lists and selecting recipes with low similarity scores. This method ensures that the chosen recipes have varied ingredients, leading to diverse preparation steps. Specifically, we only extracted the ingredients and step-by-step instructions for each recipe. This manual selection and extraction process ensures that our dataset includes multiple viable approaches to achieve the same culinary goal, providing a robust basis for summarization.

Travel Routes: For the travel routes, we manually selected ten different pairs of start and destination coordinates to ensure a diverse set of route planning problems. The coordinates were chosen to cover a variety of urban layouts, providing a comprehensive testbed for summarization. We utilized the OpenStreetMap (OSM) API (Haklay and Weber, 2008), a collaborative mapping project that provides free geographic data and mapping services, to generate routes between these coordinates. The OSM API allows for the extraction of detailed route information, including road networks and step-by-step directions. For each pair of coordinates, the API generates atmost three distinct routes, ensuring that the routes are unique by default. We extracted the step-by-step directions for each route, including the sequence of roads and waypoints. This approach ensures that our dataset captures a variety of viable travel options for each route planning problem.

5 Experimental Settings

In this section, we describe the different models used for plan summary generation and also discuss the user study settings. The constraints applied to these models and the prompt templates used for GPT-4o are detailed in Supplementary Material (Section 3).

5.1 Models

For each task, we use GPT-4o as the representative of LLMs and an abstractive technique for obtaining plan summaries. For extractive summarization, we use TextRank. Additionally, we developed a new frequency-based baseline method for extractive plan summarization. Each approach receives as input a set of plans to generate a summary. For automated plans and recipes, each set contains 5 plans, and for travel routes, each set contains 3 plans.

Algorithm 1 Baseline: Plan Summary Generation
0:  List of plans data, each plan is a list of actions
0:  Summary of the planning task problem
1:  Function parse_data(data)
2:  Initialize parsed_data as an empty list
3:  for each plan in data do
4:     Parse actions in the plan and add to parsed_data
5:  end for
6:  return  parsed_data
7:  Function ngrams(lst, n)
8:  Generate and return n-grams from list lst
9:  Function analyze_text_view(parsed_data, ngram_size)
10:  Initialize all_items as an empty list
11:  for each action in parsed_data do
12:     Add action to all_items
13:  end for
14:  Count and filter items and n-grams in all_items
15:  return  Filtered items and n-grams
16:  Function analyze_plan_view(parsed_data)
17:  Extract and count actions and secondary mentions in parsed_data
18:  Find the shortest plan and most common action sequence
19:  return  Analysis of plan view
20:  Function generate_summary(text_view, item_view)
21:  Summarize common actions, secondary mentions, shortest plan, and common action sequences
22:  return  Summary

Algorithm 1 outlines our baseline method, which involves parsing the plans to extract actions and creating a structured representation of the data. This structured data is then analyzed in two views: text view and plan view. The text view analysis identifies common items and n-grams by counting the frequency of individual actions and sequences of actions. The plan view analysis examines the structure and sequence of actions, identifying the most common actions, secondary mentions (such as objects or ingredients), the shortest plan, and the most common action sequences. The results from these analyses are combined to generate a plan summary.

5.2 User Study

To assess the ease of understanding, clarity for action, and overall preference for the summaries, we conducted a human evaluation involving ten annotators. The annotators were students (undergraduate and graduate students) and faculty staff, all with an understanding of the three PL tasks: automated plans, recipes, and travel routes. For each PL task, we provided the annotators with the actual plans and presented them with summaries generated by three different methods: GPT-4 (abstractive), TextRank (extractive), and our frequency-based baseline method (extractive). To ensure the reliability of our results, we calculated the overall inter-annotator agreement using Cohen’s kappa coefficient (Cohen, 1968). We found that the agreement among annotators was acceptable, with a coefficient of 0.72.

6 Experimental Results

Experiment 1: Comparing the number of tokens across the summaries

Figure 3 shows the boxplot comparing the token counts across three summarization methods: baseline, TextRank, and GPT-4o. The median token count for baseline is around 53, indicating consistent summary lengths with minimal variability. TextRank exhibits significant variability, with a median token count lower than baseline, reflecting diverse summary lengths. GPT-4o displays the highest median token count at approximately 176.5, indicating longer and more detailed summaries, with a wider interquartile range. This analysis highlights the differences in summary lengths, providing insights into the summarization characteristics of each method.

Refer to caption
Figure 3: Comparison of token counts across different summarization approaches.

Experiment 2: Comparing the information-richness of the summaries

In this experiment, we measure the lexical density of summaries generated by baseline, TextRank, and GPT-4o to evaluate their information richness. Lexical density is calculated as the proportion of content words—nouns, verbs, adjectives, and adverbs—to the total number of words in a summary. Figure 4 shows the lexical density of the three summary methods across 30 planning summarization tasks in the benchmark dataset. GPT-4o consistently achieves the highest lexical density, indicating it produces the most information-rich summaries. The baseline demonstrates moderate lexical density, followed by TextRank, which exhibits the lowest and most variable lexical density.

Refer to caption
Figure 4: Comparison of lexical diversity across different summarization approaches to understand their information-richness.

Experiment 3: Comparing the ease of understanding of the summaries

From the user studies, we obtained results on how easy it is to understand a summary to take an action. Each summary was rated on a scale from 1 to 5, with 1 being very difficult to understand and 5 being very easy to understand. The average ease of understanding scores are presented in Table 2. GPT-4o received the highest ease of understanding scores across the three PL tasks. For automated plans, the baseline approach ranked second, while TextRank was rated second for recipes and travel routes.

Table 2: Ease of understanding scores for the summaries across three different planning tasks.
Baseline TextRank GPT-4o
Automated Plans 3.16 2.39 4.09
Recipes 2.77 3.41 4.68
Travel Routes 2.70 3.45 3.99

Experiment 4: User preference of the summaries

The user study was also used to rank the summaries based on preferences. The aggregate preferences for each summary choice were then analyzed. For automated plans, GPT-4o was the first preference for 76% of users, followed by the baseline approach as the second preference for 44%, and TextRank as the third preference for 59%, as shown in Table 3. GPT-4o received the first preference across all three planning tasks, with TextRank and the baseline approach varying in their ranking depending on the specific task.

Table 3: Order preference percentages for each summary across different PL tasks.
PL Task Summary 1st Preference 2nd Preference 3rd Preference
Automated Plans Baseline 15% 44% 41%
TextRank 9% 32% 59%
GPT-4o 76% 24% 0%
Recipes Baseline 10% 20% 70%
TextRank 7% 67% 26%
GPT-4o 83% 13% 4%
Travel Routes Baseline 15% 13% 72%
TextRank 34% 46% 20%
GPT-4o 51% 41% 8%

7 Conclusion

In this work, we introduced the novel problem of planning task summarization. To address this problem, we developed the PLANTS dataset, encompassing three distinct PL tasks: automated plans, recipes, and travel routes. Alongside the dataset, we also presented a frequency-based baseline method for plan summarization. We evaluated both abstractive and extractive summarization methods for planning task summarization through user studies and empirical analysis. Our findings indicate that while GPT-4o is the preferred approach for generating plan summaries due to its detailed and information-rich outputs, further evaluation is needed to verify if these summaries maintain the executional semantics of PL tasks. The issue of hallucination in abstractive methods remains a significant challenge that warrants further investigation. Additionally, there is a need to develop evaluation metrics specifically tailored for PL task summaries to ensure their effectiveness and reliability.

We believe this work represents an initial effort towards advancing research in planning task summarization. The broader impact of this research could influence various domains, including robotics, dialog agents, and planning agents. We hope our contributions will inspire further advancements and exploration in this field, ultimately leading to more robust and efficient summarization techniques, datasets, and evaluation metrics for the problem of planning task summarization.

8 Limitations

Size of the Dataset: While the PLANTS dataset provides a valuable starting point for planning task summarization, it includes only 10 problems per domain, with 5 plans each for automated plans and recipes, and 3 plans each for travel routes. This limited size may not fully capture the variability and complexity of real-world planning tasks. Additionally, the dataset does not include gold summaries, as it is challenging to obtain authoritative summaries for PL tasks due to their inherent variability and subjective nature. However, to facilitate future research, we release the generators used to create this dataset, allowing for the development of larger and more diverse datasets across these domains.

Evaluation Metrics: The evaluation metrics employed in this study, such as human preference and ease of understanding, are inherently subjective and may not fully reflect the executional semantics of the plans.

Inter-Annotator Agreement: Although we measured inter-annotator agreement using Cohen’s kappa and found it to be acceptable, the subjective nature of human evaluation introduces potential variability in judgments. Future work could explore more rigorous training for annotators.

9 Ethics Statement

The development and evaluation of the PLANTS dataset were conducted with strict adherence to ethical standards. All data were sourced from publicly available repositories, ensuring compliance with usage terms and privacy regulations. Human evaluators, consisting of graduate students and professors with domain expertise, participated voluntarily and provided informed consent. Their responses were anonymized to maintain privacy. The dataset and evaluation methods were designed to minimize bias and ensure accuracy. We release the dataset generators for research purposes, encouraging responsible use in compliance with ethical guidelines. This work aims to benefit multiple domains, including robotics and planning agents, and we advocate for the responsible deployment of summarization technologies to avoid potential harm.

Acknowledgements

We would like to thank Amitava Das for discussions related to textual summarization and for helping us build parallels to planning task summarization.

References

  • Basel [2024] AI Basel. downward-benchmarks: A collection of planning benchmarks for the fast downward planner, 2024. URL https://github.com/aibasel/downward-benchmarks. Accessed: 2024-06-02.
  • Cohen [1968] Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213, 1968.
  • El-Kassas et al. [2021] Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed. Automatic text summarization: A comprehensive survey. Expert systems with applications, 165:113679, 2021.
  • Fabbri et al. [2019] Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
  • Ghallab et al. [2004] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: theory and practice. Elsevier, 2004.
  • Goldstein et al. [2000] Jade Goldstein, Vibhu O Mittal, Jaime G Carbonell, and Mark Kantrowitz. Multi-document summarization by sentence extraction. In NAACL-ANLP 2000 workshop: automatic summarization, 2000.
  • Gupta and Gupta [2019] Som Gupta and Sanjai Kumar Gupta. Abstractive summarization: An overview of the state of the art. Expert Systems with Applications, 121:49–65, 2019.
  • Gupta and Lehal [2010] Vishal Gupta and Gurpreet Singh Lehal. A survey of text summarization extractive techniques. Journal of emerging technologies in web intelligence, 2(3):258–268, 2010.
  • Haklay and Weber [2008] Mordechai Haklay and Patrick Weber. Openstreetmap: User-generated street maps. IEEE Pervasive computing, 7(4):12–18, 2008.
  • Helmert [2006] Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26:191–246, 2006.
  • ICAPS [2022] ICAPS. International planning competitions at international conference on automated planning and scheduling (icaps). In https://www.icaps-conference.org/competitions/, 2022.
  • Luhn [1958] Hans Peter Luhn. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165, 1958.
  • Marın et al. [2021] Javier Marın, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):187–203, 2021.
  • Pu et al. [2023] Xiao Pu, Mingqi Gao, and Xiaojun Wan. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558, 2023.
  • Speck et al. [2020] David Speck, Robert Mattmüller, and Bernhard Nebel. Symbolic top-k planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9967–9974, 2020.
  • Srivastava [2010] Biplav Srivastava. Summarizing processes. Technical report, IBM Research Report RI100008, IBM Research-India, 2010. Available online …, 2010.
  • Srivastava and Pallagani [2024] Biplav Srivastava and Vishal Pallagani. The case for developing a foundation model for planning-like tasks from scratch. arXiv preprint arXiv:2404.04540, 2024.
  • Zhu et al. [2021] Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. Mediasum: A large-scale media interview dataset for dialogue summarization. arXiv preprint arXiv:2103.06410, 2021.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes]

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [Yes]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A]

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes]

    2. (b)

      Did you mention the license of the assets? [Yes]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [Yes]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]