IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents

Shrestha Mohanty1, Negar Arabzadeh2, Andrea Tupini3, Yuxuan Sun4,
Alexey Skrynnik5, Artem Zholus6, Marc-Alexandre Côté3, Julia Kiseleva3
Massachusetts Institute of Technology1, University of Waterloo2,
Meta AI4, AIRI5, École Polytechnique de Montréal6, Microsoft Research3
[email protected]
Abstract

Seamless interaction between AI agents and humans using natural language remains a key goal in AI research. This paper addresses the challenges of developing interactive agents capable of understanding and executing grounded natural language instructions through the IGLU competition at NeurIPS. Despite advancements, challenges such as a scarcity of appropriate datasets and the need for effective evaluation platforms persist. We introduce a scalable data collection tool for gathering interactive grounded language instructions within a Minecraft-like environment, resulting in a Multi-Modal dataset with around 9,000 utterances and over 1,000 clarification questions. Additionally, we present a Human-in-the-Loop interactive evaluation platform for qualitative analysis and comparison of agent performance through multi-turn communication with human annotators. We offer to the community these assets referred to as IDAT (IGLU Dataset And Toolkit) which aim to advance the development of intelligent, interactive AI agents and provide essential resources for further research.

1 Introduction

One of the enduring goals of artificially intelligent (AI) agents [83] is to seamlessly interact with humans using natural language. This capability allows AI agents to learn new skills [55, 90, 79] or assist in solving tasks [68, 34, 44]. To achieve this, AI agents must be able to comprehend [50, 49] and respond to human language, executing instructions across various environments [69]. Over the years, researchers have developed numerous tasks to address this challenge, often focusing on scenarios where humans provide instructions to achieve specific goals [26, 68]. For example, in the blocks world task, the agent must understand human instructions to move blocks on a grid [83, 11]. Other setups use Minecraft [27, 24] for tasks such as moving objects [1], simulating human behavior [62], or performing household tasks [68, 82]. However, human instructions are often inherently ambiguous. To complete these tasks successfully, agents need to engage in conversations by asking clarifying questions [4, 67, 64], thereby creating a more user-friendly interface [56].

To advance and emphasize this objective of interaction-driven agent building, we organized the Interactive Grounded Language Understanding (IGLU) competition at NeurIPS in 2021[36] and 2022[37]. The primary aim of this competition was to foster the development of interactive agents capable of comprehending and executing grounded natural language instructions, particularly emphasizing the nuances of natural language dialogues and clarifications. The overarching goal of IGLU is to equip researchers with the data, tools, and insights necessary to evaluate the efficacy of interactive multi-turn communication with humans. The first significant challenge hindering the exploration of building interactive agents is the scarcity of appropriate datasets. Moreover, the data collection process is time-consuming and difficult to set up, requiring scalable, flexible, and easily extendable data collection tools. Another crucial requirement is an effective evaluation process and platform. Given the nature of the problem under consideration, an interactive and open evaluation platform is needed. This interactive “human-in-the-loop" evaluation is necessary because automatic metrics such as accuracy do not thoroughly explain the performance of agents and may not correlate well with human preferences for answers[7, 8]. Our interactive evaluation tools provide a critical supplement to automatic evaluation metrics, providing deeper qualitative insights and ensuring the robustness and validity of the evaluation process. Such an evaluation platform also addresses concerns around data leakage from benchmark datasets into training data, as highlighted in some recent studies [9]. Finally, after running this competition for two years, the task’s complexity is evident from the scores lacking in both offline and human evaluations of the agents. This emphasizes the need to release the dataset and tools to enable further research in this direction.

IDAT (IGLU Dataset And Toolkit) aims to address these challenges by making the following contributions:

  1. C1

    Data Collection Tool: A scalable tool designed for efficiently gathering interactive grounded language instructions and clarifying questions within a Minecraft-like, voxel world environment that can be run in a web browser, making it accessible to a large number of annotators in a crowdsourcing platform (Sec. 3). This tool also offers a high degree of extensibility, enabling researchers to expand existing datasets and collect more data in a customized setting.

  2. C2

    Multi-Modal Dataset: Based on the building structures task in a 3D voxel world, the dataset includes around 9,00090009,0009 , 000 natural language utterances, consisting of instructions given by annotators to build a structure followed by the corresponding world states, actions performed by the annotators, as well as images of the voxel world. Additionally, the datasets contain 1,18211821,1821 , 182 clarification questions posed by builders when instructions are ambiguous (Sec. 4).

  3. C3

    Human-in-the-Loop Interactive Evaluation Platform: An interactive platform that facilitates human multi-turn communication with reinforcement learning (RL) agents by allowing annotators to compare the performance of multiple agents and providing additional qualitative analysis into their performance, thus leading to new insights into the interactive evaluation process. We released a dataset consisting of 45454545 pairs of comparison games (Sec. 5).

The corpus collected using our data collection tool was leveraged and deployed during the competition, with over 55 teams utilizing it. This adoption highlights the utility of the dataset and corresponding tools in enabling research on the development of intelligent interactive agents. All of the above resources are publicly available under the MIT license in our repositories: datasets 111https://github.com/microsoft/iglu-datasets, data collection tool 222https://github.com/iglu-contest/dataset-collection-and-evaluation and human-in-the-loop evaluation platform333https://github.com/microsoft/greenlands. By sharing these resources with the community, our aim is to facilitate further advances in research and development, fostering the creation of more capable and interactive AI agents in a transparent manner.

2 Interactive Grounded Language Understanding (IGLU) Setup

Refer to caption
Figure 1: Interactive Grounded Language Understanding (IGLU) Setup

The IGLU competitions in 2021 [36] and 2022 [37] address the challenge of developing interactive agents capable of learning to solve building tasks through grounded natural language instructions in a collaborative environment. An interactive agent is defined as one that accurately follows instructions, requests clarification when necessary, and swiftly adapts to newly acquired skills.

To approximate this scenario and simplify the study to obtain easily interpretable findings, allowing us to understand general principles, we propose the following simulated setup: The architect and builder communicate via a chat interface in 3D environment. The architect provides the builder with grounded instructions on constructing the target structure. The builder may either seek clarification if the instructions are incomplete or ambiguous or proceed to execute the instructions. The architect has the capability to observe the builder’s actions.

To broaden the participation, the competition included two tasks: (a) an Interaction Focused Task, and (b) an Agent Building Task. This task setup inspired by a collaborative building task by Narayan-Chen et al. involves interactions between architects and builders to build a structure (Fig. 2(a)).

Interaction Focused Task Inspired by previous works on agents seeking clarification [4, 19], we split it into the following research questions:

  1. RQ1

    When to ask a clarifying question?
    Given an instruction from the architect, a model needs to predict whether the instruction is sufficient to complete the described task or if further clarification is needed.

  2. RQ2

    What clarifying question to ask?
    If the given instruction from the architect is ambiguous, a clarifying question should be raised.

Agent Building Task This task involved building agents that take the instructions and use them to navigate and place colored blocks within the building area from a first-person perspective. The RL agent receives a score reflecting the degree of completeness of the constructed structure compared to the ground truth target structure.

3 Data Collection Tool

We developed a scalable open-source data collection tool444https://github.com/iglu-contest/dataset-collection-and-evaluation to facilitate the collection of multi-modal corpora (Sec. 4) for the collaborative building task [55, 31] using the setup described in Sec. 2 Unlike the data collection environment established by [55], which utilizes the Malmo platform and requires a Minecraft game server [32], our tool is entirely developed in JavaScript. This approach eliminates the need to set up a Minecraft game server, significantly simplifying the process. Additionally, our tool is highly scalable, allowing for efficient expansion and integration with crowdsourcing platforms such as Amazon MTurk. Our data collection tool can be used to easily collect more data.

Voxel World Environment We harnessed a Minecraft-like game environment called CraftAssist voxel world [27, 73] for our data collection tool which provides an immersive platform for agents to learn from language instructions and engage in fundamental navigation and building tasks, driven by its unique physics characteristics and its 3D world representation. In the CraftAssist voxel grid world agents perform building actions within a 11×11×91111911\times 11\times 911 × 11 × 9 sized build region [55] that can be recorded as action states and retrieved for future sessions. The integrated CraftAssist library supports actions such as picking, placing, and removing blocks of different colors within the voxel world. Additionally, agents can jump to place blocks, enabling the creation of structures with varying complexity. This approach ensures scalability and facilitates extensive experimentation and development within the platform. Fig. 2(b) gives the visualization of the voxel world environment in our platform.

To reduce user friction in giving and comprehending instructions, we embedded a compass on the ground of the voxel world to aid users in understanding spatial orientations. Then in the architect task, we ask the builder to explicitly specify the view of the current structure on which the instruction is based from one of the five orientations: northward, southward, eastward, westward, or from top. Later in the builder task, we put the builder in the same orientation before providing the instructions from the architect. In this way, the architect and the builder are able to establish a shared understanding of the spatial attribute of the target structure in a multi-turn manner asynchronously. For each task, we record the following information: gameId, stepId, and avatarInfo. avatarInfo contains the agent’s spatial coordinates (x, y, z) and its corresponding pitch and yaw angles. Additionally, for the builder agent, we record a tap of the agent’s actions (movement, block placement) along with the world state changes discretely. We record the architect’s instructions and the builder’s clarification questions.

Data Collection Setup Our tool for collaborative building tasks is designed to be scalable and easily deployable to collect large datasets efficiently. It facilitates the collection of multi-modal collaborative building tasks, seamlessly integrating with crowd-sourcing platforms for efficient participant scaling. Furthermore, we enhance the data collection process by introducing asynchronous turn-taking. This means the tool no longer relies on having the same set of annotators online throughout the game. We have implemented checks to prevent a single annotator from taking on both architect and builder roles for the same structure. Importantly, this asynchronous approach allows for the simultaneous launch of multiple structures. Annotators can work on different structures concurrently without waiting for responses, saving time and making the process scalable. To facilitate clear instruction following for annotators, we utilize cardinal directions like North, South, East, and West within the voxel world.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) The architecture of the data collection tool. (b) The IGLU dataset collection pipeline.

We obtained approval from the Institutional Review Board (IRB) to conduct the study. We used Amazon Mechanical Turk (MTurk) as the crowd-sourcing platform to get annotations from 230230230230 unique annotators who provided consent to be part of the study and were paid $15 per hour. We did not collect any personally identifiable information as part of the study. Each annotator submits a task referred to as a HIT (Human Intelligence Task). A HIT consists of the CraftAssist voxelworld [73] described in Sec. 3 along with a HIT survey. The HIT survey is customizable for different tasks and includes rules for a given task, a form where instructions can be submitted, or clarifying questions asked for the building task. Finally, the data is stored in two kinds of data stores for ease of access: Tables are used to save game ids, instructions, and clarifying questions while the Object Store is used for storing files with game world states and collected actions. Although this data collection tool is currently used for the multi-turn interactions setup we described, it can be easily customized to support other general setups to collect interaction dialogs from human annotators, actions, and world states to solve building tasks.

4 IDAT Dataset

The IDAT dataset is a comprehensive multi-modal dataset that includes instruction utterances, voxel world states at each action, and the corresponding images. Following the previously described methodology, we provide a two-part dataset: a seed dataset and the IGLU dataset 555The datasets and accompanying code for analysis and visualization are publicly available at https://github.com/microsoft/iglu-datasets.

4.1 Seed Dataset

The seed dataset comprises multi-turn dialog sequences aimed at collaboratively building a target structure. A complete session of dialogues to achieve the target structure is referred to as a game as shown in Fig. 3. In each turn, an annotator assumes the role of either the architect or the builder. Architects are randomly assigned a target structure from a diverse set of structures. They provide the next step instruction for the Builder. The Builder starts from scratch at the beginning of a game or builds on intermediate results by executing the Architect’s instructions. If the instruction is unclear, the Builder can pose a clarifying question.

Tab. 2 shows the summary of the Seed dataset. 31 target structures are presented to the annotators to build. We process and clean the data by filtering out missing and low-quality submissions such as very short instructions having less than five words. Finally, we have 127127127127 completed game sessions, with the median duration of a game being around 16 minutes and the average number of turns taken to complete a game as 14 turns. A game session is considered complete when the Builder completes a given target structure after interacting with and following instructions provided by the Architect. This is denoted by the Architect marking the structure as “complete". Across all the games, we have 811 utterances or dialog interactions between the Architect and Builder annotators. The average length of instructions provided by the Architects is around 19 words, and the number of clarifying questions asked by the Builders – 126126126126. On average, 2 clarifying questions are asked per game.

The target structures have been designed to ensure a variety of building types with varying levels of difficulty. To provide a deeper understanding of the target structures in our multi-turn dataset, we performed manual labeling on the 31 structures. The types of structures and their corresponding number of structures (in brackets) in the dataset are as follows: 1. flat [7]:all blocks on the ground 2. flying [27]:there are blocks that cannot be fully added without removing some other blocks 3. diagonal [6]:some blocks are adjacent diagonally 4. tricky [6]:some blocks are hidden or they should be placed in a specific order 5. tall [25]:a structure cannot be built without the agent being high enough (the placement radius is 3 blocks). These labels are not mutually exclusive, so one structure can belong to multiple categories. We consider different categories of structures to ensure the agent uses various skills and abilities to complete the target structures. For instance, if all the structures are flat, the agent will not learn to use other actions, such as flying. This diversity is essential for training a robust and adaptable agent.

4.2 IGLU-Dataset

The multi-turn data collection process described in the previous section is fairly complex and tricky to scale. We simplify the process to be a single turn where all required attributes are captured in one shot. We first remove the complexity of building a predefined target structure. Instead, annotators are asked to perform some free-form building actions within the voxel world, while providing instructions that should allow another annotator to rebuild the same structure. These single-turn task segments enable asynchronous collaboration between annotators. This process enables the data collection at a significantly faster pace, leading to a larger corpus comprising natural language instructions, corresponding actions performed based on those instructions, and a set of clarifying questions. We record and save actions performed by annotators in a key-value pair format that stores the movement of the agent and positional changes of blocks within the voxel world.

Target Structures 31
Completed Games 127
Median Dur of Completed Games 16 mins
Avg. Turns of Completed Games 14
No. Instructions 811
Avg. Len of Instructions 19.32 words
No. Clarifying Questions 126
Avg. Clarifying Questions per Game 2
Table 1: Overview of Seed Dataset
Instructions (train/test) Avg. Length (in words)
Total 8136 (6843/1293) Instructions 18.29
Clear 7080 (5951/1129) Clarifying Questions 12.05
Ambiguous 1056 (892/164)
or Clarifying Questions
Table 2: Overview of the IGLU Dataset

We utilized the Seed dataset to provide diverse starting canvases for annotators as follows:

  • An annotator is assigned a world state from the Multi-Turn dataset as the starting point for their building task (Fig. 2(b): Ideation Stage).

  • The annotator is prompted to perform a sequence of actions for a duration of one minute.

  • Then, the annotator is required to describe their actions in the form of a natural language instruction.

  • Another annotator is shown the instructions and asked to perform the steps mentioned. If the instruction is unclear, the annotator specifies it as thus and asks clarification questions (Fig. 2(b): Clarification Question Stage).

Tab. 2 presents a summary of the IGLU dataset, which consists of 8,136 pairs of actions and instructions. We clean the collected Single-Turn dataset by filtering out low-quality samples, e.g. those with very short instructions (< 5 words) or those coming from annotators who gave low-quality instructions (e.g. providing the same instruction repeatedly). In the final set, instructions consist of on average 18 words, indicating the instructions are descriptive enough for 1-minute building actions.

In the above process, if an annotator marks the provided instruction as ambiguous to execute, they are supposed to issue a clarifying question. Otherwise, the submission is filtered out with a warning provided to the annotator. This was to ensure that every instruction annotated as “not clear" is accompanied by at least one clarifying question. Out of 8,136 instructions, 1,056 (12.98%) were marked as Not Clear, thus being ambiguous, and 7,080 (87.02%) as Clear instructions. Hence, we have 1,056 clarifying questions, one for each ambiguous question. The average length of clarifying questions is around 12 words. Tab. 5 in the appendix exemplifies a few instructions marked as being unclear, along with clarifying questions issued by annotators.

The majority of clarifying questions fall into the following categories: 1. Color: Questions clarifying the color of the blocks to be used. 2. Direction/Orientation: Questions clarifying the direction and orientation in the world. 3. Number of blocks: Questions that clarify the number of blocks to be placed. 4. Identifying blocks to be changed: Questions clarifying which blocks need to be changed. For deeper insight, we reassessed the annotations for 100 randomly selected instructions to gauge the level of agreement among the annotators. The agreement rate among the three annotators for these 100 instructions falls within the range interpreted as “fair" according to the Krippendorff agreement measure. This suggests that the interpretation of ambiguous instructions can be highly subjective, which further emphasizes the complexity of such a task. While one annotator may perceive an instruction as clear, another may find it ambiguous. Furthermore, different annotators may ask different clarifying questions about the same instruction, as they may identify unclear aspects from different perspectives.

The single-turn approach offers several advantages over the sequential nature of the multi-turn process of the seed dataset, one of which is the independence of each sample, allowing for easier utilization in different tasks. Each turn can be interpreted as a complete set of information, enabling flexibility in the data collection as well as it’s uses. This independence allows researchers to extract valuable insights and information from individual turns without considering the entire dialogue sequence. Moreover, the single-turn approach allows for collecting multiple clarifying questions for each instruction augmenting the richness and diversity of the dataset, enabling a deeper understanding of the nuances and challenges in generating clarifying questions. Both the seed and IGLU datasets offer extensive potential for studying various research questions concerning grounded language interactive agents. These datasets can be further expanded using our data collection tool.

5 IGLU Evaluation

While our focus in this paper is not on the solutions or baselines presented during the competition, we note them to underscore the need for the evaluation protocol we employed during the competition. This includes the development of an online interactive human evaluation platform which is a major contribution of this work. This evaluation platform serves as a crucial supplement to offline evaluation metrics, ensuring the robustness and validity of the evaluation process of interactive agents and allowing for deeper qualitative insights.

5.1 Offline Evaluation

Interaction Focused Task Evaluation:

  1. RQ1

    When? It is evaluated as a binary classification problem: Does the provided instruction require a clarifying question? We use the macro average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score to evaluate classifiers based on instructions marked as unclear in the corpus, ensuring a balanced measure of both precision and recall across the two classes.

  2. RQ2

    What? It is evaluated based on the quality of selected clarifying questions for unclear cases.

We formulate the problem of ranking a pool of clarifying questions instead of generating the questions for several reasons. Generating clarifying questions in a collaborative environment is challenging, as shown in [36]. If clarifying questions already exist in a pool, finding the most appropriate ones becomes a more manageable task than generating them from scratch [4]. Additionally, the evaluation of classification and ranking tasks is much more well-established compared to generation tasks, as there may be multiple correct clarifying questions for any given scenario. Therefore, ranking a pool of clarifying questions allows for better evaluation and control over the output. We assess how well the model can rank a list of human-issued clarifying questions in the corpus for a given ambiguous instruction. The model’s effectiveness is measured using Mean Reciprocal Rank (MRR). The average F1 score of the top three participants for RQ1 is 0.76. For ranking clarifying questions, the top three teams achieved an average MRR of 0.58. These results indicate that significant room for improvement remains, highlighting the challenges associated with these tasks.

Agent Building Task Evaluation To evaluate a RL agent, the evaluation system executes two episodes for each task using a held-out test set. Each task begins with a specific initial grid configuration and a designated target grid. The primary evaluation metric is the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score, computed as in Algorithm 1 (in appendix). This score is derived by comparing the predicted modifications—differences between the initial world and the final snapshot of the building zone—to the ground truth, which includes the required blocks to be added or removed. Scores for each task are computed as a weighted average, with weights based on the total number of blocks that needed modification. Participants’ models are required to complete two runs per task across a total of 96969696 tasks, resulting in 192192192192 episodes. All tasks must be completed within a 60606060-minute timeframe.666The system specifications for the machine running the submissions were as follows: 1 NVIDIA T4 GPU with 16 GB of memory, 8 vCPUs, 56 GB of RAM.

5.2 Human-in-the-Loop Interactive Online Evaluation: Greenlands Platform

To facilitate the evaluation of the RL agents by human participants we developed the interactive evaluation platform. Greenlands777https://github.com/microsoft/greenlands host agents on a Minecraft server, enabling human evaluators, sourced from a crowdsourcing platform (Amazon MTurk), to interact with and assess the agents’ performance in a real time. Our findings suggest that while current RL agents exhibit a degree of functionality, they fall short of human expectations in terms of interactivity and reliability. Technical design of the platform’s is provided in the appendix F.

Agent Total Games Total Wins Total Losses Wins Against Losses Against
B 30 17 (56.67%) 13 (43.33%)
MHB: 7 (53.85%)
P: 10 (58.82%)
MHB: 6 (46.15%)
P: 7 (41.18%)
MHB 28 15 (53.57%) 13 (46.43%)
B: 6 (46.15%)
P: 9 (60.00%)
B: 7 (53.85%)
P: 6 (40.00%)
P 32 13 (40.62%) 19 (59.38%)
B: 7 (41.18%)
MHB: 6 (40.00%)
MHB: 9 (60.00%)
B: 10 (58.82%)
Table 3: Human evaluation results for top 3 performing agents.

Our evaluation is focused on IGLU 2022 the top agents [37] (Brain Agent (B) and MHB-Pegasus (P)), and baseline model developed by IGLU team to serve as a control (MHB)  [69], which archived the following F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores in the offline evaluation: (1) B0.2540.2540.2540.254, (2) P0.1780.1780.1780.178, (3) MHB0.1500.1500.1500.150.

Our human evaluation protocol involved participants playing two separate games of interactive collaborative building task, each featuring a different agent in random order. After interacting with both agents, participants were asked to identify which agent they perceived as superior and to provide qualitative feedback on each agent’s behavior. This comparative approach mitigates the inherent subjectivity by focusing on the relative performance. Participants were blinded to the identity of the agents, anonymized as Agent 1 and Agent 2888https://github.com/iglu-contest/dataset-collection-and-evaluation. To ensure a fair comparison, both games assigned to a participant within a single MTurk hit involved the same task, with identical initial and target structures. These tasks were randomly selected from our test set.

5.2.1 Human Evaluation Results and Discussion

We recorded a total of 45454545 MTurk assignments. The human evaluations, summarized in Tab.3, suggest a correlation between human preferences and offline evaluation scores, with Brain Agent generally preferred over MHB-Pegasus. However, the generalizability of these results may be limited. Examples of human feedback on the performance of each agent are provided in appendix F.2.

Upon reviewing the qualitative feedback, we consistently see that none of the agents met human expectations or completed the tasks. Through our analysis, we identified three predominant concerns across all agents, as reported by the participants: responsiveness to commands, precision in executing actions, and compliance with given instructions.

Aligning training scenarios with the complexities of the real world is a challenging problem for interactive agents. This difficulty is evident in both offline and online evaluations of the agents. Interestingly, despite the agents’ generally poor performance, there was a discernible alignment between human preferences and the outcomes of offline evaluations. This suggests that even in the presence of task completion deficits, the behavioral patterns exhibited by agents can significantly influence human perceptions of their capabilities.

However, the offline F1 score metric fails to reliably identify specific issues affecting the agent’s performance. Additionally, human evaluators tend to provide specific instructions, especially when correcting the agents’ actions, introducing a level of complexity that the metrics used in offline evaluations, which do not account for shifts or translations, fail to capture. These findings highlight the importance of integrating human evaluations into the development cycle of interactive agents. They highlight the need for an approach that considers not only an agent’s task performance but also its behavioral interactions, as both are integral to the human experience. This emphasizes the necessity of a dynamic evaluation environment and the definition of multi-dimensional utilities to gain a deeper understanding of agent systems, which cannot be fully captured through single offline metrics. Future studies should incorporate more granular response options to capture a comprehensive range of human feedback such as allowing evaluators to express a neutral stance when no clear preference.

6 Related Work

Evolution of NLIs and ApplicationsEarly work in Natural Language Interfaces (NLIs) [84, 17, 29] laid the foundation for understanding and designing effective interfaces for human language communication with computers. In recent years, there has been a resurgence of interest in NLIs due to advances in language understanding capabilities driven by large-scale deep learning models [21, 47, 16, 2, 66, 12, 60, 15] and the increasing demand for various applications such as virtual assistants, dialog systems [40, 42, 13, 41, 43], and question answering systems [45, 46, 22, 91]. NLIs now extend beyond traditional databases to encompass knowledge bases [18, 10] to robots [77], personal assistants [35, 34], and other forms of interaction [25, 20, 88, 71]. Agent Interactivity and Learning The focus has shifted towards interactivity and continuous learning [54, 33], enabling agents to interact with users [85], learning new tasks from instructions [39, 50, 75], assessing their uncertainty [86], asking clarifying questions [3, 4, 5, 6], and leveraging feedback from humans to correct mistakes [23, 58, 57, 51]. Currently, LLMs are also being studied to asses uncertainty and their own errors [64, 65]. Newer directions are studying ways of identifying possible multi-modal utility of agentic systems to [7, 8, 63]. Grounded Language Understanding This paper focuses on grounded language understanding—connecting natural language instructions with real-world or simulated environment context and taking corresponding actions [30, 52, 48]. This is crucial to enabling more effective communication between humans and intelligent agents. Our work focuses specifically on tackling grounded language understanding in the context of collaborative building tasks performed by agents [14, 53, 69].

Leveraging Minecraft We select Minecraft for grounded language understanding due to its distinct advantages. Szlam et al. [74] highlights the benefits of an open interactive assistant in Minecraft. The game’s 3D voxel grid world and adherence to simple physics rules provide ample research scenarios for reinforcement learning experimentation [30]. Minecraft’s interactive nature, player interactions, and dialog exchanges offer diverse opportunities for grounded natural language understanding [87, 70, 55]. The game’s immense popularity ensures enthusiastic player interaction, facilitating rich human-in-the-loop studies. Minecraft’s advantage extends to the availability of the highly developed set of tools for logging agents interactions and deploying agents for evaluation with human-in-the-loop, including Malmo [32], Craftassist [27], TaskWorldMod [59], MC-Saar-Instruct [38] and IGLU GridWorld [92]. Among the Minecraft-based related works, MineDojo [24] is similar to IGLU in the sense that both are designed to develop intelligent agents within the expansive Minecraft environment. While MineDojo aims to build versatile agents capable of performing diverse tasks through an internet-scale knowledge base, IGLU seeks to enhance interactive agents that can understand and act on grounded natural language instructions, with a strong emphasis on natural language dialogue and clarification. An extensive review and comparison of relevant platforms are provided in the appendix Tab. 4.

7 Conclusion

In conclusion, we introduce IDAT comprising the dataset, tools, and evaluation platform tailored for the development of interaction-driven agents. The dataset comprises approximately 9,000 instructions and over 1,000 clarifying questions, along with corresponding actions and grid world states for interactive building tasks in a Minecraft-like environment. The released data collection tool is scalable, supports our task setup, and can be seamlessly integrated with crowdsourcing platforms. This adaptable tool enables the collection of tailored data for specific use cases, and we recommend the collection of new test datasets to address data leakage issues. Moreover, our introduction of a human-in-the-loop interactive evaluation platform provides a robust qualitative assessment of interactive agents. The efficacy of these resources was demonstrated through the NeurIPS IGLU competition, where interactive agents learned from natural language instructions. All resources, including the dataset, data collection tool, and evaluation platform, are publicly accessible to support future research endeavors.

The complexity of the task is highlighted by the low scores observed in both offline and human evaluations of the agents. The emergence of large language and multi-modal models such as GPT-4o and Gemini [76] offers a promising avenue for narrowing this gap, potentially equipping agents with the capability to interpret and respond to human communication in ways that more closely mirror natural human interactions. Future research should investigate the integration of these advanced models to bolster the agents’ adaptability and fluency in human-like dialogue, thereby enhancing the overall naturalness and effectiveness of these interactions.

8 Limitations

This work focused on a single environment, Minecraft, which might not be an ideal representation of real-world environments. Although Minecraft does not perfectly replicate real-world environments, it serves as a valuable platform for training agents on fundamental tasks using natural language. This is particularly relevant given the current performance limitations observed in agent-building tasks. Some may find the scale of the dataset limiting. However, the developed data collection tool is designed to facilitate the efficient gathering of additional data, thereby addressing this limitation.

Acknowledgments and Disclosure of Funding

We would like to express our gratitude to the many individuals who made our work possible. Our amazing co-organizers of the competition—Milagro Teruel, Arthur Szlam, Mikhail Burtsev, Mohammad Aliannejadi, Ziming Li, Zoya Volovikoa, Aleksandr Panov, and Kavya Srinet—provided invaluable assistance and contributions that were essential in building the evaluation platform, providing feedback on the data collection platform, and organizing the competition. We are grateful to Ahmed Awadallah for their guidance and support throughout this project, and to the AICrowd team for their support in hosting the competition. We extend our thanks to the team at Microsoft, including Lars Liden, Matt Mazzola, Swadheen Shukla, Qianqian Qi, Piali Choudhury, Curtis von Veh, Sam Yeh, and Jianfeng Gao, whose expertise and commitment were instrumental in the development of the Greenlands platform. Their collaborative spirit helped bring our vision to fruition. Our advisory board and previous co-organizers of the competition also deserve thanks for their input and advice. Finally, special thanks to Microsoft for their funding and overall support in making this project possible.

References

  • Abramson et al. [2020] Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020.
  • Adiwardana et al. [2020] Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
  • Aliannejadi et al. [2020] Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Convai3: Generating clarifying questions for open-domain dialogue systems (clariq). 2020.
  • Aliannejadi et al. [2021] Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4473–4484, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.367. URL https://aclanthology.org/2021.emnlp-main.367.
  • Arabzadeh et al. [2022] Negar Arabzadeh, Mahsa Seifikar, and Charles LA Clarke. Unsupervised question clarity prediction through retrieved item coherency. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3811–3816, 2022.
  • Arabzadeh et al. [2023] Negar Arabzadeh, Ali Ahmadvand, Julia Kiseleva, Yang Liu, Ahmed Hassan Awadallah, Ming Zhong, and Milad Shokouhi. PREME: Preference-based meeting exploration through an interactive questionnaire. In Andreas Vlachos and Isabelle Augenstein, editors, Findings of the Association for Computational Linguistics: EACL 2023, pages 331–342, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-eacl.25. URL https://aclanthology.org/2023.findings-eacl.25.
  • Arabzadeh et al. [2024a] Negar Arabzadeh, Siging Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles LA Clarke, and Julia Kiseleva. Assessing and verifying task utility in llm-powered applications. arXiv preprint arXiv:2405.02178, 2024a.
  • Arabzadeh et al. [2024b] Negar Arabzadeh, Julia Kiseleva, Qingyun Wu, Chi Wang, Ahmed Awadallah, Victor Dibia, Adam Fourney, and Charles Clarke. Towards better human-agent alignment: Assessing task utility in llm-powered applications. arXiv preprint arXiv:2402.09015, 2024b.
  • Balloccu et al. [2024] Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.5.
  • Berant et al. [2013] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013.
  • Bisk et al. [2016] Yonatan Bisk, Deniz Yuret, and Daniel Marcu. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 751–761, 2016.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
  • Burtsev et al. [2017] Mikhail Burtsev, Aleksandr Chuklin, Julia Kiseleva, and Alexey Borisov. Search-oriented conversational ai (scai). In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, pages 333–334, 2017.
  • Carta et al. [2023] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arXiv:2302.02662, 2023.
  • Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Clark et al. [2020] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  • Codd [1974] Edgar F Codd. Seven steps to rendezvous with the casual user. IBM Corporation, 1974.
  • Copestake and Jones [1990] Ann Copestake and Karen Sparck Jones. Natural language interfaces to databases. 1990.
  • Dalton et al. [2020] Jeff Dalton, Aleksandr Chuklin, Julia Kiseleva, and Mikhail Burtsev. Proceedings of the 5th international workshop on search-oriented conversational ai (scai). In Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI), 2020.
  • Desai et al. [2016] Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, et al. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering, pages 345–356. ACM, 2016.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018.
  • Dinan et al. [2020] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208. Springer, Cham, 2020.
  • Elgohary et al. [2020] Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. Speak to your parser: Interactive text-to-SQL with natural language feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2065–2077, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.187. URL https://www.aclweb.org/anthology/2020.acl-main.187.
  • [24] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Fast et al. [2018] Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S Bernstein. Iris: A conversational agent for complex tasks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 473. ACM, 2018.
  • Gluck and Laird [2018] Kevin A Gluck and John E Laird. Interactive task learning: Humans, robots, and agents acquiring new tasks through natural interactions. The MIT Press, 2018.
  • Gray et al. [2019] Jonathan Gray, Kavya Srinet, Yacine Jernite, Haonan Yu, Zhuoyuan Chen, Demi Guo, Siddharth Goyal, C. Lawrence Zitnick, and Arthur Szlam. CraftAssist: A Framework for Dialogue-enabled Interactive Agents. arXiv:1907.08584 [cs], July 2019. URL http://arxiv.org/abs/1907.08584. arXiv: 1907.08584.
  • Guss et al. [2019] William H Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019.
  • Hendrix et al. [1978] Gary G Hendrix, Earl D Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS), 3(2):105–147, 1978.
  • Hermann et al. [2017] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, Marcus Wainwright, Chris Apps, Demis Hassabis, and Phil Blunsom. Grounded language learning in a simulated 3d world. CoRR, abs/1706.06551, 2017. URL http://arxiv.org/abs/1706.06551.
  • Jayannavar et al. [2020] Prashant Jayannavar, Anjali Narayan-Chen, and Julia Hockenmaier. Learning to execute instructions in a minecraft dialogue. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 2589–2602, 2020.
  • Johnson et al. [2016] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In IJCAI, pages 4246–4247. Citeseer, 2016.
  • Kiseleva et al. [2014] Julia Kiseleva, Eric Crestan, Riccardo Brigo, and Roland Dittel. Modelling and detecting changes in user satisfaction. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1449–1458, 2014.
  • Kiseleva et al. [2016a] Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 45–54, 2016a.
  • Kiseleva et al. [2016b] Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pages 121–130, 2016b.
  • Kiseleva et al. [2022a] Julia Kiseleva, Ziming Li, Mohammad Aliannejadi, Shrestha Mohanty, Maartje ter Hoeve, Mikhail Burtsev, Alexey Skrynnik, Artem Zholus, Aleksandr Panov, Kavya Srinet, Arthur Szlam, Yuxuan Sun, Katja Hofmann, Marc-Alexandre Côté, Ahmed Awadallah, Linar Abdrazakov, Igor Churin, Putra Manggala, Kata Naszadi, Michiel van der Meer, and Taewoon Kim. Interactive grounded language understanding in a collaborative environment: Iglu 2021. In NeurIPS 2021 Competitions and Demonstrations Track, pages 146–161. PMLR, 2022a.
  • Kiseleva et al. [2022b] Julia Kiseleva, Alexey Skrynnik, Artem Zholus, Shrestha Mohanty, Negar Arabzadeh, Marc-Alexandre Côté, Mohammad Aliannejadi, Milagro Teruel, Ziming Li, Mikhail Burtsev, et al. Interactive grounded language understanding in a collaborative environment: Retrospective on iglu 2022 competition. In NeurIPS 2022 Competition Track, pages 204–216. PMLR, 2022b.
  • Köhn et al. [2020] Arne Köhn, Julia Wichlacz, Christine Schäfer, Alvaro Torralba, Jörg Hoffmann, and Alexander Koller. Mc-saar-instruct: a platform for minecraft instruction giving agents. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 53–56, 2020.
  • Li et al. [2020a] Toby Jia-Jun Li, Tom Mitchell, and Brad Myers. Interactive task learning from GUI-grounded natural language instructions and demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, July 2020a.
  • Li et al. [2019] Ziming Li, Julia Kiseleva, and Maarten De Rijke. Dialogue generation: From imitation learning to inverse reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6722–6729, 2019.
  • Li et al. [2020b] Ziming Li, Julia Kiseleva, and Maarten de Rijke. Rethinking supervised learning and reinforcement learning in task-oriented dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3537–3546, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.316. URL https://aclanthology.org/2020.findings-emnlp.316.
  • Li et al. [2020c] Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, and Jianfeng Gao. Guided dialogue policy learning without adversarial learning in the loop. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2308–2317, Online, November 2020c. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.209. URL https://aclanthology.org/2020.findings-emnlp.209.
  • Li et al. [2021a] Ziming Li, Julia Kiseleva, and Maarten de Rijke. Improving response quality with backward reasoning in open-domain dialogue systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1940–1944, 2021a.
  • Li et al. [2021b] Ziming Li, Dookun Park, Julia Kiseleva, Young-Bum Kim, and Sungjin Lee. Deus: A data-driven approach to estimate user satisfaction in multi-turn dialogues. arXiv preprint arXiv:2103.01287, 2021b.
  • Liu and Lane [2017] Bing Liu and Ian Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE, 2017.
  • Liu and Lane [2018] Bing Liu and Ian Lane. Adversarial learning of task-oriented neural dialog models. In Proceedings of the SIGDIAL 2018 Conference, pages 350–359, 2018.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  • Ma et al. [2022] Ziqiao Ma, Ben VanDerPloeg, Cristian-Paul Bara, Huang Yidong, Eui-In Kim, Felix Gervits, Matthew Marge, and Joyce Chai. Dorothie: Spoken dialogue for handling unexpected situations in interactive autonomous driving agents, 2022.
  • Mehta and Goldwasser [2019] Nikhil Mehta and Dan Goldwasser. Improving natural language interaction with robots using advice. arXiv preprint arXiv:1905.04655, 2019.
  • Mehta et al. [2023] Nikhil Mehta, Milagro Teruel, Patricio Figueroa Sanz, Xin Deng, Ahmed Hassan Awadallah, and Julia Kiseleva. Improving grounded language understanding in a collaborative environment by interacting with agents through help feedback. arXiv preprint arXiv:2304.10750, 2023.
  • Milani et al. [2024] Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, and Rohin Shah. Bedd: The minerl basalt evaluation and demonstrations dataset for training and benchmarking agents that solve fuzzy tasks. Advances in Neural Information Processing Systems, 36, 2024.
  • Mitsuda et al. [2022] Koh Mitsuda, Ryuichiro Higashinaka, Yuhei Oga, and Sen Yoshida. Dialogue collection for recording the process of building common ground in a collaborative task. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5749–5758, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.618.
  • Mohanty et al. [2022] Shrestha Mohanty, Negar Arabzadeh, Milagro Teruel, Yuxuan Sun, Artem Zholus, Alexey Skrynnik, Mikhail Burtsev, Kavya Srinet, Aleksandr Panov, Arthur Szlam, Marc-Alexandre Côté, and Julia Kiseleva. Collecting interactive multi-modal datasets for grounded language understanding. arXiv preprint arXiv:2211.06552, 2022.
  • Mohanty et al. [2023] Shrestha Mohanty, Negar Arabzadeh, Julia Kiseleva, Artem Zholus, Milagro Teruel, Ahmed Awadallah, Yuxuan Sun, Kavya Srinet, and Arthur Szlam. Transforming human-centered ai collaboration: Redefining embodied agents capabilities through interactive grounded language instructions. arXiv preprint arXiv:2305.10783, 2023.
  • Narayan-Chen et al. [2019] Anjali Narayan-Chen, Prashant Jayannavar, and Julia Hockenmaier. Collaborative dialogue in Minecraft. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5405–5415, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1537. URL https://aclanthology.org/P19-1537.
  • Nass and Moon [2000] Clifford Nass and Youngme Moon. Machines and mindlessness: Social responses to computers. Journal of social issues, 56(1):81–103, 2000.
  • Nguyen and au2 [2019] Khanh Nguyen and Hal Daumé III au2. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning, 2019.
  • Nguyen et al. [2022] Khanh Nguyen, Yonatan Bisk, and Hal Daumé III au2. A framework for learning to request rich and contextually useful information from humans, 2022.
  • Ogawa et al. [2020] Haruna Ogawa, Hitoshi Nishikawa, Takenobu Tokunaga, and Hikaru Yokono. Gamification platform for collecting task-oriented dialogue data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 7084–7093, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.876.
  • OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
  • Padmakumar et al. [2022] Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gökhan Tür, and Dilek Hakkani-Tür. Teach: Task-driven embodied agents that chat. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 2017–2025. AAAI Press, 2022. doi: 10.1609/aaai.v36i2.20097. URL https://doi.org/10.1609/aaai.v36i2.20097.
  • Park et al. [2023] Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  • Pramanick et al. [2022] Pradip Pramanick, Chayan Sarkar, Sayan Paul, Ruddra dev Roychoudhury, and Brojeshwar Bhowmick. Doro: Disambiguation of referred object for embodied agents, 2022.
  • Press et al. [2022] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
  • Ren et al. [2023] Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners. In Conference on Robot Learning, pages 661–682. PMLR, 2023.
  • Roller et al. [2020] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.
  • Shi et al. [2022] Zhengxiang Shi, Yue Feng, and Aldo Lipani. Learning to execute or ask clarification questions. arXiv preprint arXiv:2204.08373, 2022.
  • Shridhar et al. [2020] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  • Skrynnik et al. [2022] Alexey Skrynnik, Zoya Volovikova, Marc-Alexandre Côté, Anton Voronov, Artem Zholus, Negar Arabzadeh, Shrestha Mohanty, Milagro Teruel, Ahmed Awadallah, Aleksandr Panov, Mikhail Burtsev, and Julia Kiseleva. Learning to solve voxel building embodied tasks from pixels and natural language instructions. arXiv preprint arXiv:2211.00688, 2022.
  • Srinet et al. [2020] Kavya Srinet, Yacine Jernite, Jonathan Gray, and Arthur Szlam. CraftAssist instruction parsing: Semantic parsing for a voxel-world assistant. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4693–4714, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.427. URL https://www.aclweb.org/anthology/2020.acl-main.427.
  • Su et al. [2017] Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark Encarnacion. Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 177–186. ACM, 2017.
  • Suhr et al. [2019] Alane Suhr, Claudia Yan, Jacob Schluger, Stanley Yu, Hadi Khader, Marwa Mouallem, Iris Zhang, and Yoav Artzi. Executing instructions in situated collaborative interactions. CoRR, abs/1910.03655, 2019. URL http://arxiv.org/abs/1910.03655.
  • Sun et al. [2022] Yuxuan Sun, Ethan Carlson, Rebecca Qian, Kavya Srinet, and Arthur Szlam. Many episode learning in a modular embodied agent via end-to-end interaction. arXiv preprint arXiv:2204.08687, 2022.
  • Szlam et al. [2019] Arthur Szlam, Jonathan Gray, Kavya Srinet, Yacine Jernite, Armand Joulin, Gabriel Synnaeve, Douwe Kiela, Haonan Yu, Zhuoyuan Chen, Siddharth Goyal, Demi Guo, Danielle Rothermel, C. Lawrence Zitnick, and Jason Weston. Why Build an Assistant in Minecraft? arXiv:1907.09273 [cs], July 2019. URL http://arxiv.org/abs/1907.09273. arXiv: 1907.09273.
  • Team et al. [2021] DeepMind Interactive Agents Team, Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Felix Fischer, Petko Georgiev, Alex Goldin, Mansi Gupta, et al. Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763, 2021.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Tellex et al. [2011] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
  • Thomason et al. [2019] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. CoRR, abs/1907.04957, 2019. URL http://arxiv.org/abs/1907.04957.
  • Wang et al. [2023a] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  • Wang et al. [2016] Sida I. Wang, Percy Liang, and Christopher D. Manning. Learning language games through interaction. CoRR, abs/1606.02447, 2016. URL http://arxiv.org/abs/1606.02447.
  • Wang et al. [2017] Sida I. Wang, Samuel Ginn, Percy Liang, and Christopher D. Manning. Naturalizing a programming language via interactive learning. CoRR, abs/1704.06956, 2017. URL http://arxiv.org/abs/1704.06956.
  • Wang et al. [2023b] Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In ICCV 2023, September 2023b.
  • Winograd [1972] Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972.
  • Woods et al. [1972] W. A. Woods, Ronald M Kaplan, and Bonnie L. Webber. The lunar sciences natural language information system: Final report. BBN Report 2378, 1972.
  • Wu et al. [2023] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  • Yao et al. [2019] Ziyu Yao, Yu Su, Huan Sun, and Wen-tau Yih. Model-based interactive semantic parsing: A unified framework and a text-to-SQL case study. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5447–5458, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1547. URL https://www.aclweb.org/anthology/D19-1547.
  • Yao et al. [2020] Ziyu Yao, Yiqi Tang, Wen-tau Yih, Huan Sun, and Yu Su. An imitation game for learning semantic parsers from user interaction. 2020.
  • Young et al. [2013] Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179, 2013.
  • Zhang et al. [2020] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020.
  • Zhang et al. [2021] Yi Zhang, Sujay Kumar Jauhar, Julia Kiseleva, Ryen White, and Dan Roth. Learning to decompose and organize complex tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2726–2735, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.217. URL https://aclanthology.org/2021.naacl-main.217.
  • Zhang et al. [2019] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019.
  • Zholus et al. [2022] Artem Zholus, Alexey Skrynnik, Shrestha Mohanty, Zoya Volovikova, Julia Kiseleva, Artur Szlam, Marc-Alexandre Coté, and Aleksandr I Panov. Iglu gridworld: Simple and fast environment for embodied dialog agents. arXiv preprint arXiv:2206.00142, 2022.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The main claims are reflected in the paper’s contribution

    2. (b)

      Did you describe the limitations of your work? [Yes] See Sec. 8

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [Yes]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A]

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A]

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] We have cited authors of previous relevant works through all the sections of our paper.

    2. (b)

      Did you mention the license of the assets? [Yes] See Sec. 1

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] This paper focuses on the dataset, its analysis, and the accompanying tools. We have made all data, tools, and codes publicly available on GitHub, with direct links provided in the introduction (Sec. 1) and relevant sections.

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] We went through IRB and obtained consent from participants. See Sec. 3

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] We note in Sec. 3 that we did not collect or use any personally identifiable information.

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] We have included screenshots and instructions.

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [Yes] Please see Sec. 3

Appendix A Comparison between related platforms

Tab. 4 showcases a comparison of the IGLU dataset with other related platforms across several dimensions, including dataset size, support for collaborative instructions between humans and AI, availability of data collection and training environments, and provision of a human evaluation platform. As depicted in this table, the IGLU dataset distinguishes itself by offering a comprehensive suite of features, including a relatively large dataset size, tools for collaborative interactions, accessible data collection and training environments, and a robust human-in-the-loop evaluation platform. This positions IGLU as a versatile and valuable resource in the field of interactive grounded language understanding.

Dataset Einstellungen Size of dataset

Collaborative instructional (AI/Human)

Data collection tool availability

Training environment availability

Human Evaluation Platform

SHRDLURN[80] Building game 100 games (10,223 utterances)
Voxelurn[81] Building structures 230 structures (36,589 utterances)
CEREAL-BAR[72] Collaborative games 1202
ALFRED[68] Household tasks 25,743
CVDN[78] Navigation 2050
TEACh[61] Household tasks 3215
MineDojo [24] Minecraft 730K YouTube videos, 7K Wiki pages, 340K Reddit posts K.A. K.A.
MineRL [28] Minecraft 500 video hours
HoloAssist [82] Physical tasks 166 video hours K.A. K.A.
IGLU (our work) Collaborative building 8,947 utterances/1,182 clarifying questions
Table 4: Comparison between relevant platforms.

Appendix B Data Collection Tool

Refer to caption
Figure 3: Example of seed data collection, where the Architect can see the goal structure and provides instructions for the Builder. The blue arrows indicate turns for the first goal structure, the orange arrows indicate turns for the second goal structure. Annotators can switch roles between architect and builder for different structures.

Seed Data Collection: In Figure 3, we illustrate an example of the seed multi-turn interaction data collection. In this scenario, the Architect can observe the goal structure and offer instructions to the Builder. The blue arrows represent the turns associated with the first goal structure, while the orange arrows correspond to the turns related to the second goal structure. Annotators can switch roles between architect and builder for different structures. Figure 3 illustrates this concept of our data collection methodology with different annotators (1, 3, 2, 4, and 6) collaborating to construct Structure 1. Annotators can switch roles between architect and builder for different structures.

Figure 4 demonstrates MTurk views of the Data Collection Tool (Section 3) for the Seed Dataset (Section  4.1). We have the Architect Task, where the Architect provides instructions to the Builder based on the provided target structure. Next, we have the Builder Task, where instructions and the current structure built so far are shown. The Builder can mark the instructions as unclear or will follow the instructions by adjusting blocks in the voxel world. Finally, we have the Intermediate Architect Task, where the Architect is shown the progress of the structure built so far and provides the next instruction.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: MTurk view of the data collection tool.

B.1 Data Schema

This section describes the schema of the data we collected in both the architect and builder tasks, along with a shortened version of an example data for illustration purpose.

Listing 1: Architect data schema
{
"$id": "iglu.architect.schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Data schema for architect in IGLU",
"type": "object",
"properties": {
"gameId": {
"description": "unique id for each game session (where a target strcuture is defined)",
"type": "integer"
},
"stepId": {
"description": "a monotonically increasing id, identifying which step the architect is in",
"type": "integer"
},
"avatarInfo": {
"type": "object",
"properties": {
"perspective": {
"description": "from which perspective the architect is giving command",
"type": "string",
"enum": [
"north",
"south",
"east",
"west"
]
}
}
},
"command": {
"description": "the command architect gives after he/she sees the target structure and the current world state",
"type": "string"
}
}
}
Listing 2: Builder data schema
{
"$id": "iglu.builder.schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Data schema for builderin IGLU",
"type": "object",
"properties": {
"gameId": {
"description": "unique id for each game session (where a target strcuture is defined)",
"type": "integer"
},
"stepId": {
"description": "a monotonically increasing id, identifying which step the builder is in",
"type": "integer"
},
"avatarInfo": {
"type": "object",
"properties": {
"pos": {
"description": "an array of three floats representing avatar’s position. i.e. [x, y, z]",
"type": "array"
},
"look": {
"description": "an array of two floats representing avatar’s pitch and yaw. i.e. [pitch, yaw]",
"type": "array"
}
}
},
"worldEndingState": {
"description": "the ending state of the world after builder has interact with it",
"type": "object",
"properties": {
"blocks": {
"description": "An array of blocks info",
"type": "array",
"items": {
"description": "An array of four elements: [x, y, z, blockId]",
"type": "array"
}
}
}
},
"tape": {
"description": "A string representation of the tape recording builder’s interaction and world state changes, see example_data.txt",
"type": "string"
},
"clarification_question": {
"description": "The question builder asks for clarification when they feel confused about their task",
"type": "string"
}
}
}
Listing 3: Example of collected data (shortened)
{
"gameId": 19,
"stepId": 1,
"avatarInfo": {
"pos": [
-0.5333829883845848,
65.07999999999996,
-3.6806624583844014
],
"look": [
-1.0720000000000007,
-15.771999999999965
]
},
"worldEndingState": {
"blocks": [
[-2, 63, 1, 50],
[-1, 63, -2, 57],
[-1, 63, 1, 50],
[0, 63, -3, 57],
[1, 63, 0, 57],
]
},
"tape": [
"0 set_look (-0.004, 0)",
"1 set_look (-0.044, -0.042)",
"2 action step_backward",
"3 pos_change (-0.10159854456559483, 63, 0.014814775657966633)",
"4 action select_and_place_block 50 1 63 0",
"5 block_change (1, 63, 0, 0, 50)",
"..."
],
"clarification_question": "null"
}

Appendix C IGLU Dataset

Examples of IGLU-Dataset: Tab. 5 provides examples of instructions marked as unclear in the dataset along with different kinds of clarifying questions posed by annotators (Sec.4.2). Clarifying questions consist of topics such as color, direction, and identification of blocks.

Instruction Clarifying Question
Place four blocks to the east of the highest block, horizontally. Which color blocks?
Destroy 2 purple blocks and then build 3 green blocks diagonally. Which two purple blocks need to be destroyed?
Destroy the 3 stacked red blocks on the east side. Replace them with 3 stacked blue boxes Which three of the four stacked red blocks on the east side need to be destroyed?
Make a rectangle that is the width and height of the blue shape and fill it in with purple blocks. Which side I need to make the rectangle is not clear
Facing South remove the rightmost purple block. Place a row of three orange blocks to the left of the upper leftmost purple block. Place two orange blocks above and below the leftmost orange block. Which one of the rightmost blocks should be removed?
Facing north and purple-green blocks will be arranged one by one. Where would you like to place the purple and green blocks exactly?
Table 5: Examples of Unclear Instructions and corresponding Clarifying Questions

Appendix D IGLU-2022 Evaluation protocol

During the competition participating in the IGLU competition involves three phases:

  1. 1.

    Sign Up: Participants must register on the AIcrowd website to access the competition details and starting kits.

  2. 2.

    Prepare and Train: After registration and accepting the rules of the competition, participants can start by using prepared baselines and instructions. This involves configuring and training hybrid, RL and NLP models to interact with the IGLU environment.

  3. 3.

    Submit Models: Once training is complete, participants submit their models to the AIcrowd for automated evaluation. The performance is assessed over a fixed number of episodes, and results are ranked on the competition leaderboard.

Input: G𝐺Gitalic_G – current state of the grid, G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT – initial state of the grid, T𝑇Titalic_T, the target state of the grid.
Output: F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the computed F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score.
MGG0𝑀𝐺subscript𝐺0M\leftarrow G-G_{0}italic_M ← italic_G - italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ;
  // Compute the difference between current and initial grids
Aargmax-intersection(M,T)𝐴argmax-intersection𝑀𝑇A\leftarrow\text{argmax-intersection}(M,T)italic_A ← argmax-intersection ( italic_M , italic_T ) ;
  // Find indices where current grid’s modifications best intersect with the target
Iintersection(M,A)𝐼intersection𝑀𝐴I\leftarrow\text{intersection}(M,A)italic_I ← intersection ( italic_M , italic_A ) ;
  // Calculate the number of correct modifications
PI|T|𝑃𝐼𝑇P\leftarrow\frac{I}{|T|}italic_P ← divide start_ARG italic_I end_ARG start_ARG | italic_T | end_ARG ;
  // Calculate precision as the ratio of correct modifications to target size
RI|i:M[i]0|R\leftarrow\frac{I}{|{i:M[i]\neq 0}|}italic_R ← divide start_ARG italic_I end_ARG start_ARG | italic_i : italic_M [ italic_i ] ≠ 0 | end_ARG ;
  // Calculate recall as the ratio of correct modifications to all modifications
F12PRP+Rsubscript𝐹12𝑃𝑅𝑃𝑅F_{1}\leftarrow\frac{2\cdot P\cdot R}{P+R}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← divide start_ARG 2 ⋅ italic_P ⋅ italic_R end_ARG start_ARG italic_P + italic_R end_ARG ;
  // Compute F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score, the harmonic mean of precision and recall
return F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT;
Algorithm 1 Computation of the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Score

Appendix E IGLU 2022 Winning Solutions of Agent Building Task

Table 6 presents the results of the winners of the RL task and compares them with the proposed baseline. The Happy Iglu team won by a significant margin, offering a multimodal end-to-end solution. Team FelipeB and the Chuang team improved the NLP part of the MHB (Multitask Hierarchical Builder) baseline to arrive at their solutions. A more comprehensive overview of the solutions is provided below.

Team Approach F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Precision Recall Ep. Length # of Submissions
Happy Iglu Brain Agent 0.254 0.331 0.264 391 89
FelipeB MHB-Pegasus 0.178 0.335 0.153 283 18
Chuang MHB-Tuned 0.156 0.303 0.138 294 31
Baseline (ours) MHB 0.150 0.256 0.134 281 -
Table 6: Results of the winners of Building Task.
First Place: Happy Iglu Team

The Happy Iglu Team developed an end-to-end RL approach, called Brain Agent, to effectively address the challenges in the IGLU environment. Their approach encompassed several main strategies. Firstly, they crafted a sophisticated reward function that integrates task-specific rewards and penalties. They used the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score for evaluation, parameter tuning, and selecting the best model during training.

The team employed advanced representation learning techniques to distill relevant information from high-dimensional inputs such as grid and target_grid for the value function, incorporating additional features like compass orientation and color count. Information about grid and target_grid was absent during testing but utilized in training exclusively by the critic. An auxiliary loss—a grid reconstruction loss—was applied to optimize state utilization, ensuring the agent properly memorized the current environmental state. To address partial observability, the processing of past observation trajectories utilized the TrXL transformer architecture.

Lastly, COCO-LM-large was utilized to generate embedding vectors for each instructional input (utterances). These findings were combined, resulting in high performance scores in the IGLU environment. The model was trained using the Brain Agent999https://github.com/kakaobrain/brain-agent distributed RL framework.

Second Place: team FelipeB

This solution focused on addressing the limitations of the MHB baseline’s NLP component, particularly the low performance of the T5 model, which was reflected by its low BLEU score. To solve this, the Pegasus model, pre-trained for summarization [89], was chosen to translate utterances from architect into commands. The Pegasus-Large model was trained using the same data augmentations as the original baseline. With careful hyperparameter selection and replacing the T5 model, the BLEU score significantly improved from 0.30.30.30.3 to 0.950.950.950.95, and the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of the entire pipeline rose from 0.150.150.150.15 to 0.1780.1780.1780.178, contributing greatly to the success in the competition.

Third Place: Team Chuang

The team focused on transforming the problem of creating a voxel grid into a text-to-video task, using a temporal dimension to represent the grid’s third dimension. They utilized an open-source video diffusion model, enhanced with context prompting by integrating the starting grid into each language instruction, improving the model’s ability to generate the desired output. This approach applied to the IGLU task, outperformed the T5 model in local tests but faced challenges in external validation.

Baseline: Multitask Hierarchical Builder

The MHB baseline incorporates three core components to handle task execution based on given instructions:

NLP Module: Utilizes a finetuned T5 encoder-decoder transformer model to predict block coordinates and IDs based on textual instructions. This model is specifically trained on the IGLU dataset to generate sequences of building commands from dialogues. To handle changes in context, it incorporates the last few interactions during fine-tuning and inference to improve prediction accuracy.

Heuristic Module: This Python-based module processes the output from the NLP module to sequentially generate block placement or removal actions. It employs heuristics to determine the sequence of these actions, ensuring each block is handled individually, which aligns with the atomic operational nature of the subsequent RL module.

RL Module: Operates on visual input from the environment, along with data about the inventory and a target block, to execute the physical task of placing or removing a block. This module uses a convolutional ResNet architecture combined with an LSTM to integrate and process environmental data and execute actions based on a reinforcement learning policy trained with the Asynchronous PPO algorithm. A detailed overview of the baseline can be found at this link 101010https://gitlab.aicrowd.com/aicrowd/challenges/iglu-challenge-2022/iglu-2022-rl-mhb-baseline.

Appendix F Human Evaluation Platform Details

This section provides a technical overview of the Greenlands platform. A more detailed description can be found in the project’s code repository111111https://github.com/microsoft/greenlands/blob/main/Docs/Home.md.

The Greenlands platform is an integration of three principal components:

  1. C1

    Server — This central server operates a customized version of the standard Minecraft server, enabling human-agent interaction through specialized behaviors and commands. It is responsible for coordinating human players, pairing them with agents, and managing game progression by tracking in-game events, initializing game worlds, and monitoring the completion of games.

  2. C2

    Service — This is a standalone server that performs dual functions: it stores configurations for tasks designated for human evaluation and provides a user interface for competition organizers, such as those from IGLU, to administer these tasks.

  3. C3

    Agent Toolkit — Acting as a Python-based wrapper for the IGLU agents, the Agent Toolkit executes these models within their original training environment, Gridworld, capturing environmental changes and relaying them to the Minecraft server to synchronize the agent’s actions with the human player’s experience.

A bi-directional communication channel facilitates the exchange of game events between the Minecraft server and each Agent Toolkit instance. These events encompass a set of discrete actions within the game: (1) Player joining the game (2) Chat interactions (3) Player movements (4) Block placements (5) Block removals (6) Turn endings (7) Game conclusions

Incoming events are processed in the corresponding game world, either within the Minecraft server or the Agent Toolkit. Upon initialization of an Agent Toolkit instance, it attempts to connect to the Minecraft server, establishing a communication link that allows the server to recognize active agents available for gameplay.

The Agent Toolkit is designed to enable a single agent instance to concurrently participate in multiple games, assuming the agent model does not maintain an internal state between steps in the environment. For the purposes of our human evaluation, however, each Agent Toolkit instance was restricted to a single game to optimize inference speed and simplify monitoring. Although multiple agent instances were operational simultaneously as different Agent Toolkit processes.

It is important to note that the Gridworld environment, where the agents are executed, does not replicate the exact same physics as Minecraft. It also differs slightly in specific action parameters, such as the block placement/removal radius and the permissible collision boundaries. Consequently, agent physics is not applied within the Minecraft server; instead, agent actions are first simulated in the Agent Toolkit, and then the final state is mirrored in the Minecraft environment. In contrast, the human player’s interactions are processed directly by the Minecraft server, with relevant state information transmitted to the Agent Toolkit so the model can consume it.

A human participant wanting to play with an agent would need to go through the following sequence:

  • Acquire a join code from the competition organizers, which was created beforehand through Service’s web interface and specifies the agent and task for the game.

  • Connect to the Greenlands Minecraft server endpoint, entering the Lobby World where the sole possible action is to input the Join Code.

  • Upon code submission, the server alerts the designated agent that a game will commence, generates a new Game World with pre-set structures, and places the human as the architect and the agent as the builder. The architect has the ability to fly around the world, observing both the target structure and the agent within its build area. The agent is confined to its built area and is unable to traverse outside of it or interact with elements beyond its designated borders.

  • The game officially begins when the human, acting as the architect, compares the current state of the agent’s build area with the target structure. The human then formulates and sends an utterance to guide the agent, who serves as the builder, towards achieving the goal. After issuing this instruction, the human ends their turn using a specific command provided by the platform. Subsequently, the agent takes its turn, receiving the current world state and the entire chat history as input. It is then instructed to perform actions until it either exceeds a predefined maximum number of steps or determines that its turn is complete.

  • The turn-based interaction is conducted in a loop until the human player either (a) acknowledges that the agent has accurately completed the target structure, or (b) determines that the agent has reached an irrecoverable state and cannot complete the structure. At this point, the human issues an End Game command, which includes an indication of whether the game concluded successfully or not.

  • Subsequently, the platform dismantles the game world, readies the agent for a new game and returns the human to the Lobby World.

  • Upon the game’s conclusion, the platform dismantles the Game World and informs the corresponding Agent Toolkit instance that the session has ended, preparing it for subsequent matches. The human participant is then teleported back to the Lobby World. Additionally, the server issues a Completion Code to the participant via the chat box. The participant will enter this code into the appropriate field of the MTurk task and let the competition organizers use it to query the Service and retrieve the complete log of the game, including the human player’s assessment of whether the game concluded successfully.

F.1 Gameplay Screenshots

The following images illustrate the experience of a human participant from the moment they join the Greenlands Minecraft server till they finish the game and obtain their confirmation code.

Refer to caption
Figure 5: The human participant is spawned in the Lobby world when they join the server. It’s a flat world where the only action they’re allowed to do is to paste a Join Code in the chat box.
Refer to caption
Figure 6: Initial view that the human participants see when they first join a game. The target structure can be seen on the left side, and the agent and its initial structure can be seen on the right. The agent’s build zone has cardinal directions to make it easier for the human to provide instructions with absolute directions rather than having to rely on relative left, right. At the start of the game, the participant is also provided with instructions detailing their role and goal for this session.
Refer to caption
Figure 7: A human participant providing instructions to the agent (seen on the left side of the picture), and then ending their turn.
Refer to caption
Figure 8: The agent has just performed its action in response to the human’s instructions, and has now ended their turn.
Refer to caption
Figure 9: The human participant has finished the game and is sent back to the Lobby world. The chat box explains to them how to get the confirmation code for the game.

F.2 Examples of Human Annotations

Below is a list of comments provided by the human participants for each of the agents. These have been manually chosen as representative of each agent’s performance, and have been slightly paraphrased for correctness and readability.

Brain Agent

  • "The agent built a lot and was even able to break things but they were not able to choose colors right or destroy the right blocks."

  • "This agent was able to move around the structure and place blocks but it would always also instantly delete them. It was not able to figure out height also and kept building to high."

  • "The agent was largely unresponsive not really doing anything no matter the command whether it be to build or to break."

  • "It was able to build 3 blocks of blue like I wanted, but it was the wrong way. Then I wanted it to fix its mistakes, but the AI broke and started building and destroying blocks randomly."

Here we can see that Brain Agent tends to make actions even though it sometimes was unresponsive (ends its turn immediately without doing any action). It also tended to slightly obey the instructions of the user, especially during the first turn, but it would then start performing random actions.

MHB-Pegasus

  • "The agent placed the blocks at the wrong location, ignores the location I ask them to place blocks at, kept building in the middle, and ignored my locations I was giving them."

  • "The agent was completely unresponsive not even really moving much just receiving commands and not acting on them at all."

  • "The agent placed a lot of blocks and got rather close to what was supposed to be the structure but they placed some wrong ones and could not destroy any blocks."

  • "The agent was able to place a lot of blocks but none that were part of my commands or even the right color at times. Along with that they did not even break them once placed."

For the MHB-Pegasus agent we again see it suffers from the unresponsiveness problem. As with Brain Agent, it seems to align to human instructions for the first few turns but later devolves into random action.

MHB

  • "The agent did not listen to my commands at all. It just did nothing. The first command had it back up and look down, then it refused to do anything else."

  • "The agent placed the blocks on the incorrect side of the grid. The agent followed part of my command, and placed 3 red blocks, but they were placed improperly at the wrong location."

  • "The agent was able to place and break blocks but did not follow any of my commands besides breaking the wrong blocks they placed."

  • "The agent was incapable of turning and was stuck building in one direction with the wrong colors."

We can see that, overall, all three agents suffer from the same problems: ending up in a state where they can’t decide on a next action and end their turn prematurely (even though the human clearly tells them what to do), obeying only part of the action (placing blocks in correct location but of different colors, or vice-versa), doing sensible actions only for the first few turns.