MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation

Xiaohan Wang, Dian Li, Yilin Zhao, Sinbadliu, Hui Wang
Foundation Technology Center, Tencent PCG
{shawnbywang, goodli, yilinnzhao, sinbadliu, joltwang}@tencent.com
Corresponding author
Abstract

Utilizing complex tools with Large Language Models (LLMs) is a critical component for grounding AI agents in various real-world scenarios. The core challenge of manipulating tools lies in understanding their usage and functionality. The prevailing approach involves few-shot prompting with demonstrations or fine-tuning on expert trajectories. However, for complex tools and tasks, mere in-context demonstrations may fail to cover sufficient knowledge. Training-based methods are also constrained by the high cost of dataset construction and limited generalizability. In this paper, we introduce a new tool learning methodology (MetaTool) that is generalizable for mastering any reusable toolset. Our approach includes a self-supervised data augmentation technique that enables LLMs to gain a comprehensive understanding of various tools, thereby improving their ability to complete tasks effectively. We develop a series of meta-tasks that involve predicting masked factors of tool execution. These self-supervised tasks enable the automatic generation of high-quality QA data concerning tool comprehension. By incorporating meta-task data into the instruction tuning process, the proposed MetaTool model achieves significant superiority to open-source models and is comparable to GPT-4/GPT-3.5 on multiple tool-oriented tasks.

1 Introduction

Distinguished from other species, a critical characteristic of human beings’ advanced intelligence is the use of complex tools, which expands the frontiers neural intelligence can reach. With the advent of powerful foundation models (e.g. large language models, multi-modal models), AI has the potential to solve complex tasks as a general agent, equipped with the abilities to make long-term plans, use external tools, reflect on its own behavior, etc. Using tools crucially endows LLMs the power from external mechanisms and to exert effect on a larger scale.

Existing tool learning research majorly falls into two paradigms: tool-augmented learning and tool-oriented learning Qin et al. (2023b). The former aims at augmenting the model with complementary resources (e.g. retriever, search engine), and the latter focuses on achieving certain task objectives with tools (e.g. web navigationRawles et al. (2023); Hong et al. (2024), embodied manipulationChi et al. (2023)). While augmenting LLMs with tools requires appropriate tool selection, tool-oriented tasks raise more challenges in tool manipulation given that the orientation is tool output and state change. This work focuses on learning tool manipulation in the tool-oriented paradigm with large language models.

To utilize tools with LLMs, a mainstream way is to provide the ”cookbook” of tools with zero-shot prompting or demonstrations of tool usage with few-shot promptingXu et al. (2023); Brown et al. (2020). It may work on simple tool sets, however, for complex tools like software or machines, demonstrations can not exhaust all scenarios, and manuals are also limited in length. Ultimately, it’s impractical to expect a system to be intelligent enough to master any tool without experience of using it. Besides, prior training-based methods mainly adopt supervised fine-tuning with annotated solutions on the basis of pre-trained LLMs Qin et al. (2023c); Patil et al. (2023). Regardless of the difficulties of annotating the optimal actions for complex tasks, training on a limited amount of data is prone to overfitting action patterns. Without truly understanding the dynamics of tool execution, it’s hard to generalize to diverse task scenarios through flexible tool manipulation. For instance in Figure 1, understanding the usage or functionality of hammers (e.g. nailing, smashing) enables the robot to build a cabin better and generalize the skills to chair or table construction.

Refer to caption
Figure 1: Paradigm comparison between existing tool learning methods and proposed meta-task augmentation.

Towards the issues above, our insight is that generalizable tool manipulation should be achieved on the foundation of comprehensive tool understanding, which is learnable with a practical amount of data. In this paper, we propose a data augmentation method (MetaTool) that enhances LLM’s understanding of an external toolset and boosts the learning of tool-oriented tasks. Given a callable toolset (e.g. APIs, programs), a meta-set consisting of question-answering data of 6 meta-tasks is constructed by calling the tools in a self-supervised way. The meta-tasks are designed concerning the causality of the toolset as an autonomous system and its functionality as a function. Then we augment the solution data of tool-oriented tasks with the meta-set to fine-tune the pre-trained LLM. Evaluated on three tool-oriented tasks, our method significantly improves the success rate (+22.7%) of the open-source LLM (e.g. LLaMA-3) and is competitive with GPT-4/3.5-turbo. Moreover, we also explore the mechanism of enhancing tool manipulation capability with meta-task data. The overall contribution can be summarized in three-folds:

  • We introduce a new tool learning paradigm that facilitates the task performance of LLMs with task-agnostic tool understanding.

  • We propose an integral set of self-supervised tasks that dissect the tool execution process and enable efficient data generation and augmentation.

  • Extensive evaluation on tool-oriented tasks verifies that MetaTool significantly enhances open-source LLMs compared with conventional instruction tuning methods.

2 Related Works

2.1 Tool learning

Recent studies have shed light on the potential of utilizing tools to augment LLMs with external factual knowledge Qin et al. (2023a); Nakano et al. (2021); Song et al. (2023); Hao et al. (2024); Shen et al. (2024); Gao et al. (2023); Wu et al. (2023); Qian et al. (2023); Zhuang et al. (2024); Schick et al. (2024) and complete tasks in complex environments Gupta & Kembhavi (2023). With the burgeoning intelligence in reasoning and perception, LLMs’ tool-use capability can be widely applied in the automation of various domains including Embodied AI Wang et al. (2024c; b), web manipulation Rawles et al. (2023); Hong et al. (2024); Yang et al. (2023); Deng et al. (2024); He et al. (2024); Zhou et al. (2023), and image/video editing Wang et al. (2024a); Argaw et al. (2022); Hang et al. (2024); Fu et al. (2023). Effectively mastering complex tools challenges the model to comprehend the precondition and potential outcome of using tools. In this paper, we aim to facilitate LLMs for tool-oriented tasks by learning robust tool understanding.

2.2 Tool understanding

As noted by Hernik & Csibra (2009), when learning to utilize a specific tool, children perceive it as an object with particular functions, engaging in a cognitive process to understand its purpose and operation. Analogously, a comprehensive understanding of the tools’ functionalities is indispensable for enabling the controller to use tools proficiently. In real-world scenarios, tools are typically accompanied by a manual (or tutorial), which provides sufficient relevant details about their functionalities and usage. Endowed with strong few-shot learning Brown et al. (2020) and zero-shot learning Wei et al. (2021) capabilities, foundation models can be prompted to unravel tools’ functionalities and comprehend how to use them. To this end, we can construct suitable task-specific prompts either through manual design Vemprala et al. (2024) or retrieval Zhou et al. (2022). However, prompting is restricted by input context length, thus the situation may be more challenging with multiple complex tools with long descriptions. While most training-based tool learning methods rely on extensive expert-annotated solution data for goal-oriented tasks, the knowledge contained in the tool execution process itself remains unutilized. We propose a self-supervised data augmentation method to efficiently endow LLMs the comprehension of a set of tools.

3 Method

In this section, we first formalize the tool-oriented task with a close toolset. Then we define five general meta-tasks that are key to tool understanding and show how to generate datasets of them in an integral self-supervised way. In the end, we describe several training schemes to augment the tool-oriented training with meta-tasks data.

3.1 Problem Formalization

A tool task can be defined as a tuple 𝒮,𝒜,𝒯,g𝒮𝒜𝒯𝑔\left\langle\mathcal{S},\mathcal{A},\mathcal{T},g\right\rangle⟨ caligraphic_S , caligraphic_A , caligraphic_T , italic_g ⟩, where 𝒮,𝒜,𝒯𝒮𝒜𝒯\mathcal{S},\mathcal{A},\mathcal{T}caligraphic_S , caligraphic_A , caligraphic_T is the state space, action space, and toolset, and g𝑔gitalic_g is the goal state of the task. Toolset 𝒯={t}N𝒯subscript𝑡𝑁\mathcal{T}=\{t\}_{N}caligraphic_T = { italic_t } start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT consists of N𝑁Nitalic_N tools, each as a state transition function s=t(s,θ)superscript𝑠𝑡𝑠𝜃s^{\prime}=t(s,\theta)italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t ( italic_s , italic_θ ) that formalizes the outcome of state change when feeding the input parameters θ𝜃\thetaitalic_θ into the tool. An action a=t,θ𝒜𝑎𝑡𝜃𝒜a=\left\langle t,\theta\right\rangle\in\mathcal{A}italic_a = ⟨ italic_t , italic_θ ⟩ ∈ caligraphic_A specifies the tool and its input. As an autonomous agent, an LLM should iteratively respond with actions and inputs according to the state until it reaches the goal. Broadly, when the tools can not alter any external state, tool output like retrieval results can be regarded as the state thus the goal state is the desired information.

Refer to caption
Figure 2: Illustration of developing self-supervised meta-tasks from unsupervised tool execution process.

3.2 Self-supervised Meta-tasks for Tool Understanding

We enhance the tool understanding of the model with self-supervised surrogate (pretext) tasks instead of in-context descriptions or demonstrations. Formally, we regard tools as external systems that implement state transition mappings. Tool understanding, therefore, involves comprehending the perception-action process of these systems (referred to as tool execution) and should be generalizable to various task objectives.

Meta-task definition. We first generate single-step tool execution data 𝒟={s,a,s}𝒟𝑠𝑎superscript𝑠\mathcal{D}=\{s,a,s^{\prime}\}caligraphic_D = { italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } by stochastically sampling initial state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, and obtaining the tool output ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Five surrogate tasks (meta-tasks) are designed based on the unsupervised dataset 𝒟𝒟\mathcal{D}caligraphic_D. Basically, the model is required to predict masked factors of the execution process in line with the idea of Masked Autoencoder (MAE) He et al. (2022). We define the meta-tasks as below:

  • Effect: The model predicts the outcome state P(s|a,s)𝑃conditionalsuperscript𝑠𝑎𝑠P(s^{\prime}|a,s)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s ) given the initial state and the action.

  • Decision-making: The model decides a feasible action P(a|s,s)𝑃conditional𝑎𝑠superscript𝑠P(a|s,s^{\prime})italic_P ( italic_a | italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) given the initial and outcome state.

  • Reversion: The model deduces the initial state P(s|a,s)𝑃conditional𝑠𝑎superscript𝑠P(s|a,s^{\prime})italic_P ( italic_s | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) given the action and the outcome state.

  • Input Boundary: The model determines whether an action can be successfully executed given the current state: P(𝟙ss|a,s)𝑃conditionalsubscript1superscript𝑠𝑠𝑎𝑠P(\mathbbm{1}_{s^{\prime}\neq s}|a,s)italic_P ( blackboard_1 start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_s end_POSTSUBSCRIPT | italic_a , italic_s ).

  • Output Boundary: The model determines whether a state can be reached with any action given the current state: P(𝟙(t,θ),s=t(s,θ)|s,s)𝑃conditionalsubscript1𝑡𝜃superscript𝑠𝑡𝑠𝜃𝑠superscript𝑠P(\mathbbm{1}_{\exists(t,\theta),s^{\prime}=t(s,\theta)}|s,s^{\prime})italic_P ( blackboard_1 start_POSTSUBSCRIPT ∃ ( italic_t , italic_θ ) , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t ( italic_s , italic_θ ) end_POSTSUBSCRIPT | italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

  • Counterfact: The model predicts the new outcome state P(s′′|a,s,a)𝑃conditionalsuperscript𝑠′′𝑎superscript𝑠superscript𝑎P(s^{\prime\prime}|a,s^{\prime},a^{\prime})italic_P ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) if a new action asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT were executed given that the current action a𝑎aitalic_a results in the current outcome ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Effect, decision-making, reversion meta-tasks emphasize the causality of a tool, regarding the action as the intervention to the state Pearl (2009); Pearl & Mackenzie (2018) and the outcome as the causal effect is determined by the tool mechanism. On top of that, counterfact task is the composition of reversion and effect, further imagining the outcome altered from the fact in effect task and raising higher requirements for causal reasoning Bareinboim et al. (2015); Zhang & Bareinboim (2016). When implemented as APIs, tools may receive non-executable inputs and result in ineffective outcomes. Thus the input and output domains are also unique features of a tool as a function. We consider Input boundary meta-task that emphasizes the tool affordance that refers to what actions can be executed considering the situation and the precondition. Output boundary meta-task emphasizes the functionality of tools, that is, what goals can and cannot be achieved given the current state.

Refer to caption
Figure 3: Demonstrations of meta-tasks and tool-oriented solution data of SAW task (refer to section 4.1). We diversify the questions and annotated thoughts in the datasets with multiple templates generated by GPT-4.

Metaset construction. Based on the single-step data 𝒟𝒟\mathcal{D}caligraphic_D, datasets of meta-tasks (referred to as metasets) are constructed by automatically generating question-answering pairs, showcased in Figure 2. For each sample and each meta-task, we insert the variables of states and actions into 5 templates (diversified with GPT-4) to obtain diverse QA data.

3.3 Meta-task Augmentation

With the data of meta-tasks, we explore several manners to augment the tool manipulation ability to achieve task goals: 1) In-context learning: To enhance the tool understanding of LLMs in a training-free way, we incorporate several demonstrations of each meta-tasks in the system prompt. Existing works may include demonstrations in the tool cookbook, yet not in a systematic way. 2) Self-supervised Learning: Since we aim to build the model’s tool understanding as the foundation of tool-oriented learning, an intuitive manner is to train the LLM first on the metasets as the surrogate tasks and then on the solution data of tool-oriented tasks. In order to maintain the general ability of the model in the first stage, only the parameters of the query and value projection layers of the Transformer are updated instead of full-parameter training. We also propose to train the model on metasets separately to build the tool-use ability step by step. 3) Data augmentation: We also utilize the metasets as the augmented data of conventional instruction tuning methods that the metasets are mixed with solution data and the model is trained uniformly. The model trained on the mixed data is referred to as MetaTool.

4 Experiments

4.1 Task Setup

To evaluate LLMs’ ability on tool-oriented tasks, we develop 3 tasks emphasizing closed toolset manipulation rather than long-term planning and reflection which are also crucial abilities for AI agents. The key challenge of these tasks is effectively utilizing tools to achieve the goal, which requires the model to understand the rules (preconditions) and the mechanisms of the toolset. The task definition and dataset construction are elaborated below.

SpellAnyWord (SAW). In this task, the agent needs to sequentially construct a string that contains the target string as a continuous substring. The initial state of the task is a void string. Two non-degradable tools (functions) are avaliable: hinzufügen: to add two adjacent letters in the alphabet to the end of the current string. The tool input θ𝜃\thetaitalic_θ should be the preceding letter (e.g. passing ’a’ to hinzufügen on current string ” will result in ’ab’). Swap: to swap the position of two adjacent letters in the current string. The input should be the preceding letter (e.g. passing ’a’ to Swap on ’ab’ will result in ’ba’). An example task: The target string is ’any’. A successful action sequence can be [hinzufügen(’a’), hinzufügen(’n’), hinzufügen(’y’), Swap(’a’), hinzufügen(’o’)], which will result in a state sequence [’ab’, ’abno’, ’abnoyz’, ’banoyz’, ’banyoz’] and the final string ’banyoz’ has ’any’ as a substring.

BlocksWolrd (BW). In this scenario, the agent needs to stack several blocks on the table into a target state with one hand. Only one block can be moved at a time. Two tools (functions) are avaliable: Pick: to pick a block in the hand. The tool input should be the target block indicated by its color (e.g. Pick(’yellow’)). Blocks cannot be picked if there are blocks on top of them or there’s already a block in the hand. Stack: to stack the block in the hand onto the target block or table. The input should be the color of the target block or ’table’ (e.g. Stack(’white’), Stack(’table’)). Blocks cannot be stacked on a block with another block already on top of it or there’s no block in the hand.

Logistics (LOG). The agent needs to solve a logistics problem by arranging trucks and airplanes to transport the package to the target location. Locations are grouped by cities. Trucks can be used to move packages between locations in the same city and planes can be used to move packages between cities. Two tools (functions) are available: Truck: to transport the truck and the package (if there is any) from one location to another. Plane: to transport the airplane and the package (if there is any) from one location to another. The tool input should be the starting and ending location indicated by numbers. (e.g. Truck(1,2), Plane(2,4)). An action is invalid when there is no truck or airplane at the starting location.

Datasets collection. For the SAW task, we randomly sample 2k target strings (from 2 letters to 10 letters) as task goals. We modify the BW and LOG tasks from the prior LLM benchmark Valmeekam et al. (2024) into the tool-use version, thus 2k goals for each task are adopted following the original configuration. Optimal action sequences are obtained with heuristic strategy as the solution data. For each annotated action, we generate a thought with one of 5 templates (diversified with GPT-4). The thoughts analyze the situation and what to do following ReACT Yao et al. (2022) to leverage the model’s reasoning ability. Thus the solution contains a sequence of thought-tool-input tuples. Besides the 3 tasks we define, MetaTool can be easily generalized to other tool-oriented tasks by writing the system prompt and developing the tool functions or APIs as the external environment. The meta-task data can then be generated automatically.

4.2 Implementation Details

Our model is fine-tuned based on LLaMA3-8b-instruct AI@Meta (2024) with parameter-efficient fine-tuning method Qlora Dettmers et al. (2024) on 8 A100 GPUs. We utilize the instruction tuning version of LLaMA3 since comprehending tool-oriented tasks with specific objectives is the basis of tool understanding and manipulation. For tool-oriented task training, we construct instruction-solution pairs from the task goals and annotated solutions. For meta-task training, we formulate the task objectives and the tool execution outcome as question-answering pairs. For each task, we train the model on 10k meta-task data and 10k solution data for 3 epochs with AdamW optimizer and the learning rate of 2e-4. The models are tested in a simulated environment that receives the action of using a tool and returns the outcome and current state. We evaluated the model performance on 100 unseen cases of each three tasks.

Models SAW BW LOG
GPT-3.5-turbo 22.6 27.0 51.0
GPT-3.5-turbo-IC 20.2 21.0 43.0
GPT-4 28.6 88.0 46.0
GPT-4-IC 27.4 80.0 37.0
Vicuna-7b 4.8 17.0 0.0
LLaMA3-8b-instruct 6.0 19.0 6.0
LLaMA3-IC 4.8 18.0 2.0
LLaMA3-SS 9.5 22.0 12.0
MetaTool 32.1 38.0 29.0
Table 1: Overall comparison. IC: in-context learning, SS: self-supervised learning
E D I O C S SAW BW LOG
\usym1F5F4 15.5 31.0 10.0
\usym1F5F4 \usym1F5F4 \usym1F5F4 \usym1F5F4 \usym1F5F4 9.5 21.0 8.0
\usym1F5F4 9.5 27.0 11.0
\usym1F5F4 18.5 29.0 9.0
\usym1F5F4 17.3 32.0 18.0
\usym1F5F4 16.1 32.0 6.0
\usym1F5F4 19.6 37.0 14.0
\usym1F5F8 \usym1F5F8 \usym1F5F8 \usym1F5F8 \usym1F5F8 \usym1F5F8 32.1 38.0 29.0
Table 2: Ablation results. E: effect meta-set, D: decision-making meta-set, I: input boundary meta-set, O: output boundary meta-set, C: counterfact meta-set, S: solution dataset.

4.3 Results Analysis

Overall comparison. We evaluate the success rate (SR%) of completing each task and show the performances of several models in Table 1. Overall, SOTA closed-source LLMs show impressive zero-shot performance on tool-oriented tasks compared with open-source LLMs including LLaMA3 and Vicuna. By training on both meta-tasks and solution data, our model MetaTool gains significant improvement (+22.7%SR on average) compared with LLaMA3 (baseline) and surpasses GPT-4/GPT-3.5-turbo in the SAW/BW tasks (+3.5%/11.0%SR). Both GPT and LLaMA3 show weaker performances when provided with meta-task demonstrations (IC) since demonstrating limited cases can be redundant or misleading without proper design. LLaMA3-SS that trained on meta-tasks first gains limited improvement compared with the baseline. We conjecture that learning meta-tasks without practicing tool manipulation (training on action sequences) cannot effectively facilitate tool-use ability with tool understanding. Also fine-tuning with specific QA data may affect the basic linguistic ability of the model.

Refer to caption
Figure 4: Case study of MetaTool compared with 2 baselines on BlocksWorld task. Actions in red denote invalid ones (e.g. pick up a block at the bottom). LLaMA3-solution is the LLaMA model trained on task solution data.

Ablation study. We study the ablation of different data components and report the performances in Table 2. Merely training on solution data improves the model performance to an extent compared with the baseline LLaMA3 (+8.5% on average). It’s worth noticing that only training on meta-tasks can improve the model’s zero-shot performance on tool-oriented tasks (line 2), contrary to providing demonstrations of meta-tasks in the system prompt (LLaMA3-IC in Table 1). When removing QA data from each meta-task, the model performance shows varying degrees of degradation, which verifies the profits of meta-tasks. The meta-tasks of effect and decision-making have a relatively greater influence on the model’s tool understanding capability. Theoretically, these meta-tasks represent the causal mechanism of tools, which is the basis of high-level understanding or skills.

Case study. As showcased in Figure 3, the agent is required to construct stacks containing a blue block on top of a yellow block from a pile of 4 blocks. With mere descriptions of tools in the prompts, LLaMA3 fails to understand the precondition of using tools resulting in invalid actions. Training on tool-oriented solution data, LLaMA3-solution attempts to lift the yellow block successfully but fails to sequentially achieve the task goal and falls into repetitive loops. The proposed MetaTool model achieves the target state with an effective action sequence (although still not the optimal efficiency) and corresponding reasoning. These 3 models correspond to the 3 paradigms illustrated in Figure 1. The results show that LLMs can learn tool manipulation better on the basis of robust tool understanding.

5 Conclusion and Future Discussion

In this work, we show that tool understanding can be disentangled from other agentic capabilities such as planning and decision-making that facilitate the tool-use tasks. While multiple abilities have emerged in prior open-source LLMs, comprehending tools as external mechanisms and functions is the foundation of tool usage and remains insufficient. We propose the MetaTool method to train an LLM with self-supervised surrogate tasks concerning the functionality and causality of tools. Our model archives 22.7% SR improvement on average on three tool-oriented tasks and showcases effective tool manipulation ability.

Despite the availability of some tool learning methods, the gap between open-source large language models and SOTA closed-source models remains unignorable. One potential way to close that gap is to improve LLMs’ zero/few-shot tool understanding capability that can effectively conjecture the tool usage and functionality from the prompts. Another future direction for learning complex tools is to improve data efficiency so that the model can be trained to master certain downstream tasks with minimum effort. Generally, we propose MetaTool as a methodology to pave the way for future tool-use research.

References

  • AI@Meta (2024) AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  • Argaw et al. (2022) Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing. In European Conference on Computer Vision, pp.  201–218. Springer, 2022.
  • Bareinboim et al. (2015) Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chi et al. (2023) Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  • Deng et al. (2024) Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  • Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  • Fu et al. (2023) Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
  • Gao et al. (2023) Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  • Gupta & Kembhavi (2023) Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  • Hang et al. (2024) Tiankai Hang, Shuyang Gu, Dong Chen, Xin Geng, and Baining Guo. Cca: Collaborative competitive agents for image editing. arXiv preprint arXiv:2401.13011, 2024.
  • Hao et al. (2024) Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024.
  • He et al. (2024) Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  • He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  • Hernik & Csibra (2009) Mikolaj Hernik and Gergely Csibra. Functional understanding facilitates learning about tools in human children. Current opinion in neurobiology, 19(1):34–38, 2009.
  • Hong et al. (2024) Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14281–14290, 2024.
  • Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  • Patil et al. (2023) Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  • Pearl (2009) Judea Pearl. Causal inference in statistics: An overview. 2009.
  • Pearl & Mackenzie (2018) Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.
  • Qian et al. (2023) Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Disentangling abstract and concrete reasonings of large language models through tool creation. arXiv preprint arXiv:2305.14318, 2023.
  • Qin et al. (2023a) Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849, 2023a.
  • Qin et al. (2023b) Yujia Qin, Shengding Hu, Yankai Lin, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023b.
  • Qin et al. (2023c) Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023c.
  • Rawles et al. (2023) Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
  • Schick et al. (2024) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  • Shen et al. (2024) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  • Song et al. (2023) Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624, 2023.
  • Valmeekam et al. (2024) Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems, 36, 2024.
  • Vemprala et al. (2024) Sai H Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. IEEE Access, 2024.
  • Wang et al. (2024a) Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. Lave: Llm-powered agent assistance and language augmentation for video editing. arXiv preprint arXiv:2402.10294, 2024a.
  • Wang et al. (2024b) Xiaohan Wang, Yuehu Liu, Xinhang Song, Yuyi Liu, Sixian Zhang, and Shuqiang Jiang. An interactive navigation method with effect-oriented affordance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16446–16456, 2024b.
  • Wang et al. (2024c) Xiaohan Wang, Yuehu Liu, Xinhang Song, Beibei Wang, and Shuqiang Jiang. Camp: Causal multi-policy planning for interactive navigation in multi-room scenes. Advances in Neural Information Processing Systems, 36, 2024c.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Wu et al. (2023) Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  • Xu et al. (2023) Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
  • Yang et al. (2023) Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  • Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  • Zhang & Bareinboim (2016) Junzhe Zhang and Elias Bareinboim. Markov decision processes with unobserved confounders: A causal approach. Purdue AI Lab, West Lafayette, IN, USA, Tech. Rep, 2016.
  • Zhou et al. (2022) Shuyan Zhou, Uri Alon, Frank F Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. Docprompting: Generating code by retrieving the docs. arXiv preprint arXiv:2207.05987, 2022.
  • Zhou et al. (2023) Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
  • Zhuang et al. (2024) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024.