AutoFlow: Automated Workflow Generation for
Large Language Model Agents

Zelong Li
Rutgers University
[email protected]
&Shuyuan Xu
Rutgers University
[email protected]
&Kai Mei
Rutgers University
[email protected]
&Wenyue Hua
Rutgers University
[email protected]
&Balaji Rama
Independent Researcher
[email protected]
&Om Raheja
Independent Researcher
[email protected] &Hao Wang
Rutgers University
[email protected]
&He Zhu
Rutgers University
[email protected]
&Yongfeng Zhang
Rutgers University
[email protected]
Abstract

Recent advancements in Large Language Models (LLMs) have shown significant progress in understanding complex natural language. One important application of LLM is LLM-based AI Agent, which leverages the ability of LLM as well as external tools for complex-task solving. To make sure LLM Agents follow an effective and reliable procedure to solve the given task, manually designed workflows are usually used to guide the working mechanism of agents. However, manually designing the workflows requires considerable efforts and domain knowledge, making it difficult to develop and deploy agents on massive scales. To address these issues, we propose AutoFlow, a framework designed to automatically generate workflows for agents to solve complex tasks. AutoFlow takes natural language program as the format of agent workflow and employs a workflow optimization procedure to iteratively optimize the workflow quality. Besides, this work offers two workflow generation methods: fine-tuning-based and in-context-based methods, making the AutoFlow framework applicable to both open-source and closed-source LLMs. Experimental results show that our framework can produce robust and reliable agent workflows. We believe that the automatic generation and interpretation of workflows in natural language represent a promising paradigm for solving complex tasks, particularly with the rapid development of LLMs. The source code of this work is available at https://github.com/agiresearch/AutoFlow.

1 Introduction

Recent advancements in Large Language Models (LLMs) have demonstrated substantial progress in understanding and processing complex natural language. These developments have opened up a wide array of applications, among which the deployment of LLM-based AI agents stands out. These agents leverage the capabilities of LLMs along with external tools to tackle intricate tasks, ranging from data analysis [8], software development [24, 36], scientific research [3], travel planning [47] to many other decision-making processes in various domains.

One of the critical aspects of ensuring that LLM-based AI agents operate effectively and reliably is the design of workflows that guide their task-solving procedures. For example, an LLM-based agent for fake news detection may execute under the following workflow designed by information and communication experts [25]: 1) Check the URL, 2) Check the language, 3) Commonsense evaluation, 4) Standpoint evaluation, 5) Summarize the findings, and 6) Classification. The agent executes the workflow step by step, and each step may call the LLM or external tools to gather useful information for the final summarization and classification.

Traditionally, these workflows are manually crafted, requiring significant effort and deep domain knowledge. This manual process poses a substantial barrier to the large-scale development and deployment of AI agents, as it is both time-consuming and resource-intensive.

To address the challenges associated with manual workflow design, this paper proposes AutoFlow, a novel framework aimed at the automatic generation of workflows for AI agents to solve complex tasks. AutoFlow represents workflows in the form of natural language programs [48], facilitating easier comprehension and interaction. Central to AutoFlow is a workflow optimization procedure that iteratively refines the quality of the generated workflows, ensuring robustness and reliability.

Technically, AutoFlow introduces two innovative workflow generation methods: a fine-tuning-based method and an in-context-based method. The fine-tuning-based approach customizes the workflow generation process for specific tasks and domains by adjusting the parameters of the LLMs. In contrast, the in-context-based method utilizes contextual information to guide the generation process without the need for extensive fine-tuning, making it suitable for both open-source and closed-source LLMs. More specifically, as shown in Figure 1, the user will provide a workflow generation query to describe the type of tasks. Based on the query, the generator LLM generates a workflow and the frozen interpreter LLM executes the generated workflow on the dataset, with evaluating performance as the reward. Then, AutoFlow uses reinforcement learning (RL) to update the generator LLM with the reward. This process can be seen as one training iteration and the generator LLM expects to learn how to generate effective and optimal workflows after several iterations.

Refer to caption
Figure 1: The overall generation process of AutoFlow using reinforcement learning reward for LLMs

Our experimental results validate the effectiveness of the AutoFlow framework, showing that the generated workflows by AutoFlow outperform manually designed ones while keeping readability, and showcasing its ability to produce high-quality workflows that enable AI agents to perform complex tasks with a high degree of reliability. The automatic generation and interpretation of workflows in natural language not only streamline the development process but also represent a promising paradigm for addressing complex problems, especially in the context of the rapid evolution of LLM technologies. In summary, this paper makes the following contributions:

  • We introduce AutoFlow, a framework that can automatically generate workflows in natural language so that the workflows can be precisely interpreted by LLMs while reducing human efforts.

  • We propose two methods, the fine-tuning method and the in-context learning method, to incorporate RL in the workflow generation process for both open-source and closed-source LLMs.

  • We conduct experiments through benchmark tasks to validate the AutoFlow framework, contributing to higher valid plan rates and overall performance while keeping the generated natural language workflow readable by humans.

In the following part of this paper, we first review the related work in Section 2. In Section 3, we introduce how to represent workflows in natural language and our motivations. In Section 4, we demonstrate the detailed design of our AutoFlow framework, including two learning methods, fine-tuning and in-context learning methods. We provide and analyze the experimental results on benchmark datasets in Section 5, and finally conclude our work and suggest potential avenues for future research in Section 6.

2 Related Work

2.1 LLM Agents and Workflow

AI agent is an autonomous entity capable of making decisions and executing actions in a given environment to effectively handle various complex tasks [31, 9, 39, 46]. Recently, with the rapid advancement of Large Language Models (LLMs), LLM-based AI agents have become an important type of agent for complex task solving [8, 24, 36], such as reasoning, planning and coding.

Reasoning: LLMs typically break down complex tasks into a series of steps, constituting a chain of reasoning [41]. Approaches such as Chain of Thought (CoT) and its derivatives [41, 22], including tree [50] and graph structures [2], are commonly used. The self-consistency method [40] samples multiple reasoning paths and selects the most consistent outcome through voting.

Planning: Planning tasks require LLMs to generate a sequence of actions to achieve specific goals [10]. Recent studies have designed platforms to test LLMs’ planning abilities in areas such as expert model integration [8], travel task planning [47], and tool usage [53]. However, a known issue is that LLMs may generate non-executable, invalid or grammatically wrong plans, such as using a piece of text as input to an image-processing tool. To solve the problem, some studies [8, 53] use post-processing method to extract a chain of tools from the generated texts, which use LLM itself as a parser to post-process the generated text. Further, recent attempts integrate finite state machines into LLMs to enhance human’s controllability of LLM in planning [26, 45]. The ReAct approach [51] also uses external tools such as search engines to improve LLM planning. In this work, we build on these ideas to enhance the executability of the generated frameworks.

Coding: LLMs can generate code to solve complex tasks, reducing the need for manual programming [30, 48, 16, 28, 6, 35, 32, 4]. However, the generated code may contain errors or fail to meet user requirements. To mitigate these issues, workflow-based methods have been proposed, including manually designed and automatically generated workflows [17, 44, 54]. Another research direction involves using LLMs for natural language programming, leveraging their strong natural language understanding abilities. A notable example is the CoRE language [48], which unifies natural language programming, pseudo-code programming, and workflow programming under the same framework using LLM as interpreter. Our work follows the workflow concept in natural language programming and develops an automated workflow generation framework to reduce human labor.

2.2 Automated Machine Learning

Automated Machine Learning (AutoML) aims to reduce human labors in designing and deploying machine learning techniques, simplifying the application of ML in real-world problems. There are three main types of AutoML techniques [49, 27]:

Automated Model Selection: Tools such as Auto-sklearn [7] and Auto-WEKA [23] automatically select the best machine learning model from a library of models and hyper-parameter settings.

Automated Feature Engineering: Tools such as Data Science Machine [18], ExploreKit [19], and VEST [5] generate or select useful features without manual intervention, since feature engineering significantly impacts model performance in many applications.

Neural Architecture Search (NAS): Methods such as ENAS [34], DARTS [29], NASH [38], GNAS [13], and AmoebaNet-A [37] discover effective neural network architectures for specific tasks without manual design. Experiments show that networks generated through NAS can match or even outperform human-designed architectures across various tasks.

AutoML systems typically involve two main components for training: a controller, which is a machine learning model responsible for sampling model selections, and a child model, which comprises the parameters of the machine learning model to be created and used for the task at hand. In our work, we follow this training paradigm, using a workflow generator LLM as the controller, and the generated workflow along with a workflow interpreter LLM as the child model. More details of the proposed technique are introduced in Section 4.

3 Preliminary and Background

3.1 Natural Language Programs as Workflows

In this section, we introduce how to use natural language programs as a representation of workflows. Specifically, we will use the Code Representation and Execution (CoRE) system [48] as an example to show how to construct workflows as natural language programs and how the LLM Agent follows the workflow by executing the natural language program.

3.1.1 CoRE Language Syntax

The CoRE language defines four components to organize workflows as natural language instructions.

  • Step Name is used to uniquely identify each step of the workflow.

  • Step Type defines the type of instruction for each step. There are three different types of steps:

    • -

      Process: The process step transitions to the next specified step after executing the current step.

    • -

      Decision: Similar to conditional statements (e.g., “if-else”), the decision step is used for branching the program flow based on evaluated conditions.

    • -

      Terminal: The terminal step represents the end of the program.

  • Step Instruction is a natural language instruction to be executed in the step.

  • Step Connection points to the next step, which establishs the program execution flow.

An example workflow for image-text processing on the OpenAGI benchmark is shown as follows:

Step 1:::Process:::Identify the input data type based on the objective.:::next::Step 2
Step 2:::Process:::Identify the output data type based on the objective.:::next::Step 3
Step 3:::Process:::Select tools in the provided tool list to generate a plan.:::next::Step 4
Step 4:::Decision:::Check whether every tool in the plan is in the provided tool list.:::Yes::Step 5::No::Step 3
Step 5:::Decision:::Check whether the output data type of the previous tool is the input data type
of the next tool.:::Yes::Step 6::No::Step 3
Step 6:::Terminal:::Output the plan by listing the tool names.:::

In this paper, we use ‘:::’ to delimit the above four components in each step.

3.1.2 LLM as Interpreter for Workflow Execution

To process and execute the workflow in the CoRE language, the system uses an LLM as an interpreter. The LLM interpreter executes instructions step by step. Concretely, the execution of one step can be divided into four procedures in the CoRE system.

First, the LLM decides which information from memory may be needed to execute the current step and retrieves the relevant information from memory. After obtaining the relevant information, the system integrates the information with the instruction of that step into a structured prompt, which the LLM processes to generate a response. To extend LLM’s capability, the system may use external tools to analyze the initial response of each step. According to the initial response to the current step, the LLM determines whether external tools are required. If tool usage is confirmed, LLM will decide the tool name and tool arguments, then execute the external tool, and finally incorporate the results into the memory. After the execution of the current step, LLM will decide which is the next step to execute based on the output of the current step.

3.2 Motivation

The CoRE system enables users to write workflows in natural language, which unifies natural language programming, pseudo-code programming, and workflow programming. Although the entry barrier is lower than coding in programming languages, constructing workflows in natural language still requires much human labor and domain expertise. Inspired by Automated Machine Learning (AutoML) [14], we would like to automatically learn the best workflow based on the given task and training data. Considering the instructions in CoRE language are written in natural language and LLM has a strong ability of natural language understanding, we also use LLM as the workflow generator. To distinguish with the Interpreter LLM mentioned in 3.1, we denote the LLM that learns to generate workflows as the Workflow Generator LLM, and name the LLM that interprets and executes workflow as the Workflow Interpreter LLM, consistent with Figure 1. In this way, users only need to provide a high-level description of the task and the corresponding dataset, and the generator LLM can generate the optimal workflow in CoRE language for the interpreter LLM to execute on the given task. This process expects to minimize human efforts and automatically pursue the optimal workflow for LLM regardless of users’ knowledge on workflow design.

4 The AutoFlow Framework

Refer to caption
(a) AutoFlow generation process based on fine-tuning method with RL reward for open-source LLMs.
Refer to caption
(b) AutoFlow generation process based on in-context learning with RL reward for closed-source LLMs.
Figure 2: Overview for workflow generation with AutoFlow, using OpenAGI [8] tasks as an example

In this section, we introduce the two methods of applying the AutoFlow framework to the workflow generator LLM, i.e., the fine-tuning method for open-source LLMs and the in-context learning method for closed-source LLMs.

4.1 Fine-tuning Method for Workflow Generation with Open-source LLMs

We use LoRA adapter [12] for fine-tuning open-souced LLMs as workflow generators. The training process is shown in Figure 2.

First, the workflow generator LLM receives a few-shot example workflow and a description of the task from users as the input query. Although the CoRE language has minimal grammar requirements and the instructions are written in natural language, which can be well learned and generated by LLMs, an example workflow can help the workflow generator LLM better understand the grammar of the CoRE language. The natural language description of the task is to help the generator LLM understand the application scenarios of the workflow to be generated. Take the text and image processing tasks in OpenAGI benchmark [8] as an example, the task description could be “Provide a workflow with several steps. The workflow can guide the LLM to design plans for a type of complex tasks realted to text and image processing using the provided tools”.

Second, the next step is to generate an executable workflow based on the input query. For closed-source LLMs such as GPT-4, the model can directly generate a grammatically valid workflow given the few-shot example. However, open-source LLMs such as Mixtral-8x7B cannot consistently generate grammatically valid workflow even if few-shot example workflows are provided. To solve the problem, we follow the post-processing strategy in previous work [8, 53] and use GPT-4 as a parser to revise the output workflow into a grammatically valid one.

Third, the generated workflow will be executed by the interpreter LLM to obtain its performance on the validation dataset. Then, the generator LLM is updated based on the workflow’s performance on the validation dataset. Specifically, we use reinforcement learning (RL) to update the parameters of the LoRA adapter of the generator LLM, with the average metrics of all data instances on the validation dataset as the reward.

These three steps together consist of one iteration of the fine-tuning process. The fine-tuning process will terminal and the final workflow will be produced when the terminal condition is met, when the difference of reward between two consecutive iterations is smaller then a threshold. After the iterative optimization process, the workflow generator LLM produces the optimal workflow for the task based on the execution feedback.

4.2 In-context Learning Method for Workflow Generation with Closed-source LLMs

As for closed-source LLMs such as GPT-4, we use in-context learning to avoid fine-tuning the parameters. As shown in Figure 2, the AutoFlow framework also requires an example workflow and a description of the task, and feeds them as the input query to the workflow generator LLM. After the GPT-4 generates the workflow, we do not use a parser to revise the flow since GPT-4 can well follow the CoRE grammar demonstrated by the example workflow. Then, the interpreter LLM executes the workflow to evaluate its performance on the validation dataset as the reward, which is the same process as the fine-tuning method. The difference is that, in the next step, the AutoFlow framework directly includes the reward value in the query and prompts the generator LLM to generate a new workflow given the performance of the previously generated workflow, such as “The execution performance of the previous workflow is 0.6415. Provide a new workflow that can gain a better performance”. The whole process is demonstrated in Figure 2.

We will show in the experimentation that closed-source LLMs such as GPT-4 can well utilize the reward values in the prompt to refine the workflow and finally obtainn the optimal workflow by using the in-context learning method.

5 Experiments

5.1 Backbone Large Language Model (LLM)

We conduct experiments on both closed-source and open-source LLMs:

  • GPT-4 [33] (Closed-source) is a generative pre-trained transformer of OpenAI. In this work, we use the GPT-4-1106-preview version.

  • Mixtral-8x7B [15] (Open-source) is a pre-trained generative Sparse Mixture of Experts with 46.7 billion parameters.

In our experiment, we apply these two types of LLMs for both workflow generator LLM and interpreter LLM. Thus, there are four combinations in total.

5.2 Planning Schema of LLMs

We adopt the following LLM-based agent planning schema:

  • Zero-shot Learning (Zero) directly inputs the query to the LLM.

  • Chain-of-Thought (CoT) [41] induces the LLM to generate a coherent language sequence that serves as a meaningful intermediate step bridging the input query and the output answer.

  • Few-shot Learning (Few) presents a set of high-quality demonstrations in the prompt, each consisting of both input and desired output on the target task.

  • CoRE [48] uses a manually designed workflow with LLM as an interpreter.

  • AutoFlow is our proposed framework that can automatically generate workflows.

5.3 Benchmark Datasets

We conduct experiments on a benchmark dataset, OpenAGI [8]. The OpenAGI benchmark tasks are categorized based on their output type and ground-truth label type (Task 1, 2, and 3). Then, based on different task types, different metrics are employed to gauge the performance: CLIP Score [11], assessing the similarity between text and image, is utilized for Text-to-Image tasks (Task 1); BERT Score [55], evaluating text generation with BERT, is applied when both data labels and the expected outputs are texts (Task 2); and ViT Score [43] gauges the similarity between the image label and image output (Task 3).

5.4 Implementation Details

Our framework and all baselines are implemented by PyTorch, an open-source library. We follow the implementation setting of the OpenAGI platform [8] for Zero-shot and few-shot learnings. We leverage the DSPy framework [20, 21] to apply the CoT strategy to the OpenAGI platform. We also tried Program-of-Thought [6] and ReAct [52] strategies on the OpenAGI platform. However, the ReAct strategy requires text observation, which is unsuitable for our OpenAGI task since some observations are in image format, and Program-of-Thought cannot generate executable codes. Thus, we did not include them as the baselines.

For the hyper-parameter setting of the AutoFlow framework, we set the number of iterations for the workflow generator LLM as 30. For the open-source LLM, Mixtral, as the generator LLM, we use the REINFORCE [42] as the core reinforcement learning (RL) algorithm for the generator LLM, with the average score on the training dataset as the reward. We use Adam as the optimizer with the learning rate at 0.001 for RL. Also, we apply Low-Rank Adaptation (LoRA) [12] with the rank equal to 8 to Mixtral for efficient fine-tuning.

5.5 Experimental Analysis

We conduct the experiments on the OpenAGI [8] benchmark dataset. For a fair comparison, we show the results using the same workflow interpreter LLM in a table. Specifically, the results of using the open-source LLM, Mixtral, as the LLM interpreter is shown in Table 1; and the results of using the closed-source LLM, GPT-4, as the LLM interpreter is shown in Table 2. Each row stands for a type of task, each column represents the planning schema of an LLM interpreter. From these two tables, we can see that, after applying our AutoFlow framework, the average score over tasks is significantly better than the baselines. Compared to the best baseline, CoRE, AutoFlow has over 40% improvement when using Mixtral as the LLM interpreter, and over 5% improvement when using GPT-4 as the interpreter LLM. For the score of each type of task, our AutoFlow also reaches the highest one. Thus, the experiment results validate that AutoFlow is effective and can generate a workflow with better performance than manually designed ones.

An interesting observation is that, the best average score when using Mixtral as the LLM interpreter, is AutoFlow with GPT-4 as the workflow generator; and the best average score when using GPT-4 as the LLM interpreter, is AutoFlow with Mixtral as the workflow generator. This observation suggests that the combination of different systems (Mixtral and GPT-4) for the LLM interpreter and workflow generator might lead to a kind of synergistic effect where the strengths of one system complement the weaknesses of the other, which helps to better solve complex multi-step tasks.

Metrics / Task Zero CoT Few CoRE AutoFlow (GPT) AutoFlow (Mixtral)
Task 1 (CLIP Score) 0.0 0.0 0.1839 0.1825 0.24410.24410.24410.2441 0.1831
Task 2 (BERT Score) 0.1092 0.1987 0.0687 0.2593 0.3017 0.31330.31330.31330.3133
Task 3 (ViT Score) 0.1949 0.1562 0.5501 0.2437 0.57200.57200.57200.5720 0.4907
Average over tasks 0.1206 0.1736 0.1887 0.2483 0.35970.35970.35970.3597 0.3442
Table 1: Performance on OpenAGI when using the open-source LLM, Mixtral, as the LLM interpreter for all tasks and learning schema. Zero is for Zero-shot Learning, Few is for Few-shot Learning. The boldface numbers denote the highest score under each task type using the same LLM.
Metrics / Task Zero CoT Few CoRE AutoFlow (GPT) AutoFlow (Mixtral)
Task 1 (CLIP Score) 0.0 0.2732 0.3055 0.1368 0.30490.30490.30490.3049 0.3032
Task 2 (BERT Score) 0.2076 0.2266 0.6307 0.6505 0.6628 0.70140.70140.70140.7014
Task 3 (ViT Score) 0.5058 0.6736 0.6480 0.6480 0.68990.68990.68990.6899 0.6119
Average over tasks 0.2378 0.3359 0.5281 0.6104 0.6415 0.65010.65010.65010.6501
Table 2: Performance on OpenAGI using the closed-source LLM, GPT-4, as the LLM interpreter for all tasks and learning schema. Zero is for Zero-shot Learning, Few is for Few-shot Learning. The boldface numbers denote the highest score under each task type using the same LLM.

6 Conclusions and Future Work

In this study, we introduce the AutoFlow framework to use Large Language Models (LLMs) for automatically generating effective workflows for agents. We propose two learning methods for AutoFlow, the fine-tuning method when using open-source LLM as workflow generator, and the in-context learning method when using closed-source LLM as workflow generator. Compared to manually designed workflows, automatically generated workflows can reach better performance and significantly reduce the human labor, leading to a higher degree of automation.

Although AutoFlow demonstrates promising results, there is still space for improvement. For example, the learning process for the workflow generator LLM uses reinforcement learning, which may not be the most efficient compared to some gradient-based methods or few-shot learning methods. Future studies may try to evaluate the efficacy of other learning methods. Another example in the AutoFlow framework is that, the workflow generator and interpreter LLMs work together using a collaborative learning paradigm. Instead, we may try other learning paradigms such as the teacher-student paradigm or the adversarial learning paradigm.

References

  • [1]
  • Besta et al. [2024] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690.
  • Boiko et al. [2023] Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332 [physics.chem-ph]
  • Cai et al. [2024] Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, Jonathan Tien, Nan Duan, and Furu Wei. 2024. Low-code LLM: Graphical User Interface over Large Language Models. arXiv:2304.08103 [cs.CL]
  • Cerqueira et al. [2021] Vitor Cerqueira, Nuno Moniz, and Carlos Soares. 2021. Vest: Automatic feature engineering for forecasting. Machine Learning (2021), 1–23.
  • Chen et al. [2023] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. 2023. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research (2023).
  • Feurer et al. [2015] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf
  • Ge et al. [2023a] Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and Yongfeng Zhang. 2023a. OpenAGI: When LLM Meets Domain Experts. In Advances in Neural Information Processing Systems (NeurIPS) (2023).
  • Ge et al. [2023b] Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, and Yongfeng Zhang. 2023b. LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem. arXiv e-prints (2023), arXiv–2312.
  • Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992 (2023).
  • Hessel et al. [2021] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning.
  • Hu et al. [2021] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]
  • Huang et al. [2018] Siyu Huang, Xi Li, Zhi-Qi Cheng, Zhongfei Zhang, and Alexander Hauptmann. 2018. Gnas: A greedy neural architecture search method for multi-attribute learning. In Proceedings of the 26th ACM international conference on Multimedia. 2049–2057.
  • Hutter et al. [2019] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated machine learning: methods, systems, challenges. Springer Nature.
  • Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
  • Jojic et al. [2023] Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. Gpt is becoming a turing machine: Here are some ways to program it. arXiv preprint arXiv:2303.14310 (2023).
  • Josifoski et al. [2023] Martin Josifoski, Lars Klein, Maxime Peyrard, Yifei Li, Saibo Geng, Julian Paul Schnitzler, Yuxing Yao, Jiheng Wei, Debjit Paul, and Robert West. 2023. Flows: Building Blocks of Reasoning and Collaborating AI. arXiv:2308.01285 [cs.AI]
  • Kanter and Veeramachaneni [2015] James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 1–10.
  • Katz et al. [2016] Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. Explorekit: Automatic feature generation and selection. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 979–984.
  • Khattab et al. [2022] Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP. arXiv preprint arXiv:2212.14024 (2022).
  • Khattab et al. [2023] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714 (2023).
  • Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
  • Kotthoff et al. [2019] Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2019. Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In Automated Machine Learning. Springer, Cham, 81–95.
  • Li et al. [2023] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Li et al. [2024b] Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2024b. Large Language Model Agent for Fake News Detection. arXiv preprint arXiv:2405.01593 (2024).
  • Li et al. [2024a] Zelong Li, Wenyue Hua, Hao Wang, He Zhu, and Yongfeng Zhang. 2024a. Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents. arXiv:2402.00798 (2024).
  • Li et al. [2022] Zelong Li, Jianchao Ji, Yingqiang Ge, and Yongfeng Zhang. 2022. AutoLossGen: Automatic Loss Function Generation for Recommender Systems. SIGIR (2022).
  • Liu et al. [2023] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477 (2023).
  • Liu et al. [2019] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable Architecture Search. In International Conference on Learning Representations. https://openreview.net/forum?id=S1eYHoC5FX
  • Lyu et al. [2023] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379 (2023).
  • Mei et al. [2024] Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024. AIOS: LLM Agent Operating System. arXiv (2024).
  • Nijkamp et al. [2022] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  • OpenAI [2023] Josh et al OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  • Pham et al. [2018] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. PMLR, 4095–4104.
  • Poesia et al. [2022] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227 (2022).
  • Qian et al. [2023] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. Communicative Agents for Software Development. arXiv:2307.07924 [cs.SE]
  • Real et al. [2019] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized Evolution for Image Classifier Architecture Search. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4780–4789. https://doi.org/10.1609/aaai.v33i01.33014780
  • Thomas Elsken [2018] Frank Hutter Thomas Elsken, Jan Hendrik Metzen. 2018. Simple and efficient architecture search for Convolutional Neural Networks. https://openreview.net/forum?id=SySaJ0xCZ
  • Wang et al. [2023] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2023. A Survey on Large Language Model based Autonomous Agents. arXiv:2308.11432 [cs.AI]
  • Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
  • Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • Williams [1992] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (1992), 229–256.
  • Wu et al. [2020] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]
  • Wu et al. [2023] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155 [cs.AI]
  • Wu et al. [2024] Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. 2024. StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows. arXiv:2403.11322 [cs.CL]
  • Xi et al. [2023] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and Tao Gui. 2023. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864 [cs.AI]
  • Xie et al. [2024] Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622 (2024).
  • Xu et al. [2024] Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. 2024. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents. arXiv:2405.06907 [cs.CL]
  • Yao et al. [2018] Quanming Yao, Mengshuo Wang, Yuqiang Chen, Wenyuan Dai, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2018. Taking human out of learning applications: A survey on automated machine learning. arXiv preprint arXiv:1810.13306 (2018).
  • Yao et al. [2024] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024).
  • Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
  • Yao et al. [2023] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
  • Yuan et al. [2024] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. 2024. EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction. arXiv preprint arXiv:2401.06201 (2024).
  • Zeng et al. [2024] Zhen Zeng, William Watson, Nicole Cho, Saba Rahimi, Shayleen Reynolds, Tucker Balch, and Manuela Veloso. 2024. FlowMind: Automatic Workflow Generation with LLMs. arXiv:2404.13050 [cs.CL]
  • Zhang et al. [2020] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT.