InferAct: Inferring Safe Actions for LLM-Based Agents Through Preemptive Evaluation and Human Feedback

Haishuo Fang¹ Xiaodan Zhu^1,2 Iryna Gurevych¹
¹Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and
Hessian Center for AI (hessian.AI), Technical University of Darmstadt, Germany
²Department of Electrical and Computer Engineering & Ingenuity Labs Research Institute,
Queen’s University, Canada
¹www.ukp.tu-darmstadt.de ²[email protected]

Abstract

A crucial requirement for deploying LLM-based agents in real-life applications is the robustness against risky or even irreversible mistakes. However, the existing research lacks a focus on preemptive evaluation of reasoning trajectories performed by LLM agents, leading to a gap in ensuring safe and reliable operations. To explore better solutions, this paper introduces InferAct, a novel approach that leverages the Theory-of-Mind capability of LLMs to proactively detect potential errors before critical actions are executed (e.g., ‘buy-now’ in automatic online trading or web shopping). InferAct is also capable of integrating human feedback to prevent irreversible risks as well as enhance the actor agent’s decision-making process. Experiments on three widely-used tasks demonstrate the effectiveness of InferAct. The proposed solution presents a novel approach and concrete contributions towards developing LLM agents that can be safely deployed in different environments involving critical decision-making.¹¹1https://github.com/UKPLab/arxiv2024-inferact

Haishuo Fang¹ Xiaodan Zhu^1,2 Iryna Gurevych¹ ¹Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt, Germany ²Department of Electrical and Computer Engineering & Ingenuity Labs Research Institute, Queen’s University, Canada ¹www.ukp.tu-darmstadt.de ²[email protected]

1 Introduction

The advancement of Large Language Models (LLMs) has spawned a variety of LLM-based agents that are capable of completing complex tasks such as navigating the web Zhou et al. (2024b), managing databases Wang et al. (2023a), and generating code Wang et al. (2024). These agents’ capabilities and potentials have drawn significant research interest recently Yao et al. (2023); Liu et al. (2024); Wu et al. (2024); Xie et al. (2024); Fang et al. (2024). However, to deploy the models to real-life applications, the robustness against costly or sometimes irreversible mistakes is crucial. For instance, an incorrect purchase made by a web shopping agent can lead to a significant monetary loss, while a household agent mishandling kitchen equipment can pose serious safety risks.

Refer to caption — Figure 1: An example of our proposed preemptive evaluation workflow: The critical action heat taken by the Actor agent in a household task triggers the critic to evaluate whether the Actor agent is on track before execution. Critic alerts the human to intervene after it detects that the agent is most likely off track, avoiding any potential negative consequences.

However, the existing research in LLM agents lacks a focus on robust modeling that proactively evaluates the decision process before executing any critical actions. This leads to a gap in ensuring safe and reliable operations. In response to these challenges, we introduce InferAct, an approach designed to evaluate whether an Actor agent is on track before any critical action is executed, and to solicit human intervention if potential errors are detected (c.f. Figure 1). This mechanism aims to enhance safety and prevent negative consequences resulting from risky executions. Current studies Shinn et al. (2023); Yao et al. (2024); Zhou et al. (2024a); Kim et al. (2023b) overlook potential risks incurred by executing critical actions and assume the feedback indicating success or failure can be obtained post-action execution (e.g. ‘buy-now’ in automatic online trading or web shopping).

We argue that this assumption is impractical in real-world settings, particularly when failures carry severe penalties (e.g., property damage, financial loss) or when obtaining human feedback is costly.

Unlike the above studies, our proposed method, InferAct, does not rely on the post-execution feedback. Instead, it leverages real-time assessment to mitigate risks before any detrimental outcome materializes. By mimicking the vigilance of a human overseer, InferAct does not merely observe the actions taken by agents but infer the agent’s intent behind those actions. This ability to infer the intent is known as Theory of Mind (ToM) Premack and Woodruff (1978) in cognitive science, which enables humans to interpret the behavior of others by attributing mental states such as beliefs, and intentions to them. The most recent work Strachan et al. (2024) has shown that GPT-4 models performed at, or even sometimes above, human levels in several ToM aspects such as identifying indirect requests, false beliefs. Building on the ToM capability of LLMs, InferAct interprets the intent behind action chains executed by agents, identifying deviations when these actions stray from their intended goals. If the intentions inferred from the action chains suggest a potential deviation or error, InferAct proactively alerts humans to provide feedback. The feedback not only prevents undesirable outcomes from critical actions but offers guidance to refine the decision-making ability of the Actor agent. Ultimately, this enhances the performance and trustworthiness of LLM agents.

To evaluate the effectiveness of InferAct, we conduct experiments in three distinct environments, including a Web shopping task Yao et al. (2022), a household task Shridhar et al. (2021), and a search-based Question Answering task Yang et al. (2018). Our experiments demonstrate that InferAct achieves the state-of-the-art performance across these tasks with various LLMs (e.g. GPT-4-turbo, GPT-3.5-turbo, and Llama-3-70B) as the back-ends. By incorporating human feedback, InferAct significantly reduces the risks caused by erroneous actions and improves the performance of the Actor agent compared with alternative methods.

We further evaluate different methods in high-stakes conditions including high-priced purchases in web shopping and high-risk operations in the household task. The results reaffirm that InferAct possesses superior error detection capabilities in these scenarios. When combined with the risk-aware prompt, InferAct effectively minimizes the losses (e.g. monetary loss) incurred by undetected adverse actions compared with alternative methods. To summarize, our contributions are as follows:

•

We propose a preemptive evaluation workflow for LLM-based agents involved in critical decision-making, which integrates human feedback to enhance the safety and performance of agents.
•

We introduce InferAct, a novel approach that applies the Theory of Mind (ToM) capabilities of LLMs to assist humans in preemptively detecting potential risks of LLM agents in critical scenarios. Our experiments show that InferAct achieves state-of-the-art performance in detecting erroneous actions on three tasks with different LLMs as the back-ends.
•

InferAct has proven effective when combined with both binary and natural feedback, significantly enhancing the performance of LLM agents compared to alternative methods.
•

Our experiments in high-stakes setup show the efficacy of InferAct. When equipped with risk-aware prompts, the improvement of InferAct is evident not only in preventing the execution of incorrect critical actions but also in minimizing losses incurred from undetected incorrect actions.

2 Related Work

Trustworthiness of LLM Agents.

As LLM agents gain the capability to interact with external environments to complete various tasks, it becomes crucial to address the potential irreversible consequences of their actions and determine when human oversight is necessary. However, this area of research is still largely unexplored. The emulation method has been proposed to assess risks of API calls by utilizing LLMs as a sandbox environment Ruan et al. (2024); Hua et al. (2024). For details about these works, please refer to Appendix C. However, emulation-based methods may not always align with the execution in complex real-world environments. InferAct is the first work to explore the preemptive evaluation mechanism with human feedback for LLM agents in real-world environments (e.g. Web shopping).

Evaluation and Feedback Acquisition of LLM Agents in critical scenarios.

Current research generally assumes that feedback is either available post-execution Shinn et al. (2023); Yao et al. (2024); Zhou et al. (2024a); Kim et al. (2023b) or completely unavailable during task inference Kim et al. (2023a); Song et al. (2024); Zhao et al. (2024). Typically, the post-execution feedback is autonomously obtained after executing terminal actions such as a ‘buy-now’ command in online shopping. However, this does not necessarily reflect real-world scenarios where such direct correctness feedback is often absent. In such cases, the only feedback that might be available after terminal actions is human feedback, which assesses whether the agent has adequately fulfilled the given instructions.

Without the assumption of post-execution feedback, studies have explored how to use gold labels or human feedback to acquire insights during offline learning. Related studies includes Co-learning Qian et al. (2023), ExpeL Zhao et al. (2024), and ETO Song et al. (2024). For more information about these works, please refer to Appendix C. Unlike these works using offline learning, our work focuses on real-time error detection and the strategic acquisition of human feedback during online operations especially for irreversible actions.

Machine Theory-of-Mind.

Theory-of-Mind (ToM) is the cognitive capability that allows humans to understand and attribute mental states like beliefs and intentions to themselves and others, allowing for the prediction of behavior Premack and Woodruff (1978). ToM includes a series of tasks such as inferring others’ intent based on interconnected actions or reflecting on someone else’s mental states. The emergent ToM ability in LLMs has sparked lots of research interest. Recent studies Kosinski (2023); Bubeck et al. (2023) show that GPT models, much like humans, can exhibit strong ToM abilities but may falter with minor alterations in the false belief task Shapira et al. (2024); Ullman (2023). A comprehensive study by Strachan et al. (2024) compared LLMs to 1,907 human participants and found GPT models excel in interpreting beliefs, intentions, and non-literal expressions but falter in recognizing faux pas. Previous studies mostly focus on the evaluation of the ToM ability of LLMs. To our knowledge, we are the first to leverage the ToM ability of LLMs to assist humans in detecting off-track behaviors of LLM agents in critical decision-making scenarios.

3 The Approach

This section describes the mechanism of InferAct to assess the reasoning process of the Actor, i.e., the agent to perform the user’s task. Humans have the strong ToM ability to infer other people’s intentions based on their behaviors, without acessing to others’ internal thoughts. Inspired by this, we leverage the ToM ability of LLMs to deduce the intended tasks behind the sequences of actions and observations the Actor made during task execution. The key idea is: by comparing the tasks inferred from the Actor’s actions with the actual tasks given by the user, InferAct is able to detect whether the Actor has deviated from the user’s task during the execution process. To fulfill this, we design two components: the Task Inference Unit and the Task Verification Unit (c.f. Figure 3).

The Task Inference Unit.

This unit is responsible for inferring intended tasks from the action chain performed by the Actor. The action chain, denoted as $S$ , comprises a sequence of $\langle$ Action, Observation $\rangle$ pairs, $\{a_{1},o_{1},...,a_{m},o_{m}\}$ . The Actor operates under the ReAct Yao et al. (2023) framework, which typically consists of the sequence of $\langle$ Thought, Action, Observation $\rangle$ . However, for the purpose of unbiased task inference, the Thought component is excluded to form $S$ . The rationale is that Thought records the internal deliberations and plans of the Actor during task resolution, which might contain information about the user’s task. For instance, the first Thought of the Actor in Figure 2 explicitly states the task to ‘find 66 inches blackout shades’. Excluding the Thought component ensures that task inference remains impartial and is not influenced by direct internal cues from the Actor, which is crucial for verifying whether the actions performed by the Actor align with the user’s specified task.

Specifically, we instruct LLMs with prompt $P^{i}$ (c.f. Appendix A) to infer the $N$ most probable tasks $T=\{t_{1},t_{2},...,t_{N}\}$ that the action chain intends to solve.

T=LLM(P^{i},S)

Due to the diversity and the varying granularity of tasks performed by the Actor, we opt for generating $N$ most probable tasks rather than a single possible one. This mirrors the human ToM ability to consider multiple plausible intentions or objectives from observed action chains. Once inferred tasks are obtained, along with the user’s original task $t^{*}$ , we format them into a Multiple-Choice Question (MCQ) framework.

MCQ=\{C_{1},...,C_{N},C_{N+1}\}

where $C_{j}=t_{j}$ for $j=1,2,..,N$ and $C_{N+1}=t^{*}$ .

Each choice in the $MCQ$ represents a task, and the $MCQ$ serves as the input for the Task Verification Unit, which evaluate the alignment between the action chain $S$ and the original task $t^{*}$ .

The Task Verification Unit.

Upon assembling the $MCQ$ set, the Task Verification Unit $P^{v}$ prompts the LLM to assign a probability to each choice $C_{j}$ , indicating the likelihood that it is fulfilled or on track to be fulfilled by the action chain $S$ . The prompt $P^{v}$ is detailed in Appendix A.

P=\{p_{1},p_{2},..,p_{N},p_{t^{*}}\}=LLM(P^{v},S,MCQ)

where $p_{j}=Pr(C_{j}~{}\text{is correct}|S)$ for each choice in the $MCQ$ .

In our experiments, we directly prompt LLMs to generate verbalized probability $p_{j}$ with justifications derived from the token space of LLMs, which is friendly to commercial LLMs where logits of tokens might be unavailable. Given that LLMs can be sensitive to the choice order Robinson and Wingate (2023), we aggregate the probability of $p_{t^{*}}$ across different positions (refer to Appendix B). How to enhance the reliability of verbalized probability has been extensively investigated Mielke et al. (2022); Tian et al. (2023); Li et al. (2024); Ulmer et al. (2024). Among them, we adopt the Top- $k$ prompting strategy proposed by Tian et al. (2023) as it showed promising results in the following experiments (Section 5). It should be noted that InferAct is flexible with different probability estimation methods.

In contrast to the typical $MCQ$ where options are mutually exclusive and their prediction probabilities sum to 1.0, we consider the verification process as a multi-label task. This means that the sum of the assigned probabilities to each option does not need to be 1.0, reflecting the fact that one action chain $S$ might fulfill multiple tasks. The inferred tasks from the Task Inference Unit can vary in granularity from the original task $t^{*}$ , but are not mutually exclusive. For instance, an action chain $S$ that fulfills the specific, fine-grained inferred task (e.g. buy a grey vanity bench with metal legs) can also complete a more general, coarse-grained user’s instruction (e.g., buy a vanity bench). The multi-label setting provides LLMs with more flexibility to assign appropriate probabilities to the user’s task $t^{*}$ , contextualized by the other options in this scenario.

InferAct is performed before any critical actions, i.e., irreversible actions with bad consequences. If $p_{t^{*}}$ is low, it indicates that the Actor is likely to deviate from its intended goal. In such case, InferAct alerts humans to intervene. The feedback provided by human subjects will be appended to the input context of the Actor for the next trial. Human feedback not only prevents and mitigates negative consequences from the execution of critical actions, but also improves the Actor’s performance without the cost of failure. Regarding the forms of human feedback, in Section 5.2, we explore two typical types: binary and natural-language feedback. InferAct leverages the ToM ability of LLMs to understand the intent of the Actor’s behaviors and detect errors. InferAct with elicited human feedback can ensure that the Actor remains aligned with intended goals, thus minimizing risks and improving performance.

4 Experimental Setup

4.1 Tasks

In this section, we evaluate InferAct on three distinct tasks commonly used in LLM agents: WebShop Yao et al. (2022), HotPotQA Yang et al. (2018) and ALFWorld Shridhar et al. (2021). We define critical actions in these tasks.

WebShop.

The WebShop Yao et al. (2022) is an online shopping benchmark where an agent navigates an online store to fulfill user requests, such as purchasing a white vanity bench under $100. The agent’s actions include searching and clicking through the website, with the critical action being a click[Buy Now] due to its financial implications.

HotPotQA.

As a Wikipedia-based question-answering task, HotPotQA Yang et al. (2018) in the agent setup Yao et al. (2023) challenges agents to find correct answers using Wikipedia APIs. The APIs include search[entity], lookup[string] and finish[answer]. The critical action is finish[answer] as it often affects the user’s satisfaction with the system, e.g., in the context of customer service.

ALFWorld.

In this household task Shridhar et al. (2021), agents perform a variety of actions to fulfill the user’s task like Pick & Place, Clean & Place, Heat & Place, Cool & Place. The critical actions include Clean, Heat, Cool since these actions involve potential irreversible physical state changes to the objects being operated. For example, if the agent cleans something that should not be wet, it could damage the item. Besides, the task completion is also a critical action.

The detailed descriptions of these tasks and the corresponding data size used for evaluation can be found in Appendix E.

4.2 Evaluation Metrics

As we aim at identifying unsafe reasoning trajectory before executing critical actions, we measure how well the model can identify it. We employ the Area Under the Precision-Recall Curve (AUC-PR), recall, precision and corresponding F1-score at the optimal threshold from the AUC-PR.

4.3 Baselines and Backbone LLMs

As there is no previous work on fine-tuned critics in these tasks, we include three widely used prompting-based methods as baselines. Detailed prompts are included in Appendix A.

Standard Evaluation Prompt.

Similar to self-refinement Madaan et al. (2023) and Prospector Kim et al. (2023a), this method directly prompts LLMs to evaluate the correctness of the reasoning trajectory performed by the Actor.

Standard Evaluation with Self-Consistency.

Based on the standard evaluation prompt, self-consistency Wang et al. (2023b) evaluates the reasoning trajectory $m$ times and leverages the majority voting as the final evaluation. The sampling time $m$ is set to five in our experiments.

Multi-step Evaluation.

This approach evaluates the reasoning trajectory step-by-step. LLMs are prompted to generate a verbalized probability $P_{i}$ to estimate the correctness of each step $S_{i}$ . The overall score is aggregated based on the step-level estimate. In our experiments, we compare the performance of four different aggregation methods $\{Min,Max,Mean,Product\}$ .

Regarding back-end LLMs, we use gpt-41106-preview Achiam et al. (2023) as the Actor agent to perform the user’s task. For baseline methods, both commercial and open-sourced LLMs are adopted as the back-ends, including Llama-3 (70B) AI@Meta (2024), gpt-3.5-turbo-0613, and gpt-4-1106-preview. The implementation details of experiments can be found in Appendix B.

5 Experiment Results and Analysis

5.1 Overall Performance

Models	Methods	WebShop				HotPotQA				ALFWorld				Avg
Models	Methods	Rec	Prec	F1	AUC-PR	Rec	Prec	F1	AUC-PR	Rec	Prec	F1	AUC-PR	F1	AUC-PR
GPT-4-turbo	Standard Eval	39.6	72.0	51.1	—	27.9	65.5	39.2	—	87.2	54.7	67.2	—	52.5	—
	Standard Eval-SC (M=5)	40.7	73.3	52.3	—	26.5	66.7	37.9	—	82.6	51.1	66.1	—	52.1	—
	Multi-step Evaluation	91.3	68.7	78.4	64.5	75.0	37.5	50.0	42.5	66.0	30.7	41.9	44.4	56.8	50.5
	InferAct	98.9	67.2	80.0	73.8	80.9	36.2	50.0	45.0	100.0	61.0	75.8	75.3	68.6	64.7
GPT-3.5-turbo	Standard Eval	9.9	64.3	17.1	—	19.1	40.6	26.0	—	59.5	33.7	43.1	—	28.7	—
	Standard Eval-SC (M=5)	10.4	65.5	17.9	—	19.1	43.3	26.5	—	48.9	30.7	37.7	—	27.4	—
	Multi-step Evaluation	59.3	61.4	60.3	58.6	86.8	31.1	45.8	38.3	61.7	27.9	38.4	24.1	48.2	40.3
	InferAct	96.7	67.4	79.6	67.7	95.6	30.4	46.5	39.4	97.8	36.8	53.5	38.9	59.9	48.3
Llama-3-70B	Standard Eval	1.6	60.0	3.2	—	11.8	80.0	20.5	—	50.0	92.0	64.8	—	29.5	—
	Standard Eval-SC (M=5)	2.7	83.3	5.3	—	11.8	80.0	20.5	—	48.9	92.0	63.9	—	29.9	—
	Multi-step Evaluation	90.1	67.5	77.2	64.2	85.3	31.0	45.5	44.4	69.6	31.3	43.2	21.0	55.3	43.2
	InferAct	97.8	68.1	80.4	74.1	97.1	31.3	47.3	44.6	97.9	51.7	67.7	63.8	65.1	60.8

Table 1: InferAct outperform alternative methods across three tasks. As the standard evaluation method directly outputs correctness or incorrectness, no AUC-PR exists (represented by —). The best result among different aggregation methods of the Multi-step Evaluation is reported here (refer to Appendix D for complete results).

As illustrated in Table 1, InferAct consistently surpasses alternative methods across different benchmarks, demonstrating robust performance with both commercial and open-source LLMs. Notably, InferAct (GPT-4-turbo) achieves the best average F1-score and AUC-PR on these tasks, reflecting the strong ToM capability of GPT-4-turbo.

On Webshop, InferAct outperforms all baseline methods across different backend LLMs. For instance, with GPT-4-turbo, InferAct achieves an F1-score that is 28.9% higher than the Standard Evaluation while using GPT-3.5-turbo, InferAct outperforms Multi-step evaluation by 19.3% (F1-score). A significant challenge in WebShop evaluation lies in comprehending the subtle semantic difference in similar items, product attributes such as distinguishing between a box spring foundation and a bed with a box spring, or, dark brown and coffee brown hair dye. Baseline methods struggle with these nuanced differences.

Unlike baselines which directly contrast the Actor’s reasoning trajectory and the user’s task, InferAct address the challenge by performing backward inference. It infers a set of plausible instructions that could have led to this action chain. For instance, as depicted in Figure 2 (C), InferAct infers three instructions related to custom cut-to-size blackout shades based on the Actor’s action chain. However, the user explicitly requests 66×66 inch blackout shades. Such discrepancies are overlooked by other methods but are successfully identified by InferAct by assigning a zero likelihood to the user’s actual task, as shown in Figure 2 (D).

HotPotQA is an information-seeking task. While the multi-step evaluation method achieves competitive results, or even matches the performance using GPT-4-turbo, InferAct still delivers the best performance across the three back-end LLMs. The performance gains of InferAct are less pronounced on HotPotQA compared to WebShop and ALFWorld, primarily because the multi-step method benefits from the LLMs’ internal knowledge on this particular task. InferAct can showcase its advantage when the reasoning path is flawed or the LLM internal knowledge is unreliable. For instance, a user asks about the number of personnel the Navy that had Gilliam-class attack transports have, baseline methods failed to detect the Actor missed specific detail the Navy that had Gilliam-class attack transports have. InferAct successfully pinpointed this omission by inferring that the question seeking for the number of personnel the Navy have is more inclined to be answered, when referencing the ‘Navy’ broadly, rather than the original, more specific query concerning the Navy with Gilliam-class attack transports.

The Multi-step Evaluation method achieves the second-best F1-score on WebShop and performs similarly to InferAct on HotPotQA. However, its effectiveness notably declines in the ALFWorld task where the Actor needs to perform more exploration steps to locate the required items (such as a cup, mug, or pan). These exploration steps are assigned low scores, strongly affecting the overall accuracy of multi-step evaluations across different aggregation methods (see Appendix D for results). This issue does not hurdle InferAct which outperforms Multi-step Evaluation and Standard Evaluation by 33.9% and 8.6% respectively with GPT-4-turbo as the backend.

5.2 The Synergy of InferAct and the Actor

The critics attempt to proactively identify potential risks before executing critical actions, allowing for human involvement to help mitigate the potential negative outcomes through feedback. Our study investigates both the binary Liu et al. (2018); Shi et al. (2021) and Natural-Language (NL) feedback Tandon et al. (2022); Madaan et al. (2022). Binary feedback, ideal for users seeking minimal engagement, straightforwardly indicates the Actor with clear ‘correct’ or ‘incorrect’ signals. In our experiments, we use the gold labels from the dataset to provide such signals. This information enables the Actor to perform self-reflection Shinn et al. (2023) for subsequent trials. For more detailed insights, NL feedback is suitable. We utilize GPT-4-turbo to craft NL feedback by comparing a gold outcome (e.g., the correct product in WebShop) with the predicted one (refer to Appendix A.5 for prompts), which mimics what humans may say when seeing the differences. Previous work Bai et al. (2022); Lee et al. (2024) has suggested that the feedback generated by advanced LLMs (e.g. GPT4, PaLM) could be on par with the feedback sourced from humans in some summarization, dialogue generation, and categorization tasks. This allows us to simulate human feedback in a scalable and immediate way. Table 2 and Figure 4 demonstrate InferAct’s effectiveness across three tasks with both binary and NL feedback. The Actor, guided by InferAct, consistently outperforms baselines over three iterations using both binary and NL feedback. For instance, InferAct with NL feedback surpasses the second-best method, Multi-step Evaluation by 8.3% on WebShop. Moreover, we compared our method against the upper-bound scenario where the Actor always receives feedback after completing terminal actions without any critic involved. As depicted in Table 2, InferAct performs competitively, trailing by only 0.3% in WebShop and 2% in HotPotQA with binary feedback, while achieving equivalent performance in ALFWorld. This competitive edge is attributed to two factors: InferAct consistently achieves high recall across all tasks. (Table 1) and there are many challenging cases that remain unsolved even with post-execution feedback. Figure 4 further illustrates that NL feedback significantly boosts the Actor’s performance over iterations when compared to binary feedback, highlighting the value of richer, more informative feedback mechanisms in complex decision-making tasks.

Method	Feedback Type	#Iteration	WebShop	HotPotQA	ALFWorld
		N=0	30.0	57.3	64.9
Standard Eval	Binary	N=1	32.0	61.7	67.9
	NL	N=1	39.7	66.3	74.6
	Binary	N=3	34.3	61.7	71.6
	NL	N=3	42.3	70.0	83.6
Multi-step Eval	Binary	N=1	32.0	62.7	67.9
	NL	N=1	42.3	73.3	71.6
	Binary	N=3	35.3	63.3	70.1
	NL	N=3	45.7	80.3	76.1
InferAct	Binary	N=1	33.7	63.3	70.9
	NL	N=1	48.0	73.3	76.9
	Binary	N=3	39.0	64.3	75.4
	NL	N=3	56.3	80.3	87.3
Post-Execution	Binary	N=3	39.3	66.3	75.4
	NL	N=3	57.0	80.6	87.3

Table 2: The Actor equipped with InferAct achieves the highest success rate with both binary and Natural Language (NL) feedback. The best performance with NL feedback is in bold while the best performance with binary feedback is marked with underline. As the performance of Standard Eval-SC is similar to Standard Eval in Table 1, we exclude it to reduce costs.

5.3 Evaluation with High-Stake Actions

The overall evaluation presented in Section 5.1 does not consider the costs of adverse actions. In reality, high-stakes decisions may carry more significant consequences than low-stakes counterparts. Recognizing this, we specifically explore the performance of InferAct and other methods using GPT-4-turbo under high-stakes conditions. Specifically in WebShop, we mimic costly decisions by considering the purchases with prices exceeding $60, representing the top one-third (66.6th percentile) of prices within the dataset. For ALFWorld, actions such as Heat and Cool are considered high-stakes considering their irreversible impact on the physical state of objects. For HotPotQA, it is not intuitive to mimic a costly setting.

Furthermore, to quantitatively assess the implications of errors, we consider the cost metric, which measures the negative impact of incorrect decisions (false negatives). In WebShop, this involves calculating the price associated with incorrectly selected products, while for ALFWorld, we count the number of misoperations. This metric complements conventional evaluations such as F1-score, rendering a comprehensive view of the performance of these critics. To enhance the critics’ sensitivity to risks, we integrate risk-aware prompts (refer to Appendix A.4). Table 3 reaffirms the efficacy of InferAct; with the risk-aware prompt, InferAct achieves the best performance in all metrics. In ALFWorld, however, the addition of the risk-aware prompt does not alter the performance, indicating that all methods are insensitive to this feature. In WebShop, although adding a risk-aware prompt might not always lead to a higher F1-score, it effectively reduces the costs associated with undetected reverse actions for all evaluated critics. This is exemplified by both multi-step evaluation and the standard evaluation method, where the precision deteriorates while the cost is reduced. As shown in Figure 5, more cases are predicted as positive after integrating the risk-aware prompt. This means these methods tend to be more cautious about expensive purchases. For InferAct, although the recall and precision remain unchanged, the cost also decreased.

Methods	WebShop				Alfworld
Methods	Rec	Prec	F1	Cost	Rec	Prec	F1	Cost
Standard Eval
w/o risk aware	32.6	71.4	44.8	$5646.8	100.0	44.2	61.3	0
w risk aware	43.5	69.0	53.3	$4616.5	100.0	44.2	61.3	0
Multi-step Eval
w/o risk aware	89.1	74.5	81.2	$686.5	94.7	42.9	59.0	1
w risk aware	89.1	70.7	78.8	$603.5	94.7	42.9	59.0	1
InferAct
w/o risk aware	95.7	73.3	83.0	$228.0	100.0	46.3	63.3	0
w risk aware	95.7	73.3	83.0	$170.0	100.0	46.3	63.3	0

Table 3: InferAct achieves the best performance under high-stake conditions.

6 Conclusion

Performing real-time evaluation over the reasoning process of LLM agents before executing costly or irreversible actions is crucial for deploying such models to many real-life applications, which, however, is significantly understudied. This paper proposes InferAct, built on the Theory-of-Mind abilities of LLMs, aiming to proactively assess the risk and alert humans when needed, thereby mitigating or preventing negative outcomes before they occur. Experiments demonstrate the superior performance of InferAct across different environments and the benefit of human feedback. Further findings in high-stake setting reveal that when equipped with the risk-aware prompt, InferAct improved its robustness and behaved more cautiously in facing costly decisions, consequently reducing the risk and expense of incorrect decisions. This makes InferAct a valuable tool for LLM agents in applications. InferAct sets baselines for further research that emphasizes proactively guiding LLM agents in order to develop trustworthy systems.

7 Limitations

Despite the efficacy of InferAct in preemptive adverse action detection for LLM agents, there are several limitations that warrant mention and provide avenues for future research. First, as InferAct leverages the ToM ability of LLMs, the smaller LLMs may exhibit suboptimal performance in comparison to their larger counterparts due to limitations in their ToM and instruction-following abilities.

Second, the scope of our high-stakes experiments is currently confined to simulations within online shopping and household environments. This limited scope may not adequately capture the complexity of high-stakes scenarios in other critical fields such as healthcare and finance. For instance, risk measurement in finance Tarantino (2010) involves multifaceted variables and interactions that are significantly more complex than the cost metric used in our study. Developing effective preemptive evaluation approaches to enhance the safety of LLM-based Agents within different fields is an imperative direction. Additionally, our focus was on immediate and direct consequences of critical actions, without delving into the long-term and indirect effects that may hold substantial importance Lindner et al. (2021).

Third, while we demonstrate the effectiveness of InferAct in integrating binary and natural language feedback to enhance agents’ safer and more accurate reasoning, the natural language feedback presents inherent variability due to individual differences in expression and language proficiency. Investigating how such variability influences the interpretation and subsequent actions of LLM agents is an interesting topic for future research.

Acknowledgments

This work was supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. We gratefully acknowledge the support of Microsoft with a grant for access to OpenAI GPT models via the Azure cloud (Accelerate Foundation Model Academic Research).

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Almeida et al. (2024) Guilherme F.C.F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. 2024. Exploring the psychology of llms’ moral and legal reasoning. Artificial Intelligence, 333:104–145.
Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Fang et al. (2024) Haishuo Fang, Xiaodan Zhu, and Iryna Gurevych. 2024. DARA: Decomposition-alignment-reasoning autonomous language agent for question answering over knowledge graphs. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand. Association for Computational Linguistics.
Hagendorff (2023) Thilo Hagendorff. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988.
Hagendorff et al. (2023) Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. 2023. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838.
Hua et al. (2024) Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wei, and Yongfeng Zhang. 2024. Trustagent: Towards safe and trustworthy llm-based agents through agent constitution. arXiv preprint arXiv:2402.01586.
Kim et al. (2023a) Byoungjip Kim, Youngsoo Jang, Lajanugen Logeswaran, Geon-Hyeong Kim, Yu Jin Kim, Honglak Lee, and Moontae Lee. 2023a. Prospector: Improving llm agents with self-asking and trajectory ranking. NeurIPS 2023 Foundation Models for Decision Making Workshop.
Kim et al. (2023b) Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023b. Language models can solve computer tasks. In Advances in Neural Information Processing Systems, volume 36, pages 39648–39677. Curran Associates, Inc.
Kosinski (2023) Michal Kosinski. 2023. Theory of mind might have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083.
Lee et al. (2024) Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. In Forty-first International Conference on Machine Learning.
Li et al. (2024) Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, and Tat-Seng Chua. 2024. Think twice before assure: Confidence estimation for large language models through reflection on multiple answers. arXiv preprint arXiv:2403.09972.
Lindner et al. (2021) David Lindner, Hoda Heidari, and Andreas Krause. 2021. Addressing the long-term impact of ml decisions via policy regret. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 537–544. International Joint Conferences on Artificial Intelligence Organization. Main Track.
Liu et al. (2018) Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, and Larry Heck. 2018. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2060–2069, New Orleans, Louisiana. Association for Computational Linguistics.
Liu et al. (2024) Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. Agentbench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations.
Madaan et al. (2022) Aman Madaan, Niket Tandon, Peter Clark, and Yiming Yang. 2022. Memory-assisted prompt editing to improve GPT-3 after deployment. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2833–2861, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, pages 46534–46594. Curran Associates, Inc.
Mielke et al. (2022) Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
Premack and Woodruff (1978) David Premack and Guy Woodruff. 1978. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4):515–526.
Qian et al. (2023) Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2023. Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025.
Robinson and Wingate (2023) Joshua Robinson and David Wingate. 2023. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations.
Ruan et al. (2024) Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the risks of LM agents with an LM-emulated sandbox. In The Twelfth International Conference on Learning Representations.
Shapira et al. (2024) Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. 2024. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2257–2273, St. Julian’s, Malta. Association for Computational Linguistics.
Shi et al. (2021) Weiyan Shi, Yu Li, Saurav Sahay, and Zhou Yu. 2021. Refine and imitate: Reducing repetition and inconsistency in persuasion dialogues via reinforcement learning and human demonstration. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3478–3492, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
Shridhar et al. (2021) Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations.
Song et al. (2024) Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024. Trial and error: Exploration-based trajectory optimization for llm agents. arXiv preprint arXiv:2403.02502.
Strachan et al. (2024) James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. 2024. Testing theory of mind in large language models and humans. Nature Human Behaviour, pages 1–11.
Tandon et al. (2022) Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339–352, Seattle, United States. Association for Computational Linguistics.
Tarantino (2010) Anthony Tarantino. 2010. Essentials of risk management in finance, volume 53. John Wiley & Sons.
Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics.
Ullman (2023) Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399.
Ulmer et al. (2024) Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. 2024. Calibrating large language models using their generations only. arXiv preprint arXiv:2403.05973.
Wang et al. (2023a) Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun Li. 2023a. Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242.
Wang et al. (2024) Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better LLM agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
Wu et al. (2024) Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. OS-copilot: Towards generalist computer agents with self-improvement. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
Xie et al. (2024) Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972.
Xu et al. (2024) Ruoxi Xu, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, and Xianpei Han. 2024. Ai for social science and social science of ai: A survey. Information Processing & Management, 61(3):103665.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
Yao et al. (2022) Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
Yao et al. (2024) Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh R N, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil L Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. 2024. Retroformer: Retrospective large language agents with policy gradient optimization. In The Twelfth International Conference on Learning Representations.
Zhao et al. (2024) Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. Expel: LLM agents are experiential learners. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 19632–19642. AAAI Press.
Zhou et al. (2024a) Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024a. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning.
Zhou et al. (2024b) Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024b. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations.

Appendix A Instructions for different Methods