InvAgent: A Large Language Model based Multi-Agent System for Inventory Management in Supply Chains

Yinzhu Quan
Georgia Institute of Technology
Atlanta, GA 30332, USA
[email protected]
&Zefang Liu^∗
Georgia Institute of Technology
Atlanta, GA 30332, USA
[email protected]
These authors contributed equally to this work.

Abstract

Supply chain management (SCM) involves coordinating the flow of goods, information, and finances across various entities to deliver products efficiently. Effective inventory management is crucial in today’s volatile, uncertain, complex, and ambiguous (VUCA) world. Previous research has demonstrated the superiority of heuristic methods and reinforcement learning applications in inventory management. However, the application of large language models (LLMs) as autonomous agents in multi-agent systems for inventory management remains underexplored. This study introduces a novel approach using LLMs to manage multi-agent inventory systems. Leveraging their zero-shot learning capabilities, our model, InvAgent, enhances resilience and improves efficiency across the supply chain network. Our contributions include utilizing LLMs for zero-shot learning to enable adaptive and informed decision-making without prior training, providing significant explainability and clarity through Chain-of-Thought (CoT), and demonstrating dynamic adaptability to varying demand scenarios while minimizing costs and avoiding stockouts. Extensive evaluations across different scenarios highlight the efficiency of our model in SCM.

Yinzhu Quan^†^†thanks: These authors contributed equally to this work. Georgia Institute of Technology Atlanta, GA 30332, USA [email protected] Zefang Liu^∗ Georgia Institute of Technology Atlanta, GA 30332, USA [email protected]

1 Introduction

Supply chain management (SCM) involves coordinating and managing the flow of goods, information, and finances across various interconnected entities, from suppliers to consumers, to deliver products efficiently and effectively. Inventory management, a critical component of SCM, focuses on overseeing and controlling the ordering, storage, and use of components and finished products. In today’s volatile, uncertain, complex, and ambiguous (VUCA) world, effective inventory management is essential for aligning supply with demand, minimizing costs, and enhancing the resilience of supply chains. This ensures that companies can adapt to disruptions Quan et al. (2023), optimize resources Abaku et al. (2024), and maintain seamless operations Yasmin (2024) in a highly interconnected and dynamic market environment.

Previous research in inventory management has explored various applications of heuristic methods, such as the beer distribution game Goodwin and Franklin (1994); Edali and Yasarcan (2014); Oroojlooyjadid et al. (2022). Additionally, numerous implementations of reinforcement learning models have been investigated, including the decentralized inventory management Mousa et al. (2024) and the adaptive supply chain synchronization Kegenbekov and Jackson (2021). However, these methods often require sophisticated design, extensive training resources, and lack explainability. In contrast, large language models (LLMs) present a promising alternative, offering adaptive decision-making without prior training and enhanced interpretability. Recent studies have started to utilize LLMs in supply chain research, as demonstrated by Li et al. (2023a), Quan and Liu (2024), and Singla et al. (2023).

Refer to caption — Figure 1: The framework of InvAgent, a LLM-based zero-shot multi-agent inventory management system. Firstly, the user proxy resets the environment at the beginning of the first round. Secondly, the user proxy requests the state of the current round for each stage from the environment. Then, the user proxy provides the current state to each stage and requests the action from it. Finally, all agents take actions together and move to the next state.

LLMs are increasingly utilized as autonomous agents in multi-agent systems, showcasing advanced planning, decision-making, and simulation capabilities across diverse domains Guo et al. (2024) such as gaming Mao et al. (2023) and financial markets Li et al. (2023c). However, the application of LLMs to tackle the multi-agent inventory management problem (IMP) within supply chains remains relatively underexplored. In this study, we propose InvAgent¹¹1https://github.com/zefang-liu/InvAgent, an advanced zero-shot multi-agent inventory management system utilizing LLMs. Our approach leverages LLMs to enhance system resilience and foster collaboration across various components of the supply chain network through their reasoning and decision-making capabilities. The framework of our paper could be seen from Figure 1.

Our contributions of this paper are as follows:

1.

We leverage LLMs to manage multi-agent inventory systems as zero-shot learners, enabling adaptive and informed decision-making without prior training or specific examples.
2.

Our model offers significant explainability and clarity, enhanced by Chain-of-Thought (CoT) for reasoning, making it easier to understand and trust, and more reliable compared to traditional heuristic and reinforcement learning models.
3.

Our model adapts dynamically to varying demand scenarios, minimizing costs and avoiding stockouts, demonstrating efficiency in supply chain management through extensive evaluation across different scenarios.

2 Related Work

LLM-Based Multi-Agent System Applications in Economics. The LLM-based MASs have been used in economic and financial trading simulations to model human behavior. It enables agents with specific endowments, information, and preferences to interact in scenarios like macroeconomic activities Li et al. (2023b), information marketplaces Weiss et al. (2023), financial trading Li et al. (2023c), and virtual town simulations Zhao et al. (2023). These agents operate in both cooperative and decentralized environments, demonstrating diverse applications in economic studies Guo et al. (2024).

Multi-Agent System Applications in Supply Chain. There are some studies extensively explore the potential of MAS to enhance supply chain efficiency and responsiveness, addressing various challenges from integration to dynamic adaptation and coordination. Nissen (2001) examines the integration of supply chains using agent-based technologies, highlighting how agents can facilitate more efficient and responsive supply chain operations. Kaihara (2003) discusses the application of MASs in modeling supply chains that operate in dynamic environments, focusing on how agents can adapt to changes and uncertainties. Moyaux et al. (2003) explores how multi-agent coordination mechanisms can help reduce the bullwhip effect in supply chains, using a token-based approach to enhance collaboration and information sharing among agents.

Multi-Agent Reinforcement Learning Applications in Supply Chain. Research in multi-agent reinforcement learning (MARL) for SCM focuses on optimizing interactions and cooperation among multiple agents in dynamic environments. Oroojlooyjadid et al. (2022) propose the Shaped-Reward Deep Q-Network (SRDQN) algorithm for RL in the beer distribution game, where agents optimize behaviors through rewards and punishments to improve performance. Hori and Matsui (2023) enhance cooperative policies in the beer game using reward shaping techniques based on mechanism design applied to SRDQN, improving performance in multi-agent settings. Additionally, OR-Gym Hubbs et al. (2020) is an open-source library that benchmarks RL solutions against heuristic models in operations research problems including SCM.

3 Methodology

This section outlines the methodological framework, starting with the definition of a multi-period, multi-echelon production-inventory system. We then propose InvAgent, a large language model (LLM) based multi-agent inventory management system designed for supply chain optimization.

3.1 Problem Definition

A multi-period, multi-echelon production-inventory system for a single non-perishable product is designed for illustrating and simulating a typical multi-stage supply chain. As shown in Figure 2, each stage in this supply chain consists of an inventory holding area and a production area. The inventory holding area stores the materials necessary for production at that stage. One unit of inventory produces one unit of product at each stage. There are lead times for transferring products between stages. The outgoing material from stage $i$ serves as the feed material for production at stage $i-1$ . Stages are numbered in ascending order: $0,1,...,M-1$ , with stage 0 being the retailer. Production at each stage is limited by the stage’s production capacity and available inventory. Figure 2 depicts the flow of raw materials through various stages of production and inventory management, ultimately culminating in the fulfillment of customer demand at the retail level.

There are $T$ periods in each simulation, starting from 1, with $t=0$ used for the initial condition of the supply chain. At the beginning of each time period, the following sequence of events occurs:

1.

Check deliveries: Each stage receives incoming inventory replenishment shipments that have arrived after the stage’s respective lead time.
2.

Check orders and demands: Each stage places replenishment orders to their respective suppliers. Replenishment orders are filled according to the available production capacity and inventory at the suppliers. Customer demand occurs at the retailer and is filled based on the available inventory at the retailer.
3.

Deliver orders and demands: Each stage delivers as many products as possible to satisfy downstream demand or replenishment orders. Unfulfilled sales and replenishment orders are backlogged, with backlogged sales taking priority in the following period.
4.

Compute profits: Each stage computes the profit and cost for product sales, material orders, backlog penalties, and surplus inventory holding costs.

Notation	Definition
$m$	Stage, where $m\in\mathcal{M}=\{0,1,2,\ldots,M-1\}$
$t$	Period, where $t\in\mathcal{T}=\{0,1,2,\ldots,T\}$
$I_{m,t}$	Inventory at the end of period $t$
$\hat{I}_{m,t}$	Desired inventory at the end of period $t$
$O_{m,t}$	Requested order placed during period $t$
$R_{m,t}$	Fulfilled order during period $t$
$D_{t}$	Customer demand during period $t$
$S_{m,t}$	Sales during period $t$
$B_{m,t}$	Backlog at the end of period $t$
$L_{m}$	Lead times between stage $m+1$ and stage $m$
$L_{\max}$	Maximum lead time in the system
$P_{m,t}$	Profit at stage $m$ during period $t$
$c_{m}$	Production capacity at stage $m$
$p_{m}$	Unit sale price
$r_{m}$	Unit order (procurement) cost
$k_{m}$	Unit penalty for unfulfilled order
$h_{m}$	Unit inventory holding cost

Table 1: Notations and definitions for parameters.

With the notations defined in Table 1, the entire inventory management problem (IMP), inspired by Hubbs et al. (2020), can be expressed using following equations:

	$\displaystyle I_{m,t}=I_{m,t-1}+R_{m,t-L_{m}}-S_{m,t},$		(1)
	$\displaystyle\qquad\forall m\in\mathcal{M},$
	$\displaystyle R_{m,t}=\min(B_{m+1,t-1}+O_{m,t},c_{m+1},$
	$\displaystyle{\hskip 37.0pt}I_{m+1,t-1}+R_{m+1,t-L_{m+1}}),$		(2a)
	$\displaystyle\qquad\forall m=0,...,M-2,$
	$\displaystyle R_{M-1,t}=O_{M-1,t},$		(2b)
	$\displaystyle S_{m,t}=R_{m-1,t},\quad\forall m=1,...,M-1,$		(3a)
	$\displaystyle S_{0,t}=\min(B_{0,t-1}+D_{t},c_{0},$
	$\displaystyle{\hskip 30.0pt}I_{0,t-1}+R_{0,t-L_{0}}),$		(3b)
	$\displaystyle B_{m,t}=B_{m,t-1}+O_{m-1,t}-S_{m,t},$		(4a)
	$\displaystyle\qquad\forall m=1,...,M-1,$
	$\displaystyle B_{0,t}=B_{0,t-1}+D_{t}-S_{0,t},$		(4b)
	$\displaystyle P_{m,t}=p_{m}S_{m,t}-r_{m}R_{m,t}-k_{m}B_{m,t}$
	$\displaystyle{\hskip 30.0pt}-h_{m}I_{m,t},\qquad\forall m\in\mathcal{M}.$		(5)

In Equation 1, the current inventory at stage $m$ at the end of the current period $t$ is equal to the final inventory in the previous period, plus fulfilled order placed $L_{m}$ periods ago, minus the sales during the current period. In Equation 2a, fulfilled order at stage $m$ placed during period $t$ is decided by (1) previous backlog at the upstream stage plus newly requested orders, (2) upstream stage production capacity, and (3) total available inventory at the upstream stage $m+1$ at the start of period $t$ , including leftover stock from the previous period and newly arrived orders after accounting for lead time. The final fulfilled order is the minimum of these three conditions, ensuring that the order does not exceed any of these constraints. This ensures that the supply chain operates within its capacity and inventory limits, preventing overcommitment and stockouts. Equation 2b tells us requested orders at the upmost stage are always fulfilled, because we assume an unlimited supply of raw materials. Sales are always equal to fulfilled orders except at stage 0 (retailer), as shown in Equation 3a. In Equation 3b, sales at stage 0 (retailer) during period $t$ are determined by the minimum of three conditions: (1) the previous backlog at stage 0 plus the current customer demand, (2) the production capacity of stage 0, and (3) the total available inventory at stage 0 at the start of period $t$ , which includes leftover stock from the previous period and newly fulfilled orders after accounting for the lead time. This ensures that the sales at the retailer do not exceed the total demand, production capacity, or available stock. In Equation 4a, the backlog at stage $m$ during period $t$ is calculated as the sum of the previous period’s backlog at stage and the orders requested from the previous stage, minus the sales at stage $m$ , for all stages except the retailer. In Equation 4b, the backlog at stage 0 (retailer) during period $t$ is calculated similarly to Equation 4a, but the requested order is replaced by customer demand because the retailer is directly in contact with customers. In Equation 5, the profit at each stage $m$ during period $t$ is calculated as the sales revenue minus the procurement costs, unfulfilled demand costs, and inventory holding costs.

3.2 InvAgent

In this work, we propose InvAgent, a LLM based multi-agent inventory management system for supply chain optimization. InvAgent includes several key agents: one user proxy and one agent for each stage. The user proxy serves as an intermediary between the environment and all supply chain agents, facilitating communication and managing the exchange of data. The framework of InvAgent method is illustrated in Figure 1, which follows these steps:

1.

The user proxy resets the environment at the beginning of the first round.
2.

The user proxy requests the state of the current round for each stage from the environment.
3.

The user proxy provides the state to each stage and requests the action from it.
4.

The user proxy sends the agent actions to the environment and obtains the next state and the reward for this step.
5.

The user proxy determines whether the simulation is terminated; if not, the simulation moves to step 2.

At the beginning of the simulation, we create system messages for agents in Figure 3, which provide essential information, such as definitions, roles, and goals in the supply chain. The state $s_{m,t}$ and action $a_{m,t}$ of an agent are defined as: $s_{m,t}=[c_{m},p_{m},r_{m},k_{m},h_{m},L_{m},I_{m,t-1},B_{m,t-1},\\ B_{m+1,t-1},S_{m,t-L_{\max}},\dots,S_{m,t-1},0,\dots,0,\\ R_{m,t-L_{m}},\dots,R_{m,t-1}]$ and $a_{m,t}=O_{m,t}$ , where the state includes the current stage features, inventory, backlog, upstream backlog, recent sales, and arriving deliveries with left zero padding.

Figure 3: System messages providing essential information, such as definitions, roles, and goals in the supply chain.

The prompt, as designed in Figure 4, aims to provide the state and request actions from each agent, ensuring effective decision-making and clear communication within the supply chain. It includes contextual information such as the current period, stage²²2The stage is counted from 1 instead of 0 in the prompt to prevent confusion for LLMs., and number of stages to position the model within the supply chain. The state description (Figure 5) provides a comprehensive snapshot of inventory levels, backlogs, previous sales, and incoming deliveries, enabling informed decisions. Demand (Figure 6) and downstream order (Figure 7) details help match supply with immediate needs, allowing upstream suppliers to quickly respond to downstream orders or demands. The strategy description (Figure 8) outlines guidelines like considering lead times and avoiding overordering to maintain inventory balance. By requesting reasoning before specifying the action, the prompt promotes transparency and interpretability in decision-making. This design leverages LLMs’ capabilities to enhance inventory management, ensuring decisions are well-informed, transparent, and aligned with the supply chain strategy. One example of the prompt and the response from GPT-4 is shown in Appendix A.

Figure 4: Prompt provided to LLMs for inventory management simulation. State description, demand description, downstream order description, and strategy description are shown in Figures 5, 6, 7, and 8, respectively.

Figure 5: State descriptions providing the current state for each agent in each period. For the previous sales, we select recent

L_{max}

periods, and for the arriving deliveries, we select next

L_{m}

periods.

Figure 6: Demand descriptions for different demand scenarios included in the LLM prompt.

Figure 7: The downstream order from the previous stage to the current stage at one round, which can delivery the downstream information faster.

Figure 8: Strategy description introducing the golden rule of the problem and providing the LLM with suggestions for decision-making.

The features of the prompt design are as follows:

•

Zero-Shot Learning. Our designed prompt operates on a zero-shot basis without providing any specific examples to the LLM. This means that the model must generate responses based solely on its pre-existing knowledge and the information presented in the prompt.
•

Demand Description. Since we don’t have any prior training process, unlike reinforcement learning, which involves a training process to enhance understanding of the environment and demand, providing a clear and detailed description of the demand is crucial to ensure accurate understanding and effective responses.
•

Downstream Order. The prompt considers downstream order, where information can be delivered swiftly and shared efficiently between different stages.
•

Human-Crafted Strategy. The inherent strategy of the LLM is generally sufficient for simple scenarios, such as constant demands. However, for more complex scenarios like seasonal demands, it is assumed that additional human-crafted strategies can be helpful for the LLM’s decision-making.
•

Chain-of-Thought (CoT). The CoT approach can enhance the explainability of the results. By guiding the LLM through a structured reasoning process, CoT helps the model to better understand the scenario and improve its reasoning capabilities, ultimately leading to more accurate and reliable outcomes.

4 Experiments

In this section, we evaluate the performance of InvAgent, our proposed large language model (LLM) based multi-agent inventory management system, by describing experimental scenarios, baseline models, and the experimental setup. We then present results showing InvAgent’s adaptability and efficiency, concluding with ablation studies to assess the influence of various prompt components in dynamic supply chain management.

4.1 Experiment Scenarios

{adjustwidth}

-2cm-2cm Scenario Constant Variable Larger Seasonal Normal Number of Stages 4 4 4 4 4 Number of Periods 12 12 12 12 12 Initial Inventories [12, 12, 12, 12] [12, 12, 12, 12] [12, 12, 12, 12] [12, 12, 12, 12] [12, 14, 16, 18] Lead Times [2,2,2,2] [2, 2, 2, 2] [2, 2, 2, 2] [2, 2, 2, 2] [1, 2, 3, 4] Demand 4 $\mathcal{U}(0,4)$ $\mathcal{U}(0,8)$ $\mathcal{C}(4,8)$ $\mathcal{N}(4,2^{2})$ Product Capacities [20, 20, 20, 20] [20, 20, 20, 20] [20, 20, 20, 20] [20, 20, 20, 20] [20, 22, 24, 26] Sales Prices [0, 0, 0, 0] [0, 0, 0, 0] [5, 5, 5, 5] [5, 5, 5, 5] [9, 8, 7, 6] Order Costs [0, 0, 0, 0] [0, 0, 0, 0] [5, 5, 5, 5] [5, 5, 5, 5] [8, 7, 6, 5] Backlog Costs [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1] Holding Costs [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1] [1, 1, 1, 1]

Table 2: Parameter settings for different supply chain scenarios.

We describe the various experiment scenarios designed to evaluate the performance of our inventory management system in a multi-echelon supply chain. Each scenario introduces specific conditions to rigorously test the robustness and adaptability of the proposed model. Parameter settings for these scenarios are summarized in Table 2.

In the first scenario, a four-stage supply chain is tested with a constant demand of 4 units per period over 12 periods, starting with 12 units of inventory per stage and a lead time of 2 periods. This scenario aims to test the basic functionality of the model under stable conditions. The second scenario introduces variable demand, uniformly ranging between 0 and 4 units per period, adding randomness to evaluate the system’s ability to manage fluctuating demand while maintaining efficient inventory. The third scenario further increases demand variability, with uniform distribution between 0 and 8 units per period, and incorporates sales and order costs set at 5 units per period, testing the model’s capability to handle high variability and financial impacts. The fourth scenario simulates seasonal demand with a leaping pattern ranging from 4 to 8 units per period, maintaining the same financial parameters as the third scenario, to evaluate the system’s performance under predictable but varying demand patterns. Finally, the fifth scenario features normally distributed demand with a mean of 4 units and a standard deviation of 2 units per period, varying lead times, initial inventories, sales prices and order costs across the stages. This scenario tests the system’s performance under more realistic demand fluctuations and varying operational constraints. These scenarios collectively provide a comprehensive test bed for evaluating the efficacy and adaptability of our multi-agent system in managing dynamic inventory across a multi-echelon supply chain.

4.2 Baselines

We have four baselines: two heuristic policies, namely the base-stock policy and the tracking demand policy, and two reinforcement learning (RL) policies, specifically independent proximal policy optimization (IPPO) and multi-agent proximal policy optimization (MAPPO).

Two heuristic baselines are designed based on the desired inventory levels, where each stage aims to maintain sufficient inventory to fulfill customer demands or downstream orders. The stage order (action) is computed by

O_{m,t}=\min(\max(0,\hat{O}_{m,t}),c_{m}),

(6)

where the $\hat{O}_{m,t}$ is determined by the desired inventory $\hat{I}_{m,t}$ , current inventory $I_{m,t-1}$ , upstream backlog order $B_{m+1,t-1}$ , and cumulative sum of arriving deliveries as

\hat{O}_{m,t}=\hat{I}_{m,t}-I_{m,t-1}-B_{m+1,t-1}\\ -\sum_{\Delta t=1}^{L_{m}}R_{m,t-\Delta t}.

(7)

When an order is placed, the stage replenishes its inventory to this target level by ordering the difference between the base-stock level and the current inventory position with upstream backlog orders and arriving deliveries. Based on different choices of the desired inventory, we introduce two heuristic polices as follows.

Base-Stock Policy. The base-stock policy Lee et al. (1997); Oroojlooyjadid et al. (2022) is an inventory management strategy where each stage in the supply chain maintains a constant inventory level, or base-stock level. Here, the desired inventory level is set equal to the production capacity:

\hat{I}_{m,t}=c_{m}.

(8)

Tracking Demand Policy. The tracking demand policy is an inventory management strategy that adjusts orders based on observed demand (or sale) patterns rather than maintaining a constant base-stock level. By dynamically aligning supply with actual consumption, this policy ensures a responsive and efficient inventory system. In this policy, the desired inventory is set as:

\hat{I}_{m,t}=\bar{S}_{m,t-1}L_{m}+B_{m,t-1},

(9)

where the average sale for recent rounds is

\bar{S}_{m,t-1}=\frac{1}{L_{\max}}\sum_{\Delta t=1}^{L_{\max}}S_{m,t-\Delta t}.

(10)

More heuristic baselines are discussed in Appendix B.

Independent Proximal Policy Optimization (IPPO) with Parameter Sharing. Proximal policy optimization (PPO) Schulman et al. (2017) updates policies by iteratively sampling data through interaction with the environment and optimizing a clipped surrogate objective function using multiple epochs of stochastic gradient ascent. Independent PPO (IPPO) De Witt et al. (2020) is a RL approach where each agent is independently trained with the PPO algorithm. Here we employ the parameter sharing by using the same policy parameters for all agents to improve learning efficiency and coordination.

Multi-Agent Proximal Policy Optimization (MAPPO). Multi-agent PPO (MAPPO) Yu et al. (2022) is an extension of PPO algorithm designed for multi-agent environments. It utilizes a centralized value function that takes into account global information from all agents, improving variance reduction and stability during training.

4.3 Experiment Settings

The performance of our model InvAgent is evaluated using the total reward from all stages all periods during one simulation (episode), and the reported numbers are averaged over 5 episodes for each experiment to reduce the uncertainty. We utilize the Python packages such as AutoGen Wu et al. (2023), Gymnasium Towers et al. (2023), and RLlib Liang et al. (2018), and also LLMs including GPT-4, GPT-4O, and GPT-4-Turbo Achiam et al. (2023). For the constant demand scenario, we change the last part of the prompt in Figure 4 to "([0], [4], or [8] only)" to boost the InvAgent performance.

The performance of baseline models is evaluated based on the episode reward averaged over 100 episodes. For reinforcement learning (RL), we explore various hyper-parameter settings, including the numbers of hidden units ([128, 128] and [256, 256]), activation function (ReLU), learning rate (1e-4, 5e-4, 1e-3), training batch size (500, 1000, 2000), stochastic gradient descent (SGD) minibatch size (32, 64, 128), number of SGD iterations (5, 10, 20), number of training iterations (500, 800, 1000, 1500), and discount factor (1.0). We randomly select 20 combinations of these hyper-parameters during one experiment and keep the best one of them. The final hyper-parameters used for all scenarios are presented in Appendix C. The RL experiments are conducted on a NVIDIA A10 GPU.

4.4 Experiment Results

{adjustwidth}

-2cm-2cm Model Constant Variable Larger Seasonal Normal Base-Stock -296.00 (0.00) -523.69 (49.15) -392.21 (111.79) -274.29 (40.75) -322.44 (99.59) Tracking-Demand -360.00 (0.00) -412.41 (41.76) -265.07 (99.67) -421.90 (55.18) -232.20 (75.45) IPPO -132.17 (40.17) -389.55 (40.28) -202.39 (92.96) -126.73 (183.63) -102.90 (64.68) MAPPO -129.81 (16.02) -391.53 (34.09) -106.79 (109.86) -99.39 (126.09) -41.98 (75.22) InvAgent (w/o strategy) -156.00 (0.00) -336.60 (43.24) -350.20 (149.57) -488.00 (114.82) -172.60 (104.70) InvAgent (w/ strategy) -200.00 (0.00) -377.60 (53.50) -357.60 (50.04) -420.60 (225.42) -192.40 (98.51)

Table 3: Mean episode rewards and standard deviations (in parentheses) for base-stock, tracking-demand, IPPO, MAPPO, and InvAgent (with and without the hand-crafted strategy) models under various demand scenarios.

Our experiment results, as shown in Table 3, highlight the performance of various models across different demand scenarios. The InvAgent model demonstrates competitive performance, particularly in the variable demand scenario, where InvAgent (without the hand-crafted strategy) achieves the highest mean episode rewards. While the MAPPO model exhibits the best performance in the other demand scenarios, InvAgent’s zero-shot capability and adaptability offer significant benefits. This adaptability allows InvAgent to make reasonable decisions and understand concepts without specific examples, showcasing a level of generalization and adaptability akin to human intuition.

When compared to heuristic baselines, InvAgent shows notable advantages. Unlike the base-stock policy and tracking-demand policy, which rely on fixed or historical data and often struggle with fluctuating demands, InvAgent dynamically adapts to real-time conditions, minimizing inventory costs and avoiding stockouts. InvAgent performs particularly well in the variable demand scenario, indicating its effectiveness in managing unpredictable demand patterns. These results demonstrate that InvAgent can achieve lower costs and better adaptability than the heuristic baselines.

In comparison to RL models, InvAgent offers distinct strengths despite certain performance limitations. RL models like MAPPO and IPPO achieve higher metrics in several scenarios due to extensive training but come with increased complexity and potential instability. RL training often requires significant computational resources and time, with risks of overfitting. In contrast, InvAgent provides key advantages in explainability and ease of implementation, making reasonable decisions without prior training. Although InvAgent does not always outperform RL models, its stability, simplicity, and interpretability make it a valuable alternative for dynamic inventory management.

The comparison of the InvAgent model with and without strategy shows that the inherent strategy of the LLM is generally sufficient for many simpler scenarios, such as constant and variable demands. However, in more complex scenarios like seasonal demands, the addition of a human-crafted strategy enhances the LLM’s decision-making. This demonstrates that while the model’s built-in capabilities are adequate for straightforward situations, incorporating explicit strategies proves beneficial for managing more complex and patterned demand scenarios, thereby improving performance and adaptability.

4.5 Ablation Studies

We conduct ablation studies to evaluate the impact of various prompt components in InvAgent under the variable demand scenario, as detailed in Table 4. This scenario introduces randomness with demand varying uniformly between 0 and 4 units per period over 12 periods. The results demonstrate that the contributions of different components to the model’s performance vary significantly. The prompt without the strategy component is the best one and other models are compared with it.

{adjustwidth}

-2cm-2cm Model Demand Downstream Strategy CoT History Reward $\bm{\Delta}$ % GPT-4 ✓ ✓ ✗ ✓ ✓ -336.60 (43.24) 0.00% GPT-4 ✓ ✓ ✓ ✓ ✓ -377.60 (53.50) -12.18% GPT-4 ✗ ✓ ✓ ✓ ✓ -349.40 (29.43) -3.80% GPT-4 ✓ ✗ ✓ ✓ ✓ -419.00 (35.91) -24.48% GPT-4 ✗ ✗ ✓ ✓ ✓ -379.40 (40.03) -12.72% GPT-4 ✓ ✓ ✗ ✓ ✗ -339.20 (46.63) -0.77% GPT-4 ✓ ✓ ✓ ✗ ✓ -369.80 (36.83) -9.86% GPT-4 ✓ ✓ ✓ ✓ ✗ -387.40 (11.09) -15.09% GPT-4O ✓ ✓ ✓ ✓ ✓ -405.00 (35.14) -20.32% GPT-4-Turbo ✓ ✓ ✓ ✓ ✓ -636.40 (195.26) -89.07%

Table 4: Ablation studies on different prompt settings of the InvAgent for the variable demand scenario, where each reward is averaged from 5 experiments and the standard deviation is reported in parentheses. The percentage change in the reward compared to the first result is also included.

Overall, the ablation study confirms the robustness and adaptability of our model in dynamic supply chain environments. Components such as the demand description and downstream order are particularly essential for optimizing performance under variable demands. By keeping agent histories, all previous messages in the entire episode (simulation) are retained, allowing stage agents to use the entire chat history for context in their decision-making. Additionally, structured reasoning through Chain-of-Thought (CoT) also play crucial roles. These findings emphasize the importance of each component in achieving effective and efficient inventory management, guiding future improvements and applications in more complex scenarios.

5 Conclusion

In this study, we demonstrate the effectiveness of using large language models (LLMs) as autonomous agents for multi-agent inventory management in supply chain optimization. Our novel model, InvAgent, leverages the zero-shot learning capabilities of LLMs, enabling them to make adaptive and informed decisions without prior training. The integration of structured reasoning through the Chain-of-Thought (CoT) methodology further enhances the explainability and transparency of the model, making it more reliable and easier to trust compared to traditional heuristic or reinforcement learning models. The experimental results show that our model performs competitively, achieving lower costs and better adaptability compared with heuristic polices across various demand scenarios. This highlights the potential of LLMs to notably improve supply chain management by reducing inventory costs and minimizing stockouts.

In the future, we will fine-tune our model using reinforcement learning to enhance decision-making capabilities, allowing the LLMs to learn and optimize strategies over iterations. We also plan to use real-world data to evaluate the efficiency of our model and the utility of the agents. For real data with seasonality, decomposing data into level, trend, and seasonality components will be explored to refine predictive accuracy. Combining human-crafted strategies with the LLMs’ inherent capabilities will be another focus to handle varied and unpredictable demand patterns more robustly.

Limitations

Scope of Data. The primary limitation of this study is the dependence on simulated scenarios and synthetic data, which may not fully capture the complexities and variabilities of real-world supply chain environments. This reliance restricts the generalizability of our findings to actual supply chain operations, which may exhibit different behaviors and challenges.

Computational Resources. In our model, we need to call the OpenAI API, which incurs significant costs. Despite this limitation, our approach offers valuable insights into the use of large language models for dynamic inventory management, suggesting the need for further research and refinement.

Ethical Considerations

This study adheres to ethical AI principles, ensuring transparency, fairness, and accountability. We use only synthetic data, avoiding any private or sensitive information. Rigorous standards are maintained in experiment design and result evaluation.

References

Abaku et al. (2024) Emmanuel Adeyemi Abaku, Tolulope Esther Edunjobi, and Agnes Clare Odimarha. 2024. Theoretical approaches to ai in supply chain optimization: Pathways to efficiency and resilience. International Journal of Science and Technology Research Archive, 6(1):092–107.
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
De Witt et al. (2020) Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. 2020. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533.
Edali and Yasarcan (2014) Mert Edali and Hakan Yasarcan. 2014. A mathematical model of the beer game. Journal of Artificial Societies and Social Simulation, 17(4):2.
Goodwin and Franklin (1994) Jack S Goodwin and Stephen G Franklin. 1994. The beer distribution game: using simulation to teach systems thinking. Journal of Management Development, 13(8):7–15.
Guo et al. (2024) Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680.
Hori and Matsui (2023) Masaaki Hori and Toshihiro Matsui. 2023. Improving multi-agent reinforcement learning for beer game by reward design based on payment mechanism. International Journal of Smart Computing and Artificial Intelligence, 7(2).
Hubbs et al. (2020) Christian D Hubbs, Hector D Perez, Owais Sarwar, Nikolaos V Sahinidis, Ignacio E Grossmann, and John M Wassick. 2020. Or-gym: A reinforcement learning library for operations research problems. arXiv preprint arXiv:2008.06319.
Kaihara (2003) Toshiya Kaihara. 2003. Multi-agent based supply chain modelling with dynamic environment. International Journal of Production Economics, 85(2):263–269.
Kegenbekov and Jackson (2021) Zhandos Kegenbekov and Ilya Jackson. 2021. Adaptive supply chain: Demand–supply synchronization using deep reinforcement learning. Algorithms, 14(8):240.
Lee et al. (1997) Hau L Lee, Venkata Padmanabhan, and Seungjin Whang. 1997. Information distortion in a supply chain: The bullwhip effect. Management science, 43(4):546–558.
Li et al. (2023a) Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. 2023a. Large language models for supply chain optimization. arXiv preprint arXiv:2307.03875.
Li et al. (2023b) Nian Li, Chen Gao, Yong Li, and Qingmin Liao. 2023b. Large language model-empowered agents for simulating macroeconomic activities. arXiv preprint arXiv:2310.10436.
Li et al. (2023c) Yang Li, Yangyang Yu, Haohang Li, Zhi Chen, and Khaldoun Khashanah. 2023c. Tradinggpt: Multi-agent system with layered memory and distinct characters for enhanced financial trading performance. arXiv preprint arXiv:2309.03736.
Liang et al. (2018) Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. 2018. RLlib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning (ICML).
Mao et al. (2023) Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. 2023. Alympics: Language agents meet game theory. arXiv preprint arXiv:2311.03220.
Mousa et al. (2024) Marwan Mousa, Damien van de Berg, Niki Kotecha, Ehecatl Antonio del Rio-Chanona, and Max Mowbray. 2024. An analysis of multi-agent reinforcement learning for decentralized inventory control systems. Computers & Chemical Engineering, page 108783.
Moyaux et al. (2003) Thierry Moyaux, Brahim Chaib-Draa, and Sophie D’Amours. 2003. Multi-agent coordination based on tokens: Reduction of the bullwhip effect in a forest supply chain. In Proceedings of the second international joint conference on autonomous agents and multiagent systems, pages 670–677.
Nissen (2001) Mark E Nissen. 2001. Agent-based supply chain integration. Information Technology and Management, 2:289–312.
Oroojlooyjadid et al. (2022) Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence V Snyder, and Martin Takáč. 2022. A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing & Service Operations Management, 24(1):285–304.
Quan and Liu (2024) Yinzhu Quan and Zefang Liu. 2024. Econlogicqa: A question-answering benchmark for evaluating large language models in economic sequential reasoning. arXiv preprint arXiv:2405.07938.
Quan et al. (2023) Yinzhu Quan, Ashwin Pothen, and Benoit Montreuil. 2023. Predictive demand disruption signals for supply chain networks. In IISE Annual Conference and Expo. IISE.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Singla et al. (2023) Tanmay Singla, Dharun Anandayuvaraj, Kelechi G Kalu, Taylor R Schorlemmer, and James C Davis. 2023. An empirical study on using large language models to analyze software supply chain security failures. In Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, pages 5–15.
Towers et al. (2023) Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. 2023. Gymnasium.
Weiss et al. (2023) Martin Weiss, Nasim Rahaman, Manuel Wuthrich, Yoshua Bengio, Li Erran Li, Bernhard Schölkopf, and Christopher Pal. 2023. Rethinking the buyer’s inspection paradox in information markets with language agents.
Wu et al. (2023) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
Yasmin (2024) Ghazala Yasmin. 2024. Supply chain management: Ensuring seamless operations. Journal of Management Science Research Review, 2(1):55–66.
Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35:24611–24624.
Zhao et al. (2023) Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. 2023. Competeai: Understanding the competition behaviors in large language model-based agents. arXiv preprint arXiv:2310.17512.

Appendix A Prompt and Response Example

An example of the InvAgent prompt and response for the constant demand scenario with GPT-4 is presented in Figure 9. The prompt includes a detailed state description, a demand description specifying the expected demand at the retailer, and a strategy description advising aligning open orders with expected downstream orders and backlog after considering lead times and the bullwhip effect. In the response, the Retailer Agent, considering the current inventory suffices for up to 3 rounds of maximal demand and the 2-round lead time, decides not to place an order this round, aiming to prevent excessive inventory, as articulated in the agent’s reasoned response.

Figure 9: Example of the prompt and response of InvAgent for the constant demand scenario with GPT-4.

Appendix B Evaluation Results of Heuristic Baseline Variants

The evaluation results of the base-stock policy and the tracking demand policy variants are displayed in Table 5. These evaluations provide insights into how different inventory policies perform under diverse demand scenarios. The first section of the table shows the performance of the base-stock policy with different desired inventory levels based on production capacities $c_{m}$ . The results indicate that maintaining a lower desired inventory level generally results in better performance across varying demand conditions. The second section of the table includes five variants of the tracking demand policy, denoted by different formulas involving the sales $S_{m,t-1}$ , lead time $L_{m}$ , and backlog $B_{m,t-1}$ . While no single variant consistently outperforms the others, averaging sales typically helps manage variable demands across most scenarios.

{adjustwidth}

-2cm-2cm Desired Inventory Constant Variable Larger Seasonal Normal $0.8c_{m}$ -208.00 (0.00) -435.69 (49.15) -234.28 (102.81) -207.75 (34.67) -150.67 (101.80) $0.9c_{m}$ -252.00 (0.00) -479.69 (49.15) -310.74 (109.16) -229.08 (34.66) -226.31 (103.32) $c_{m}$ -296.00 (0.00) -523.69 (49.15) -392.21 (111.79) -274.29 (40.75) -322.44 (99.59) $S_{m,t-1}L_{m}+B_{m,t-1}$ -364.00 (0.00) -390.17 (44.24) -393.31 (79.05) -525.84 (47.85) -283.39 (61.83) $S_{m,t-1}(L_{m}+1)+B_{m,t-1}$ -120.00 (0.00) -395.68 (41.86) -470.55 (76.73) -524.26 (64.68) -351.23 (90.08) $\bar{S}_{m,t-1}L_{m}+B_{m,t-1}$ -360.00 (0.00) -412.41 (41.76) -265.07 (99.67) -421.90 (55.18) -232.20 (75.45) $\bar{S}_{m,t-1}(L_{m}+1)+B_{m,t-1}$ -252.00 (0.00) -382.77 (48.50) -489.75 (110.96) -610.03 (94.43) -177.54 (70.87) $1.2\bar{S}_{m,t-1}L_{m}+B_{m,t-1}$ -361.00 (0.00) -397.22 (50.02) -325.81 (98.39) -479.07 (69.47) -218.98 (73.26)

Table 5: Evaluation results (averaged episode rewards and their standard deviations) for different heuristic model variants under various demand scenarios.

Appendix C Reinforcement Learning Baseline Settings

The hyperparameter settings used for the independent proximal policy optimization (IPPO) with parameter sharing and multi-agent proximal policy optimization (MAPPO) baselines are provided in Table 6 and Table 7 respectively.

{adjustwidth}

-2cm-2cm Hyperparameter Constant Variable Larger Seasonal Normal Numbers of Hidden Unit [128, 128] [256, 256] [128, 128] [128, 128] [128, 128] Activation Function ReLU ReLU ReLU ReLU ReLU Learning Rate 0.0001 0.0001 0.001 0.0005 0.0005 Training Batch Size 1000 1000 2000 2000 1000 SGD Minibatch Size 128 128 128 128 128 Number of SGD Iterations 5 10 5 5 5 Number of Training Iterations 1000 1500 1000 800 500

Table 6: Hyperparameters for the independent proximal policy optimization (IPPO) with parameter sharing baseline.

{adjustwidth}

-2cm-2cm Hyperparameter Constant Variable Larger Seasonal Normal Numbers of Hidden Unit [128, 128] [128, 128] [128, 128] [256, 256] [128, 128] Activation Function ReLU ReLU ReLU ReLU ReLU Learning Rate 0.0001 0.0001 0.001 0.0001 0.0001 Training Batch Size 500 2000 2000 1000 500 SGD Minibatch Size 128 32 32 128 128 Number of SGD Iterations 10 5 10 10 10 Number of Training Iterations 500 500 800 1500 500

Table 7: Hyperparameters for the multi-agent proximal policy optimization (MAPPO) baseline.

Appendix D Case Studies

In this section, we have two demand scenario case studies based on our model InvAgent. One case study examines a variable demand scenario without strategy, while the other looks at a seasonal demand scenario with strategy in place. The supply chain in these scenarios comprises four stages, moving from downstream to upstream: the retailer, wholesaler, distributor, and manufacturer.

D.1 Variable Demand Scenario

In the variable demand scenario, Figure 10 shows how LLMs take actions (orders) in response to changes in the demand, inventory, backlog, and profit. At the start of the simulation (episode), when demand first appears, the retailer begins to respond to the change. Initially, the retailer’s inventory decreases because retailers are the first to supply customers, followed by the wholesaler, distributor, and manufacturer. Due to the lead time from the upstream suppliers, the retailer’s inventory cannot stabilize immediately, even after placing orders with the upstream wholesaler. After several ordering cycles, the retailer’s inventory eventually reaches a relatively steady state.

Another interesting phenomenon occurs in the middle of the simulation, when the backlog value of the distributor reaches its peak. This happens because the distributor’s inventory starts decreasing several periods earlier and completely runs out in period 6. The distributor fails to restock in a timely manner as the inventory dwindles, resulting in a huge backlog in period 7. To prevent this, the distributor should place orders at least the lead time periods before the inventory running out.

D.2 Seasonal Demand Scenario

In the seasonal demand scenario, Figure 11 shows how LLM takes action (order) in response to changes in demand, inventory, backlog, and profit. In this scenario, all agents are informed of the demand distribution in each period. Specifically, the demand follows a uniform distribution, $\mathcal{U}(0,4)$ , from periods 1 to 4, and a different uniform distribution, $\mathcal{U}(5,8)$ , from periods 5 to 12. During periods 3 and 4, four agents, particularly the distributor and manufacturer, attempt to order large quantities of products from their upstream suppliers. The manufacturer’s inventory is exhausted due to the high volume of downstream orders. This leads to a spike in backlog during period 4, causing the manufacturer’s profit to reach its minimum value. In the subsequent periods, the manufacturer continues to order raw materials, which helps mitigate the backlog and demonstrates the flexibility and resilience of our model.