Search | arXiv e-print repository

Multi-Agent Learning of Efficient Fulfilment and Routing Strategies in E-Commerce

Authors: Omkar Shelke, Pranavi Pathakota, Anandsingh Chauhan, Harshad Khadilkar, Hardik Meisheri, Balaraman Ravindran

Abstract: This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-m… ▽ More This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-making: (i) selection of a fulfillment node for each order (including the option of deferral to a future time), and then (ii) routing of vehicles (each of which can carry multiple orders originating from the same warehouse). We propose an approach that combines graph neural networks and reinforcement learning to train the node selection and vehicle routing agents. We include real-world constraints such as warehouse inventory capacity, vehicle characteristics such as travel times, service times, carrying capacity, and customer constraints including time windows for delivery. The complexity of this problem arises from the fact that outcomes (rewards) are driven both by the fulfillment node mapping as well as the routing algorithms, and are spatio-temporally distributed. Our experiments show that this algorithmic pipeline outperforms pure heuristic policies. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2306.15913 [pdf, other]

DCT: Dual Channel Training of Action Embeddings for Reinforcement Learning with Large Discrete Action Spaces

Authors: Pranavi Pathakota, Hardik Meisheri, Harshad Khadilkar

Abstract: The ability to learn robust policies while generalizing over large discrete action spaces is an open challenge for intelligent systems, especially in noisy environments that face the curse of dimensionality. In this paper, we present a novel framework to efficiently learn action embeddings that simultaneously allow us to reconstruct the original action as well as to predict the expected future sta… ▽ More The ability to learn robust policies while generalizing over large discrete action spaces is an open challenge for intelligent systems, especially in noisy environments that face the curse of dimensionality. In this paper, we present a novel framework to efficiently learn action embeddings that simultaneously allow us to reconstruct the original action as well as to predict the expected future state. We describe an encoder-decoder architecture for action embeddings with a dual channel loss that balances between action reconstruction and state prediction accuracy. We use the trained decoder in conjunction with a standard reinforcement learning algorithm that produces actions in the embedding space. Our architecture is able to outperform two competitive baselines in two diverse environments: a 2D maze environment with more than 4000 discrete noisy actions, and a product recommendation task that uses real-world e-commerce transaction data. Empirical results show that the model results in cleaner action embeddings, and the improved representations help learn better policies with earlier convergence. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 17 pages

arXiv:2210.17296 [pdf, other]

Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning

Authors: Harshad Khadilkar, Hardik Meisheri

Abstract: A significant challenge in reinforcement learning is quantifying the complex relationship between actions and long-term rewards. The effects may manifest themselves over a long sequence of state-action pairs, making them hard to pinpoint. In this paper, we propose a method to link transitions with significant deviations in state with unusually large variations in subsequent rewards. Such transitio… ▽ More A significant challenge in reinforcement learning is quantifying the complex relationship between actions and long-term rewards. The effects may manifest themselves over a long sequence of state-action pairs, making them hard to pinpoint. In this paper, we propose a method to link transitions with significant deviations in state with unusually large variations in subsequent rewards. Such transitions are marked as possible causal effects, and the corresponding state-action pairs are added to a separate replay buffer. In addition, we include \textit{contrastive} samples corresponding to transitions from a similar state but with differing actions. Including this Contrastive Experience Replay (CER) during training is shown to outperform standard value-based methods on 2D navigation tasks. We believe that CER can be useful for a broad class of learning tasks, including for any off-policy reinforcement learning algorithm. △ Less

Submitted 28 October, 2022; originally announced October 2022.

arXiv:2203.00885 [pdf, other]

A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Authors: Hardik Meisheri, Somjit Nath, Mayank Baranwal, Harshad Khadilkar

Abstract: Most existing literature on supply chain and inventory management consider stochastic demand processes with zero or constant lead times. While it is true that in certain niche scenarios, uncertainty in lead times can be ignored, most real-world scenarios exhibit stochasticity in lead times. These random fluctuations can be caused due to uncertainty in arrival of raw materials at the manufacturer's… ▽ More Most existing literature on supply chain and inventory management consider stochastic demand processes with zero or constant lead times. While it is true that in certain niche scenarios, uncertainty in lead times can be ignored, most real-world scenarios exhibit stochasticity in lead times. These random fluctuations can be caused due to uncertainty in arrival of raw materials at the manufacturer's end, delay in transportation, an unforeseen surge in demands, and switching to a different vendor, to name a few. Stochasticity in lead times is known to severely degrade the performance in an inventory management system, and it is only fair to abridge this gap in supply chain system through a principled approach. Motivated by the recently introduced delay-resolved deep Q-learning (DRDQN) algorithm, this paper develops a reinforcement learning based paradigm for handling uncertainty in lead times (\emph{action delay}). Through empirical evaluations, it is further shown that the inventory management with uncertain lead times is not only equivalent to that of delay in information sharing across multiple echelons (\emph{observation delay}), a model trained to handle one kind of delay is capable to handle delays of another kind without requiring to be retrained. Finally, we apply the delay-resolved framework to scenarios comprising of multiple products subjected to stochasticity in lead times, and elucidate how the delay-resolved framework negates the effect of any delay to achieve near-optimal performance. △ Less

Submitted 8 March, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00874 [pdf, other]

Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Authors: Durgesh Kalwar, Omkar Shelke, Somjit Nath, Hardik Meisheri, Harshad Khadilkar

Abstract: Improving sample efficiency is a key challenge in reinforcement learning, especially in environments with large state spaces and sparse rewards. In literature, this is resolved either through the use of auxiliary tasks (subgoals) or through clever exploration strategies. Exploration methods have been used to sample better trajectories in large environments while auxiliary tasks have been incorpora… ▽ More Improving sample efficiency is a key challenge in reinforcement learning, especially in environments with large state spaces and sparse rewards. In literature, this is resolved either through the use of auxiliary tasks (subgoals) or through clever exploration strategies. Exploration methods have been used to sample better trajectories in large environments while auxiliary tasks have been incorporated where the reward is sparse. However, few studies have attempted to tackle both large scale and reward sparsity at the same time. This paper explores the idea of combining exploration with auxiliary task learning using General Value Functions (GVFs) and a directed exploration strategy. We present a way to learn value functions which can be used to sample actions and provide directed exploration. Experiments on navigation tasks with varying grid sizes demonstrate the performance advantages over several competitive baselines. △ Less

Submitted 27 February, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2112.08736 [pdf, other]

Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Authors: Pranavi Pathakota, Kunwar Zaid, Anulekha Dhara, Hardik Meisheri, Shaun D Souza, Dheeraj Shah, Harshad Khadilkar

Abstract: We describe a novel decision-making problem developed in response to the demands of retail electronic commerce (e-commerce). While working with logistics and retail industry business collaborators, we found that the cost of delivery of products from the most opportune node in the supply chain (a quantity called the cost-to-serve or CTS) is a key challenge. The large scale, high stochasticity, and… ▽ More We describe a novel decision-making problem developed in response to the demands of retail electronic commerce (e-commerce). While working with logistics and retail industry business collaborators, we found that the cost of delivery of products from the most opportune node in the supply chain (a quantity called the cost-to-serve or CTS) is a key challenge. The large scale, high stochasticity, and large geographical spread of e-commerce supply chains make this setting ideal for a carefully designed data-driven decision-making algorithm. In this preliminary work, we focus on the specific subproblem of delivering multiple products in arbitrary quantities from any warehouse to multiple customers in each time period. We compare the relative performance and computational efficiency of several baselines, including heuristics and mixed-integer linear programming. We show that a reinforcement learning based algorithm is competitive with these policies, with the potential of efficient scale-up in the real world. △ Less

Submitted 16 December, 2021; originally announced December 2021.

arXiv:2102.11762 [pdf, other]

doi 10.1145/3493700.3493709

School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Authors: Omkar Shelke, Hardik Meisheri, Harshad Khadilkar

Abstract: Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and… ▽ More Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance. △ Less

Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: 8 pages, Submitted to ALA workshop 2021

Journal ref: CODS-COMAD 2022: 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

arXiv:2011.00424 [pdf, other]

Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Authors: Hardik Meisheri, Harshad Khadilkar

Abstract: We describe our solution approach for Pommerman TeamRadio, a competition environment associated with NeurIPS 2019. The defining feature of our algorithm is achieving sample efficiency within a restrictive computational budget while beating the previous years learning agents. The proposed algorithm (i) uses imitation learning to seed the policy, (ii) explicitly defines the communication protocol be… ▽ More We describe our solution approach for Pommerman TeamRadio, a competition environment associated with NeurIPS 2019. The defining feature of our algorithm is achieving sample efficiency within a restrictive computational budget while beating the previous years learning agents. The proposed algorithm (i) uses imitation learning to seed the policy, (ii) explicitly defines the communication protocol between the two teammates, (iii) shapes the reward to provide a richer feedback signal to each agent during training and (iv) uses masking for catastrophic bad actions. We describe extensive tests against baselines, including those from the 2019 competition leaderboard, and also a specific investigation of the learned policy and the effect of each modification on performance. We show that the proposed approach is able to achieve competitive performance within half a million games of training, significantly faster than other studies in the literature. △ Less

Submitted 1 November, 2020; originally announced November 2020.

arXiv:2006.04037 [pdf, other]

Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Authors: Nazneen N Sultana, Hardik Meisheri, Vinita Baniwal, Somjit Nath, Balaraman Ravindran, Harshad Khadilkar

Abstract: This paper describes the application of reinforcement learning (RL) to multi-product inventory management in supply chains. The problem description and solution are both adapted from a real-world business solution. The novelty of this problem with respect to supply chain literature is (i) we consider concurrent inventory management of a large number (50 to 1000) of products with shared capacity, (… ▽ More This paper describes the application of reinforcement learning (RL) to multi-product inventory management in supply chains. The problem description and solution are both adapted from a real-world business solution. The novelty of this problem with respect to supply chain literature is (i) we consider concurrent inventory management of a large number (50 to 1000) of products with shared capacity, (ii) we consider a multi-node supply chain consisting of a warehouse which supplies three stores, (iii) the warehouse, stores, and transportation from warehouse to stores have finite capacities, (iv) warehouse and store replenishment happen at different time scales and with realistic time lags, and (v) demand for products at the stores is stochastic. We describe a novel formulation in a multi-agent (hierarchical) reinforcement learning framework that can be used for parallelised decision-making, and use the advantage actor critic (A2C) algorithm with quantised action spaces to solve the problem. Experiments show that the proposed approach is able to handle a multi-objective reward comprised of maximising product sales and minimising wastage of perishable products. △ Less

Submitted 7 June, 2020; originally announced June 2020.

arXiv:1911.04947 [pdf, other]

Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Authors: Hardik Meisheri, Omkar Shelke, Richa Verma, Harshad Khadilkar

Abstract: The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimizat… ▽ More The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimization (PPO) reinforcement learning algorithm. The basic PPO approach is modified for stable transition from the imitation learning phase through reward shaping, action filters based on heuristics, and curriculum learning. The proposed methodology is able to beat heuristic and pure reinforcement learning baselines with a combined 100,000 training games, significantly faster than other non-tree-search methods in literature. We present results against multiple agents provided by the developers of the simulation, including some that we have enhanced. We include a sensitivity analysis over different parameters, and highlight undesirable effects of some strategies that initially appear promising. Since Pommerman is a complex multi-agent competitive environment, the strategies developed here provide insights into several real-world problems with characteristics such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards. △ Less

Submitted 13 November, 2019; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: Presented at Deep Reinforcement Learning workshop, NeurIPS-2019

arXiv:1911.02771 [pdf, other]

doi 10.1007/978-3-030-33698-1_8

Characterizing behavioral trends in a community driven discussion platform

Authors: Sachin Thukral, Arnab Chatterjee, Hardik Meisheri, Tushar Kataria, Aman Agarwal, Ishan Verma, Lipika Dey

Abstract: This article presents a systematic analysis of the patterns of behavior of individuals as well as groups observed in community-driven platforms for discussion like Reddit, where users usually exchange information and viewpoints on their topics of interest. We perform a statistical analysis of the behavior of posts and model the users' interactions around them. A platform like Reddit which has grow… ▽ More This article presents a systematic analysis of the patterns of behavior of individuals as well as groups observed in community-driven platforms for discussion like Reddit, where users usually exchange information and viewpoints on their topics of interest. We perform a statistical analysis of the behavior of posts and model the users' interactions around them. A platform like Reddit which has grown exponentially, starting from a very small community to one of the largest social networks, with its large user base and popularity harboring a variety of behavior of users in terms of their activity. Our work provides interesting insights about a huge number of inactive posts which fail to attract attention despite their authors exhibiting Cyborg-like behavior to attract attention. We also observe short-lived yet extremely active posts emulate a phenomenon like Mayfly Buzz. A method is presented, to study the activity around posts which are highly active, to determine the presence of Limelight hogging activity. We also present a systematic analysis to study the presence of controversies in posts. We analyzed data from two periods of one-year duration but separated by few years in time, to understand how social media has evolved through the years. △ Less

Submitted 7 November, 2019; originally announced November 2019.

Comments: 19 pages. Extended version of arxiv:1809.07087. Springer Lecture Notes Format, to be published in Lecture Notes in Social Networks (Springer)

arXiv:1910.00211 [pdf, other]

Reinforcement Learning for Multi-Objective Optimization of Online Decisions in High-Dimensional Systems

Authors: Hardik Meisheri, Vinita Baniwal, Nazneen N Sultana, Balaraman Ravindran, Harshad Khadilkar

Abstract: This paper describes a purely data-driven solution to a class of sequential decision-making problems with a large number of concurrent online decisions, with applications to computing systems and operations research. We assume that while the micro-level behaviour of the system can be broadly captured by analytical expressions or simulation, the macro-level or emergent behaviour is complicated by n… ▽ More This paper describes a purely data-driven solution to a class of sequential decision-making problems with a large number of concurrent online decisions, with applications to computing systems and operations research. We assume that while the micro-level behaviour of the system can be broadly captured by analytical expressions or simulation, the macro-level or emergent behaviour is complicated by non-linearity, constraints, and stochasticity. If we represent the set of concurrent decisions to be computed as a vector, each element of the vector is assumed to be a continuous variable, and the number of such elements is arbitrarily large and variable from one problem instance to another. We first formulate the decision-making problem as a canonical reinforcement learning (RL) problem, which can be solved using purely data-driven techniques. We modify a standard approach known as advantage actor critic (A2C) to ensure its suitability to the problem at hand, and compare its performance to that of baseline approaches on the specific instance of a multi-product inventory management task. The key modifications include a parallelised formulation of the decision-making task, and a training procedure that explicitly recognises the quantitative relationship between different decisions. We also present experimental results probing the learned policies, and their robustness to variations in the data. △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: 22 pages, 10 figures

arXiv:1809.07087 [pdf, other]

Analyzing behavioral trends in community driven discussion platforms like Reddit

Authors: Sachin Thukral, Hardik Meisheri, Tushar Kataria, Aman Agarwal, Ishan Verma, Arnab Chatterjee, Lipika Dey

Abstract: The aim of this paper is to present methods to systematically analyze individual and group behavioral patterns observed in community driven discussion platforms like Reddit where users exchange information and views on various topics of current interest. We conduct this study by analyzing the statistical behavior of posts and modeling user interactions around them. We have chosen Reddit as an exam… ▽ More The aim of this paper is to present methods to systematically analyze individual and group behavioral patterns observed in community driven discussion platforms like Reddit where users exchange information and views on various topics of current interest. We conduct this study by analyzing the statistical behavior of posts and modeling user interactions around them. We have chosen Reddit as an example, since it has grown exponentially from a small community to one of the biggest social network platforms in the recent times. Due to its large user base and popularity, a variety of behavior is present among users in terms of their activity. Our study provides interesting insights about a large number of inactive posts which fail to gather attention despite their authors exhibiting Cyborg-like behavior to draw attention. We also present interesting insights about short-lived but extremely active posts emulating a phenomenon like Mayfly Buzz. Further, we present methods to find the nature of activity around highly active posts to determine the presence of Limelight hogging activity, if any. We analyzed over $2$ million posts and more than $7$ million user responses to them during entire 2008 and over $63$ million posts and over $608$ million user responses to them from August 2014 to July 2015 amounting to two one-year periods, in order to understand how social media space has evolved over the years. △ Less

Submitted 19 September, 2018; originally announced September 2018.

Comments: 8 pages, 9 figs, ASONAM 2018

arXiv:1802.09046 [pdf, other]

Multiclass Common Spatial Pattern for EEG based Brain Computer Interface with Adaptive Learning Classifier

Authors: Hardik Meisheri, Nagraj Ramrao, Suman Mitra

Abstract: In Brain Computer Interface (BCI), data generated from Electroencephalogram (EEG) is non-stationary with low signal to noise ratio and contaminated with artifacts. Common Spatial Pattern (CSP) algorithm has been proved to be effective in BCI for extracting features in motor imagery tasks, but it is prone to overfitting. Many algorithms have been devised to regularize CSP for two class problem, how… ▽ More In Brain Computer Interface (BCI), data generated from Electroencephalogram (EEG) is non-stationary with low signal to noise ratio and contaminated with artifacts. Common Spatial Pattern (CSP) algorithm has been proved to be effective in BCI for extracting features in motor imagery tasks, but it is prone to overfitting. Many algorithms have been devised to regularize CSP for two class problem, however they have not been effective when applied to multiclass CSP. Outliers present in data affect extracted CSP features and reduces performance of the system. In addition to this non-stationarity present in the features extracted from the CSP present a challenge in classification. We propose a method to identify and remove artifact present in the data during pre-processing stage, this helps in calculating eigenvectors which in turn generates better CSP features. To handle the non-stationarity, Self-Regulated Interval Type-2 Neuro-Fuzzy Inference System (SRIT2NFIS) was proposed in the literature for two class EEG classification problem. This paper extends the SRIT2NFIS to multiclass using Joint Approximate Diagonalization (JAD). The results on standard data set from BCI competition IV shows significant increase in the accuracies from the current state of the art methods for multiclass classification. △ Less

Submitted 6 March, 2021; v1 submitted 25 February, 2018; originally announced February 2018.

arXiv:1710.02745 [pdf, other]

Multi-Document Summarization using Distributed Bag-of-Words Model

Authors: Kaustubh Mani, Ishan Verma, Hardik Meisheri, Lipika Dey

Abstract: As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to mi… ▽ More As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to minimize the reconstruction error between the summary and the documents. We apply sentence selection and beam search, to further improve the performance of our model. Experimental results on two different datasets show significant performance gains compared with the state-of-the-art baselines. △ Less

Submitted 11 June, 2018; v1 submitted 7 October, 2017; originally announced October 2017.

Showing 1–15 of 15 results for author: Meisheri, H