Search | arXiv e-print repository

Fully Dynamic Online Selection through Online Contention Resolution Schemes

Authors: Vashist Avadhanula, Andrea Celli, Riccardo Colini-Baldeschi, Stefano Leonardi, Matteo Russo

Abstract: We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the full… ▽ More We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the fully dynamic version of the problem, elements stay active for a limited time interval, and then leave. This models, for example, the online matching of tasks to workers with task/worker-dependent working times, and sequential posted pricing of perishable goods. A successful approach to online selection problems in the adversarial setting is given by the notion of Online Contention Resolution Scheme (OCRS), that uses a priori information to formulate a linear relaxation of the underlying optimization problem, whose optimal fractional solution is rounded online for any adversarial order of the input sequence. Our main contribution is providing a general method for constructing an OCRS for fully dynamic online selection problems. Then, we show how to employ such OCRS to construct no-regret algorithms in a partial information model with semi-bandit feedback and adversarial inputs. △ Less

Submitted 8 January, 2023; originally announced January 2023.

arXiv:2211.06516 [pdf, other]

Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms

Authors: Vashist Avadhanula, Omar Abdul Baki, Hamsa Bastani, Osbert Bastani, Caner Gocmen, Daniel Haimovich, Darren Hwang, Dima Karamshuk, Thomas Leeper, Jiayuan Ma, Gregory Macnamara, Jake Mullett, Christopher Palow, Sung Park, Varun S Rajagopal, Kevin Schaeffer, Parikshit Shah, Deeksha Sinha, Nicolas Stier-Moses, Peng Xu

Abstract: We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation t… ▽ More We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2202.05194 [pdf, other]

Robust and fair work allocation

Authors: Amine Allouah, Christian Kroer, Xuan Zhang, Vashist Avadhanula, Anil Dania, Caner Gocmen, Sergey Pupyrev, Parikshit Shah, Nicolas Stier

Abstract: In today's digital world, interaction with online platforms is ubiquitous, and thus content moderation is important for protecting users from content that do not comply with pre-established community guidelines. Having a robust content moderation system throughout every stage of planning is particularly important. We study the short-term planning problem of allocating human content reviewers to di… ▽ More In today's digital world, interaction with online platforms is ubiquitous, and thus content moderation is important for protecting users from content that do not comply with pre-established community guidelines. Having a robust content moderation system throughout every stage of planning is particularly important. We study the short-term planning problem of allocating human content reviewers to different harmful content categories. We use tools from fair division and study the application of competitive equilibrium and leximin allocation rules. Furthermore, we incorporate, to the traditional Fisher market setup, novel aspects that are of practical importance. The first aspect is the forecasted workload of different content categories. We show how a formulation that is inspired by the celebrated Eisenberg-Gale program allows us to find an allocation that not only satisfies the forecasted workload, but also fairly allocates the remaining reviewing hours among all content categories. The resulting allocation is also robust as the additional allocation provides a guardrail in cases where the actual workload deviates from the predicted workload. The second practical consideration is time dependent allocation that is motivated by the fact that partners need scheduling guidance for the reviewers across days to achieve efficiency. To address the time component, we introduce new extensions of the various fair allocation approaches for the single-time period setting, and we show that many properties extend in essence, albeit with some modifications. Related to the time component, we additionally investigate how to satisfy markets' desire for smooth allocation (e.g., partners for content reviewers prefer an allocation that does not vary much from time to time, to minimize staffing switch). We demonstrate the performance of our proposed approaches through real-world data obtained from Meta. △ Less

Submitted 14 February, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

arXiv:2112.06517 [pdf, other]

Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations

Authors: Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta

Abstract: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we… ▽ More We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice. △ Less

Submitted 12 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

arXiv:2103.16816 [pdf, other]

QUEST: Queue Simulation for Content Moderation at Scale

Authors: Rahul Makhijani, Parikshit Shah, Vashist Avadhanula, Caner Gocmen, Nicolás E. Stier-Moses, Julián Mestre

Abstract: Moderating content in social media platforms is a formidable challenge due to the unprecedented scale of such systems, which typically handle billions of posts per day. Some of the largest platforms such as Facebook blend machine learning with manual review of platform content by thousands of reviewers. Operating a large-scale human review system poses interesting and challenging methodological qu… ▽ More Moderating content in social media platforms is a formidable challenge due to the unprecedented scale of such systems, which typically handle billions of posts per day. Some of the largest platforms such as Facebook blend machine learning with manual review of platform content by thousands of reviewers. Operating a large-scale human review system poses interesting and challenging methodological questions that can be addressed with operations research techniques. We investigate the problem of optimally operating such a review system at scale using ideas from queueing theory and simulation. △ Less

Submitted 31 March, 2021; originally announced March 2021.

arXiv:2103.10246 [pdf, other]

Stochastic Bandits for Multi-platform Budget Optimization in Online Advertising

Authors: Vashist Avadhanula, Riccardo Colini-Baldeschi, Stefano Leonardi, Karthik Abinav Sankararaman, Okke Schrijvers

Abstract: We study the problem of an online advertising system that wants to optimally spend an advertiser's given budget for a campaign across multiple platforms, without knowing the value for showing an ad to the users on those platforms. We model this challenging practical application as a Stochastic Bandits with Knapsacks problem over $T$ rounds of bidding with the set of arms given by the set of distin… ▽ More We study the problem of an online advertising system that wants to optimally spend an advertiser's given budget for a campaign across multiple platforms, without knowing the value for showing an ad to the users on those platforms. We model this challenging practical application as a Stochastic Bandits with Knapsacks problem over $T$ rounds of bidding with the set of arms given by the set of distinct bidding $m$-tuples, where $m$ is the number of platforms. We modify the algorithm proposed in Badanidiyuru \emph{et al.,} to extend it to the case of multiple platforms to obtain an algorithm for both the discrete and continuous bid-spaces. Namely, for discrete bid spaces we give an algorithm with regret $O\left(OPT \sqrt {\frac{mn}{B} }+ \sqrt{mn OPT}\right)$, where $OPT$ is the performance of the optimal algorithm that knows the distributions. For continuous bid spaces the regret of our algorithm is $\tilde{O}\left(m^{1/3} \cdot \min\left\{ B^{2/3}, (m T)^{2/3} \right\} \right)$. When restricted to this special-case, this bound improves over Sankararaman and Slivkins in the regime $OPT \ll T$, as is the case in the particular application at hand. Second, we show an $ Ω\left (\sqrt {m OPT} \right)$ lower bound for the discrete case and an $Ω\left( m^{1/3} B^{2/3}\right)$ lower bound for the continuous setting, almost matching the upper bounds. Finally, we use a real-world data set from a large internet online advertising company with multiple ad platforms and show that our algorithms outperform common benchmarks and satisfy the required properties warranted in the real-world application. △ Less

Submitted 25 March, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

arXiv:2011.14033 [pdf, other]

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Authors: Priyank Agrawal, Theja Tulabandhula, Vashist Avadhanula

Abstract: In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of… ▽ More In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{κd T})$, where $κ$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + κ)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee. △ Less

Submitted 14 April, 2024; v1 submitted 27 November, 2020; originally announced November 2020.

Comments: Bug fixed

arXiv:2011.01488 [pdf, other]

Multi-armed Bandits with Cost Subsidy

Authors: Deeksha Sinha, Karthik Abinav Sankararama, Abbas Kazerouni, Vashist Avadhanula

Abstract: In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by several businesses (especial… ▽ More In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by several businesses (especially online platforms), and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish a fundamental lower bound on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem. We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms. △ Less

Submitted 15 March, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

arXiv:1911.00638 [pdf, other]

Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints

Authors: Samuel Daulton, Shaun Singh, Vashist Avadhanula, Drew Dimmery, Eytan Bakshy

Abstract: Recent advances in contextual bandit optimization and reinforcement learning have garnered interest in applying these methods to real-world sequential decision making problems. Real-world applications frequently have constraints with respect to a currently deployed policy. Many of the existing constraint-aware algorithms consider problems with a single objective (the reward) and a constraint on th… ▽ More Recent advances in contextual bandit optimization and reinforcement learning have garnered interest in applying these methods to real-world sequential decision making problems. Real-world applications frequently have constraints with respect to a currently deployed policy. Many of the existing constraint-aware algorithms consider problems with a single objective (the reward) and a constraint on the reward with respect to a baseline policy. However, many important applications involve multiple competing objectives and auxiliary constraints. In this paper, we propose a novel Thompson sampling algorithm for multi-outcome contextual bandit problems with auxiliary constraints. We empirically evaluate our algorithm on a synthetic problem. Lastly, we apply our method to a real world video transcoding problem and provide a practical way for navigating the trade-off between safety and performance using Bayesian optimization. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: To appear at NeurIPS 2019, Workshop on Safety and Robustness in Decision Making. 11 pages (including references and appendix)

arXiv:1706.03880 [pdf, other]

MNL-Bandit: A Dynamic Learning Approach to Assortment Selection

Authors: Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi

Abstract: We consider a dynamic assortment selection problem, where in every round the retailer offers a subset (assortment) of $N$ substitutable products to a consumer, who selects one of these products according to a multinomial logit (MNL) choice model. The retailer observes this choice and the objective is to dynamically learn the model parameters, while optimizing cumulative revenues over a selling hor… ▽ More We consider a dynamic assortment selection problem, where in every round the retailer offers a subset (assortment) of $N$ substitutable products to a consumer, who selects one of these products according to a multinomial logit (MNL) choice model. The retailer observes this choice and the objective is to dynamically learn the model parameters, while optimizing cumulative revenues over a selling horizon of length $T$. We refer to this exploration-exploitation formulation as the MNL-Bandit problem. Existing methods for this problem follow an "explore-then-exploit" approach, which estimate parameters to a desired accuracy and then, treating these estimates as if they are the correct parameter values, offers the optimal assortment based on these estimates. These approaches require certain a priori knowledge of "separability", determined by the true parameters of the underlying MNL model, and this in turn is critical in determining the length of the exploration period. (Separability refers to the distinguishability of the true optimal assortment from the other sub-optimal alternatives.) In this paper, we give an efficient algorithm that simultaneously explores and exploits, achieving performance independent of the underlying parameters. The algorithm can be implemented in a fully online manner, without knowledge of the horizon length $T$. Furthermore, the algorithm is adaptive in the sense that its performance is near-optimal in both the "well separated" case, as well as the general parameter setting where this separation need not hold. △ Less

Submitted 29 June, 2018; v1 submitted 12 June, 2017; originally announced June 2017.

arXiv:1706.00977 [pdf, other]

Thompson Sampling for the MNL-Bandit

Authors: Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi

Abstract: We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multino… ▽ More We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance. △ Less

Submitted 3 January, 2019; v1 submitted 3 June, 2017; originally announced June 2017.

Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2017

Showing 1–11 of 11 results for author: Avadhanula, V