Search | arXiv e-print repository

Social Learning with Bounded Rationality: Negative Reviews Persist under Newest First

Authors: Jackie Baek, Atanas Dinev, Thodoris Lykouris

Abstract: We study a model of social learning from reviews where customers are computationally limited and make purchases based on reading only the first few reviews displayed by the platform. Under this bounded rationality, we establish that the review ordering policy can have a significant impact. In particular, the popular Newest First ordering induces a negative review to persist as the most recent revi… ▽ More We study a model of social learning from reviews where customers are computationally limited and make purchases based on reading only the first few reviews displayed by the platform. Under this bounded rationality, we establish that the review ordering policy can have a significant impact. In particular, the popular Newest First ordering induces a negative review to persist as the most recent review longer than a positive review. This phenomenon, which we term the Cost of Newest First, can make the long-term revenue unboundedly lower than a counterpart where reviews are exogenously drawn for each customer. We show that the impact of the Cost of Newest First can be mitigated under dynamic pricing, which allows the price to depend on the set of displayed reviews. Under the optimal dynamic pricing policy, the revenue loss is at most a factor of 2. On the way, we identify a structural property for this optimal dynamic pricing: the prices should ensure that the probability of a purchase is always the same, regardless of the state of reviews. We also study an extension of the model where customers put more weight on more recent reviews (and discount older reviews based on their time of posting), and we show that Newest First is still not the optimal ordering policy if customers discount slowly. Lastly, we corroborate our theoretical findings using a real-world review dataset. We find that the average rating of the first page of reviews is statistically significantly smaller than the overall average rating, which is in line with our theoretical results. △ Less

Submitted 22 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: An extended abstract appeared at the Twenty-Fifth ACM Conference on Economics and Computation (EC 2024)

arXiv:2402.12237 [pdf, other]

Learning to Defer in Content Moderation: The Human-AI Interplay

Authors: Thodoris Lykouris, Wentao Weng

Abstract: Successful content moderation in online platforms relies on a human-AI collaboration approach. A typical heuristic estimates the expected harmfulness of a post and uses fixed thresholds to decide whether to remove it and whether to send it for human review. This disregards the prediction uncertainty, the time-varying element of human review capacity and post arrivals, and the selective sampling in… ▽ More Successful content moderation in online platforms relies on a human-AI collaboration approach. A typical heuristic estimates the expected harmfulness of a post and uses fixed thresholds to decide whether to remove it and whether to send it for human review. This disregards the prediction uncertainty, the time-varying element of human review capacity and post arrivals, and the selective sampling in the dataset (humans only review posts filtered by the admission algorithm). In this paper, we introduce a model to capture the human-AI interplay in content moderation. The algorithm observes contextual information for incoming posts, makes classification and admission decisions, and schedules posts for human review. Only admitted posts receive human reviews on their harmfulness. These reviews help educate the machine-learning algorithms but are delayed due to congestion in the human review system. The classical learning-theoretic way to capture this human-AI interplay is via the framework of learning to defer, where the algorithm has the option to defer a classification task to humans for a fixed cost and immediately receive feedback. Our model contributes to this literature by introducing congestion in the human review system. Moreover, unlike work on online learning with delayed feedback where the delay in the feedback is exogenous to the algorithm's decisions, the delay in our model is endogenous to both the admission and the scheduling decisions. We propose a near-optimal learning algorithm that carefully balances the classification loss from a selectively sampled dataset, the idiosyncratic loss of non-reviewed posts, and the delay loss of having congestion in the human review system. To the best of our knowledge, this is the first result for online learning in contextual queueing systems and hence our analytical framework may be of independent interest. △ Less

Submitted 2 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2308.07817 [pdf, other]

Quantifying the Cost of Learning in Queueing Systems

Authors: Daniel Freund, Thodoris Lykouris, Wentao Weng

Abstract: Queueing systems are widely applicable stochastic models with use cases in communication networks, healthcare, service systems, etc. Although their optimal control has been extensively studied, most existing approaches assume perfect knowledge of the system parameters. Of course, this assumption rarely holds in practice where there is parameter uncertainty, thus motivating a recent line of work on… ▽ More Queueing systems are widely applicable stochastic models with use cases in communication networks, healthcare, service systems, etc. Although their optimal control has been extensively studied, most existing approaches assume perfect knowledge of the system parameters. Of course, this assumption rarely holds in practice where there is parameter uncertainty, thus motivating a recent line of work on bandit learning for queueing systems. This nascent stream of research focuses on the asymptotic performance of the proposed algorithms. In this paper, we argue that an asymptotic metric, which focuses on late-stage performance, is insufficient to capture the intrinsic statistical complexity of learning in queueing systems which typically occurs in the early stage. Instead, we propose the Cost of Learning in Queueing (CLQ), a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty. We characterize the CLQ of a single queue multi-server system, and then extend these results to multi-queue multi-server systems and networks of queues. In establishing our results, we propose a unified analysis framework for CLQ that bridges Lyapunov and bandit analysis, provides guarantees for a wide range of algorithms, and could be of independent interest. △ Less

Submitted 27 October, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: A condensed version of this work was accepted for presentation at the Conference on Neural Information Processing Systems (NeurIPS 2023). Compared to the first version of the paper, the current version expands the comparison with related work

arXiv:2301.10642 [pdf, other]

Group fairness in dynamic refugee assignment

Authors: Daniel Freund, Thodoris Lykouris, Elisabeth Paulson, Bradley Sturt, Wentao Weng

Abstract: Ensuring that refugees and asylum seekers thrive (e.g., find employment) in their host countries is a profound humanitarian goal, and a primary driver of employment is the geographic location within a host country to which the refugee or asylum seeker is assigned. Recent research has proposed and implemented algorithms that assign refugees and asylum seekers to geographic locations in a manner tha… ▽ More Ensuring that refugees and asylum seekers thrive (e.g., find employment) in their host countries is a profound humanitarian goal, and a primary driver of employment is the geographic location within a host country to which the refugee or asylum seeker is assigned. Recent research has proposed and implemented algorithms that assign refugees and asylum seekers to geographic locations in a manner that maximizes the average employment across all arriving refugees. While these algorithms can have substantial overall positive impact, using data from two industry collaborators we show that the impact of these algorithms can vary widely across key subgroups based on country of origin, age, or educational background. Thus motivated, we develop a simple and interpretable framework for incorporating group fairness into the dynamic refugee assignment problem. In particular, the framework can flexibly incorporate many existing and future definitions of group fairness from the literature (e.g., maxmin, randomized, and proportionally-optimized within-group). Equipped with our framework, we propose two bid-price algorithms that maximize overall employment while simultaneously yielding provable group fairness guarantees. Through extensive numerical experiments using various definitions of group fairness and real-world data from the U.S. and the Netherlands, we show that our algorithms can yield substantial improvements in group fairness compared to an offline benchmark fairness constraints, with only small relative decreases ($\approx$ 1%-5%) in global performance. △ Less

Submitted 11 January, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

arXiv:2208.09407 [pdf, other]

Learning in Stackelberg Games with Non-myopic Agents

Authors: Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alex Wei

Abstract: We study Stackelberg games where a principal repeatedly interacts with a long-lived, non-myopic agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, non-myopic agents pose additional complications. In particular, non-myopic agents may strategically select actions that are inferior in the present to mislead the princ… ▽ More We study Stackelberg games where a principal repeatedly interacts with a long-lived, non-myopic agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, non-myopic agents pose additional complications. In particular, non-myopic agents may strategically select actions that are inferior in the present to mislead the principal's learning algorithm and obtain better outcomes in the future. We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents. Through the design and analysis of minimally reactive bandit algorithms, our reduction trades off the statistical efficiency of the principal's learning algorithm against its effectiveness in inducing near-best-responses. We apply this framework to Stackelberg security games (SSGs), pricing with unknown demand curve, strategic classification, and general finite Stackelberg games. In each setting, we characterize the type and impact of misspecifications present in near-best-responses and develop a learning algorithm robust to such misspecifications. Along the way, we improve the query complexity of learning in SSGs with $n$ targets from the state-of-the-art $O(n^3)$ to a near-optimal $\widetilde{O}(n)$ by uncovering a fundamental structural property of such games. This result is of independent interest beyond learning with non-myopic agents. △ Less

Submitted 19 August, 2022; originally announced August 2022.

Comments: An extended abstract of this work appeared at the ACM Conference on Economics and Computation (EC) 2022

arXiv:2206.03324 [pdf, other]

Efficient decentralized multi-agent learning in asymmetric bipartite queueing systems

Authors: Daniel Freund, Thodoris Lykouris, Wentao Weng

Abstract: We study decentralized multi-agent learning in bipartite queueing systems, a standard model for service systems. In particular, N agents request service from K servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers… ▽ More We study decentralized multi-agent learning in bipartite queueing systems, a standard model for service systems. In particular, N agents request service from K servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers, require communication through shared randomness and unique agent identities, and are computationally demanding. In contrast, we provide a simple learning algorithm that, when run decentrally by each agent, leads the queueing system to have efficient performance in general asymmetric bipartite queueing systems while also having additional robustness properties. Along the way, we provide the first provably efficient UCB-based algorithm for the centralized case of the problem. △ Less

Submitted 5 August, 2023; v1 submitted 5 June, 2022; originally announced June 2022.

Comments: To appear in Operations Research. A preliminary version of this work was accepted for presentation at the Conference on Learning Theory (COLT) 2022. Compared to the first version of the paper, the current version expands upon the related work and adds intuition on the technical content

arXiv:2107.01509 [pdf, other]

Bayesian decision-making under misspecified priors with applications to meta-learning

Authors: Max Simchowitz, Christopher Tosh, Akshay Krishnamurthy, Daniel Hsu, Thodoris Lykouris, Miroslav Dudík, Robert E. Schapire

Abstract: Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecifi… ▽ More Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $\tilde{\mathcal{O}}(H^2 ε)$ from TS with a well specified prior, where $ε$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors. △ Less

Submitted 3 July, 2021; originally announced July 2021.

arXiv:2007.07990 [pdf, other]

Static pricing for multi-unit prophet inequalities

Authors: Shuchi Chawla, Nikhil Devanur, Thodoris Lykouris

Abstract: We study a pricing problem where a seller has $k$ identical copies of a product, buyers arrive sequentially, and the seller prices the items aiming to maximize social welfare. When $k=1$, this is the so called "prophet inequality" problem for which there is a simple pricing scheme achieving a competitive ratio of $1/2$. On the other end of the spectrum, as $k$ goes to infinity, the asymptotic perf… ▽ More We study a pricing problem where a seller has $k$ identical copies of a product, buyers arrive sequentially, and the seller prices the items aiming to maximize social welfare. When $k=1$, this is the so called "prophet inequality" problem for which there is a simple pricing scheme achieving a competitive ratio of $1/2$. On the other end of the spectrum, as $k$ goes to infinity, the asymptotic performance of both static and adaptive pricing is well understood. We provide a static pricing scheme for the small-supply regime: where $k$ is small but larger than $1$. Prior to our work, the best competitive ratio known for this setting was the $1/2$ that follows from the single-unit prophet inequality. Our pricing scheme is easy to describe as well as practical -- it is anonymous, non-adaptive, and order-oblivious. We pick a single price that equalizes the expected fraction of items sold and the probability that the supply does not sell out before all customers are served; this price is then offered to each customer while supply lasts. This extends an approach introduced by Samuel-Cahn for the case of $k=1$. This pricing scheme achieves a competitive ratio that increases gradually with the supply. Subsequent work by Jiang, Ma, and Zhang shows that our pricing scheme is the optimal static pricing for every value of $k$. △ Less

Submitted 20 June, 2023; v1 submitted 15 July, 2020; originally announced July 2020.

arXiv:2006.05051 [pdf, other]

Constrained episodic reinforcement learning in concave-convex and knapsack settings

Authors: Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, Wen Sun

Abstract: We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to linear constraints, and the remaining work focuses on either… ▽ More We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to linear constraints, and the remaining work focuses on either the feasibility question or settings with a single episode. Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments. △ Less

Submitted 5 June, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: The NeurIPS 2020 version of this paper includes a small bug, leading to an incorrect dependence on H in Theorem 3.4. This version fixes it by adjusting Eq. (9), Theorem 3.4 and the relevant proofs. Changes in the main text are noted in red. Changes in the appendix are limited to Appendices B.1, B.5, and B.6 and the statement of Lemma F.3

arXiv:2003.02287 [pdf, other]

Bandits with adversarial scaling

Authors: Thodoris Lykouris, Vahab Mirrokni, Renato Paes Leme

Abstract: We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component. Our model captures display advertising where the "click-through-rate" can be decomposed to a (fixed across time) arm-quality component and a non-stochastic user-relevance component (fixed across arms). Despite the relative stochasticity of our model, we demonstrate two settings… ▽ More We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component. Our model captures display advertising where the "click-through-rate" can be decomposed to a (fixed across time) arm-quality component and a non-stochastic user-relevance component (fixed across arms). Despite the relative stochasticity of our model, we demonstrate two settings where most bandit algorithms suffer. On the positive side, we show that two algorithms, one from the action elimination and one from the mirror descent family are adaptive enough to be robust to adversarial scaling. Our results shed light on the robustness of adaptive parameter selection in stochastic bandits, which may be of independent interest. △ Less

Submitted 28 August, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

Comments: Appeared in ICML 2020

arXiv:2002.11650 [pdf, other]

Contextual Search in the Presence of Adversarial Corruptions

Authors: Akshay Krishnamurthy, Thodoris Lykouris, Chara Podimata, Robert Schapire

Abstract: We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model… ▽ More We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard formulations of this problem assume that agents act in accordance with a specific homogeneous response model. In practice, however, some responses may be adversarially corrupted. Existing algorithms heavily depend on the assumed response model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrary misspecifications. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying response model. In particular, we provide two algorithms, one based on multidimensional binary search methods and one based on gradient descent. We show that these algorithms attain near-optimal regret in the absence of adversarial corruptions and their performance degrades gracefully with the number of such agents, providing the first results for contextual search in any adversarial noise model. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis. △ Less

Submitted 6 August, 2022; v1 submitted 26 February, 2020; originally announced February 2020.

Comments: The first version was titled "Corrupted multidimensional binary search: Learning in the presence of irrational agents". An 8-page extended abstract titled "Contextual search in the presence of irrational agents" appeared at the 53rd ACM Symposium on the Theory of Computing (STOC '21)

arXiv:1911.08689 [pdf, ps, other]

Corruption-robust exploration in episodic reinforcement learning

Authors: Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun

Abstract: We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of u… ▽ More We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of uncertainty", by complementing them with principles from "action elimination". Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear-function-approximation settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning. △ Less

Submitted 31 October, 2023; v1 submitted 19 November, 2019; originally announced November 2019.

Comments: Accepted in Mathematics of Operations Research. Preliminary version was accepted for presentation at COLT'21

arXiv:1909.08375 [pdf, other]

Advancing subgroup fairness via sleeping experts

Authors: Avrim Blum, Thodoris Lykouris

Abstract: We study methods for improving fairness to subgroups in settings with overlapping populations and sequential predictions. Classical notions of fairness focus on the balance of some property across different populations. However, in many applications the goal of the different groups is not to be predicted equally but rather to be predicted well. We demonstrate that the task of satisfying this guara… ▽ More We study methods for improving fairness to subgroups in settings with overlapping populations and sequential predictions. Classical notions of fairness focus on the balance of some property across different populations. However, in many applications the goal of the different groups is not to be predicted equally but rather to be predicted well. We demonstrate that the task of satisfying this guarantee for multiple overlapping groups is not straightforward and show that for the simple objective of unweighted average of false negative and false positive rate, satisfying this for overlapping populations can be statistically impossible even when we are provided predictors that perform well separately on each subgroup. On the positive side, we show that when individuals are equally important to the different groups they belong to, this goal is achievable; to do so, we draw a connection to the sleeping experts literature in online learning. Motivated by the one-sided feedback in natural settings of interest, we extend our results to such a feedback model. We also provide a game-theoretic interpretation of our results, examining the incentives of participants to join the system and to provide the system full information about predictors they may possess. We end with several interesting open problems concerning the strength of guarantees that can be achieved in a computationally efficient manner. △ Less

Submitted 2 December, 2019; v1 submitted 18 September, 2019; originally announced September 2019.

Comments: To appear in ITCS 2020

arXiv:1905.09898 [pdf, ps, other]

Feedback graph regret bounds for Thompson Sampling and UCB

Authors: Thodoris Lykouris, Eva Tardos, Drishti Wali

Abstract: We study the stochastic multi-armed bandit problem with the graph-based feedback structure introduced by Mannor and Shamir. We analyze the performance of the two most prominent stochastic bandit algorithms, Thompson Sampling and Upper Confidence Bound (UCB), in the graph-based feedback setting. We show that these algorithms achieve regret guarantees that combine the graph structure and the gaps be… ▽ More We study the stochastic multi-armed bandit problem with the graph-based feedback structure introduced by Mannor and Shamir. We analyze the performance of the two most prominent stochastic bandit algorithms, Thompson Sampling and Upper Confidence Bound (UCB), in the graph-based feedback setting. We show that these algorithms achieve regret guarantees that combine the graph structure and the gaps between the means of the arm distributions. Surprisingly this holds despite the fact that these algorithms do not explicitly use the graph structure to select arms; they observe the additional feedback but do not explore based on it. Towards this result we introduce a "layering technique" highlighting the commonalities in the two algorithms. △ Less

Submitted 14 February, 2020; v1 submitted 23 May, 2019; originally announced May 2019.

Comments: Appeared in ALT 2020

arXiv:1810.11829 [pdf, ps, other]

On preserving non-discrimination when combining expert advice

Authors: Avrim Blum, Suriya Gunasekar, Thodoris Lykouris, Nathan Srebro

Abstract: We study the interplay between sequential decision making and avoiding discrimination against protected groups, when examples arrive online and do not follow distributional assumptions. We consider the most basic extension of classical online learning: "Given a class of predictors that are individually non-discriminatory with respect to a particular metric, how can we combine them to perform as we… ▽ More We study the interplay between sequential decision making and avoiding discrimination against protected groups, when examples arrive online and do not follow distributional assumptions. We consider the most basic extension of classical online learning: "Given a class of predictors that are individually non-discriminatory with respect to a particular metric, how can we combine them to perform as well as the best predictor, while preserving non-discrimination?" Surprisingly we show that this task is unachievable for the prevalent notion of "equalized odds" that requires equal false negative rates and equal false positive rates across groups. On the positive side, for another notion of non-discrimination, "equalized error rates", we show that running separate instances of the classical multiplicative weights algorithm for each group achieves this guarantee. Interestingly, even for this notion, we show that algorithms with stronger performance guarantees than multiplicative weights cannot preserve non-discrimination. △ Less

Submitted 29 March, 2019; v1 submitted 28 October, 2018; originally announced October 2018.

Comments: Appeared in NIPS 2018

arXiv:1803.09353 [pdf, ps, other]

Stochastic bandits robust to adversarial corruptions

Authors: Thodoris Lykouris, Vahab Mirrokni, Renato Paes Leme

Abstract: We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be adversarially changed to trick the algorithm, e.g., click fraud, fake reviews and email spam. The goal of this model is to encourage the design of bandit algorithms that (i) work well in mixed adversarial and… ▽ More We introduce a new model of stochastic bandits with adversarial corruptions which aims to capture settings where most of the input follows a stochastic pattern but some fraction of it can be adversarially changed to trick the algorithm, e.g., click fraud, fake reviews and email spam. The goal of this model is to encourage the design of bandit algorithms that (i) work well in mixed adversarial and stochastic models, and (ii) whose performance deteriorates gracefully as we move from fully stochastic to fully adversarial models. In our model, the rewards for all arms are initially drawn from a distribution and are then altered by an adaptive adversary. We provide a simple algorithm whose performance gracefully degrades with the total corruption the adversary injected in the data, measured by the sum across rounds of the biggest alteration the adversary made in the data in that round; this total corruption is denoted by $C$. Our algorithm provides a guarantee that retains the optimal guarantee (up to a logarithmic term) if the input is stochastic and whose performance degrades linearly to the amount of corruption $C$, while crucially being agnostic to it. We also provide a lower bound showing that this linear degradation is necessary if the algorithm achieves optimal performance in the stochastic setting (the lower bound works even for a known amount of corruption, a special case in which our algorithm achieves optimal performance without the extra logarithm). △ Less

Submitted 25 March, 2018; originally announced March 2018.

Comments: To appear in STOC 2018

arXiv:1802.05399 [pdf, other]

Competitive caching with machine learned advice

Authors: Thodoris Lykouris, Sergei Vassilvitskii

Abstract: Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the exp… ▽ More Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work we develop a framework for augmenting online algorithms with a machine learned oracle to achieve competitive ratios that provably improve upon unconditional worst case lower bounds when the oracle has low error. Our approach treats the oracle as a complete black box, and is not dependent on its inner workings, or the exact distribution of its errors. We apply this framework to the traditional caching problem -- creating an eviction strategy for a cache of size $k$. We demonstrate that naively following the oracle's recommendations may lead to very poor performance, even when the average error is quite low. Instead we show how to modify the Marker algorithm to take into account the oracle's predictions, and prove that this combined approach achieves a competitive ratio that both (i) decreases as the oracle's error decreases, and (ii) is always capped by $O(\log k)$, which can be achieved without any oracle input. We complement our results with an empirical evaluation of our algorithm on real world datasets, and show that it performs well empirically even using simple off-the-shelf predictions. △ Less

Submitted 21 August, 2020; v1 submitted 14 February, 2018; originally announced February 2018.

Comments: Preliminary versions appeared in ICML 18 and SysML 18. The current version improves the presentation of the suggested framework (Section 2.2), provides a more clear discussion on how it can be more broadly applied, and fixes some more minor presentation issues in other sections

arXiv:1711.03639 [pdf, ps, other]

Small-loss bounds for online learning with partial information

Authors: Thodoris Lykouris, Karthik Sridharan, Eva Tardos

Abstract: We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative… ▽ More We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called "small-loss" $o(αL^{\star})$ regret bounds with high probability, where $α$ is the independence number of the graph, and $L^{\star}$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\tilde{O}(\sqrt{dL^{\star}})$ for actual regret, where $d$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $\tilde{O}(\sqrt{κL^{\star}})$ regret guarantee for fixed feedback graphs with clique-partition number at most $κ$. △ Less

Submitted 26 July, 2021; v1 submitted 9 November, 2017; originally announced November 2017.

Comments: The current version represents the content that will appear in Mathematics of Operations Research. An extended abstract of the paper appeared at the 31st Annual Conference on Learning Theory (COLT 2018)

arXiv:1608.06819 [pdf, ps, other]

Pricing and Optimization in Shared Vehicle Systems: An Approximation Framework

Authors: Siddhartha Banerjee, Daniel Freund, Thodoris Lykouris

Abstract: Optimizing shared vehicle systems (bike/scooter/car/ride-sharing) is more challenging compared to traditional resource allocation settings due to the presence of \emph{complex network externalities} -- changes in the demand/supply at any location affect future supply throughout the system within short timescales. These externalities are well captured by steady-state Markovian models, which are the… ▽ More Optimizing shared vehicle systems (bike/scooter/car/ride-sharing) is more challenging compared to traditional resource allocation settings due to the presence of \emph{complex network externalities} -- changes in the demand/supply at any location affect future supply throughout the system within short timescales. These externalities are well captured by steady-state Markovian models, which are therefore widely used to analyze such systems. However, using such models to design pricing and other control policies is computationally difficult since the resulting optimization problems are high-dimensional and non-convex. To this end, we develop a \emph{rigorous approximation framework} for shared vehicle systems, providing a unified approach for a wide range of controls (pricing, matching, rebalancing), objective functions (throughput, revenue, welfare), and system constraints (travel-times, welfare benchmarks, posted-price constraints). Our approach is based on the analysis of natural convex relaxations, and obtains as special cases existing approximate-optimal policies for limited settings, asymptotic-optimality results, and heuristic policies. The resulting guarantees are non-asymptotic and parametric, and provide operational insights into the design of real-world systems. In particular, for any shared vehicle system with $n$ stations and $m$ vehicles, our framework obtains an approximation ratio of $1+(n-1)/m$, which is particularly meaningful when $m/n$, the average number of vehicles per station, is large, as is often the case in practice. △ Less

Submitted 10 May, 2021; v1 submitted 24 August, 2016; originally announced August 2016.

Comments: The current version represents the content that will appear in Operations Research. A one-page abstract of the paper appeared at the 18th ACM Conference on Economics and Computation (EC 2017)

arXiv:1606.06244 [pdf, ps, other]

Learning in Games: Robustness of Fast Convergence

Authors: Dylan J. Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, Eva Tardos

Abstract: We show that learning algorithms satisfying a $\textit{low approximate regret}$ property experience fast convergence to approximate optimality in a large class of repeated games. Our property, which simply requires that each learner has small regret compared to a $(1+ε)$-multiplicative approximation to the best action in hindsight, is ubiquitous among learning algorithms; it is satisfied even by t… ▽ More We show that learning algorithms satisfying a $\textit{low approximate regret}$ property experience fast convergence to approximate optimality in a large class of repeated games. Our property, which simply requires that each learner has small regret compared to a $(1+ε)$-multiplicative approximation to the best action in hindsight, is ubiquitous among learning algorithms; it is satisfied even by the vanilla Hedge forecaster. Our results improve upon recent work of Syrgkanis et al. [SALS15] in a number of ways. We require only that players observe payoffs under other players' realized actions, as opposed to expected payoffs. We further show that convergence occurs with high probability, and show convergence under bandit feedback. Finally, we improve upon the speed of convergence by a factor of $n$, the number of players. Both the scope of settings and the class of algorithms for which our analysis provides fast convergence are considerably broader than in previous work. Our framework applies to dynamic population games via a low approximate regret property for shifting experts. Here we strengthen the results of Lykouris et al. [LST16] in two ways: We allow players to select learning algorithms from a larger class, which includes a minor variant of the basic Hedge algorithm, and we increase the maximum churn in players for which approximate optimality is achieved. In the bandit setting we present a new algorithm which provides a "small loss"-type bound with improved dependence on the number of actions in utility settings, and is both simple and efficient. This result may be of independent interest. △ Less

Submitted 16 December, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 27 pages. NIPS 2016

arXiv:1505.00391 [pdf, ps, other]

Learning and Efficiency in Games with Dynamic Population

Authors: Thodoris Lykouris, Vasilis Syrgkanis, Eva Tardos

Abstract: We study the quality of outcomes in repeated games when the population of players is dynamically changing and participants use learning algorithms to adapt to the changing environment. Game theory classically considers Nash equilibria of one-shot games, while in practice many games are played repeatedly, and in such games players often use algorithmic tools to learn to play in the given environmen… ▽ More We study the quality of outcomes in repeated games when the population of players is dynamically changing and participants use learning algorithms to adapt to the changing environment. Game theory classically considers Nash equilibria of one-shot games, while in practice many games are played repeatedly, and in such games players often use algorithmic tools to learn to play in the given environment. Most previous work on learning in repeated games assumes that the population playing the game is static over time. We analyze the efficiency of repeated games in dynamically changing environments, motivated by application domains such as Internet ad-auctions and packet routing. We prove that, in many classes of games, if players choose their strategies in a way that guarantees low adaptive regret, then high social welfare is ensured, even under very frequent changes. In fact, in large markets learning players achieve asymptotically optimal social welfare despite high turnover. Previous work has only showed that high welfare is guaranteed for learning outcomes in static environments. Our work extends these results to more realistic settings when participation is drastically evolving over time. △ Less

Submitted 22 May, 2020; v1 submitted 2 May, 2015; originally announced May 2015.

Comments: Preliminary version appeared in ACM Symposium on Discrete Algorithms 2016 (SODA 2016). This version adds a major new result: asymptotic optimality of simultaneous second-price auctions with dynamic population. Presentation is significantly simplified and all results are presented parametrically

Showing 1–21 of 21 results for author: Lykouris, T