Search | arXiv e-print repository

Incentive-compatible Bandits: Importance Weighting No More

Authors: Julian Zimmert, Teodor V. Marinov

Abstract: We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no reg… ▽ More We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(\sqrt{T \log(K)})$ regret and $O(T^{2/3}(K\log(K))^{1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(\sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $\tilde O(\sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(\sqrt{KT})$ regret. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2403.19462 [pdf, other]

Offline Imitation Learning from Multiple Baselines with Applications to Compiler Optimization

Authors: Teodor V. Marinov, Alekh Agarwal, Mircea Trofin

Abstract: This work studies a Reinforcement Learning (RL) problem in which we are given a set of trajectories collected with K baseline policies. Each of these policies can be quite suboptimal in isolation, and have strong performance in complementary parts of the state space. The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space. We propose a si… ▽ More This work studies a Reinforcement Learning (RL) problem in which we are given a set of trajectories collected with K baseline policies. Each of these policies can be quite suboptimal in isolation, and have strong performance in complementary parts of the state space. The goal is to learn a policy which performs as well as the best combination of baselines on the entire state space. We propose a simple imitation learning based algorithm, show a sample complexity bound on its accuracy and prove that the the algorithm is minimax optimal by showing a matching lower bound. Further, we apply the algorithm in the setting of machine learning guided compiler optimization to learn policies for inlining programs with the objective of creating a small binary. We demonstrate that we can learn a policy that outperforms an initial policy learned via standard RL through a few iterations of our approach. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2305.17040 [pdf, other]

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

Authors: Jacob Abernethy, Alekh Agarwal, Teodor V. Marinov, Manfred K. Warmuth

Abstract: We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a… ▽ More We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a mechanism whereby a transformer can achieve the following: (a) receive an i.i.d. sequence of examples which have been converted into a prompt using potentially-ambiguous delimiters, (b) correctly segment the prompt into examples and labels, (c) infer from the data a \textit{sparse linear regressor} hypothesis, and finally (d) apply this hypothesis on the given test example and return a predicted label. We establish that this entire procedure is implementable using the transformer mechanism, and we give sample complexity guarantees for this learning framework. Our empirical findings validate the challenge of segmentation, and we show a correspondence between our posited mechanisms and observed attention maps for step (c). △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.03784 [pdf, ps, other]

Leveraging User-Triggered Supervision in Contextual Bandits

Authors: Alekh Agarwal, Claudio Gentile, Teodor V. Marinov

Abstract: We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new fram… ▽ More We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback. △ Less

Submitted 7 February, 2023; originally announced February 2023.

arXiv:2206.10022 [pdf, other]

Stochastic Online Learning with Feedback Graphs: Finite-Time and Asymptotic Optimality

Authors: Teodor V. Marinov, Mehryar Mohri, Julian Zimmert

Abstract: We revisit the problem of stochastic online learning with feedback graphs, with the goal of devising algorithms that are optimal, up to constants, both asymptotically and in finite time. We show that, surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context and that, in general, it is decoupled from the asymptotic rate. We discuss alternative choice… ▽ More We revisit the problem of stochastic online learning with feedback graphs, with the goal of devising algorithms that are optimal, up to constants, both asymptotically and in finite time. We show that, surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context and that, in general, it is decoupled from the asymptotic rate. We discuss alternative choices and propose a notion of finite-time optimality that we argue is \emph{meaningful}. For that notion, we give an algorithm that admits quasi-optimal regret both in finite-time and asymptotically. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2206.01836 [pdf, ps, other]

Dimension Independent Generalization of DP-SGD for Overparameterized Smooth Convex Optimization

Authors: Yi-An Ma, Teodor Vanislavov Marinov, Tong Zhang

Abstract: This paper considers the generalization performance of differentially private convex learning. We demonstrate that the convergence analysis of Langevin algorithms can be used to obtain new generalization bounds with differential privacy guarantees for DP-SGD. More specifically, by using some recently obtained dimension-independent convergence results for stochastic Langevin algorithms with convex… ▽ More This paper considers the generalization performance of differentially private convex learning. We demonstrate that the convergence analysis of Langevin algorithms can be used to obtain new generalization bounds with differential privacy guarantees for DP-SGD. More specifically, by using some recently obtained dimension-independent convergence results for stochastic Langevin algorithms with convex objective functions, we obtain $O(n^{-1/4})$ privacy guarantees for DP-SGD with the optimal excess generalization error of $\tilde{O}(n^{-1/2})$ for certain classes of overparameterized smooth convex optimization problems. This improves previous DP-SGD results for such problems that contain explicit dimension dependencies, so that the resulting generalization bounds become unsuitable for overparameterized models used in practical applications. △ Less

Submitted 3 June, 2022; originally announced June 2022.

arXiv:2110.13282 [pdf, ps, other]

The Pareto Frontier of model selection for general Contextual Bandits

Authors: Teodor V. Marinov, Julian Zimmert

Abstract: Recent progress in model selection raises the question of the fundamental limits of these techniques. Under specific scrutiny has been model selection for general contextual bandits with nested policy classes, resulting in a COLT2020 open problem. It asks whether it is possible to obtain simultaneously the optimal single algorithm guarantees over all policies in a nested sequence of policy classes… ▽ More Recent progress in model selection raises the question of the fundamental limits of these techniques. Under specific scrutiny has been model selection for general contextual bandits with nested policy classes, resulting in a COLT2020 open problem. It asks whether it is possible to obtain simultaneously the optimal single algorithm guarantees over all policies in a nested sequence of policy classes, or if otherwise this is possible for a trade-off $α\in[\frac{1}{2},1)$ between complexity term and time: $\ln(|Π_m|)^{1-α}T^α$. We give a disappointing answer to this question. Even in the purely stochastic regime, the desired results are unobtainable. We present a Pareto frontier of up to logarithmic factors matching upper and lower bounds, thereby proving that an increase in the complexity term $\ln(|Π_m|)$ independent of $T$ is unavoidable for general policy classes. As a side result, we also resolve a COLT2016 open problem concerning second-order bounds in full-information games. △ Less

Submitted 25 October, 2021; originally announced October 2021.

arXiv:2107.01264 [pdf, other]

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Authors: Christoph Dann, Teodor V. Marinov, Mehryar Mohri, Julian Zimmert

Abstract: We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy.… ▽ More We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs. Our results show that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy. △ Less

Submitted 26 October, 2021; v1 submitted 2 July, 2021; originally announced July 2021.

arXiv:2006.09255 [pdf, other]

Corralling Stochastic Bandit Algorithms

Authors: Raman Arora, Teodor V. Marinov, Mehryar Mohri

Abstract: We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corrallin… ▽ More We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards. △ Less

Submitted 28 February, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

arXiv:2002.09609 [pdf, ps, other]

Private Stochastic Convex Optimization: Efficient Algorithms for Non-smooth Objectives

Authors: Raman Arora, Teodor V. Marinov, Enayat Ullah

Abstract: In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples. In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples. △ Less

Submitted 17 November, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

arXiv:1907.12189 [pdf, ps, other]

Bandits with Feedback Graphs and Switching Costs

Authors: Raman Arora, Teodor V. Marinov, Mehryar Mohri

Abstract: We study the adversarial multi-armed bandit problem where partial observations are available and where, in addition to the loss incurred for each action, a \emph{switching cost} is incurred for shifting to a new action. All previously known results incur a factor proportional to the independence number of the feedback graph. We give a new algorithm whose regret guarantee depends only on the domina… ▽ More We study the adversarial multi-armed bandit problem where partial observations are available and where, in addition to the loss incurred for each action, a \emph{switching cost} is incurred for shifting to a new action. All previously known results incur a factor proportional to the independence number of the feedback graph. We give a new algorithm whose regret guarantee depends only on the domination number of the graph. We further supplement that result with a lower bound. Finally, we also give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available. △ Less

Submitted 22 March, 2020; v1 submitted 28 July, 2019; originally announced July 2019.

Comments: Camera ready from NeurIPS 2019, new algorithm and improved results in Section 3.2

arXiv:1811.04127 [pdf, ps, other]

Policy Regret in Repeated Games

Authors: Raman Arora, Michael Dinitz, Teodor V. Marinov, Mehryar Mohri

Abstract: The notion of \emph{policy regret} in online learning is a well defined? performance measure for the common scenario of adaptive adversaries, which more traditional quantities such as external regret do not take into account. We revisit the notion of policy regret and first show that there are online learning settings in which policy regret and external regret are incompatible: any sequence of pla… ▽ More The notion of \emph{policy regret} in online learning is a well defined? performance measure for the common scenario of adaptive adversaries, which more traditional quantities such as external regret do not take into account. We revisit the notion of policy regret and first show that there are online learning settings in which policy regret and external regret are incompatible: any sequence of play that achieves a favorable regret with respect to one definition must do poorly with respect to the other. We then focus on the game-theoretic setting where the adversary is a self-interested agent. In that setting, we show that external regret and policy regret are not in conflict and, in fact, that a wide class of algorithms can ensure a favorable regret with respect to both definitions, so long as the adversary is also using such an algorithm. We also show that the sequence of play of no-policy regret algorithms converges to a \emph{policy equilibrium}, a new notion of equilibrium that we introduce. Relating this back to external regret, we show that coarse correlated equilibria, which no-external regret players converge to, are a strict subset of policy equilibria. Thus, in game-theoretic settings, every sequence of play with no external regret also admits no policy regret, but the converse does not hold. △ Less

Submitted 22 March, 2020; v1 submitted 9 November, 2018; originally announced November 2018.

Comments: Camera ready from NeurIPS 2018; 25 pages; Slightly updated results and proofs for Section 3 and Section 4

arXiv:1808.00934 [pdf, other]

Streaming Kernel PCA with $\tilde{O}(\sqrt{n})$ Random Features

Authors: Enayat Ullah, Poorya Mianjy, Teodor V. Marinov, Raman Arora

Abstract: We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, $O(\sqrt{n} \log n)$ features suffices to achieve $O(1/ε^2)$ sample complexity. Furthermore, we give a memory efficient streaming algorithm based on classical Oja's algorithm that achieves this rate. We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, $O(\sqrt{n} \log n)$ features suffices to achieve $O(1/ε^2)$ sample complexity. Furthermore, we give a memory efficient streaming algorithm based on classical Oja's algorithm that achieves this rate. △ Less

Submitted 15 November, 2018; v1 submitted 2 August, 2018; originally announced August 2018.

Comments: Advances in Neural Information Processing Systems (NIPS), 2018. 42 pages, 3 figures

arXiv:1702.06818 [pdf, other]

Stochastic Approximation for Canonical Correlation Analysis

Authors: Raman Arora, Teodor V. Marinov, Poorya Mianjy, Nathan Srebro

Abstract: We propose novel first-order stochastic approximation algorithms for canonical correlation analysis (CCA). Algorithms presented are instances of inexact matrix stochastic gradient (MSG) and inexact matrix exponentiated gradient (MEG), and achieve $ε$-suboptimality in the population objective in $\operatorname{poly}(\frac{1}ε)$ iterations. We also consider practical variants of the proposed algorit… ▽ More We propose novel first-order stochastic approximation algorithms for canonical correlation analysis (CCA). Algorithms presented are instances of inexact matrix stochastic gradient (MSG) and inexact matrix exponentiated gradient (MEG), and achieve $ε$-suboptimality in the population objective in $\operatorname{poly}(\frac{1}ε)$ iterations. We also consider practical variants of the proposed algorithms and compare them with other methods for CCA both theoretically and empirically. △ Less

Submitted 26 February, 2018; v1 submitted 22 February, 2017; originally announced February 2017.

Showing 1–14 of 14 results for author: Marinov, T V