-
Testing the Feasibility of Linear Programs with Bandit Feedback
Authors:
Aditya Gangrade,
Aditya Gopalan,
Venkatesh Saligrama,
Clayton Scott
Abstract:
While the recent literature has seen a surge in the study of constrained bandit problems, all existing methods for these begin by assuming the feasibility of the underlying problem. We initiate the study of testing such feasibility assumptions, and in particular address the problem in the linear bandit setting, thus characterising the costs of feasibility testing for an unknown linear program usin…
▽ More
While the recent literature has seen a surge in the study of constrained bandit problems, all existing methods for these begin by assuming the feasibility of the underlying problem. We initiate the study of testing such feasibility assumptions, and in particular address the problem in the linear bandit setting, thus characterising the costs of feasibility testing for an unknown linear program using bandit feedback. Concretely, we test if $\exists x: Ax \ge 0$ for an unknown $A \in \mathbb{R}^{m \times d}$, by playing a sequence of actions $x_t\in \mathbb{R}^d$, and observing $Ax_t + \mathrm{noise}$ in response. By identifying the hypothesis as determining the sign of the value of a minimax game, we construct a novel test based on low-regret algorithms and a nonasymptotic law of iterated logarithms. We prove that this test is reliable, and adapts to the `signal level,' $Γ,$ of any instance, with mean sample costs scaling as $\widetilde{O}(d^2/Γ^2)$. We complement this by a minimax lower bound of $Ω(d/Γ^2)$ for sample costs of reliable tests, dominating prior asymptotic lower bounds by capturing the dependence on $d$, and thus elucidating a basic insight missing in the extant literature on such problems.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
A Multilingual Virtual Guide for Self-Attachment Technique
Authors:
Alicia Jiayun Law,
Ruoyu Hu,
Lisa Alazraki,
Anandha Gopalan,
Neophytos Polydorou,
Abbas Edalat
Abstract:
In this work, we propose a computational framework that leverages existing out-of-language data to create a conversational agent for the delivery of Self-Attachment Technique (SAT) in Mandarin. Our framework does not require large-scale human translations, yet it achieves a comparable performance whilst also maintaining safety and reliability. We propose two different methods of augmenting availab…
▽ More
In this work, we propose a computational framework that leverages existing out-of-language data to create a conversational agent for the delivery of Self-Attachment Technique (SAT) in Mandarin. Our framework does not require large-scale human translations, yet it achieves a comparable performance whilst also maintaining safety and reliability. We propose two different methods of augmenting available response data through empathetic rewriting. We evaluate our chatbot against a previous, English-only SAT chatbot through non-clinical human trials (N=42), each lasting five days, and quantitatively show that we are able to attain a comparable level of performance to the English SAT chatbot. We provide qualitative analysis on the limitations of our study and suggestions with the aim of guiding future improvements.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs
Authors:
Debangshu Banerjee,
Aditya Gopalan
Abstract:
Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs). The typical assumption under which the algorithms are analysed is realizability, i.e., that the true values of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the t…
▽ More
Parametric, feature-based reward models are employed by a variety of algorithms in decision-making settings such as bandits and Markov decision processes (MDPs). The typical assumption under which the algorithms are analysed is realizability, i.e., that the true values of actions are perfectly explained by some parametric model in the class. We are, however, interested in the situation where the true values are (significantly) misspecified with respect to the model class. For parameterized bandits, contextual bandits and MDPs, we identify structural conditions, depending on the problem instance and model class, under which basic algorithms such as $ε$-greedy, LinUCB and fitted Q-learning provably learn optimal policies under even highly misspecified models. This is in contrast to existing worst-case results for, say misspecified bandits, which show regret bounds that scale linearly with time, and shows that there can be a nontrivially large set of bandit instances that are robust to misspecification.
△ Less
Submitted 22 February, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
A Unified Framework for Discovering Discrete Symmetries
Authors:
Pavan Karjol,
Rohan Kashyap,
Aditya Gopalan,
Prathosh A. P
Abstract:
We consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups including locally symmetric, dihedral and cyclic subgroups. At the core of the framework is a novel architecture composed of linear, matrix-valued and non-linear functions that expresses functions inv…
▽ More
We consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups including locally symmetric, dihedral and cyclic subgroups. At the core of the framework is a novel architecture composed of linear, matrix-valued and non-linear functions that expresses functions invariant to these subgroups in a principled manner. The structure of the architecture enables us to leverage multi-armed bandit algorithms and gradient descent to efficiently optimize over the linear and the non-linear functions, respectively, and to infer the symmetry that is ultimately learnt. We also discuss the necessity of the matrix-valued functions in the architecture. Experiments on image-digit sum and polynomial regression tasks demonstrate the effectiveness of our approach.
△ Less
Submitted 27 October, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Robust Traffic Light Detection Using Salience-Sensitive Loss: Computational Framework and Evaluations
Authors:
Ross Greer,
Akshay Gopalkrishnan,
Jacob Landgren,
Lulua Rakla,
Anish Gopalan,
Mohan Trivedi
Abstract:
One of the most important tasks for ensuring safe autonomous driving systems is accurately detecting road traffic lights and accurately determining how they impact the driver's actions. In various real-world driving situations, a scene may have numerous traffic lights with varying levels of relevance to the driver, and thus, distinguishing and detecting the lights that are relevant to the driver a…
▽ More
One of the most important tasks for ensuring safe autonomous driving systems is accurately detecting road traffic lights and accurately determining how they impact the driver's actions. In various real-world driving situations, a scene may have numerous traffic lights with varying levels of relevance to the driver, and thus, distinguishing and detecting the lights that are relevant to the driver and influence the driver's actions is a critical safety task. This paper proposes a traffic light detection model which focuses on this task by first defining salient lights as the lights that affect the driver's future decisions. We then use this salience property to construct the LAVA Salient Lights Dataset, the first US traffic light dataset with an annotated salience property. Subsequently, we train a Deformable DETR object detection transformer model using Salience-Sensitive Focal Loss to emphasize stronger performance on salient traffic lights, showing that a model trained with this loss function has stronger recall than one trained without.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
(Safe) SMART Hands: Hand Activity Analysis and Distraction Alerts Using a Multi-Camera Framework
Authors:
Ross Greer,
Lulua Rakla,
Anish Gopalan,
Mohan Trivedi
Abstract:
Manual (hand-related) activity is a significant source of crash risk while driving. Accordingly, analysis of hand position and hand activity occupation is a useful component to understanding a driver's readiness to take control of a vehicle. Visual sensing through cameras provides a passive means of observing the hands, but its effectiveness varies depending on camera location. We introduce an alg…
▽ More
Manual (hand-related) activity is a significant source of crash risk while driving. Accordingly, analysis of hand position and hand activity occupation is a useful component to understanding a driver's readiness to take control of a vehicle. Visual sensing through cameras provides a passive means of observing the hands, but its effectiveness varies depending on camera location. We introduce an algorithmic framework, SMART Hands, for accurate hand classification with an ensemble of camera views using machine learning. We illustrate the effectiveness of this framework in a 4-camera setup, reaching 98% classification accuracy on a variety of locations and held objects for both of the driver's hands. We conclude that this multi-camera framework can be extended to additional tasks such as gaze and pose analysis, with further applications in driver and passenger safety.
△ Less
Submitted 29 January, 2023; v1 submitted 14 January, 2023;
originally announced January 2023.
-
On the Minimax Regret for Linear Bandits in a wide variety of Action Spaces
Authors:
Debangshu Banerjee,
Aditya Gopalan
Abstract:
As noted in the works of \cite{lattimore2020bandit}, it has been mentioned that it is an open problem to characterize the minimax regret of linear bandits in a wide variety of action spaces. In this article we present an optimal regret lower bound for a wide class of convex action spaces.
As noted in the works of \cite{lattimore2020bandit}, it has been mentioned that it is an open problem to characterize the minimax regret of linear bandits in a wide variety of action spaces. In this article we present an optimal regret lower bound for a wide class of convex action spaces.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference
Authors:
Debangshu Banerjee,
Avishek Ghosh,
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $Ω(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning ho…
▽ More
We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $Ω(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability.
We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.
△ Less
Submitted 7 January, 2023; v1 submitted 23 July, 2022;
originally announced July 2022.
-
Actor-Critic based Improper Reinforcement Learning
Authors:
Mohammadi Zaki,
Avinash Mohan,
Aditya Gopalan,
Shie Mannor
Abstract:
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a…
▽ More
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials.
Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
Demystifying Approximate Value-based RL with $ε$-greedy Exploration: A Differential Inclusion View
Authors:
Aditya Gopalan,
Gugan Thoppe
Abstract:
Q-learning and SARSA with $ε$-greedy exploration are leading reinforcement learning methods. Their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors such as policy oscillation, chattering, and convergence to different attractors (possibly even the worst policy) on different runs, apart from th…
▽ More
Q-learning and SARSA with $ε$-greedy exploration are leading reinforcement learning methods. Their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors such as policy oscillation, chattering, and convergence to different attractors (possibly even the worst policy) on different runs, apart from the usual instability. A theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion to provide the first framework for resolving this problem. We also provide numerical examples to illustrate our framework's prowess in explaining these algorithms' behaviors.
△ Less
Submitted 10 February, 2023; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Adaptive Estimation of Random Vectors with Bandit Feedback: A mean-squared error viewpoint
Authors:
Dipayan Sen,
L. A. Prashanth,
Aditya Gopalan
Abstract:
We consider the problem of sequentially learning to estimate, in the mean squared error (MSE) sense, a Gaussian $K$-vector of unknown covariance by observing only $m < K$ of its entries in each round. We first establish a concentration bound for MSE estimation. We then frame the estimation problem with bandit feedback, and propose a variant of the successive elimination algorithm. We also derive a…
▽ More
We consider the problem of sequentially learning to estimate, in the mean squared error (MSE) sense, a Gaussian $K$-vector of unknown covariance by observing only $m < K$ of its entries in each round. We first establish a concentration bound for MSE estimation. We then frame the estimation problem with bandit feedback, and propose a variant of the successive elimination algorithm. We also derive a minimax lower bound to understand the fundamental limit on the sample complexity of this problem.
△ Less
Submitted 11 January, 2024; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Bregman Deviations of Generic Exponential Families
Authors:
Sayak Ray Chowdhury,
Patrick Saux,
Odalric-Ambrym Maillard,
Aditya Gopalan
Abstract:
We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families. Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family…
▽ More
We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families. Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family and a finite sample estimate of the parameter. Our bound is time-uniform and makes appear a quantity extending the classical information gain to exponential families, which we call the Bregman information gain. For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian, Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain. We further numerically compare the resulting confidence bounds to state-of-the-art alternatives for time-uniform concentration and show that this novel method yields competitive results. Finally, we highlight the benefit of our concentration bounds on some illustrative applications.
△ Less
Submitted 13 July, 2023; v1 submitted 18 January, 2022;
originally announced January 2022.
-
CO-STAR: Conceptualisation of Stereotypes for Analysis and Reasoning
Authors:
Teyun Kwon,
Anandha Gopalan
Abstract:
Warning: this paper contains material which may be offensive or upsetting.
While much of recent work has focused on the detection of hate speech and overtly offensive content, very little research has explored the more subtle but equally harmful language in the form of implied stereotypes. This is a challenging domain, made even more so by the fact that humans often struggle to understand and re…
▽ More
Warning: this paper contains material which may be offensive or upsetting.
While much of recent work has focused on the detection of hate speech and overtly offensive content, very little research has explored the more subtle but equally harmful language in the form of implied stereotypes. This is a challenging domain, made even more so by the fact that humans often struggle to understand and reason about stereotypes. We build on existing literature and present CO-STAR (COnceptualisation of STereotypes for Analysis and Reasoning), a novel framework which encodes the underlying concepts of implied stereotypes. We also introduce the CO-STAR training data set, which contains just over 12K structured annotations of implied stereotypes and stereotype conceptualisations, and achieve state-of-the-art results after training and manual evaluation. The CO-STAR models are, however, limited in their ability to understand more complex and subtly worded stereotypes, and our research motivates future work in developing models with more sophisticated methods for encoding common-sense knowledge.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
On Slowly-varying Non-stationary Bandits
Authors:
Ramakrishnan Krishnamurthy,
Aditya Gopalan
Abstract:
We consider minimisation of dynamic regret in non-stationary bandits with a slowly varying property. Namely, we assume that arms' rewards are stochastic and independent over time, but that the absolute difference between the expected rewards of any arm at any two consecutive time-steps is at most a drift limit $δ> 0$. For this setting that has not received enough attention in the past, we give a n…
▽ More
We consider minimisation of dynamic regret in non-stationary bandits with a slowly varying property. Namely, we assume that arms' rewards are stochastic and independent over time, but that the absolute difference between the expected rewards of any arm at any two consecutive time-steps is at most a drift limit $δ> 0$. For this setting that has not received enough attention in the past, we give a new algorithm which extends naturally the well-known Successive Elimination algorithm to the non-stationary bandit setting. We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits. The analysis in turn relies on a novel characterization of the instance as a detectable gap profile that depends on the expected arm reward differences. We also provide the first minimax regret lower bound for this problem, enabling us to show that our algorithm is essentially minimax optimal. Also, this lower bound we obtain matches that of the more general total variation-budgeted bandits problem, establishing that the seemingly easier former problem is at least as hard as the more general latter problem in the minimax sense. We complement our theoretical results with experimental illustrations.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Data Flow Dissemination in a Network
Authors:
Aditya Gopalan,
Alexander Stolyar
Abstract:
We consider the following network model motivated, in particular, by blockchains and peer-to-peer live streaming. Data packet flows arrive at the network nodes and need to be disseminated to all other nodes. Packets are relayed through the network via links of finite capacity. A packet leaves the network when it is disseminated to all nodes. Our focus is on two communication disciplines, which det…
▽ More
We consider the following network model motivated, in particular, by blockchains and peer-to-peer live streaming. Data packet flows arrive at the network nodes and need to be disseminated to all other nodes. Packets are relayed through the network via links of finite capacity. A packet leaves the network when it is disseminated to all nodes. Our focus is on two communication disciplines, which determine the order in which packets are transmitted over each link, namely {\em Random-Useful} (RU) and {\em Oldest-Useful} (OU). We show that RU has the maximum stability region in a general network. For the OU we demonstrate that, somewhat surprisingly, it does {\em not} in general have the maximum stability region. We prove that OU does achieve maximum stability in the important special case of a symmetric network, given by the full graph with equal capacities on all links and equal arrival rates at all nodes. We also give other stability results, and compare different disciplines' performances in a symmetric system via simulation. Finally, we study the cumulative delays experienced by a packet as it propagates through the symmetric system, specifically the delay asymptotic behavior as $N \to \infty$. We put forward some conjectures about this behavior, supported by heuristic arguments and simulation experiments.
△ Less
Submitted 5 September, 2023; v1 submitted 18 October, 2021;
originally announced October 2021.
-
Bandit Quickest Changepoint Detection
Authors:
Aditya Gopalan,
Venkatesh Saligrama,
Braghadeesh Lakshminarayanan
Abstract:
Many industrial and security applications employ a suite of sensors for detecting abrupt changes in temporal behavior patterns. These abrupt changes typically manifest locally, rendering only a small subset of sensors informative. Continuous monitoring of every sensor can be expensive due to resource constraints, and serves as a motivation for the bandit quickest changepoint detection problem, whe…
▽ More
Many industrial and security applications employ a suite of sensors for detecting abrupt changes in temporal behavior patterns. These abrupt changes typically manifest locally, rendering only a small subset of sensors informative. Continuous monitoring of every sensor can be expensive due to resource constraints, and serves as a motivation for the bandit quickest changepoint detection problem, where sensing actions (or sensors) are sequentially chosen, and only measurements corresponding to chosen actions are observed. We derive an information-theoretic lower bound on the detection delay for a general class of finitely parameterized probability distributions. We then propose a computationally efficient online sensing scheme, which seamlessly balances the need for exploration of different sensing options with exploitation of querying informative actions. We derive expected delay bounds for the proposed scheme and show that these bounds match our information-theoretic lower bounds at low false alarm rates, establishing optimality of the proposed method. We then perform a number of experiments on synthetic and real datasets demonstrating the effectiveness of our proposed method.
△ Less
Submitted 13 June, 2023; v1 submitted 22 July, 2021;
originally announced July 2021.
-
CARLS: Cross-platform Asynchronous Representation Learning System
Authors:
Chun-Ta Lu,
Yun Zeng,
Da-Cheng Juan,
Yicheng Fan,
Zhe Li,
Jan Dlabal,
Yi-Ting Chen,
Arjun Gopalan,
Allan Heydon,
Chun-Sung Ferng,
Reah Miyara,
Ariel Fuxman,
Futang Peng,
Zhen Li,
Tom Duerig,
Andrew Tomkins
Abstract:
In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additiona…
▽ More
In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additional knowledge inferred or discovered during training, such as node embeddings for graph neural networks or reliable pseudo labels from model predictions. We also describe three learning paradigms -- semi-supervised learning, curriculum learning and multimodal learning -- as examples that can be scaled up efficiently by CARLS. One version of CARLS has been open-sourced and available for download at: https://github.com/tensorflow/neural-structured-learning/tree/master/research/carls
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling
Authors:
Mohammani Zaki,
Avi Mohan,
Aditya Gopalan,
Shie Mannor
Abstract:
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. Modern communication systems are becoming increasingly complex, and are required to handle multiple types of traffic with widely varying characteristics such as arrival rates and service times. This, coupled with the need for rapid network deployment, render a bottom up approach of first…
▽ More
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. Modern communication systems are becoming increasingly complex, and are required to handle multiple types of traffic with widely varying characteristics such as arrival rates and service times. This, coupled with the need for rapid network deployment, render a bottom up approach of first characterizing the traffic and then devising an appropriate scheduling protocol infeasible.
In contrast, we formulate a top down approach to scheduling where, given an unknown network and a set of scheduling policies, we use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies. We derive convergence results and analyze finite time performance of the algorithm. Simulation results show that the algorithm performs well even when the arrival rates are nonstationary and can stabilize the system even when the constituent policies are unstable.
△ Less
Submitted 1 May, 2021;
originally announced May 2021.
-
Improper Reinforcement Learning with Gradient-based Policy Optimization
Authors:
Mohammadi Zaki,
Avinash Mohan,
Aditya Gopalan,
Shie Mannor
Abstract:
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a…
▽ More
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials.
\par We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. We derive convergence rate guarantees for the approach assuming access to a gradient oracle. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. Numerical results on (i) the standard control theoretic benchmark of stabilizing an inverted pendulum and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable\footnote{Under review. Please do not distribute.}.
△ Less
Submitted 3 July, 2021; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Graph Autoencoders with Deconvolutional Networks
Authors:
Jia Li,
Tomas Yu,
Da-Cheng Juan,
Arjun Gopalan,
Hong Cheng,
Andrew Tomkins
Abstract:
Recent studies have indicated that Graph Convolutional Networks (GCNs) act as a \emph{low pass} filter in spectral domain and encode smoothed node representations. In this paper, we consider their opposite, namely Graph Deconvolutional Networks (GDNs) that reconstruct graph signals from smoothed node representations. We motivate the design of Graph Deconvolutional Networks via a combination of inv…
▽ More
Recent studies have indicated that Graph Convolutional Networks (GCNs) act as a \emph{low pass} filter in spectral domain and encode smoothed node representations. In this paper, we consider their opposite, namely Graph Deconvolutional Networks (GDNs) that reconstruct graph signals from smoothed node representations. We motivate the design of Graph Deconvolutional Networks via a combination of inverse filters in spectral domain and de-noising layers in wavelet domain, as the inverse operation results in a \emph{high pass} filter and may amplify the noise. Based on the proposed GDN, we further propose a graph autoencoder framework that first encodes smoothed graph representations with GCN and then decodes accurate graph signals with GDN. We demonstrate the effectiveness of the proposed method on several tasks including unsupervised graph-level representation , social recommendation and graph generation
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Stochastic Linear Bandits with Protected Subspace
Authors:
Advait Parulekar,
Soumya Basu,
Aditya Gopalan,
Karthikeyan Shanmugam,
Sanjay Shakkottai
Abstract:
We study a variant of the stochastic linear bandit problem wherein we optimize a linear objective function but rewards are accrued only orthogonal to an unknown subspace (which we interpret as a \textit{protected space}) given only zero-order stochastic oracle access to both the objective itself and protected subspace. In particular, at each round, the learner must choose whether to query the obje…
▽ More
We study a variant of the stochastic linear bandit problem wherein we optimize a linear objective function but rewards are accrued only orthogonal to an unknown subspace (which we interpret as a \textit{protected space}) given only zero-order stochastic oracle access to both the objective itself and protected subspace. In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action. Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space, and (in almost all rounds) plays optimistically with respect to a confidence set for this space. We provide a $\tilde{O}(sd\sqrt{T})$ regret upper bound in the case where the action space is the complete unit ball in $\mathbb{R}^d$, $s < d$ is the dimension of the protected subspace, and $T$ is the time horizon. Moreover, we demonstrate that a discrete action space can lead to linear regret with an optimistic algorithm, reinforcing the sub-optimality of optimism in certain settings. We also show that protection constraints imply that for certain settings, no consistent algorithm can have a regret smaller than $Ω(T^{3/4}).$ We finally empirically validate our results with synthetic and real datasets.
△ Less
Submitted 1 March, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
No-regret Algorithms for Multi-task Bayesian Optimization
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We consider multi-objective optimization (MOO) of an unknown vector-valued function in the non-parametric Bayesian optimization (BO) setting, with the aim being to learn points on the Pareto front of the objectives. Most existing BO algorithms do not model the fact that the multiple objectives, or equivalently, tasks can share similarities, and even the few that do lack rigorous, finite-time regre…
▽ More
We consider multi-objective optimization (MOO) of an unknown vector-valued function in the non-parametric Bayesian optimization (BO) setting, with the aim being to learn points on the Pareto front of the objectives. Most existing BO algorithms do not model the fact that the multiple objectives, or equivalently, tasks can share similarities, and even the few that do lack rigorous, finite-time regret guarantees that capture explicitly inter-task structure. In this work, we address this problem by modelling inter-task dependencies using a multi-task kernel and develop two novel BO algorithms based on random scalarizations of the objectives. Our algorithms employ vector-valued kernel regression as a stepping stone and belong to the upper confidence bound class of algorithms. Under a smoothness assumption that the unknown vector-valued function is an element of the reproducing kernel Hilbert space associated with the multi-task kernel, we derive worst-case regret bounds for our algorithms that explicitly capture the similarities between tasks. We numerically benchmark our algorithms on both synthetic and real-life MOO problems, and show the advantages offered by learning with multi-task kernels.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Sequential Multi-hypothesis Testing in Multi-armed Bandit Problems:An Approach for Asymptotic Optimality
Authors:
Gayathri R Prabhu,
Srikrishna Bhashyam,
Aditya Gopalan,
Rajesh Sundaresan
Abstract:
We consider a multi-hypothesis testing problem involving a K-armed bandit. Each arm's signal follows a distribution from a vector exponential family. The actual parameters of the arms are unknown to the decision maker. The decision maker incurs a delay cost for delay until a decision and a switching cost whenever he switches from one arm to another. His goal is to minimise the overall cost until a…
▽ More
We consider a multi-hypothesis testing problem involving a K-armed bandit. Each arm's signal follows a distribution from a vector exponential family. The actual parameters of the arms are unknown to the decision maker. The decision maker incurs a delay cost for delay until a decision and a switching cost whenever he switches from one arm to another. His goal is to minimise the overall cost until a decision is reached on the true hypothesis. Of interest are policies that satisfy a given constraint on the probability of false detection. This is a sequential decision making problem where the decision maker gets only a limited view of the true state of nature at each stage, but can control his view by choosing the arm to observe at each stage. An information-theoretic lower bound on the total cost (expected time for a reliable decision plus total switching cost) is first identified, and a variation on a sequential policy based on the generalised likelihood ratio statistic is then studied. Due to the vector exponential family assumption, the signal processing at each stage is simple; the associated conjugate prior distribution on the unknown model parameters enables easy updates of the posterior distribution. The proposed policy, with a suitable threshold for stopping, is shown to satisfy the given constraint on the probability of false detection. Under a continuous selection assumption, the policy is also shown to be asymptotically optimal in terms of the total cost among all policies that satisfy the constraint on the probability of false detection.
△ Less
Submitted 28 July, 2020; v1 submitted 25 July, 2020;
originally announced July 2020.
-
Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners
Authors:
Mohammadi Zaki,
Avi Mohan,
Aditya Gopalan
Abstract:
We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $δ$ and an unknown vector $θ^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^Tθ^*$, with probability at least $1-δ,$ using noisy measurements of the form $x^Tθ^*.$ For this fixed confidence ($δ$-PAC) setting,…
▽ More
We study the problem of best arm identification in linearly parameterised multi-armed bandits. Given a set of feature vectors $\mathcal{X}\subset\mathbb{R}^d,$ a confidence parameter $δ$ and an unknown vector $θ^*,$ the goal is to identify $\arg\max_{x\in\mathcal{X}}x^Tθ^*$, with probability at least $1-δ,$ using noisy measurements of the form $x^Tθ^*.$ For this fixed confidence ($δ$-PAC) setting, we propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Previous approaches rely on access to minimax optimization oracles. The algorithm, which we call the \textit{Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $θ^*$ in each round and uses it to eliminate suboptimal arms in phases. PELEG achieves fast shrinkage of this confidence ellipsoid along the most confusing (i.e., close to, but not optimal) directions by interpreting the problem as a two player zero-sum game, and sequentially converging to its saddle point using low-regret learners to compute players' strategies in each round. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
△ Less
Submitted 13 June, 2020;
originally announced June 2020.
-
How Reliable are Test Numbers for Revealing the COVID-19 Ground Truth and Applying Interventions?
Authors:
Aditya Gopalan,
Himanshu Tyagi
Abstract:
The number of confirmed cases of COVID-19 is often used as a proxy for the actual number of ground truth COVID-19 infected cases in both public discourse and policy making. However, the number of confirmed cases depends on the testing policy, and it is important to understand how the number of positive cases obtained using different testing policies reveals the unknown ground truth. We develop an…
▽ More
The number of confirmed cases of COVID-19 is often used as a proxy for the actual number of ground truth COVID-19 infected cases in both public discourse and policy making. However, the number of confirmed cases depends on the testing policy, and it is important to understand how the number of positive cases obtained using different testing policies reveals the unknown ground truth. We develop an agent-based simulation framework in Python that can simulate various testing policies as well as interventions such as lockdown based on them. The interaction between the agents can take into account various communities and mobility patterns. A distinguishing feature of our framework is the presence of another `flu'-like illness with symptoms similar to COVID-19, that allows us to model the noise in selecting the pool of patients to be tested. We instantiate our model for the city of Bengaluru in India, using census data to distribute agents geographically, and traffic flow mobility data to model long-distance interactions and mixing. We use the simulation framework to compare the performance of three testing policies: Random Symptomatic Testing (RST), Contact Tracing (CT), and a new Location Based Testing policy (LBT). We observe that if a sufficient fraction of symptomatic patients come out for testing, then RST can capture the ground truth quite closely even with very few daily tests. However, CT consistently captures more positive cases. Interestingly, our new LBT, which is operationally less intensive than CT, gives performance that is comparable with CT. In another direction, we compare the efficacy of these three testing policies in enabling lockdown, and observe that CT flattens the ground truth curve maximally, followed closely by LBT, and significantly better than RST.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
Regret Minimization in Stochastic Contextual Dueling Bandits
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We consider the problem of stochastic $K$-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to rece…
▽ More
We consider the problem of stochastic $K$-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferences--famously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees $\tilde O(d\sqrt{T})$ and $\tilde O(\sqrt{dT \log K})$. Subsequently we also show that $Ω(\sqrt {dT})$ is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.
△ Less
Submitted 8 May, 2021; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Throughput Optimal Decentralized Scheduling with Single-bit State Feedback for a Class of Queueing Systems
Authors:
Avinash Mohan,
Aditya Gopalan,
Anurag Kumar
Abstract:
Motivated by medium access control for resource-challenged wireless Internet of Things (IoT), we consider the problem of queue scheduling with reduced queue state information. In particular, we consider a time-slotted scheduling model with $N$ sensor nodes, with pair-wise dependence, such that Nodes $i$ and $i + 1,~0 < i < N$ cannot transmit together. We develop new throughput-optimal scheduling p…
▽ More
Motivated by medium access control for resource-challenged wireless Internet of Things (IoT), we consider the problem of queue scheduling with reduced queue state information. In particular, we consider a time-slotted scheduling model with $N$ sensor nodes, with pair-wise dependence, such that Nodes $i$ and $i + 1,~0 < i < N$ cannot transmit together. We develop new throughput-optimal scheduling policies requiring only the empty-nonempty state of each queue that we term Queue Nonemptiness-Based (QNB) policies. We propose a Policy Splicing technique to combine scheduling policies for small networks in order to construct throughput-optimal policies for larger networks, some of which also aim for low delay. For $N = 3,$ there exists a sum-queue length optimal QNB scheduling policy. We show, however, that for $N > 4,$ there is no QNB policy that is sum-queue length optimal over all arrival rate vectors in the capacity region.
We then extend our results to a more general class of interference constraints that we call cluster-of-cliques (CoC) conflict graphs. We consider two types of CoC networks, namely, Linear Arrays of Cliques (LAoC) and Star-of-Cliques (SoC) networks. We develop QNB policies for these classes of networks, study their stability and delay properties, and propose and analyze techniques to reduce the amount of state information to be disseminated across the network for scheduling. In the SoC setting, we propose a throughput-optimal policy that only uses information that nodes in the network can glean by sensing activity (or lack thereof) on the channel. Our throughput-optimality results rely on two new arguments: a Lyapunov drift lemma specially adapted to policies that are queue length-agnostic, and a priority queueing analysis for showing strong stability.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Best-item Learning in Random Utility Models with Subset Choices
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. We identify a new property of such a RUM, termed the…
▽ More
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items based on their relative win/loss empirical counts, and can be bounded as a function of the noise distribution alone. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee of $O(\frac{n}{c^2ε^2} \log \frac{k}δ)$ rounds to identify an $ε$-optimal item with confidence $1-δ$, when the worst case pairwise advantage in the RUM has sensitivity at least $c$ to the parameter gaps of items. Fundamental lower bounds on PAC sample complexity show that this is near-optimal in terms of its dependence on $n,k$ and $c$.
△ Less
Submitted 18 February, 2020;
originally announced February 2020.
-
Stability and Scalability of Blockchain Systems
Authors:
Aditya Gopalan,
Abishek Sankararaman,
Anwar Walid,
Sriram Vishwanath
Abstract:
The blockchain paradigm provides a mechanism for content dissemination and distributed consensus on Peer-to-Peer (P2P) networks. While this paradigm has been widely adopted in industry, it has not been carefully analyzed in terms of its network scaling with respect to the number of peers. Applications for blockchain systems, such as cryptocurrencies and IoT, require this form of network scaling.…
▽ More
The blockchain paradigm provides a mechanism for content dissemination and distributed consensus on Peer-to-Peer (P2P) networks. While this paradigm has been widely adopted in industry, it has not been carefully analyzed in terms of its network scaling with respect to the number of peers. Applications for blockchain systems, such as cryptocurrencies and IoT, require this form of network scaling.
In this paper, we propose a new stochastic network model for a blockchain system. We identify a structural property called \emph{one-endedness}, which we show to be desirable in any blockchain system as it is directly related to distributed consensus among the peers. We show that the stochastic stability of the network is sufficient for the one-endedness of a blockchain. We further establish that our model belongs to a class of network models, called monotone separable models. This allows us to establish upper and lower bounds on the stability region. The bounds on stability depend on the connectivity of the P2P network through its conductance and allow us to analyze the scalability of blockchain systems on large P2P networks. We verify our theoretical insights using both synthetic data and real data from the Bitcoin network.
△ Less
Submitted 18 December, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Sequential Mode Estimation with Oracle Queries
Authors:
Dhruti Shah,
Tuhinangshu Choudhury,
Nikhil Karamchandani,
Aditya Gopalan
Abstract:
We consider the problem of adaptively PAC-learning a probability distribution $\mathcal{P}$'s mode by querying an oracle for information about a sequence of i.i.d. samples $X_1, X_2, \ldots$ generated from $\mathcal{P}$. We consider two different query models: (a) each query is an index $i$ for which the oracle reveals the value of the sample $X_i$, (b) each query is comprised of two indices $i$ a…
▽ More
We consider the problem of adaptively PAC-learning a probability distribution $\mathcal{P}$'s mode by querying an oracle for information about a sequence of i.i.d. samples $X_1, X_2, \ldots$ generated from $\mathcal{P}$. We consider two different query models: (a) each query is an index $i$ for which the oracle reveals the value of the sample $X_i$, (b) each query is comprised of two indices $i$ and $j$ for which the oracle reveals if the samples $X_i$ and $X_j$ are the same or not. For these query models, we give sequential mode-estimation algorithms which, at each time $t$, either make a query to the corresponding oracle based on past observations, or decide to stop and output an estimate for the distribution's mode, required to be correct with a specified confidence. We analyze the query complexity of these algorithms for any underlying distribution $\mathcal{P}$, and derive corresponding lower bounds on the optimal query complexity under the two querying models.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
On Online Learning in Kernelized Markov Decision Processes
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philosophies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness…
▽ More
We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philosophies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
Towards Optimal and Efficient Best Arm Identification in Linear Bandits
Authors:
Mohammadi Zaki,
Avinash Mohan,
Aditya Gopalan
Abstract:
We give a new algorithm for best arm identification in linearly parameterised bandits in the fixed confidence setting. The algorithm generalises the well-known LUCB algorithm of Kalyanakrishnan et al. (2012) by playing an arm which minimises a suitable notion of geometric overlap of the statistical confidence set for the unknown parameter, and is fully adaptive and computationally efficient as com…
▽ More
We give a new algorithm for best arm identification in linearly parameterised bandits in the fixed confidence setting. The algorithm generalises the well-known LUCB algorithm of Kalyanakrishnan et al. (2012) by playing an arm which minimises a suitable notion of geometric overlap of the statistical confidence set for the unknown parameter, and is fully adaptive and computationally efficient as compared to several state-of-the methods. We theoretically analyse the sample complexity of the algorithm for problems with two and three arms, showing optimality in many cases. Numerical results indicate favourable performance over other algorithms with which we compare.
△ Less
Submitted 7 November, 2019; v1 submitted 5 November, 2019;
originally announced November 2019.
-
On Batch Bayesian Optimization
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We present two algorithms for Bayesian optimization in the batch feedback setting, based on Gaussian process upper confidence bound and Thompson sampling approaches, along with frequentist regret guarantees and numerical results.
We present two algorithms for Bayesian optimization in the batch feedback setting, based on Gaussian process upper confidence bound and Thompson sampling approaches, along with frequentist regret guarantees and numerical results.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
On Adaptivity in Information-constrained Online Learning
Authors:
Siddharth Mitra,
Aditya Gopalan
Abstract:
We study how to adapt to smoothly-varying ('easy') environments in well-known online learning problems where acquiring information is expensive. For the problem of label efficient prediction, which is a budgeted version of prediction with expert advice, we present an online algorithm whose regret depends optimally on the number of labels allowed and $Q^*$ (the quadratic variation of the losses of…
▽ More
We study how to adapt to smoothly-varying ('easy') environments in well-known online learning problems where acquiring information is expensive. For the problem of label efficient prediction, which is a budgeted version of prediction with expert advice, we present an online algorithm whose regret depends optimally on the number of labels allowed and $Q^*$ (the quadratic variation of the losses of the best action in hindsight), along with a parameter-free counterpart whose regret depends optimally on $Q$ (the quadratic variation of the losses of all the actions). These quantities can be significantly smaller than $T$ (the total time horizon), yielding an improvement over existing, variation-independent results for the problem. We then extend our analysis to handle label efficient prediction with bandit feedback, i.e., label efficient bandits. Our work builds upon the framework of optimistic online mirror descent, and leverages second order corrections along with a carefully designed hybrid regularizer that encodes the constrained information structure of the problem. We then consider revealing action-partial monitoring games -- a version of label efficient prediction with additive information costs, which in general are known to lie in the \textit{hard} class of games having minimax regret of order $T^{\frac{2}{3}}$. We provide a strategy with an $\mathcal{O}((Q^*T)^{\frac{1}{3}})$ bound for revealing action games, along with one with a $\mathcal{O}((QT)^{\frac{1}{3}})$ bound for the full class of hard partial monitoring games, both being strict improvements over current bounds.
△ Less
Submitted 6 December, 2019; v1 submitted 19 October, 2019;
originally announced October 2019.
-
Bayesian Optimization under Heavy-tailed Payoffs
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. This is in contrast to existing literature that typically assumes sub-Gaussian noise distributions for queries. Under the assumption that the unknown function belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by a k…
▽ More
We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. This is in contrast to existing literature that typically assumes sub-Gaussian noise distributions for queries. Under the assumption that the unknown function belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, we first show that an adaptation of the well-known GP-UCB algorithm with reward truncation enjoys sublinear $\tilde{O}(T^{\frac{2 + α}{2(1+α)}})$ regret even with only the $(1+α)$-th moments, $α\in (0,1]$, of the reward distribution being bounded ($\tilde{O}$ hides logarithmic factors). However, for the common squared exponential (SE) and Matérn kernels, this is seen to be significantly larger than a fundamental $Ω(T^{\frac{1}{1+α}})$ lower bound on regret. We resolve this gap by developing novel Bayesian optimization algorithms, based on kernel approximation techniques, with regret bounds matching the lower bound in order for the SE kernel. We numerically benchmark the algorithms on environments based on both synthetic models and real-world data sets.
△ Less
Submitted 16 September, 2019;
originally announced September 2019.
-
From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We consider PAC-learning a good item from $k$-subsetwise feedback information sampled from a Plackett-Luce probability model, with instance-dependent sample complexity performance. In the setting where subsets of a fixed size can be tested and top-ranked feedback is made available to the learner, we give an algorithm with optimal instance-dependent sample complexity, for PAC best arm identificatio…
▽ More
We consider PAC-learning a good item from $k$-subsetwise feedback information sampled from a Plackett-Luce probability model, with instance-dependent sample complexity performance. In the setting where subsets of a fixed size can be tested and top-ranked feedback is made available to the learner, we give an algorithm with optimal instance-dependent sample complexity, for PAC best arm identification, of $O\bigg(\frac{θ_{[k]}}{k}\sum_{i = 2}^n\max\Big(1,\frac{1}{Δ_i^2}\Big) \ln\frac{k}δ\Big(\ln \frac{1}{Δ_i}\Big)\bigg)$, $Δ_i$ being the Plackett-Luce parameter gap between the best and the $i^{th}$ best item, and $θ_{[k]}$ is the sum of the \pl\, parameters for the top-$k$ items. The algorithm is based on a wrapper around a PAC winner-finding algorithm with weaker performance guarantees to adapt to the hardness of the input instance. The sample complexity is also shown to be multiplicatively better depending on the length of rank-ordered feedback available in each subset-wise play. We show optimality of our algorithms with matching sample complexity lower bounds. We next address the winner-finding problem in Plackett-Luce models in the fixed-budget setting with instance dependent upper and lower bounds on the misidentification probability, of $Ω\left(\exp(-2 \tilde ΔQ) \right)$ for a given budget $Q$, where $\tilde Δ$ is an explicit instance-dependent problem complexity parameter. Numerical performance results are also reported.
△ Less
Submitted 26 February, 2020; v1 submitted 1 March, 2019;
originally announced March 2019.
-
Combinatorial Bandits with Relative Feedback
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In t…
▽ More
We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-$m$ rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size $k$ with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\frac{n}{m} \ln T)$ and $O(\frac{n}{k} \ln T)$, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-$m$ rank-ordered feedback over single winner feedback ($m=1$). Our theoretical results are corroborated with empirical evaluations.
△ Less
Submitted 26 February, 2020; v1 submitted 1 March, 2019;
originally announced March 2019.
-
Active Ranking with Subset-wise Preferences
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We consider the problem of probably approximately correct (PAC) ranking $n$ items by adaptively eliciting subset-wise preference feedback. At each round, the learner chooses a subset of $k$ items and observes stochastic feedback indicating preference information of the winner (most preferred) item of the chosen subset drawn according to a Plackett-Luce (PL) subset choice model unknown a priori. Th…
▽ More
We consider the problem of probably approximately correct (PAC) ranking $n$ items by adaptively eliciting subset-wise preference feedback. At each round, the learner chooses a subset of $k$ items and observes stochastic feedback indicating preference information of the winner (most preferred) item of the chosen subset drawn according to a Plackett-Luce (PL) subset choice model unknown a priori. The objective is to identify an $ε$-optimal ranking of the $n$ items with probability at least $1 - δ$. When the feedback in each subset round is a single Plackett-Luce-sampled item, we show $(ε, δ)$-PAC algorithms with a sample complexity of $O\left(\frac{n}{ε^2} \ln \frac{n}δ \right)$ rounds, which we establish as being order-optimal by exhibiting a matching sample complexity lower bound of $Ω\left(\frac{n}{ε^2} \ln \frac{n}δ \right)$---this shows that there is essentially no improvement possible from the pairwise comparisons setting ($k = 2$). When, however, it is possible to elicit top-$m$ ($\leq k$) ranking feedback according to the PL model from each adaptively chosen subset of size $k$, we show that an $(ε, δ)$-PAC ranking sample complexity of $O\left(\frac{n}{m ε^2} \ln \frac{n}δ \right)$ is achievable with explicit algorithms, which represents an $m$-wise reduction in sample complexity compared to the pairwise case. This again turns out to be order-wise unimprovable across the class of symmetric ranking algorithms. Our algorithms rely on a novel {pivot trick} to maintain only $n$ itemwise score estimates, unlike $O(n^2)$ pairwise score estimates that has been used in prior work. We report results of numerical experiments that corroborate our findings.
△ Less
Submitted 1 March, 2019; v1 submitted 23 October, 2018;
originally announced October 2018.
-
PAC Battling Bandits in the Plackett-Luce Model
Authors:
Aadirupa Saha,
Aditya Gopalan
Abstract:
We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model--an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item…
▽ More
We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model--an online learning framework where at each trial the learner chooses a subset of $k$ arms from a fixed set of $n$ arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top $m$ most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over $n$ arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require $O(\frac{n}{ε^2} \ln \frac{1}δ)$ sample complexity \citep{Busa_pl,Busa_top}. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-$m$ items (TR) for $2\le m \le k$. We show, surprisingly, that with winner information (WI) feedback over subsets of size $2 \leq k \leq n$, the best achievable sample complexity is still $O\left( \frac{n}{ε^2} \ln \frac{1}δ\right)$, independent of $k$, and the same as that in the Dueling Bandit setting ($k=2$). For the more general top-$m$ ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of $Ω\bigg( \frac{n}{mε^2} \ln \frac{1}δ\bigg)$, which suggests a multiplicative reduction by a factor ${m}$ owing to the additional information revealed from preferences among $m$ items instead of just $1$. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.
△ Less
Submitted 1 March, 2019; v1 submitted 12 August, 2018;
originally announced August 2018.
-
Online Learning in Kernelized Markov Decision Processes
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reprodu…
▽ More
We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process priors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multidimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.
△ Less
Submitted 2 January, 2019; v1 submitted 21 May, 2018;
originally announced May 2018.
-
Learning to detect an oddball target with observations from an exponential family
Authors:
Gayathri R Prabhu,
Srikrishna Bhashyam,
Aditya Gopalan,
Rajesh Sundaresan
Abstract:
The problem of detecting an odd arm from a set of K arms of a multi-armed bandit, with fixed confidence, is studied in a sequential decision-making scenario. Each arm's signal follows a distribution from a vector exponential family. All arms have the same parameters except the odd arm. The actual parameters of the odd and non-odd arms are unknown to the decision maker. Further, the decision maker…
▽ More
The problem of detecting an odd arm from a set of K arms of a multi-armed bandit, with fixed confidence, is studied in a sequential decision-making scenario. Each arm's signal follows a distribution from a vector exponential family. All arms have the same parameters except the odd arm. The actual parameters of the odd and non-odd arms are unknown to the decision maker. Further, the decision maker incurs a cost for switching from one arm to another. This is a sequential decision making problem where the decision maker gets only a limited view of the true state of nature at each stage, but can control his view by choosing the arm to observe at each stage. Of interest are policies that satisfy a given constraint on the probability of false detection. An information-theoretic lower bound on the total cost (expected time for a reliable decision plus total switching cost) is first identified, and a variation on a sequential policy based on the generalised likelihood ratio statistic is then studied. Thanks to the vector exponential family assumption, the signal processing in this policy at each stage turns out to be very simple, in that the associated conjugate prior enables easy updates of the posterior distribution of the model parameters. The policy, with a suitable threshold, is shown to satisfy the given constraint on the probability of false detection. Further, the proposed policy is asymptotically optimal in terms of the total cost among all policies that satisfy the constraint on the probability of false detection.
△ Less
Submitted 12 June, 2022; v1 submitted 11 December, 2017;
originally announced December 2017.
-
Online Learning for Structured Loss Spaces
Authors:
Siddharth Barman,
Aditya Gopalan,
Aadirupa Saha
Abstract:
We consider prediction with expert advice when the loss vectors are assumed to lie in a set described by the sum of atomic norm balls. We derive a regret bound for a general version of the online mirror descent (OMD) algorithm that uses a combination of regularizers, each adapted to the constituent atomic norms. The general result recovers standard OMD regret bounds, and yields regret bounds for n…
▽ More
We consider prediction with expert advice when the loss vectors are assumed to lie in a set described by the sum of atomic norm balls. We derive a regret bound for a general version of the online mirror descent (OMD) algorithm that uses a combination of regularizers, each adapted to the constituent atomic norms. The general result recovers standard OMD regret bounds, and yields regret bounds for new structured settings where the loss vectors are (i) noisy versions of points from a low-rank subspace, (ii) sparse vectors corrupted with noise, and (iii) sparse perturbations of low-rank vectors. For the problem of online learning with structured losses, we also show lower bounds on regret in terms of rank and sparsity of the source set of the loss vectors, which implies lower bounds for the above additive loss settings as well.
△ Less
Submitted 12 November, 2017; v1 submitted 13 June, 2017;
originally announced June 2017.
-
Misspecified Linear Bandits
Authors:
Avishek Ghosh,
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We consider the problem of online learning in misspecified linear stochastic multi-armed bandit problems. Regret guarantees for state-of-the-art linear bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit (OFUL) hold under the assumption that the arms expected rewards are perfectly linear in their features. It is, however, of interest to investigate the impact of potential m…
▽ More
We consider the problem of online learning in misspecified linear stochastic multi-armed bandit problems. Regret guarantees for state-of-the-art linear bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit (OFUL) hold under the assumption that the arms expected rewards are perfectly linear in their features. It is, however, of interest to investigate the impact of potential misspecification in linear bandit models, where the expected rewards are perturbed away from the linear subspace determined by the arms features. Although OFUL has recently been shown to be robust to relatively small deviations from linearity, we show that any linear bandit algorithm that enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL) must suffer linear regret under a sparse additive perturbation of the linear model. In an attempt to overcome this negative result, we define a natural class of bandit models characterized by a non-sparse deviation from linearity. We argue that the OFUL algorithm can fail to achieve sublinear regret even under models that have non-sparse deviation.We finally develop a novel bandit algorithm, comprising a hypothesis test for linearity followed by a decision to use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly linear bandit models, the algorithm provably exhibits OFULs favorable regret performance, while for misspecified models satisfying the non-sparse deviation property, the algorithm avoids the linear regret phenomenon and falls back on UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on recommendation data from the public Yahoo! Learning to Rank Challenge dataset, empirically support our findings.
△ Less
Submitted 23 April, 2017;
originally announced April 2017.
-
On Kernelized Multi-armed Bandits
Authors:
Sayak Ray Chowdhury,
Aditya Gopalan
Abstract:
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward func…
▽ More
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.
△ Less
Submitted 17 May, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
Bandit algorithms to emulate human decision making using probabilistic distortions
Authors:
Ravi Kumar Kolla,
Prashanth L. A.,
Aditya Gopalan,
Krishna Jagannathan,
Michael Fu,
Steve Marcus
Abstract:
Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as…
▽ More
Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits. For the regret minimization setting in $K$-armed as well as linear bandit problems, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate reward distortions, and exhibit sublinear regret. For the $K$-armed bandit setting, we derive an upper bound on the expected regret for our proposed algorithm, and then we prove a matching lower bound to establish the order-optimality of our algorithm. For the linearly parameterized setting, our algorithm achieves a regret upper bound that is of the same order as that of regular linear bandit algorithm called Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm, and unlike OFUL, our algorithm handles distortions and an arm-dependent noise model. For the best arm identification problem in the $K$-armed bandit setting, we propose algorithms, derive guarantees on their performance, and also show that these algorithms are order optimal by proving matching fundamental limits on performance. For best arm identification in linear bandits, we propose an algorithm and establish sample complexity guarantees. Finally, we present simulation experiments which demonstrate the advantages resulting from using distortion-aware learning algorithms in a vehicular traffic routing application.
△ Less
Submitted 31 October, 2023; v1 submitted 30 November, 2016;
originally announced November 2016.
-
Low-rank Bandits with Latent Mixtures
Authors:
Aditya Gopalan,
Odalric-Ambrym Maillard,
Mohammadi Zaki
Abstract:
We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures of C many representative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown…
▽ More
We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures of C many representative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown a priori. The user identity is the only contextual information available to the learner while interacting. This induces a low-rank structure on the matrix of expected rewards r a,b from recommending item a to user b. The problem reduces to the well-known linear bandit when either user or item-side features are perfectly known. In the setting where each user, with its stochastically sampled taste profile, interacts only for a small number of sessions, we develop a bandit algorithm for the two-sided uncertainty. It combines the Robust Tensor Power Method of Anandkumar et al. (2014b) with the OFUL linear bandit algorithm of Abbasi-Yadkori et al. (2011). We provide the first rigorous regret analysis of this combination, showing that its regret after T user interactions is $\tilde O(C\sqrt{BT})$, with B the number of users. An ingredient towards this result is a novel robustness property of OFUL, of independent interest.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Optimal Recommendation to Users that React: Online Learning for a Class of POMDPs
Authors:
Rahul Meshram,
Aditya Gopalan,
D. Manjunath
Abstract:
We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendatio…
▽ More
We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendations can affect future preferences, and (3) it allows for the development of learning algorithms with provable performance guarantees. The problem is cast as an average-cost restless multi-armed bandit for a given user, with an independent partially observable Markov decision process (POMDP) for each item of content. We analyze the POMDP for a single arm, describe its structural properties, and characterize its optimal policy. We then develop a Thompson sampling-based online reinforcement learning algorithm to learn the parameters of the model and optimize utility from the binary responses of the users to continuous recommendations. We then analyze the performance of the learning algorithm and characterize the regret. Illustrative numerical results and directions for extension to the restless hidden Markov multi-armed bandit problem are also presented.
△ Less
Submitted 30 March, 2016;
originally announced March 2016.
-
Collaborative Learning of Stochastic Bandits over a Social Network
Authors:
Ravi Kumar Kolla,
Krishna Jagannathan,
Aditya Gopalan
Abstract:
We consider a collaborative online learning paradigm, wherein a group of agents connected through a social network are engaged in playing a stochastic multi-armed bandit game. Each time an agent takes an action, the corresponding reward is instantaneously observed by the agent, as well as its neighbours in the social network. We perform a regret analysis of various policies in this collaborative l…
▽ More
We consider a collaborative online learning paradigm, wherein a group of agents connected through a social network are engaged in playing a stochastic multi-armed bandit game. Each time an agent takes an action, the corresponding reward is instantaneously observed by the agent, as well as its neighbours in the social network. We perform a regret analysis of various policies in this collaborative learning setting. A key finding of this paper is that natural extensions of widely-studied single agent learning policies to the network setting need not perform well in terms of regret. In particular, we identify a class of non-altruistic and individually consistent policies, and argue by deriving regret lower bounds that they are liable to suffer a large regret in the networked setting. We also show that the learning performance can be substantially improved if the agents exploit the structure of the network, and develop a simple learning algorithm based on dominating sets of the network. Specifically, we first consider a star network, which is a common motif in hierarchical social networks, and show analytically that the hub agent can be used as an information sink to expedite learning and improve the overall regret. We also derive networkwide regret bounds for the algorithm applied to general networks. We conduct numerical experiments on a variety of networks to corroborate our analytical results.
△ Less
Submitted 11 July, 2016; v1 submitted 29 February, 2016;
originally announced February 2016.
-
Optimal WiFi Sensing via Dynamic Programming
Authors:
Abhinav Kumar,
Rahul Vaze,
Sibi Raj B Pillai,
Aditya Gopalan
Abstract:
The problem of finding an optimal sensing schedule for a mobile device that encounters an intermittent WiFi access opportunity is considered. At any given time, the WiFi is in any of the two modes, ON or OFF, and the mobile's incentive is to connect to the WiFi in the ON mode as soon as possible, while spending as little sensing energy. We introduce a dynamic programming framework which enables th…
▽ More
The problem of finding an optimal sensing schedule for a mobile device that encounters an intermittent WiFi access opportunity is considered. At any given time, the WiFi is in any of the two modes, ON or OFF, and the mobile's incentive is to connect to the WiFi in the ON mode as soon as possible, while spending as little sensing energy. We introduce a dynamic programming framework which enables the characterization of an explicit solution for several models, particularly when the OFF periods are exponentially distributed. While the problem for non-exponential OFF periods is ill-posed in general, a usual workaround in literature is to make the mobile device aware if one ON period is completely missed. In this restricted setting, using the DP framework, the deterministic nature of the optimal sensing policy is established, and value iterations are shown to converge to the optimal solution. Finally, we address the blind situation where the distributions of ON and OFF periods are unknown. A continuous bandit based learning algorithm that has vanishing regret (loss compared to the optimal strategy with the knowledge of distributions) is presented, and comparisons with the optimal schemes are provided for exponential ON and OFF times.
△ Less
Submitted 28 October, 2014;
originally announced October 2014.
-
Thompson Sampling for Learning Parameterized Markov Decision Processes
Authors:
Aditya Gopalan,
Shie Mannor
Abstract:
We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems,…
▽ More
We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems, and derive a frequentist regret bound for priors over general parameter spaces. The result shows that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability. It holds for prior distributions that put significant probability near the true model, without any additional, specific closed-form structure such as conjugate or product-form priors. The constant factor in the logarithmic scaling encodes the information complexity of learning the MDP in terms of the Kullback-Leibler geometry of the parameter space.
△ Less
Submitted 31 March, 2015; v1 submitted 29 June, 2014;
originally announced June 2014.