-
Fixed Confidence Best Arm Identification in the Bayesian Setting
Authors:
Kyoungseok Jang,
Junpei Komiyama,
Kazutoshi Yamazaki
Abstract:
We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show t…
▽ More
We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show that the traditional FC-BAI algorithms studied in the frequentist setting, such as track-and-stop and top-two algorithms, result in arbitrarily suboptimal performances in the Bayesian setting. We also obtain a lower bound of the expected number of samples in the Bayesian setting and introduce a variant of successive elimination that has a matching performance with the lower bound up to a logarithmic factor. Simulations verify the theoretical results.
△ Less
Submitted 22 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Replicability is Asymptotically Free in Multi-armed Bandits
Authors:
Junpei Komiyama,
Shinji Ito,
Yuichi Yoshida,
Souta Koshino
Abstract:
This work is motivated by the growing demand for reproducible machine learning. We study the stochastic multi-armed bandit problem. In particular, we consider a replicable algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. We observe that existing algorithms require $O(1/ρ^2)$ times more regret than non…
▽ More
This work is motivated by the growing demand for reproducible machine learning. We study the stochastic multi-armed bandit problem. In particular, we consider a replicable algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. We observe that existing algorithms require $O(1/ρ^2)$ times more regret than nonreplicable algorithms, where $ρ$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $ρ$, provided that the magnitude of the confidence bounds is chosen carefully. We introduce an explore-then-commit algorithm that draws arms uniformly before committing to a single arm. Additionally, we examine a successive elimination algorithm that eliminates suboptimal arms at the end of each phase. To ensure the replicability of these algorithms, we incorporate randomness into their decision-making processes. We extend the use of successive elimination to the linear bandit problem as well. For the analysis of these algorithms, we propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Learning Fair Division from Bandit Feedback
Authors:
Hakuei Yamada,
Junpei Komiyama,
Kenshi Abe,
Atsushi Iwasaki
Abstract:
This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learni…
▽ More
This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learning of both the type distribution of arriving items and agents' values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We establish regret bounds in Nash social welfare and empirically validate the superior performance of our proposed algorithms across synthetic and empirical datasets.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
High-dimensional Contextual Bandit Problem without Sparsity
Authors:
Junpei Komiyama,
Masaaki Imaizumi
Abstract:
In this research, we investigate the high-dimensional linear contextual bandit problem where the number of features $p$ is greater than the budget $T$, or it may even be infinite. Differing from the majority of previous works in this field, we do not impose sparsity on the regression coefficients. Instead, we rely on recent findings on overparameterized models, which enables us to analyze the perf…
▽ More
In this research, we investigate the high-dimensional linear contextual bandit problem where the number of features $p$ is greater than the budget $T$, or it may even be infinite. Differing from the majority of previous works in this field, we do not impose sparsity on the regression coefficients. Instead, we rely on recent findings on overparameterized models, which enables us to analyze the performance the minimum-norm interpolating estimator when data distributions have small effective ranks. We propose an explore-then-commit (EtC) algorithm to address this problem and examine its performance. Through our analysis, we derive the optimal rate of the ETC algorithm in terms of $T$ and show that this rate can be achieved by balancing exploration and exploitation. Moreover, we introduce an adaptive explore-then-commit (AEtC) algorithm that adaptively finds the optimal balance. We assess the performance of the proposed algorithms through a series of simulations.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Strategic Choices of Migrants and Smugglers in the Central Mediterranean Sea
Authors:
Katherine Hoffmann Pham,
Junpei Komiyama
Abstract:
The sea crossing from Libya to Italy is one of the world's most dangerous and politically contentious migration routes, and yet over half a million people have attempted the crossing since 2014. Leveraging data on aggregate migration flows and individual migration incidents, we estimate how migrants and smugglers have reacted to changes in border enforcement, namely the rise in interceptions by th…
▽ More
The sea crossing from Libya to Italy is one of the world's most dangerous and politically contentious migration routes, and yet over half a million people have attempted the crossing since 2014. Leveraging data on aggregate migration flows and individual migration incidents, we estimate how migrants and smugglers have reacted to changes in border enforcement, namely the rise in interceptions by the Libyan Coast Guard starting in 2017 and the corresponding decrease in the probability of rescue at sea. We find support for a deterrence effect in which attempted crossings along the Central Mediterranean route declined, and a diversion effect in which some migrants substituted to the Western Mediterranean route. At the same time, smugglers adapted their tactics. Using a strategic model of the smuggler's choice of boat size, we estimate how smugglers trade off between the short-run payoffs to launching overcrowded boats and the long-run costs of making less successful crossing attempts under different levels of enforcement. Taken together, these analyses shed light on how the integration of incident- and flow-level datasets can inform ongoing migration policy debates and identify potential consequences of changing enforcement regimes.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification
Authors:
Junpei Komiyama,
Taira Tsuchiya,
Junya Honda
Abstract:
We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optim…
▽ More
We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
△ Less
Submitted 26 October, 2022; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Anytime Capacity Expansion in Medical Residency Match by Monte Carlo Tree Search
Authors:
Kenshi Abe,
Junpei Komiyama,
Atsushi Iwasaki
Abstract:
This paper considers the capacity expansion problem in two-sided matchings, where the policymaker is allowed to allocate some extra seats as well as the standard seats. In medical residency match, each hospital accepts a limited number of doctors. Such capacity constraints are typically given in advance. However, such exogenous constraints can compromise the welfare of the doctors; some popular ho…
▽ More
This paper considers the capacity expansion problem in two-sided matchings, where the policymaker is allowed to allocate some extra seats as well as the standard seats. In medical residency match, each hospital accepts a limited number of doctors. Such capacity constraints are typically given in advance. However, such exogenous constraints can compromise the welfare of the doctors; some popular hospitals inevitably dismiss some of their favorite doctors. Meanwhile, it is often the case that the hospitals are also benefited to accept a few extra doctors. To tackle the problem, we propose an anytime method that the upper confidence tree searches the space of capacity expansions, each of which has a resident-optimal stable assignment that the deferred acceptance method finds. Constructing a good search tree representation significantly boosts the performance of the proposed method. Our simulation shows that the proposed method identifies an almost optimal capacity expansion with a significantly smaller computational budget than exact methods based on mixed-integer programming.
△ Less
Submitted 22 May, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Suboptimal Performance of the Bayes Optimal Algorithm in Frequentist Best Arm Identification
Authors:
Junpei Komiyama
Abstract:
We consider the fixed-budget best arm identification problem with rewards following normal distributions. In this problem, the forecaster is given $K$ arms (or treatments) and $T$ time steps. The forecaster attempts to find the arm with the largest mean, via an adaptive experiment conducted using an algorithm. The algorithm's performance is evaluated by simple regret, reflecting the quality of the…
▽ More
We consider the fixed-budget best arm identification problem with rewards following normal distributions. In this problem, the forecaster is given $K$ arms (or treatments) and $T$ time steps. The forecaster attempts to find the arm with the largest mean, via an adaptive experiment conducted using an algorithm. The algorithm's performance is evaluated by simple regret, reflecting the quality of the estimated best arm. While frequentist simple regret can decrease exponentially with respect to $T$, Bayesian simple regret decreases polynomially. This paper demonstrates that the Bayes optimal algorithm, which minimizes the Bayesian simple regret, does not yield an exponential decrease in simple regret under certain parameter settings. This contrasts with the numerous findings that suggest the asymptotic equivalence of Bayesian and frequentist approaches in fixed sampling regimes. Although the Bayes optimal algorithm is formulated as a recursive equation that is virtually impossible to compute exactly, we lay the groundwork for future research by introducing a novel concept termed the expected Bellman improvement.
△ Less
Submitted 14 April, 2024; v1 submitted 10 February, 2022;
originally announced February 2022.
-
Rate-optimal Bayesian Simple Regret in Best Arm Identification
Authors:
Junpei Komiyama,
Kaito Ariu,
Masahiro Kato,
Chao Qin
Abstract:
We consider best arm identification in the multi-armed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization (Lai, 1987), the leading term in the Bayesian simple regret derives from the region where the gap between optimal and suboptimal arms is smaller than $\sqrt{\frac{\log T}{T}}$.…
▽ More
We consider best arm identification in the multi-armed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization (Lai, 1987), the leading term in the Bayesian simple regret derives from the region where the gap between optimal and suboptimal arms is smaller than $\sqrt{\frac{\log T}{T}}$. We propose a simple and easy-to-compute algorithm with its leading term matching with the lower bound up to a constant factor; simulation results support our theoretical findings.
△ Less
Submitted 25 July, 2023; v1 submitted 18 November, 2021;
originally announced November 2021.
-
Deviation-Based Learning: Training Recommender Systems Using Informed User Choice
Authors:
Junpei Komiyama,
Shunya Noda
Abstract:
This paper proposes a new approach to training recommender systems called deviation-based learning. The recommender and rational users have different knowledge. The recommender learns user knowledge by observing what action users take upon receiving recommendations. Learning eventually stalls if the recommender always suggests a choice: Before the recommender completes learning, users start follow…
▽ More
This paper proposes a new approach to training recommender systems called deviation-based learning. The recommender and rational users have different knowledge. The recommender learns user knowledge by observing what action users take upon receiving recommendations. Learning eventually stalls if the recommender always suggests a choice: Before the recommender completes learning, users start following the recommendations blindly, and their choices do not reflect their knowledge. The learning rate and social welfare improve substantially if the recommender abstains from recommending a particular choice when she predicts that multiple alternatives will produce a similar payoff.
△ Less
Submitted 18 August, 2022; v1 submitted 20 September, 2021;
originally announced September 2021.
-
Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling
Authors:
Kaito Ariu,
Masahiro Kato,
Junpei Komiyama,
Kenichiro McAlinn,
Chao Qin
Abstract:
We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technic…
▽ More
We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technical issues, and the proof and statement of Theorem 1 (2) are incorrect. We then show, through a counterexample, that Theorem 1 (3) is false. For the former two, we correct the statements and provide rigorous proofs. For Theorem 1 (3), we propose an alternative objective function, which we call posterior weighted policy regret, and derive the asymptotic optimality of exploration sampling.
△ Less
Submitted 24 November, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits
Authors:
Junpei Komiyama,
Edouard Fouché,
Junya Honda
Abstract:
We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independe…
▽ More
We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.
△ Less
Submitted 25 October, 2023; v1 submitted 23 July, 2021;
originally announced July 2021.
-
Controlling False Discovery Rates under Cross-Sectional Correlations
Authors:
Junpei Komiyama,
Masaya Abe,
Kei Nakagawa,
Kenichiro McAlinn
Abstract:
We consider controlling the false discovery rate for testing many time series with an unknown cross-sectional correlation structure. Given a large number of hypotheses, false and missing discoveries can plague an analysis. While many procedures have been proposed to control false discovery, most of them either assume independent hypotheses or lack statistical power. A problem of particular interes…
▽ More
We consider controlling the false discovery rate for testing many time series with an unknown cross-sectional correlation structure. Given a large number of hypotheses, false and missing discoveries can plague an analysis. While many procedures have been proposed to control false discovery, most of them either assume independent hypotheses or lack statistical power. A problem of particular interest is in financial asset pricing, where the goal is to determine which ``factors" lead to excess returns out of a large number of potential factors. Our contribution is two-fold. First, we show the consistency of Fama and French's prominent method under multiple testing. Second, we propose a novel method for false discovery control using double bootstrapping. We achieve superior statistical power to existing methods and prove that the false discovery rate is controlled. Simulations and a real data application illustrate the efficacy of our method over existing methods.
△ Less
Submitted 9 June, 2021; v1 submitted 15 February, 2021;
originally announced February 2021.
-
On Statistical Discrimination as a Failure of Social Learning: A Multi-Armed Bandit Approach
Authors:
Junpei Komiyama,
Shunya Noda
Abstract:
We analyze statistical discrimination in hiring markets using a multi-armed bandit model. Myopic firms face workers arriving with heterogeneous observable characteristics. The association between the worker's skill and characteristics is unknown ex ante; thus, firms need to learn it. Laissez-faire causes perpetual underestimation: minority workers are rarely hired, and therefore, the underestimati…
▽ More
We analyze statistical discrimination in hiring markets using a multi-armed bandit model. Myopic firms face workers arriving with heterogeneous observable characteristics. The association between the worker's skill and characteristics is unknown ex ante; thus, firms need to learn it. Laissez-faire causes perpetual underestimation: minority workers are rarely hired, and therefore, the underestimation tends to persist. Even a marginal imbalance in the population ratio frequently results in perpetual underestimation. We propose two policy solutions: a novel subsidy rule (the hybrid mechanism) and the Rooney Rule. Our results indicate that temporary affirmative actions effectively alleviate discrimination stemming from insufficient data.
△ Less
Submitted 14 July, 2023; v1 submitted 2 October, 2020;
originally announced October 2020.
-
A Robust Transferable Deep Learning Framework for Cross-sectional Investment Strategy
Authors:
Kei Nakagawa,
Masaya Abe,
Junpei Komiyama
Abstract:
Stock return predictability is an important research theme as it reflects our economic and social organization, and significant efforts are made to explain the dynamism therein. Statistics of strong explanative power, called "factor" have been proposed to summarize the essence of predictive stock returns. Although machine learning methods are increasingly popular in stock return prediction, an inf…
▽ More
Stock return predictability is an important research theme as it reflects our economic and social organization, and significant efforts are made to explain the dynamism therein. Statistics of strong explanative power, called "factor" have been proposed to summarize the essence of predictive stock returns. Although machine learning methods are increasingly popular in stock return prediction, an inference of the stock returns is highly elusive, and still most investors, if partly, rely on their intuition to build a better decision making. The challenge here is to make an investment strategy that is consistent over a reasonably long period, with the minimum human decision on the entire process. To this end, we propose a new stock return prediction framework that we call Ranked Information Coefficient Neural Network (RIC-NN). RIC-NN is a deep learning approach and includes the following three novel ideas: (1) nonlinear multi-factor approach, (2) stopping criteria with ranked information coefficient (rank IC), and (3) deep transfer learning among multiple regions. Experimental comparison with the stocks in the Morgan Stanley Capital International (MSCI) indices shows that RIC-NN outperforms not only off-the-shelf machine learning methods but also the average return of major equity investment funds in the last fourteen years.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
A Simple Way to Deal with Cherry-picking
Authors:
Junpei Komiyama,
Takanori Maehara
Abstract:
Statistical hypothesis testing serves as statistical evidence for scientific innovation. However, if the reported results are intentionally biased, hypothesis testing no longer controls the rate of false discovery. In particular, we study such selection bias in machine learning models where the reporter is motivated to promote an algorithmic innovation. When the number of possible configurations (…
▽ More
Statistical hypothesis testing serves as statistical evidence for scientific innovation. However, if the reported results are intentionally biased, hypothesis testing no longer controls the rate of false discovery. In particular, we study such selection bias in machine learning models where the reporter is motivated to promote an algorithmic innovation. When the number of possible configurations (e.g., datasets) is large, we show that the reporter can falsely report an innovation even if there is no improvement at all. We propose a `post-reporting' solution to this issue where the bias of the reported results is verified by another set of results. The theoretical findings are supported by experimental results with synthetic and real-world datasets.
△ Less
Submitted 11 October, 2018;
originally announced October 2018.
-
Comparing Fairness Criteria Based on Social Outcome
Authors:
Junpei Komiyama,
Hajime Shimao
Abstract:
Fairness in algorithmic decision-making processes is attracting increasing concern. When an algorithm is applied to human-related decision-making an estimator solely optimizing its predictive power can learn biases on the existing data, which motivates us the notion of fairness in machine learning. while several different notions are studied in the literature, little studies are done on how these…
▽ More
Fairness in algorithmic decision-making processes is attracting increasing concern. When an algorithm is applied to human-related decision-making an estimator solely optimizing its predictive power can learn biases on the existing data, which motivates us the notion of fairness in machine learning. while several different notions are studied in the literature, little studies are done on how these notions affect the individuals. We demonstrate such a comparison between several policies induced by well-known fairness criteria, including the color-blind (CB), the demographic parity (DP), and the equalized odds (EO). We show that the EO is the only criterion among them that removes group-level disparity. Empirical studies on the social welfare and disparity of these policies are conducted.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
Two-stage Algorithm for Fairness-aware Machine Learning
Authors:
Junpei Komiyama,
Hajime Shimao
Abstract:
Algorithmic decision making process now affects many aspects of our lives. Standard tools for machine learning, such as classification and regression, are subject to the bias in data, and thus direct application of such off-the-shelf tools could lead to a specific group being unfairly discriminated. Removing sensitive attributes of data does not solve this problem because a \textit{disparate impac…
▽ More
Algorithmic decision making process now affects many aspects of our lives. Standard tools for machine learning, such as classification and regression, are subject to the bias in data, and thus direct application of such off-the-shelf tools could lead to a specific group being unfairly discriminated. Removing sensitive attributes of data does not solve this problem because a \textit{disparate impact} can arise when non-sensitive attributes and sensitive attributes are correlated. Here, we study a fair machine learning algorithm that avoids such a disparate impact when making a decision. Inspired by the two-stage least squares method that is widely used in the field of economics, we propose a two-stage algorithm that removes bias in the training data. The proposed algorithm is conceptually simple. Unlike most of existing fair algorithms that are designed for classification tasks, the proposed method is able to (i) deal with regression tasks, (ii) combine explanatory attributes to remove reverse discrimination, and (iii) deal with numerical sensitive attributes. The performance and fairness of the proposed algorithm are evaluated in simulations with synthetic and real-world datasets.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
Authors:
Junpei Komiyama,
Junya Honda,
Hiroshi Nakagawa
Abstract:
We study the K-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. The hardness of recommending Copeland winners, the arms that beat the greatest number of other arms, is characterized by deriving an asymptotic regret bound. We propose Copeland Winners Relative Minimum Empirical Divergence (CW-…
▽ More
We study the K-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. The hardness of recommending Copeland winners, the arms that beat the greatest number of other arms, is characterized by deriving an asymptotic regret bound. We propose Copeland Winners Relative Minimum Empirical Divergence (CW-RMED) and derive an asymptotically optimal regret bound for it. However, it is not known whether the algorithm can be efficiently computed or not. To address this issue, we devise an efficient version (ECW-RMED) and derive its asymptotic regret bound. Experimental comparisons of dueling bandit algorithms show that ECW-RMED significantly outperforms existing ones.
△ Less
Submitted 24 May, 2016; v1 submitted 5 May, 2016;
originally announced May 2016.
-
Regret Lower Bound and Optimal Algorithm in Finite Stochastic Partial Monitoring
Authors:
Junpei Komiyama,
Junya Honda,
Hiroshi Nakagawa
Abstract:
Partial monitoring is a general model for sequential learning with limited feedback formalized as a game between two players. In this game, the learner chooses an action and at the same time the opponent chooses an outcome, then the learner suffers a loss and receives a feedback signal. The goal of the learner is to minimize the total loss. In this paper, we study partial monitoring with finite ac…
▽ More
Partial monitoring is a general model for sequential learning with limited feedback formalized as a game between two players. In this game, the learner chooses an action and at the same time the opponent chooses an outcome, then the learner suffers a loss and receives a feedback signal. The goal of the learner is to minimize the total loss. In this paper, we study partial monitoring with finite actions and stochastic outcomes. We derive a logarithmic distribution-dependent regret lower bound that defines the hardness of the problem. Inspired by the DMED algorithm (Honda and Takemura, 2010) for the multi-armed bandit problem, we propose PM-DMED, an algorithm that minimizes the distribution-dependent regret. PM-DMED significantly outperforms state-of-the-art algorithms in numerical experiments. To show the optimality of PM-DMED with respect to the regret bound, we slightly modify the algorithm by introducing a hinge function (PM-DMED-Hinge). Then, we derive an asymptotically optimal regret upper bound of PM-DMED-Hinge that matches the lower bound.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem
Authors:
Junpei Komiyama,
Junya Honda,
Hisashi Kashima,
Hiroshi Nakagawa
Abstract:
We study the $K$-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed,…
▽ More
We study the $K$-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound. Experimental comparisons of dueling bandit algorithms show that the proposed algorithm significantly outperforms existing ones.
△ Less
Submitted 29 June, 2015; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Authors:
Junpei Komiyama,
Junya Honda,
Hiroshi Nakagawa
Abstract:
We discuss a multiple-play multi-armed bandit (MAB) problem in which several arms are selected at each round. Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem. In this paper, we propose the multiple-play T…
▽ More
We discuss a multiple-play multi-armed bandit (MAB) problem in which several arms are selected at each round. Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem. In this paper, we propose the multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the multiple-play MAB problem, and discuss its regret analysis. We prove that MP-TS for binary rewards has the optimal regret upper bound that matches the regret lower bound provided by Anantharam et al. (1987). Therefore, MP-TS is the first computationally efficient algorithm with optimal regret. A set of computer simulations was also conducted, which compared MP-TS with state-of-the-art algorithms. We also propose a modification of MP-TS, which is shown to have better empirical performance.
△ Less
Submitted 20 March, 2019; v1 submitted 2 June, 2015;
originally announced June 2015.