Skip to main content

Showing 1–23 of 23 results for author: Ariu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.17968  [pdf, other

    cs.LG

    Matroid Semi-Bandits in Sublinear Time

    Authors: Ruo-Chun Tzeng, Naoto Ohsaka, Kaito Ariu

    Abstract: We study the matroid semi-bandits problem, where at each round the learner plays a subset of $K$ arms from a feasible set, and the goal is to maximize the expected cumulative linear rewards. Existing algorithms have per-round time complexity at least $Ω(K)$, which becomes expensive when $K$ is large. To address this computational issue, we propose FasterCUCB whose sampling rule takes time sublinea… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  2. arXiv:2405.14546  [pdf, other

    cs.GT cs.MA math.OC nlin.CD

    Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

    Authors: Yuma Fujimoto, Kaito Ariu, Kenshi Abe

    Abstract: This study examines the global behavior of dynamics in learning in games between two players, X and Y. We consider the simplest situation for memory asymmetry between two players: X memorizes the other Y's previous action and uses reactive strategies, while Y has no memory. Although this memory complicates the learning dynamics, we discover two novel quantities that characterize the global behavio… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 11 pages, 4 figures (main); 4 pages (appendix)

  3. arXiv:2404.13846  [pdf, other

    cs.LG cs.AI cs.CL

    Filtered Direct Preference Optimization

    Authors: Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

    Abstract: Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct prefere… ▽ More

    Submitted 4 July, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

  4. arXiv:2404.01054  [pdf, other

    cs.CL cs.AI

    Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

    Authors: Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe

    Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A commo… ▽ More

    Submitted 23 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  5. arXiv:2402.10825  [pdf, other

    cs.GT cs.MA math.OC nlin.CD

    Nash Equilibrium and Learning Dynamics in Three-Player Matching $m$-Action Games

    Authors: Yuma Fujimoto, Kaito Ariu, Kenshi Abe

    Abstract: Learning in games discusses the processes where multiple players learn their optimal strategies through the repetition of game plays. The dynamics of learning between two players in zero-sum games, such as matching pennies, where their benefits are competitive, have already been well analyzed. However, it is still unexplored and challenging to analyze the dynamics of learning among three players.… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 10 pages, 4 figures (main), 9 pages, 1 figure (appendix)

  6. arXiv:2402.03923  [pdf, other

    cs.LG

    Return-Aligned Decision Transformer

    Authors: Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra

    Abstract: Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. However, as applications broaden, it becomes increasingly crucial to train agents that not only maximize the returns, but align the actual return with a specified target return, giving control over the agent's performance. Decision Transformer (DT) op… ▽ More

    Submitted 27 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  7. arXiv:2401.02749  [pdf, other

    cs.AI cs.CL

    Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding

    Authors: Yuu Jinnai, Kaito Ariu

    Abstract: Minimum Bayes-Risk (MBR) decoding is shown to be a powerful alternative to beam search decoding for a wide range of text generation tasks. However, MBR requires a huge amount of time for inference to compute the MBR objective, which makes the method infeasible in many situations where response time is critical. Confidence-based pruning (CBP) (Cheng and Vlachos, 2023) has recently been proposed to… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

  8. arXiv:2311.05263  [pdf, other

    cs.AI cs.CL

    Model-Based Minimum Bayes Risk Decoding for Text Generation

    Authors: Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe

    Abstract: Minimum Bayes Risk (MBR) decoding has been shown to be a powerful alternative to beam search decoding in a variety of text generation tasks. MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function. Since it is impractical to compute the expected risk exactly over all possible hypotheses, two approx… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

  9. arXiv:2308.12000  [pdf, other

    stat.ML cs.LG

    On Universally Optimal Algorithms for A/B Testing

    Authors: Po-An Wang, Kaito Ariu, Alexandre Proutiere

    Abstract: We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the {\it uniform sampling} algorithm) in all instances, and that (ii) strictly outper… ▽ More

    Submitted 4 June, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted at ICML 2024

  10. arXiv:2306.12968  [pdf, other

    cs.SI cs.LG stat.ML

    Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model

    Authors: Kaito Ariu, Alexandre Proutiere, Se-Young Yun

    Abstract: We consider the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters, where cluster sizes grow linearly with the total number $n$ of items. In the LSBM, a label is (independently) observed for each pair of items. Our objective is to devise an efficient algorithm that recovers clusters using the observed labels. To this end, we revis… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  11. arXiv:2305.16610  [pdf, other

    cs.GT cs.LG

    Adaptively Perturbed Mirror Descent for Learning in Games

    Authors: Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Atsushi Iwasaki

    Abstract: This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. The optimistic family of learning algorithms, exemplified by optimistic MD, successfully achieves {\it last-iterate} convergence in scenarios devoid of noise, leading the dynam… ▽ More

    Submitted 24 June, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted at ICML 2024

  12. arXiv:2305.13619  [pdf, other

    cs.GT cs.MA math.OC nlin.CD

    Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

    Authors: Yuma Fujimoto, Kaito Ariu, Kenshi Abe

    Abstract: Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because th… ▽ More

    Submitted 16 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 9 pages & 5 figures (main), 5 pages & 2 figures (appendix)

  13. arXiv:2305.01202  [pdf, other

    cs.IR cs.LG

    Exploration of Unranked Items in Safe Online Learning to Re-Rank

    Authors: Hiroaki Shiino, Kaito Ariu, Kenshi Abe, Togashi Riku

    Abstract: Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  14. arXiv:2302.01073  [pdf, other

    cs.GT cs.MA math.OC nlin.CD

    Learning in Multi-Memory Games Triggers Complex Dynamics Diverging from Nash Equilibrium

    Authors: Yuma Fujimoto, Kaito Ariu, Kenshi Abe

    Abstract: Repeated games consider a situation where multiple agents are motivated by their independent rewards throughout learning. In general, the dynamics of their learning become complex. Especially when their rewards compete with each other like zero-sum games, the dynamics often do not converge to their optimum, i.e., the Nash equilibrium. To tackle such complexity, many studies have understood various… ▽ More

    Submitted 22 May, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: 8 pages & 4 figures (main), 6 pages & 1figure (appendix)

  15. arXiv:2208.09855  [pdf, other

    cs.GT cs.LG

    Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games

    Authors: Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, Atsushi Iwasaki

    Abstract: This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings. In the former, players observe their exact gradient vectors of the utility functions. In the latter, they only observe the noisy gradient vectors. Eve… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 August, 2022; originally announced August 2022.

    Comments: Accepted in AISTATS 2023

  16. arXiv:2201.04469  [pdf, other

    stat.ML cs.LG econ.EM math.ST

    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap

    Authors: Masahiro Kato, Kaito Ariu, Masaaki Imaizumi, Masahiro Nomura, Chao Qin

    Abstract: We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First,… ▽ More

    Submitted 28 December, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

  17. arXiv:2111.09885  [pdf, other

    cs.LG stat.ML

    Rate-optimal Bayesian Simple Regret in Best Arm Identification

    Authors: Junpei Komiyama, Kaito Ariu, Masahiro Kato, Chao Qin

    Abstract: We consider best arm identification in the multi-armed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization (Lai, 1987), the leading term in the Bayesian simple regret derives from the region where the gap between optimal and suboptimal arms is smaller than $\sqrt{\frac{\log T}{T}}$.… ▽ More

    Submitted 25 July, 2023; v1 submitted 18 November, 2021; originally announced November 2021.

    Comments: To appear in Mathematics of Operations Research. Changed the title from the previous version

    MSC Class: Primary: 62L05; secondary: 62C10; 68W27

  18. arXiv:2109.08229  [pdf, ps, other

    econ.EM cs.LG stat.ME

    Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling

    Authors: Kaito Ariu, Masahiro Kato, Junpei Komiyama, Kenichiro McAlinn, Chao Qin

    Abstract: We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technic… ▽ More

    Submitted 24 November, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

    Comments: Submitted to Econometrica

  19. arXiv:2106.14077  [pdf, other

    cs.LG econ.EM math.ST stat.ME stat.ML

    The Role of Contextual Information in Best Arm Identification

    Authors: Masahiro Kato, Kaito Ariu

    Abstract: We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We s… ▽ More

    Submitted 26 February, 2024; v1 submitted 26 June, 2021; originally announced June 2021.

  20. arXiv:2010.12470  [pdf, other

    cs.LG econ.EM stat.ML

    A Practical Guide of Off-Policy Evaluation for Bandit Problems

    Authors: Masahiro Kato, Kenshi Abe, Kaito Ariu, Shota Yasui

    Abstract: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. Recently, applying OPE methods for bandit problems has garnered attention. For the theoretical guarantees of an estimator of the policy value, the OPE methods require various conditions on the target policy and policy used for generating the samples. However, existing… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  21. arXiv:2010.12363  [pdf, other

    stat.ML cs.IR cs.LG

    Regret in Online Recommendation Systems

    Authors: Kaito Ariu, Narae Ryu, Se-Young Yun, Alexandre Proutière

    Abstract: This paper proposes a theoretical analysis of recommendation systems in an online setting, where items are sequentially recommended to users over time. In each round, a user, randomly picked from a population of $m$ users, requests a recommendation. The decision-maker observes the user and selects an item from a catalogue of $n$ items. Importantly, an item cannot be recommended twice to the same u… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: Advances in Neural Information Processing Systems (NeurIPS 2020)

  22. arXiv:2010.11994  [pdf, other

    stat.ML cs.LG

    Thresholded Lasso Bandit

    Authors: Kaito Ariu, Kenshi Abe, Alexandre Proutière

    Abstract: In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., signifi… ▽ More

    Submitted 19 June, 2022; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: International Conference on Machine Learning (ICML 2022), Proceedings of Machine Learning Research

  23. arXiv:1910.06002  [pdf, other

    stat.ML cs.LG

    Optimal Clustering from Noisy Binary Feedback

    Authors: Kaito Ariu, Jungseul Ok, Alexandre Proutiere, Se-Young Yun

    Abstract: We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overl… ▽ More

    Submitted 5 February, 2024; v1 submitted 14 October, 2019; originally announced October 2019.