Skip to main content

Showing 1–16 of 16 results for author: Morimura, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.01280  [pdf, other

    cs.CL

    Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

    Authors: Hao Wang, Tetsuro Morimura, Ukyo Honda, Daisuke Kawahara

    Abstract: Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exp… ▽ More

    Submitted 2 July, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: NAACL SRW 2024

  2. arXiv:2404.13846  [pdf, other

    cs.LG cs.AI cs.CL

    Filtered Direct Preference Optimization

    Authors: Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

    Abstract: Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct prefere… ▽ More

    Submitted 4 July, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

  3. arXiv:2404.01054  [pdf, other

    cs.CL cs.AI

    Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

    Authors: Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe

    Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A commo… ▽ More

    Submitted 23 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  4. arXiv:2404.00752  [pdf, other

    cs.CL cs.AI

    On the True Distribution Approximation of Minimum Bayes-Risk Decoding

    Authors: Atsumoto Ohashi, Ukyo Honda, Tetsuro Morimura, Yuu Jinnai

    Abstract: Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation. MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others. Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods. From a theoretical standpoint,… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: NAACL 2024 (main conference)

  5. arXiv:2402.03923  [pdf, other

    cs.LG

    Return-Aligned Decision Transformer

    Authors: Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra

    Abstract: Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. However, as applications broaden, it becomes increasingly crucial to train agents that not only maximize the returns, but align the actual return with a specified target return, giving control over the agent's performance. Decision Transformer (DT) op… ▽ More

    Submitted 27 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  6. arXiv:2401.05054  [pdf, other

    cs.CL cs.AI

    Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

    Authors: Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, Peinan Zhang

    Abstract: One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse. Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed for generating diverse outputs are predominantly based on beam search or random samplin… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

  7. arXiv:2311.05263  [pdf, other

    cs.AI cs.CL

    Model-Based Minimum Bayes Risk Decoding for Text Generation

    Authors: Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe

    Abstract: Minimum Bayes Risk (MBR) decoding has been shown to be a powerful alternative to beam search decoding in a variety of text generation tasks. MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function. Since it is impractical to compute the expected risk exactly over all possible hypotheses, two approx… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

  8. arXiv:2310.14768  [pdf, other

    cs.LG cs.AI

    Policy Gradient with Kernel Quadrature

    Authors: Satoshi Hayakawa, Tetsuro Morimura

    Abstract: Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on t… ▽ More

    Submitted 5 December, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: 18 pages, 2 figures

  9. arXiv:2308.13696  [pdf, other

    cs.CL cs.AI

    On the Depth between Beam Search and Exhaustive Search for Text Generation

    Authors: Yuu Jinnai, Tetsuro Morimura, Ukyo Honda

    Abstract: Beam search and exhaustive search are two extreme ends of text decoding algorithms with respect to the search depth. Beam search is limited in both search width and depth, whereas exhaustive search is a global search that has no such limitations. Surprisingly, beam search is not only computationally cheaper but also performs better than exhaustive search despite its higher search error. Plenty of… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  10. arXiv:2307.06721  [pdf, other

    cs.CL cs.LG

    Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative

    Authors: Sho Shimoyama, Tetsuro Morimura, Kenshi Abe, Toda Takamichi, Yuta Tomomatsu, Masakazu Sugiyama, Asahi Hentona, Yuuki Azuma, Hirotaka Ninomiya

    Abstract: Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based one… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

  11. arXiv:2306.05292  [pdf, other

    cs.IR cs.LG

    Safe Collaborative Filtering

    Authors: Riku Togashi, Tatsushi Oka, Naoto Ohsaka, Tetsuro Morimura

    Abstract: Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset. Tail performance is also a vital determinant of success for personalized recommender systems to reduce the risk of losing users with low satisfaction. This study intr… ▽ More

    Submitted 28 February, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: accepted at ICLR2024

  12. arXiv:2206.01011  [pdf, other

    cs.LG cs.AI

    Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

    Authors: Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang

    Abstract: Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking… ▽ More

    Submitted 4 July, 2024; v1 submitted 2 June, 2022; originally announced June 2022.

    Comments: Accepted to Reinforcement Learning Conference (RLC) 2024

  13. arXiv:2010.01404  [pdf, other

    cs.LG stat.ML

    Mean-Variance Efficient Reinforcement Learning by Expected Quadratic Utility Maximization

    Authors: Masahiro Kato, Kei Nakagawa, Kenshi Abe, Tetsuro Morimura

    Abstract: Risk management is critical in decision making, and mean-variance (MV) trade-off is one of the most common criteria. However, in reinforcement learning (RL) for sequential decision making under uncertainty, most of the existing methods for MV control suffer from computational difficulties caused by the double sampling problem. In this paper, in contrast to strict MV control, we consider learning M… ▽ More

    Submitted 5 September, 2021; v1 submitted 3 October, 2020; originally announced October 2020.

  14. arXiv:1907.01221  [pdf

    cs.AI cs.HC cs.LG

    Visual analytics for team-based invasion sports with significant events and Markov reward process

    Authors: Kun Zhao, Takayuki Osogami, Tetsuro Morimura

    Abstract: In team-based invasion sports such as soccer and basketball, analytics is important for teams to understand their performance and for audiences to understand matches better. The present work focuses on performing visual analytics to evaluate the value of any kind of event occurring in a sports match with a continuous parameter space. Here, the continuous parameter space involves the time, location… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

    Comments: 8 pages, 10 figures

  15. arXiv:1906.06663  [pdf, ps, other

    stat.ML cs.LG

    Sampler for Composition Ratio by Markov Chain Monte Carlo

    Authors: Yachiko Obara, Tetsuro Morimura, Hiroki Yanagisawa

    Abstract: Invention involves combination, or more precisely, ratios of composition. According to Thomas Edison, "Genius is one percent inspiration and 99 percent perspiration" is an example. In many situations, researchers and inventors already have a variety of data and manage to create something new by using it, but the key problem is how to select and combine knowledge. In this paper, we propose a new Ma… ▽ More

    Submitted 28 June, 2019; v1 submitted 16 June, 2019; originally announced June 2019.

    Comments: 8 pages, 4 figures

  16. arXiv:1203.3497  [pdf

    cs.LG stat.ML

    Parametric Return Density Estimation for Reinforcement Learning

    Authors: Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, Toshiyuki Tanaka

    Abstract: Most conventional Reinforcement Learning (RL) algorithms aim to optimize decision-making rules in terms of the expected returns. However, especially for risk management purposes, other risk-sensitive criteria such as the value-at-risk or the expected shortfall are sometimes preferred in real applications. Here, we describe a parametric method for estimating density of the returns, which allows us… ▽ More

    Submitted 15 March, 2012; originally announced March 2012.

    Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

    Report number: UAI-P-2010-PG-368-375