Zum Hauptinhalt springen

Showing 1–3 of 3 results for author: Coste, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.13210  [pdf, other

    cs.LG

    Bayesian Reward Models for LLM Alignment

    Authors: Adam X. Yang, Maxime Robeyns, Thomas Coste, Zhengyan Shi, Jun Wang, Haitham Bou-Ammar, Laurence Aitchison

    Abstract: To ensure that large language model (LLM) responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible… ▽ More

    Submitted 2 July, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

  2. arXiv:2312.14878  [pdf, other

    cs.AI cs.LG

    Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning

    Authors: Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, Jun Wang

    Abstract: A key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL). However, constructing a standalone RL policy that maps perception to action directly encounters severe problems, chief among them being its lack of generality across multiple tasks and the need for a large amount of training data. The leading cause is that it cannot effectively integrate prior information… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: paper and appendix, 27 pages

  3. arXiv:2310.02743  [pdf, other

    cs.LG

    Reward Model Ensembles Help Mitigate Overoptimization

    Authors: Thomas Coste, Usman Anwar, Robert Kirk, David Krueger

    Abstract: Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon… ▽ More

    Submitted 10 March, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at ICLR 2024