Search | arXiv e-print repository

Critique-out-Loud Reward Models

Authors: Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, Prithviraj Ammanabrolu

Abstract: Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single for… ▽ More Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2404.16767 [pdf, other]

REBEL: Reinforcement Learning via Regressing Relative Rewards

Authors: Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun

Abstract: While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise impleme… ▽ More While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard. △ Less

Submitted 1 September, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: New experimental results on general chat

arXiv:2404.08513 [pdf, other]

Adversarial Imitation Learning via Boosting

Authors: Jonathan D. Chang, Dhruv Sreenivas, Yingbing Huang, Kianté Brantley, Wen Sun

Abstract: Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective… ▽ More Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 19 pages, 7 figures, 4 tables, 3 algorithms, ICLR 2024

arXiv:2404.08495 [pdf, other]

Dataset Reset Policy Optimization for RLHF

Authors: Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of r… ▽ More Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo. △ Less

Submitted 16 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: 28 pages, 6 tables, 3 Figures, 3 Algorithms

arXiv:2404.03673 [pdf, other]

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

Authors: Owen Oertell, Jonathan D. Chang, Yiyi Zhang, Kianté Brantley, Wen Sun

Abstract: Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning… ▽ More Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at https://rlcm.owenoertell.com. △ Less

Submitted 22 June, 2024; v1 submitted 25 March, 2024; originally announced April 2024.

Comments: 18 pages, 9 figures, 1 table

arXiv:2310.04407 [pdf, other]

Policy-Gradient Training of Language Models for Ranking

Authors: Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, Thorsten Joachim

Abstract: Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires… ▽ More Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals. This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2306.11816 [pdf, other]

Learning to Generate Better Than Your LLM

Authors: Jonathan D. Chang, Kiante Brantley, Rajkumar Ramamurthy, Dipendra Misra, Wen Sun

Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Opt… ▽ More Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril. △ Less

Submitted 13 November, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 23 pages, 5 figures, 7 tables, 4 algorithms

arXiv:2209.06348 [pdf, other]

COVID-19 Regional Waves and Spread Risk Assessment through the Analysis of the Initial Outbreak in Guatemala

Authors: Juan Adolfo Ponciano, Juan Diego Chang, Mariela Abdalah, Kevin Facey, José Miguel Ponciano

Abstract: The initial surge of the COVID-19 pandemic hit Guatemala on March 2020. On a country scale, the epidemic has undergone a fairly well-known and distinguishable initial phase, reaching its peak on mid July 2020. However, the detailed picture is more involved and reflects inter-regional variations in the epidemic dynamics, presumably grounded on socio-demographic, connectivity, and human mobility fac… ▽ More The initial surge of the COVID-19 pandemic hit Guatemala on March 2020. On a country scale, the epidemic has undergone a fairly well-known and distinguishable initial phase, reaching its peak on mid July 2020. However, the detailed picture is more involved and reflects inter-regional variations in the epidemic dynamics, presumably grounded on socio-demographic, connectivity, and human mobility factors. Classifying the regional epidemic curves and identifying the major hubs of regional COVID-19 spread can contribute towards defining an evidence-based risk map for future outbreaks of infectious diseases with similar transmissibility properties. In this work, we make a regional wave decomposition of the initial epidemic phase registered in Guatemala, and we use the Richards phenomenological model alongside multivariate ordination techniques of its estimated model parameters to draw a countrywide picture of the first epidemiological wave. By exploring similarities in the model space parameters, we traced routes for the disease spread across the country. We evaluated how well the proposed classification can help to define a regional risk hierarchy comprising early stage focal points, major hubs, and secondary regions of epidemic progression. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: 22 pages, 7 figures, 2 tables

arXiv:2207.05837 [pdf, other]

Learning Bellman Complete Representations for Offline Policy Evaluation

Authors: Jonathan D. Chang, Kaiwen Wang, Nathan Kallus, Wen Sun

Abstract: We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations… ▽ More We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: Accepted for Long Talk at ICML 2022

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:2938-2971, 2022

arXiv:2203.15228 [pdf, other]

doi 10.1109/SoutheastCon48659.2022.9763890

SHOP: A Deep Learning Based Pipeline for near Real-Time Detection of Small Handheld Objects Present in Blurry Video

Authors: Abhinav Ganguly, Amar C Gandhi, Sylvia E, Jeffrey D Chang, Ian M Hudson

Abstract: While prior works have investigated and developed computational models capable of object detection, models still struggle to reliably interpret images with motion blur and small objects. Moreover, none of these models are specifically designed for handheld object detection. In this work, we present SHOP (Small Handheld Object Pipeline), a pipeline that reliably and efficiently interprets blurry im… ▽ More While prior works have investigated and developed computational models capable of object detection, models still struggle to reliably interpret images with motion blur and small objects. Moreover, none of these models are specifically designed for handheld object detection. In this work, we present SHOP (Small Handheld Object Pipeline), a pipeline that reliably and efficiently interprets blurry images containing handheld objects. The specific models used in each stage of the pipeline are flexible and can be changed based on performance requirements. First, images are deblurred and then run through a pose detection system where areas-of-interest are proposed around the hands of any people present. Next, object detection is performed on the images by a single-stage object detector. Finally, the proposed areas-of-interest are used to filter out low confidence detections. Testing on a handheld subset of Microsoft Common Objects in Context (MS COCO) demonstrates that this 3 stage process results in a 70 percent decrease in false positives while only reducing true positives by 17 percent in its strongest configuration. We also present a subset of MS COCO consisting solely of handheld objects that can be used to continue the development of handheld object detection methods. https://github.com/spider-sense/SHOP △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 8 pages, 5 figures. Accepted to IEEE SoutheastCon 2022

arXiv:2106.03207 [pdf, other]

Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage

Authors: Jonathan D. Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, Wen Sun

Abstract: This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework… ▽ More This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo. △ Less

Submitted 31 January, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: 42 pages, 5 figures, 7 tables

arXiv:1912.08761 [pdf, ps, other]

Gravitational realization of magnons in a ferromagnetic spin chain

Authors: Juan Diego Chang, Rodrigo de Leon Ardon, Juan Ponciano, Giovanni Ramirez

Abstract: A gravitational model of magnons in thermal equilibrium with a ferromagnetic spin chain is developed in a phenomenological bottom-up approach. A large Schwarzschild-AdS black hole background is used as the thermal reservoir and the magnon dynamics is obtained by scalar fields and branes in the bulk. The key feature of this model is that the coupling of the spin chain is related with the radial pos… ▽ More A gravitational model of magnons in thermal equilibrium with a ferromagnetic spin chain is developed in a phenomenological bottom-up approach. A large Schwarzschild-AdS black hole background is used as the thermal reservoir and the magnon dynamics is obtained by scalar fields and branes in the bulk. The key feature of this model is that the coupling of the spin chain is related with the radial position in which the brane is located. We further study a ferromagnetic spin chain with a competing interaction and find that the couplings are related by the difference of positions of the branes. We show how to obtain the model from a weak limit of a dynamical gravitational system. This allows us to embed the model into a holographic system. The couplings can be related to entanglement entropy at finite temperature of the CFT since the turning point of minimal surfaces coincides with the position of the branes. The difference of entropy is used to define a notion of distance between the chain couplings. △ Less

Submitted 21 January, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

Comments: 9 pages. Heavy editing, conceptual issues corrected

Showing 1–12 of 12 results for author: Chang, J D