-
Critique-out-Loud Reward Models
Authors:
Zachary Ankner,
Mansheej Paul,
Brandon Cui,
Jonathan D. Chang,
Prithviraj Ammanabrolu
Abstract:
Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single for…
▽ More
Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
REBEL: Reinforcement Learning via Regressing Relative Rewards
Authors:
Zhaolin Gao,
Jonathan D. Chang,
Wenhao Zhan,
Owen Oertell,
Gokul Swamy,
Kianté Brantley,
Thorsten Joachims,
J. Andrew Bagnell,
Jason D. Lee,
Wen Sun
Abstract:
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise impleme…
▽ More
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
△ Less
Submitted 1 September, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
Adversarial Imitation Learning via Boosting
Authors:
Jonathan D. Chang,
Dhruv Sreenivas,
Yingbing Huang,
Kianté Brantley,
Wen Sun
Abstract:
Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective…
▽ More
Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Dataset Reset Policy Optimization for RLHF
Authors:
Jonathan D. Chang,
Wenhao Zhan,
Owen Oertell,
Kianté Brantley,
Dipendra Misra,
Jason D. Lee,
Wen Sun
Abstract:
Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of r…
▽ More
Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.
△ Less
Submitted 16 April, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
Authors:
Owen Oertell,
Jonathan D. Chang,
Yiyi Zhang,
Kianté Brantley,
Wen Sun
Abstract:
Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning…
▽ More
Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at https://rlcm.owenoertell.com.
△ Less
Submitted 22 June, 2024; v1 submitted 25 March, 2024;
originally announced April 2024.
-
Policy-Gradient Training of Language Models for Ranking
Authors:
Ge Gao,
Jonathan D. Chang,
Claire Cardie,
Kianté Brantley,
Thorsten Joachim
Abstract:
Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires…
▽ More
Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals. This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Learning to Generate Better Than Your LLM
Authors:
Jonathan D. Chang,
Kiante Brantley,
Rajkumar Ramamurthy,
Dipendra Misra,
Wen Sun
Abstract:
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Opt…
▽ More
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril.
△ Less
Submitted 13 November, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
COVID-19 Regional Waves and Spread Risk Assessment through the Analysis of the Initial Outbreak in Guatemala
Authors:
Juan Adolfo Ponciano,
Juan Diego Chang,
Mariela Abdalah,
Kevin Facey,
José Miguel Ponciano
Abstract:
The initial surge of the COVID-19 pandemic hit Guatemala on March 2020. On a country scale, the epidemic has undergone a fairly well-known and distinguishable initial phase, reaching its peak on mid July 2020. However, the detailed picture is more involved and reflects inter-regional variations in the epidemic dynamics, presumably grounded on socio-demographic, connectivity, and human mobility fac…
▽ More
The initial surge of the COVID-19 pandemic hit Guatemala on March 2020. On a country scale, the epidemic has undergone a fairly well-known and distinguishable initial phase, reaching its peak on mid July 2020. However, the detailed picture is more involved and reflects inter-regional variations in the epidemic dynamics, presumably grounded on socio-demographic, connectivity, and human mobility factors. Classifying the regional epidemic curves and identifying the major hubs of regional COVID-19 spread can contribute towards defining an evidence-based risk map for future outbreaks of infectious diseases with similar transmissibility properties. In this work, we make a regional wave decomposition of the initial epidemic phase registered in Guatemala, and we use the Richards phenomenological model alongside multivariate ordination techniques of its estimated model parameters to draw a countrywide picture of the first epidemiological wave. By exploring similarities in the model space parameters, we traced routes for the disease spread across the country. We evaluated how well the proposed classification can help to define a regional risk hierarchy comprising early stage focal points, major hubs, and secondary regions of epidemic progression.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Learning Bellman Complete Representations for Offline Policy Evaluation
Authors:
Jonathan D. Chang,
Kaiwen Wang,
Nathan Kallus,
Wen Sun
Abstract:
We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations…
▽ More
We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
SHOP: A Deep Learning Based Pipeline for near Real-Time Detection of Small Handheld Objects Present in Blurry Video
Authors:
Abhinav Ganguly,
Amar C Gandhi,
Sylvia E,
Jeffrey D Chang,
Ian M Hudson
Abstract:
While prior works have investigated and developed computational models capable of object detection, models still struggle to reliably interpret images with motion blur and small objects. Moreover, none of these models are specifically designed for handheld object detection. In this work, we present SHOP (Small Handheld Object Pipeline), a pipeline that reliably and efficiently interprets blurry im…
▽ More
While prior works have investigated and developed computational models capable of object detection, models still struggle to reliably interpret images with motion blur and small objects. Moreover, none of these models are specifically designed for handheld object detection. In this work, we present SHOP (Small Handheld Object Pipeline), a pipeline that reliably and efficiently interprets blurry images containing handheld objects. The specific models used in each stage of the pipeline are flexible and can be changed based on performance requirements. First, images are deblurred and then run through a pose detection system where areas-of-interest are proposed around the hands of any people present. Next, object detection is performed on the images by a single-stage object detector. Finally, the proposed areas-of-interest are used to filter out low confidence detections. Testing on a handheld subset of Microsoft Common Objects in Context (MS COCO) demonstrates that this 3 stage process results in a 70 percent decrease in false positives while only reducing true positives by 17 percent in its strongest configuration. We also present a subset of MS COCO consisting solely of handheld objects that can be used to continue the development of handheld object detection methods. https://github.com/spider-sense/SHOP
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage
Authors:
Jonathan D. Chang,
Masatoshi Uehara,
Dhruv Sreenivas,
Rahul Kidambi,
Wen Sun
Abstract:
This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework…
▽ More
This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo.
△ Less
Submitted 31 January, 2022; v1 submitted 6 June, 2021;
originally announced June 2021.
-
Gravitational realization of magnons in a ferromagnetic spin chain
Authors:
Juan Diego Chang,
Rodrigo de Leon Ardon,
Juan Ponciano,
Giovanni Ramirez
Abstract:
A gravitational model of magnons in thermal equilibrium with a ferromagnetic spin chain is developed in a phenomenological bottom-up approach. A large Schwarzschild-AdS black hole background is used as the thermal reservoir and the magnon dynamics is obtained by scalar fields and branes in the bulk. The key feature of this model is that the coupling of the spin chain is related with the radial pos…
▽ More
A gravitational model of magnons in thermal equilibrium with a ferromagnetic spin chain is developed in a phenomenological bottom-up approach. A large Schwarzschild-AdS black hole background is used as the thermal reservoir and the magnon dynamics is obtained by scalar fields and branes in the bulk. The key feature of this model is that the coupling of the spin chain is related with the radial position in which the brane is located. We further study a ferromagnetic spin chain with a competing interaction and find that the couplings are related by the difference of positions of the branes. We show how to obtain the model from a weak limit of a dynamical gravitational system. This allows us to embed the model into a holographic system. The couplings can be related to entanglement entropy at finite temperature of the CFT since the turning point of minimal surfaces coincides with the position of the branes. The difference of entropy is used to define a notion of distance between the chain couplings.
△ Less
Submitted 21 January, 2020; v1 submitted 18 December, 2019;
originally announced December 2019.