\newverbcommand\cverb\newverbcommand\gverb

ARM: Efficient Guided Decoding with Autoregressive Reward Models

Sergey Troshin
University of Amsterdam
[email protected] &Vlad Niculae
University of Amsterdam
[email protected] &Antske Fokkens
VU University Amsterdam
[email protected]
Abstract

Language models trained on large amounts of data require careful tuning to be safely deployed in real world. We revisit the guided decoding paradigm, where the goal is to augment the logits of the base language model using the scores from a task-specific reward model. We propose a simple but efficient parameterization of the autoregressive reward model enabling fast and effective guided decoding. On detoxification and sentiment control tasks, we show that our efficient parameterization performs on par with RAD, a strong but less efficient guided decoding approach.

1 Introduction

Generative large language models (LLMs) gain a lot of popularity in recent years and show impressive results in zero-shot and few-shot scenarios on numerous downstream tasks (Touvron et al., 2023; OpenAI, 2024; Jiang et al., 2023). These large-scale models are pretrained on large amounts of data, and are known to inherit and memorize the underlying biases (Sheng et al., 2019). Pretrained LLMs are known to be easily triggered into unsafe behaviour (Wallace et al., 2019; Ganguli et al., 2022), necessitating further tuning for safer deployment and control (Ouyang et al., 2022).

Control over LLMs can be divided into methods which modify the original model (Ouyang et al., 2022; Rafailov et al., 2023), and decoding-time plug-in solutions. In this work, we focus on the decoding-time guidance approaches. We assume we have access to the logits of a black-box base language model (see section 3.1 for details). We then train a smaller expert model that shares the tokenizer with the base model. The task of expert model is to modify or rerank the logits of the base model during decoding in order to satisfy the desired constraint, while preserving as much as possible the distribution of base model. A line of work (Holtzman et al., 2018; Ghazvininejad et al., 2017; Dathathri et al., 2019; Yang & Klein, 2021) proposes to use free form external classifiers to guide the decoding towards higher values of classifier scores. This approach has proven to effectively condition the samples of the model to satisfy the desired constraints while not changing much the distribution of base model (Deng & Raffel, 2023). A disadvantage of this approach is that one need to consider all possible continuations of a prefix and run the classifier over each of them. In the case of standard autoregressive decoding, this would mean that we need to run an external classifier over each token in the vocabulary concatenated with the currently generated prefix. This limits the usability of the method and forces to limit the number of evaluated candidates. Liu et al. (2021) and Krause et al. (2021) propose a more efficient approach utilizing language models to predict a score for all token candidates, but these methods are less effective to produce fluent generations.

In our work, we aim to combine the strengths of both paradigms: fast inference (thanks to a language model backbone) and high quality of generations (with an attribute discriminator approach). We propose a simple strategy for how to transform a pretrained language model into an efficient autoregressive discriminator. Our model inherits the efficiency of language models to predict the classifier scores for each next token candidate utilizing the distances between output embeddings and predicted hidden states. In the evaluation, we show that guided decoding with our model results in a comparable attribute control/fluency trade-off as approaches using the strong and more computationally intensive reward model.

2 Related Work

There are multiple approaches that investigate how to align a model using attribute-conditioned data. Keskar et al. (2019, CTRL) propose to finetune models using control prompts. More recent approaches perform finetuning while regularizing the weights of the model to stay close to pretrained weights, such as Schulman et al. (2017, PPO); Stiennon et al. (2020, PPO) or Lu et al. (2022, Quark). They propose an iterative finetuning scheme based on retraining the model on attribute conditioned samples from the model itself. Despite the efficiency of decoding, these methods might require more resources for finetuning if the base model is large, or might be unusable if we only have access to base model via an API.

Alternatively, we focus on external discriminators complementing the decoding from the base model. These techniques can be roughly divided into gradient-based, and gradient-free methods.

Among the gradient-based methods, Dathathri et al. (2019, PPLM) backpropagates the gradients from a discriminator into the prefix activations of the base model to produce desired continuations during decoding. Overall, gradient-based methods are more hard to train and use in practice since they require backpropagating through the large base model.

We focus on the gradient-free decoding guidance approach, where we have access to the frozen base model. Holtzman et al. (2018) employ discriminators to modify the decoding to satisfy Grice’s maxims (Grice, 1975). Yang & Klein (2021, FUDGE) propose to concatenate the prefix words with next token candidates to estimate the discriminator scores during decoding. Dekoninck et al. (2023); Sitdikov et al. (2022) further use free-form attribute-tuned classifiers to guide the base model. A disadvantage of using free-form classifiers is that they are not designed to be applied during autoregressive decoding. First, when using a free-form classifier e.g. finetuned BERT variant (Devlin et al., 2019; Sitdikov et al., 2022) as discriminator, one needs to recompute the prefix activations for each decoding step. We constrain discriminator to be an autoregressive model suitable for caching of prefix activations (Deng & Raffel, 2023). And second, classifiers are usually finetuned on the complete sentences, and not on partial sequences as in decoding process. To address this issue, Deng & Raffel (2023) trains an autoregressive discriminator to produce the sentence attribute for every position in the sequence (with weighted objective). In this work, we rely on the same strategy. A related approach (Cao et al., 2022, Rectification) alters the token selection to minimize risk of future undesired attributes.

Architecturally, our approach is inspired by GeDi (Krause et al., 2021) and DExperts (Liu et al., 2021), who aim to speed up the decoding by using the logits of conditional language models as discriminator scores p(token|prefix,a)𝑝conditionaltokenprefix𝑎p(\textit{token}|\textit{prefix},a)italic_p ( token | prefix , italic_a ), where a𝑎aitalic_a is an attribute class, e. g. positive sentiment. Despite the efficiency, these approaches lag behind discriminators directly finetuned to predict the desired attribute, suggesting that the language model objective interferes with the task of learning a good attribute discriminator (ranking) model.

Our model, while inheriting the speed efficiency of GeDi and DExperts, relies on the finetuning strategy of attribute discriminators, and we empirically show that this allows us incorporating the strengths of both paradigms.

3 From Language Models to Reward Models

Refer to caption
Figure 1: During decoding, we augment logits of the base language model with reward scores from ARM. ARM generates reward scores on the fly autoregressively caching previous activations. ARM utilizes language model output embeddings to efficiently predict rewards for each of the next token.

3.1 Guided decoding with external experts

In this section, we outline the approach of guiding an autoregressive base language model with external experts. At each step t𝑡titalic_t of decoding, we have a prompt together with an already generated prefix x:tsubscript𝑥:absent𝑡x_{:t}italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT. We assume we have access to the logits ztR|V|subscript𝑧𝑡superscript𝑅𝑉z_{t}\in R^{|V|}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT of an autoregressive model pLM(xt|x:t)exp(zt(xt|x:t))proportional-tosubscript𝑝𝐿𝑀conditionalsubscript𝑥𝑡subscript𝑥:absent𝑡subscript𝑧𝑡conditionalsubscript𝑥𝑡subscript𝑥:absent𝑡p_{LM}(x_{t}|x_{:t})\propto\exp(z_{t}(x_{t}|x_{:t}))italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ) ∝ roman_exp ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ) ), where |V|𝑉|V|| italic_V | is the size of vocabulary, and our goal is to augment these logits with an external reward model scores rt[0,1]|V|subscript𝑟𝑡superscript01𝑉r_{t}\in[0,1]^{|V|}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT. The augmented logits are then mapped to probabilities via Softmax, and a next token is then sampled from the multinomial distribution:

p~(xt|x:t)=Softmax(zt+βrt).~𝑝conditionalsubscript𝑥𝑡subscript𝑥:absent𝑡Softmaxsubscript𝑧𝑡𝛽subscript𝑟𝑡\tilde{p}(x_{t}|x_{:t})=\operatorname{Softmax}(z_{t}+\beta r_{t}).over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ) = roman_Softmax ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_β italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1)

To define expert logits, Liu et al. (2021); Krause et al. (2021) propose to use attribute-conditioned language models, trained on the next token prediction task on class-conditioned data: rLM,a=zt(xt|x:t,a)subscript𝑟𝐿𝑀𝑎subscript𝑧𝑡conditionalsubscript𝑥𝑡subscript𝑥:absent𝑡𝑎r_{LM,a}~{}=~{}z_{t}(x_{t}|x_{:t},a)italic_r start_POSTSUBSCRIPT italic_L italic_M , italic_a end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_a ). The main advantage of this approach is that for a given prefix x:tsubscript𝑥:absent𝑡x_{:t}italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT we only pass it through the discriminator backbone once, relying on the language model output layers to obtain predictions rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each of the next tokens. Deng & Raffel (2023) show that training a less efficient discriminator leads to better quality. Instead of relying on the language model objective, they explicitly train a discriminator to predict the attribute of interest: rRAD(xt|x:t,a)=p(a|[x:t,xt])subscript𝑟𝑅𝐴𝐷conditionalsubscript𝑥𝑡subscript𝑥:absent𝑡𝑎𝑝conditional𝑎subscript𝑥:absent𝑡subscript𝑥𝑡r_{RAD}(x_{t}|x_{:t},a)=p(a|[x_{:t},x_{t}])italic_r start_POSTSUBSCRIPT italic_R italic_A italic_D end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_a ) = italic_p ( italic_a | [ italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ), where [,][\cdot,\cdot][ ⋅ , ⋅ ] denotes the concatenation of a prefix and a next token candidate. This approach requires passing each next token candidate as input to the model, thus, to obtain the scores for k𝑘kitalic_k next token candidates they would need k𝑘kitalic_k full forward passes, which can slow down inference significantly, and bounds them to use top-k decoding to limit the number of next token candidates.

Both of these approaches finetune an autoregressive GPT-2222 (Radford et al., 2019) language model backbone, however, they parameterize the rewards differently suggesting that we are limited to either a fast or an expressive model.

In our work, we ask whether it is possible to have a both efficient and expressive autoregressive reward model suitable for efficient on-the-fly guidance of the base language model.

For a reward model to be efficiently utilized during autoregressive decoding, we aim to satisfy the following requirements:

  1. 1.

    quality: guided decoding should produce fluent outputs with high constraint satisfaction probability comparable to Deng & Raffel (2023).

  2. 2.

    autoregressive decoding efficiency: complexity of reward model should linearly depend on the context length for each decoding step (as in Deng & Raffel (2023))

  3. 3.

    prediction efficiency: the backbone is called only once for each decoding step and does not depend on the number of next token candidates (as in Liu et al. (2021); Krause et al. (2021))

We answer affirmatively, by proposing ARM, an autoregressive reward model suitable for guided decoding, designed for efficient modelling of rewards scores. We demonstrate that guided decoding with our ARM results in high quality of constrained generation on detoxification and sentiment control tasks (see section 4).

3.2 Autoregressive Reward Model

In order to satisfy autoregressive decoding efficiency and, hopefully, high quality of predictions, we employ the same strategy as Deng & Raffel (2023) finetuning an autoregressive language model backbone with classifier heads to predict the reward scores of each next token given the prefix. This formulation allow us to cache the activations of the model during left-to-right decoding, thus having linear complexity w. r. t. the input length for each decoding step.

To ensure prediction efficiency, we aim to preserve the language modelling style of prediction to get a score for each of the next tokens modeled as a similarity between a predicted hidden vector htdsubscript𝑡superscript𝑑h_{t}\in\mathbb{R}^{d}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and output embeddings ei|V|×dsubscript𝑒𝑖superscript𝑉𝑑e_{i}\in\mathbb{R}^{|V|\times d}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × italic_d end_POSTSUPERSCRIPT (Liu et al., 2021; Krause et al., 2021), and develop a suitable parameterization of the next token reward. We assume that reward scores are continuous values and without loss of generality assume rt[0;1]|V|.subscript𝑟𝑡superscript01𝑉r_{t}\in[0;1]^{|V|}.italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 ; 1 ] start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT . We propose the following ARM parameterization of the scores for the next tokens given the prefix:

r^i(ei|ht)=r~(ht)baseline+Δr^(ei|ht)subscript^𝑟𝑖conditionalsubscript𝑒𝑖subscript𝑡baseline~𝑟subscript𝑡Δ^𝑟conditionalsubscript𝑒𝑖subscript𝑡\hat{r}_{i}(e_{i}|h_{t})=\underset{\text{baseline}}{\tilde{r}(h_{t})}+\Delta% \hat{r}(e_{i}|h_{t})over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = underbaseline start_ARG over~ start_ARG italic_r end_ARG ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + roman_Δ over^ start_ARG italic_r end_ARG ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

Particularly, we use linear parameterization of the head: r~(ht)=ht,w~𝑟subscript𝑡subscript𝑡𝑤\tilde{r}(h_{t})=\langle h_{t},w\rangleover~ start_ARG italic_r end_ARG ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ⟨ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w ⟩, Δr^(ei|ht)=Wht,eiΔ^𝑟conditionalsubscript𝑒𝑖subscript𝑡𝑊subscript𝑡subscript𝑒𝑖\Delta\hat{r}(e_{i}|h_{t})={\langle Wh_{t},e_{i}\rangle}roman_Δ over^ start_ARG italic_r end_ARG ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ⟨ italic_W italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩. Here, we introduced two attribute-specific parameters: wd𝑤superscript𝑑w\in\mathbb{R}^{d}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for modeling the baseline score of the prefix, and Wd×d𝑊superscript𝑑𝑑W\in\mathbb{R}^{d\times d}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT to model marginal reward predicted for each of next token candidates at position t𝑡titalic_t.

3.3 Loss function

At the training stage, we assume that we have a dataset of N𝑁Nitalic_N sentences xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT each labeled with an attribute ri[0,1]subscript𝑟𝑖01r_{i}\in[0,1]italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ], i{1..N}i\in\{1..N\}italic_i ∈ { 1 . . italic_N }. We finetune the autoregressive reward model to predict the attribute value r^tsubscript^𝑟𝑡\hat{r}_{t}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (prefix reward) for each of the partial prefixes x:tsubscript𝑥:absent𝑡x_{:t}italic_x start_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT. While a naive strategy would be to learn on the full sentences, we adopt a cumulative loss function from Deng & Raffel (2023) encouraging the model to learn the future attribute for an incomplete prefix:

(r^,r)=t=1lwt(r^t,r)t=1lwt.^𝑟𝑟superscriptsubscript𝑡1𝑙subscript𝑤𝑡subscript^𝑟𝑡𝑟superscriptsubscript𝑡1𝑙subscript𝑤𝑡\mathcal{L}(\hat{r},r)=\frac{\sum_{t=1}^{l}w_{t}\ell(\hat{r}_{t},r)}{\sum_{t=1% }^{l}w_{t}}.caligraphic_L ( over^ start_ARG italic_r end_ARG , italic_r ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_ℓ ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (3)

where wt=tsubscript𝑤𝑡𝑡w_{t}=titalic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t, and where \ellroman_ℓ defines the distance between true sentence attribute and predicted prefix reward. The intuition of the loss is to put more weight to the prefixes closer to the full sentence, where the ground truth labels are provided, while still encouraging the model to learn rewards for partial prefixes. Similar to (Deng & Raffel, 2023), we use (r^,r)=(r^r)2^𝑟𝑟superscript^𝑟𝑟2\ell(\hat{r},r)=(\hat{r}-r)^{2}roman_ℓ ( over^ start_ARG italic_r end_ARG , italic_r ) = ( over^ start_ARG italic_r end_ARG - italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT throughout this work.

Additionally, we are interested whether we can finetune a reward model with our parameterization to match the performance of an already trained less efficient but potentially more expressive reward model. To answer this question, we use a distillation loss (Hinton et al., 2015) to mimic the outputs of a less efficient high-quality reward model:

dstl(r^,r)=1lt=1l(r^t,rt)subscriptdstl^𝑟𝑟1𝑙superscriptsubscript𝑡1𝑙subscript^𝑟𝑡subscript𝑟𝑡\mathcal{L}_{\text{dstl}}(\hat{r},r)=\frac{1}{l}\sum_{t=1}^{l}\ell(\hat{r}_{t}% ,r_{t})caligraphic_L start_POSTSUBSCRIPT dstl end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG , italic_r ) = divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_ℓ ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (4)

3.4 Regularization

A reward model only can observe a limited number of tokens during finetuning. While the loss defined above provides the positive signal for some tokens, it might be beneficial to make on objective more “contrastive” and regularize the prediction for other tokens (including rare ones). In our parameterization (2), it is natural to push the predicted reward towards the baseline for unrelated tokens, regularizing the outputs.

We regularize the prediction of our model to be close on average to the prefix baseline by forcing the marginal prediction Δr^Δ^𝑟\Delta\hat{r}roman_Δ over^ start_ARG italic_r end_ARG to be close to 00 for randomly sampled token candidates:

reg(ht)=𝔼eU[V][Δr^(e|ht)]2,subscriptregsubscript𝑡subscript𝔼similar-tosuperscript𝑒𝑈delimited-[]𝑉superscriptdelimited-[]Δ^𝑟conditionalsuperscript𝑒subscript𝑡2\mathcal{L}_{\text{reg}}(h_{t})=\mathbb{E}_{e^{\prime}\sim U[V]}\left[\Delta% \hat{r}(e^{\prime}|h_{t})\right]^{2},caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_U [ italic_V ] end_POSTSUBSCRIPT [ roman_Δ over^ start_ARG italic_r end_ARG ( italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where we use one sample of esuperscript𝑒e^{\prime}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for each prefix position, sampling uniformly from a set of all possible output embeddings.

4 Experiments

To test our model, we follow previous work (Liu et al., 2021; Deng & Raffel, 2023) and train two types of attribute discriminators: for toxicity reduction and sentiment control.

In our experiments, we guide the decoding from GPT-2-Large using a finetuned discriminator. For our discriminator backbone we are using GPT-2-Small. We finetune all parameters of the discriminator except embeddings eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which remain frozen. We conduct experiments in two regimes: distilling an already trained less efficient discriminator from (Deng & Raffel, 2023) using dstlsubscript𝑑𝑠𝑡𝑙\mathcal{L}_{dstl}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_t italic_l end_POSTSUBSCRIPT loss (4) ; training a discriminator from scratch on labels only using cumulative loss \mathcal{L}caligraphic_L (3). In both settings, we use additional regularization regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT forcing marginal prediction for unrelated next tokens to stay close to zero.

For evaluation, we decode using top-k sampling from the multinomial distribution defined in (1), where top-k candidates are selected taking k largest logits of the base model at the current decoding step, while assigning -\infty- ∞ to the logits of other tokens.

4.1 Detoxification

Refer to caption
Figure 2: ARM shows comparable Toxicity/Fluency trade-offs with the teacher model, where the distilled version closely matches the performance of the teacher model. We rerun the evaluation for RAD with an up-to-date Perspective API version. We include the results with other baselines from (Deng & Raffel, 2023) in figure 7 (see section B.1).

For the detoxification evaluation, we follow previous work (Deng & Raffel, 2023; Liu et al., 2021) and evaluate samples from guided decoding given a 10k subset (Liu et al., 2021) of prompts from the RealToxicityPrompts dataset (Gehman et al., 2020). We follow Deng & Raffel (2023); Liu et al. (2021) and finetune our model on 2M pairs of text and continuous labels between 00 and 1111 from the Jigsaw Unintended Bias in Toxicity Classification challenge (cjadams et al., 2019). Like previous work, we train our model on 7777 independent labels (‘toxicity’, ‘severe toxicity’, ‘obscene’, ‘identity attack’, ‘insult’, ‘threat’, ‘sexual explicit’) with different head parameters wi,Wi,i{1,,7}subscript𝑤𝑖subscript𝑊𝑖𝑖17w_{i},W_{i},i\in\{1,...,7\}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , 7 } for each sub-task. During decoding, we only use the predictor of toxicity. For the distillation experiment, we use the same dataset, and the released toxicity discriminator from Deng & Raffel (2023) as a teacher.

During decoding, we sample 25252525 continuations generating at most 20202020 new tokens. To evaluate toxicity, we use an external closed-source toxicity classifier Perspective API (Lees et al., 2022), and following previous work (Deng & Raffel, 2023; Liu et al., 2021), we rely on Maximal Average Toxicity metric, which is the maximal value of toxicity score over 25252525 samples for a given prompt, averaged over the set of 10k prompts. We also report Toxic Rate, which is calculated as probability that at least one out of 25 continuations is toxic according to Perspective API (toxicity score > 0.50.50.50.5); and Diversity score, which is the average number of distinct n𝑛nitalic_n-grams normalized by the length of text (Li et al., 2018). To evaluate the fluency of model generations, we use the average perplexity of GPT-2-XL, calculated over the sampled continuations. In the experiments, we will look at the toxicity/fluency trade-off, alternating the weight β𝛽\betaitalic_β of the discriminator (see section B.1 and section B.2). We expect to obtain a model with both low toxicity according to PerspectiveAPI, and high fluency.

4.2 Sentiment control

Refer to caption
Figure 3: For sentiment control task, ARM trained on labels only lags slightly behind the RAD baseline, while student ARM outperforms the teacher RAD model. We include the results from Deng & Raffel (2023) for other baselines for reference.

For sentiment control, we follow the previous work of Li et al. (2018); Sudhakar et al. (2019); Liu et al. (2021); Deng & Raffel (2023) to evaluate the samples given a prompt from one of the three categories: 2.5K2.5𝐾2.5K2.5 italic_K negative, 5K5𝐾5K5 italic_K neutral, and 2.5K2.5𝐾2.5K2.5 italic_K positive prompts from OpenWebText (Gokaslan & Cohen, 2019). To finetune ARM on labels only, we follow Deng & Raffel (2023) and finetune our model on millions of reviews from the Amazon Polarity (Zhang et al., 2015) and and SST-2 (Socher et al., 2013) datasets. To distil the sentiment discriminator of Deng & Raffel (2023), we use text examples from the Amazon Polarity dataset. Additional training details are provided in appendix A.

For evaluation, we follow Deng & Raffel (2023), and use the average Positive Rate metric w. r. t. the finetuned external DistilBERT classifier (Sanh et al., 2019) provided via the HuggingFace text classification pipeline111https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english.

As in the toxicity task, we use GPT-2-XL to evaluate the fluency of the sampled continuations, and we expect to obtain high Positive Rate and low average perplexity.

4.3 Results

Detoxification

For detoxification task (figure 2), our efficient student (ARM) closely follows the RAD teacher for attribute control/fluency trade-off. Both ARM student and RAD teacher can significantly reduce Average Maximal Toxicity with only a slight drop in fluency. We observe that ARM train on labels only shows worse trade-off for the lower values of toxicity.

For completeness, in figure 7, we include the results for other baselines from Deng & Raffel (2023) computed for the older version of Perspective API.

Sentiment control

From the results on the sentiment control task in figure 3, we observe that the ARM student model shows even slightly better trade-off than RAD teacher model, closely following approaches that require training using feedback from evaluation pipeline (Lu et al., 2022, Quark), (Stiennon et al., 2020, PPO). Again, ARM trained of labels only perform less fluently but still competitively compared to other guided decoding baselines.

Our results suggest that we can indeed use our efficient parameterization of reward model to distill already trained effective but less efficient reward model without the loss of quality. Alternatively, we can use data labelled with sentence attributes to finetune ARM without a teacher model, obtaining competitive but slightly worse results compared to distilling the RAD teacher.

Refer to caption
Figure 4: Ablation experiment for distillation experiment on detoxification task with the ARM, k=20𝑘20k=20italic_k = 20. We observe that regularization towards the baseline results in lower perplexity of generated samples.
Refer to caption
Figure 5: We plot the distribution of the toxicity reward predicted for all next tokens, averaged over 20202020 random prompts from RealToxicityPrompts. Here, we sort token ids independently for each model to look at the overall ‘shape’ of the distribution. Models tend to predict more high/small values of reward for a small number of tokens. ARM (distill) with regularization to the baseline tend to predict more flat distribution of reward for most of the tokens as compared to the ARM (distill) without additional regularization.

4.4 Ablation

Observing good results for our parameterization in previous experiments, we then investigate the effect of the regularization component and baseline component for head parameterization.

As shown in figure 4, we experiment with distillation and observe that turning off regularization, or further removing baseline from head parameterization results in higher average perplexity, although these simpler models still show adequate decrease in toxicity. By analyzing the distribution of rewards for all next tokens (see figure 5), we observe that the unregularized ARM tends to predict less flat distributions (higher values in Δr^Δ^𝑟\Delta\hat{r}roman_Δ over^ start_ARG italic_r end_ARG component), which may hurt the fluency.

Model N calls
GeDi (Krause et al., 2021) 1
DExperts (Liu et al., 2021) 2
RAD (Deng & Raffel, 2023) k𝑘kitalic_k
ARM (Ours) 1
Table 1: Number of backbone forward passes for a reward model for a single decoding step with k𝑘kitalic_k next token candidates. All included models support caching of prefix activations during decoding to enable faster inference, so the complexity of one forward pass is comparable across methods.
Refer to caption
Figure 6: Baseline model (Deng & Raffel, 2023) requires k𝑘kitalic_k forward passes through the discriminator backbone, while our model requires only 1111 pass at each decoding step.

4.5 Efficiency

The main advantage of our reward model is that it need only a single pass through the backbone for each step of decoding. In table 1, we compare the number of backbone passes needed for external expert models. In figure 6, we measure time per generated token when running the decoding for the toxicity task for us and the model of Deng & Raffel (2023). When we increase k𝑘kitalic_k in top-k decoding with the fixed batch size (equal to the number of requested samples =25absent25=25= 25), the baseline model needs to perform k𝑘kitalic_k full forward passes through intermediate layers. For our model, during top-k decoding, we simply index relevant next token embeddings to calculate the dot product with the output hidden state.

Furthermore, as in Deng & Raffel (2023); Liu et al. (2021), our expert model is autoregressive, which means that we cache the prefix activations during decoding.

5 Conclusion

We present ARM, an efficient approach to parameterize the reward modelling, suitable for autoregressive decoding, caching of prefix activations, and prediction of next token scores with a single call of a discriminator backbone model. We bridge the gap between two paradigms of training attribute discriminators, demonstrating that we can have both efficient and effective guided decoding with external expert models.

Limitations

Models discussed in this work can only reduce the probability of generating the toxic responses, not prevent it. Moreover, evaluation of toxicity is not perfect, and even very low toxicity score from automatic evaluation such as Perspective API does not necessary mean that the sample is ‘safe’. Furthermore, we should not exclusively rely on toxicity when evaluating the safety of samples from language models due to the complexity and variability of language. It is also not clear, that by reducing toxicity, we are not introducing other harms.

Acknowledgments

This publication is part of the project VI.Veni.212.228 of the research programme ‘Veni’, which is financed by the Dutch Research Council (NWO); and is part of ‘Hybrid Intelligence: augmenting human intellect’ (https://hybrid-intelligence-centre.nl) with project number 024.004.022 of the research programme ‘Gravitation’ which is (partly) financed by the Dutch Research Council (NWO).

We thank Eugeniia Tokarchuk, Kata Naszadi, Shaomu Tan, Yan Meng, Wafaa Mohammed and other members of Language Technology Lab for fruitful discussions and feedback. We also thank Perspective API team for increasing API quota for us.

References

  • Cao et al. (2022) Meng Cao, Mehdi Fatemi, Jackie CK Cheung, and Samira Shabanian. Systematic Rectification of Language Models via Dead-end Analysis. In The Eleventh International Conference on Learning Representations, September 2022.
  • cjadams et al. (2019) cjadams, Borkan Daniel, inversion, Sorensen Jeffrem, Dixon Lucas, Vasserman Lucy, and nithum. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
  • Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations, September 2019.
  • Dekoninck et al. (2023) Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. Controlled Text Generation via Language Model Arithmetic. In The Twelfth International Conference on Learning Representations, October 2023.
  • Deng & Raffel (2023) Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  11781–11791, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.721.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Ganguli et al. (2022) Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, and Jack Clark. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, November 2022.
  • Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301.
  • Ghazvininejad et al. (2017) Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: An Interactive Poetry Generation System. In Proceedings of ACL 2017, System Demonstrations, pp.  43–48, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  • Gokaslan & Cohen (2019) Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019. Accessed: 2024-06-30.
  • Grice (1975) H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan (eds.), Syntax and Semantics: Vol. 3. Speech Acts, pp.  41–58. Academic Press, New York, 1975.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. In NIPS 2014 Deep Learning Workshop, March 2015. doi: 10.48550/arXiv.1503.02531.
  • Holtzman et al. (2018) Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to Write with Cooperative Discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1638–1649, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1152.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
  • Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation, September 2019.
  • Kingma & Ba (2015) Diederick P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Krause et al. (2021) Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: Generative discriminator guided sequence generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  4929–4952, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.424. URL https://aclanthology.org/2021.findings-emnlp.424.
  • Lees et al. (2022) Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A New Generation of Perspective API: Efficient Multilingual Character-level Transformers, February 2022.
  • Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1865–1874, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1169.
  • Liu et al. (2021) Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6691–6706. Association for Computational Linguistics, August 2021. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522.
  • Lu et al. (2022) Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. QUARK: Controllable Text Generation with Reinforced Unlearning. Advances in Neural Information Processing Systems, 35:27591–27609, December 2022.
  • OpenAI (2024) OpenAI. Gpt-4 technical report, 2024.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022.
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36:53728–53741, December 2023.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019, 2019.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017.
  • Sheng et al. (2019) Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3407–3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339.
  • Sitdikov et al. (2022) Askhat Sitdikov, Nikita Balagansky, Daniil Gavrilov, and Alexander Markov. Classifiers are Better Experts for Controllable Text Generation, November 2022.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, 2013. Association for Computational Linguistics. URL https://huggingface.co/datasets/stanfordnlp/sst2.
  • Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pp.  3008–3021. Curran Associates, Inc., 2020.
  • Sudhakar et al. (2019) Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. “Transforming” Delete, Retrieve, Generate Approach for Controlled Text Style Transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3269–3279, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1322.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL http://arxiv.org/abs/2302.13971.
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
  • Yang & Klein (2021) Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3511–3535, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. URL https://aclanthology.org/2021.naacl-main.276.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015. URL https://huggingface.co/datasets/fancyzhx/amazon_polarity.

Supplementary Material

Warning: this appendix might contain disturbing examples.

Appendix A Training Details

We reuse the hyperparameters from Deng & Raffel (2023). We initialize the parameters of our models with GPT-2-Small weights, freeze tied input-output embeddings. We finetune our models with Adam optimizer (Kingma & Ba, 2015) with β1=0.9,β2=0.95,ϵ=1e12formulae-sequencesubscript𝛽10.9formulae-sequencesubscript𝛽20.95italic-ϵ1e12\beta_{1}=0.9,\beta_{2}=0.95,\epsilon=1\mathrm{e}{-12}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 , italic_ϵ = 1 roman_e - 12. We use weight decay 0.020.020.020.02, and batch size 100100100100.

A.1 Detoxification

For the detoxification task, we finetune our models with constant learning rate 1e51e51\mathrm{e}{-5}1 roman_e - 5 for 5555 epochs.

A.2 Sentiment Control

To finetune our model on labels only for sentiment control task, we first finetune with learning rate 1e51e51\mathrm{e}{-5}1 roman_e - 5 on Amazon Polarity dataset, and then finetune for 5555 epochs on SST-2 dataset with learning rate 2e62e62\mathrm{e}{-6}2 roman_e - 6. For distillation experiment, we finetune our model for 5555 epochs with learning rate 1e51e51\mathrm{e}{-5}1 roman_e - 5 on Amazon Polarity dataset.

Appendix B Results

B.1 Detoxification

Results for detoxification task are presented in section B.1. Additionally, in figure 7, we include results from Deng & Raffel (2023) for other baseline models (for older version of Perspective API).

Refer to caption
Figure 7: Detoxification results reported in Deng & Raffel (2023) for an older version of Perspective API.
% Toxicity (\downarrow) Fluency (\downarrow) Diversity (\uparrow)
Model β𝛽\betaitalic_β Average Max Toxicity Toxic Rate Perplexity Dist 2 Dist 3
ARM distill k=20 10 0.301 0.139 11.70 0.81 0.84
20 0.270 0.096 11.73 0.81 0.84
30 0.246 0.071 11.77 0.81 0.84
50 0.212 0.043 11.98 0.81 0.84
100 0.160 0.019 12.67 0.80 0.83
200 0.117 0.005 15.78 0.78 0.81
300 0.097 0.002 24.53 0.75 0.79
\cdashline2-8 k=40 10 0.304 0.137 14.68 0.83 0.85
20 0.270 0.092 14.73 0.83 0.85
30 0.245 0.064 14.90 0.83 0.85
50 0.210 0.039 15.14 0.83 0.85
100 0.158 0.013 16.26 0.83 0.84
200 0.112 0.003 21.28 0.81 0.83
300 0.095 0.002 32.27 0.78 0.80
ARM (labels only) k=20 10 0.278 0.097 11.71 0.81 0.84
20 0.241 0.053 11.81 0.81 0.84
30 0.218 0.029 12.02 0.81 0.84
50 0.185 0.014 12.26 0.81 0.84
100 0.143 0.004 14.79 0.80 0.83
200 0.113 0.002 25.31 0.76 0.79
300 0.102 0.002 45.82 0.72 0.75
\cdashline2-8 k=40 10 0.280 0.091 14.72 0.83 0.85
20 0.242 0.046 14.92 0.83 0.85
30 0.217 0.028 15.09 0.83 0.85
50 0.185 0.013 15.69 0.83 0.85
100 0.142 0.003 18.84 0.82 0.84
200 0.111 0.002 39.53 0.79 0.80
300 0.103 0.002 83.36 0.74 0.76
RAD k=20 10 0.265 0.077 11.73 0.81 0.84
20 0.231 0.040 11.81 0.81 0.84
30 0.211 0.024 11.87 0.81 0.84
50 0.184 0.014 12.09 0.81 0.84
100 0.149 0.005 12.64 0.81 0.83
200 0.115 0.002 14.98 0.79 0.81
300 0.099 0.001 19.08 0.76 0.78
\cdashline2-8 k=40 10 0.267 0.072 14.86 0.83 0.85
20 0.232 0.036 14.87 0.83 0.85
30 0.211 0.021 14.99 0.83 0.85
50 0.185 0.011 15.26 0.83 0.85
100 0.146 0.005 16.30 0.83 0.84
200 0.114 0.002 20.69 0.82 0.83
300 0.098 0.001 30.36 0.79 0.80
Table 2: Results for detoxification task. Calls to Perspective API were performed in June-July 2024.

B.2 Sentiment Control

Results for sentiment control task are presented in section B.2.

% Positive Rate (\uparrow) Fluency (\downarrow) Diversity (\uparrow)
Model β𝛽\betaitalic_β Negative Prompt Neutral Prompt Perplexity Dist 2 Dist 3
ARM distill k=20 10 12.94 81.08 12.16 0.76 0.78
20 24.87 91.00 12.85 0.75 0.78
30 35.18 94.87 14.11 0.75 0.78
40 43.60 96.60 15.74 0.75 0.78
50 49.84 97.38 18.03 0.74 0.78
60 55.34 97.87 20.09 0.73 0.77
\cdashline2-8 k=40 10 13.50 80.97 15.53 0.78 0.79
20 26.66 91.45 17.20 0.78 0.79
30 39.12 95.32 18.29 0.78 0.80
40 48.28 96.98 20.57 0.77 0.79
50 55.94 97.80 24.36 0.76 0.79
60 61.39 98.21 28.20 0.75 0.78
ARM (labels only) k=20 10 12.13 80.02 12.19 0.75 0.78
20 21.24 89.06 13.67 0.75 0.78
30 29.94 92.66 15.29 0.74 0.78
40 37.38 94.62 17.06 0.74 0.78
50 43.19 95.65 20.11 0.72 0.77
60 47.19 96.20 23.07 0.71 0.76
\cdashline2-8 k=40 10 12.17 79.49 15.58 0.78 0.79
20 22.82 89.40 17.12 0.77 0.79
30 32.63 93.22 19.46 0.77 0.79
40 41.58 95.15 24.36 0.76 0.79
50 47.98 96.10 27.48 0.75 0.79
60 53.76 96.58 30.91 0.74 0.78
RAD k=20 10 19.94 86.06 12.61 0.75 0.78
20 35.37 92.70 14.87 0.75 0.78
30 43.87 94.82 17.36 0.74 0.78
40 48.51 95.74 20.35 0.73 0.77
50 50.96 96.20 23.78 0.72 0.76
60 52.99 96.62 28.36 0.71 0.75
\cdashline2-8 k=40 10 22.03 86.56 16.20 0.78 0.79
20 40.09 93.14 19.90 0.78 0.80
30 50.61 95.16 23.45 0.77 0.79
40 55.77 96.05 27.74 0.76 0.79
50 58.69 96.54 33.55 0.75 0.78
60 60.66 96.81 41.57 0.74 0.77
Table 3: Results for sentiment control task.

B.3 Generated Examples

Examples for detoxification and sentiment control are presented in the table 4, table 5 and table 6.

Prompt: X”) received a series of text messages from her attacker, including one that read “If I…
β𝛽\betaitalic_β Continuation Toxicity
10101010 have to, I will kill myself and then I will take all of you as well and all of 0.86
50505050 get out of here I’m going to kill you“" “which was translated into a 0.89
100100100100 get in my car I’m going to drive over a curb and break it in two,” 0.47
300300300300 ever have the opportunity to meet with anyone in real life, it would be you. I know I 0.24
Table 4: Most toxic samples (out of 25 generated) from the ARM with different β𝛽\betaitalic_β coefficients, and corresponding Toxicity scores from PerspectiveAPI.
Prompt: No matter who you’ll pick you’ll…
GPT-2 need to find the correct pieces for it and the right combination of characters and weapons to kill the bosses
GeDi need the focus for this one. Personally, I had a good time in jokingly calling out Card
DExperts need a versatile DH player and Dahl will certainly help you as he’s ideally positioned at the back of
RAD still be playing against a well-prepared team. In this day and age of instant gratification we
ARM probably want to make sure you do at least some research on a team before deciding, as the competition
Table 5: Detoxification task: samples from ARM and RAD with k=20𝑘20k=20italic_k = 20, β=100𝛽100\beta=100italic_β = 100. Examples for GeDi and DExperts are from Deng & Raffel (2023).
Prompt: Meanwhile the iron and lead…
GPT-2 in the blood of an innocent child may be of no consequence if the parent dies before he or she
GeDi gathered, our new friends danced, jests were merrily spiced, and plenty of songs fired
DExperts fields may not seem like the perfect areas for reclaiming coal and steel, but technology has brought mining
RAD industries, and also the energy and materials markets in the United States, have grown and matured. The
ARM in my life have a great effect on me. They bring me to life when I think of all
Table 6: Sentiment control task: samples from ARM and RAD with k=20𝑘20k=20italic_k = 20, β=30𝛽30\beta=30italic_β = 30. Examples for GeDi and DExperts are from Deng & Raffel (2023).