\newverbcommand\cverb\newverbcommand\gverb

ARM: Efficient Guided Decoding with Autoregressive Reward Models

Sergey Troshin
University of Amsterdam
[email protected] &Vlad Niculae
University of Amsterdam
[email protected] &Antske Fokkens
VU University Amsterdam
[email protected]

Abstract

Language models trained on large amounts of data require careful tuning to be safely deployed in real world. We revisit the guided decoding paradigm, where the goal is to augment the logits of the base language model using the scores from a task-specific reward model. We propose a simple but efficient parameterization of the autoregressive reward model enabling fast and effective guided decoding. On detoxification and sentiment control tasks, we show that our efficient parameterization performs on par with RAD, a strong but less efficient guided decoding approach.

1 Introduction

Generative large language models (LLMs) gain a lot of popularity in recent years and show impressive results in zero-shot and few-shot scenarios on numerous downstream tasks (Touvron et al., 2023; OpenAI, 2024; Jiang et al., 2023). These large-scale models are pretrained on large amounts of data, and are known to inherit and memorize the underlying biases (Sheng et al., 2019). Pretrained LLMs are known to be easily triggered into unsafe behaviour (Wallace et al., 2019; Ganguli et al., 2022), necessitating further tuning for safer deployment and control (Ouyang et al., 2022).

Control over LLMs can be divided into methods which modify the original model (Ouyang et al., 2022; Rafailov et al., 2023), and decoding-time plug-in solutions. In this work, we focus on the decoding-time guidance approaches. We assume we have access to the logits of a black-box base language model (see section 3.1 for details). We then train a smaller expert model that shares the tokenizer with the base model. The task of expert model is to modify or rerank the logits of the base model during decoding in order to satisfy the desired constraint, while preserving as much as possible the distribution of base model. A line of work (Holtzman et al., 2018; Ghazvininejad et al., 2017; Dathathri et al., 2019; Yang & Klein, 2021) proposes to use free form external classifiers to guide the decoding towards higher values of classifier scores. This approach has proven to effectively condition the samples of the model to satisfy the desired constraints while not changing much the distribution of base model (Deng & Raffel, 2023). A disadvantage of this approach is that one need to consider all possible continuations of a prefix and run the classifier over each of them. In the case of standard autoregressive decoding, this would mean that we need to run an external classifier over each token in the vocabulary concatenated with the currently generated prefix. This limits the usability of the method and forces to limit the number of evaluated candidates. Liu et al. (2021) and Krause et al. (2021) propose a more efficient approach utilizing language models to predict a score for all token candidates, but these methods are less effective to produce fluent generations.

In our work, we aim to combine the strengths of both paradigms: fast inference (thanks to a language model backbone) and high quality of generations (with an attribute discriminator approach). We propose a simple strategy for how to transform a pretrained language model into an efficient autoregressive discriminator. Our model inherits the efficiency of language models to predict the classifier scores for each next token candidate utilizing the distances between output embeddings and predicted hidden states. In the evaluation, we show that guided decoding with our model results in a comparable attribute control/fluency trade-off as approaches using the strong and more computationally intensive reward model.

2 Related Work

There are multiple approaches that investigate how to align a model using attribute-conditioned data. Keskar et al. (2019, CTRL) propose to finetune models using control prompts. More recent approaches perform finetuning while regularizing the weights of the model to stay close to pretrained weights, such as Schulman et al. (2017, PPO); Stiennon et al. (2020, PPO) or Lu et al. (2022, Quark). They propose an iterative finetuning scheme based on retraining the model on attribute conditioned samples from the model itself. Despite the efficiency of decoding, these methods might require more resources for finetuning if the base model is large, or might be unusable if we only have access to base model via an API.

Alternatively, we focus on external discriminators complementing the decoding from the base model. These techniques can be roughly divided into gradient-based, and gradient-free methods.

Among the gradient-based methods, Dathathri et al. (2019, PPLM) backpropagates the gradients from a discriminator into the prefix activations of the base model to produce desired continuations during decoding. Overall, gradient-based methods are more hard to train and use in practice since they require backpropagating through the large base model.

We focus on the gradient-free decoding guidance approach, where we have access to the frozen base model. Holtzman et al. (2018) employ discriminators to modify the decoding to satisfy Grice’s maxims (Grice, 1975). Yang & Klein (2021, FUDGE) propose to concatenate the prefix words with next token candidates to estimate the discriminator scores during decoding. Dekoninck et al. (2023); Sitdikov et al. (2022) further use free-form attribute-tuned classifiers to guide the base model. A disadvantage of using free-form classifiers is that they are not designed to be applied during autoregressive decoding. First, when using a free-form classifier e.g. finetuned BERT variant (Devlin et al., 2019; Sitdikov et al., 2022) as discriminator, one needs to recompute the prefix activations for each decoding step. We constrain discriminator to be an autoregressive model suitable for caching of prefix activations (Deng & Raffel, 2023). And second, classifiers are usually finetuned on the complete sentences, and not on partial sequences as in decoding process. To address this issue, Deng & Raffel (2023) trains an autoregressive discriminator to produce the sentence attribute for every position in the sequence (with weighted objective). In this work, we rely on the same strategy. A related approach (Cao et al., 2022, Rectification) alters the token selection to minimize risk of future undesired attributes.

Architecturally, our approach is inspired by GeDi (Krause et al., 2021) and DExperts (Liu et al., 2021), who aim to speed up the decoding by using the logits of conditional language models as discriminator scores $p(\textit{token}|\textit{prefix},a)$ , where $a$ is an attribute class, e. g. positive sentiment. Despite the efficiency, these approaches lag behind discriminators directly finetuned to predict the desired attribute, suggesting that the language model objective interferes with the task of learning a good attribute discriminator (ranking) model.

Our model, while inheriting the speed efficiency of GeDi and DExperts, relies on the finetuning strategy of attribute discriminators, and we empirically show that this allows us incorporating the strengths of both paradigms.

3 From Language Models to Reward Models

Refer to caption — Figure 1: During decoding, we augment logits of the base language model with reward scores from ARM. ARM generates reward scores on the fly autoregressively caching previous activations. ARM utilizes language model output embeddings to efficiently predict rewards for each of the next token.

3.1 Guided decoding with external experts

In this section, we outline the approach of guiding an autoregressive base language model with external experts. At each step $t$ of decoding, we have a prompt together with an already generated prefix $x_{:t}$ . We assume we have access to the logits $z_{t}\in R^{|V|}$ of an autoregressive model $p_{LM}(x_{t}|x_{:t})\propto\exp(z_{t}(x_{t}|x_{:t}))$ , where $|V|$ is the size of vocabulary, and our goal is to augment these logits with an external reward model scores $r_{t}\in[0,1]^{|V|}$ . The augmented logits are then mapped to probabilities via Softmax, and a next token is then sampled from the multinomial distribution:

\tilde{p}(x_{t}|x_{:t})=\operatorname{Softmax}(z_{t}+\beta r_{t}).

(1)

To define expert logits, Liu et al. (2021); Krause et al. (2021) propose to use attribute-conditioned language models, trained on the next token prediction task on class-conditioned data: $r_{LM,a}~{}=~{}z_{t}(x_{t}|x_{:t},a)$ . The main advantage of this approach is that for a given prefix $x_{:t}$ we only pass it through the discriminator backbone once, relying on the language model output layers to obtain predictions $r_{t}$ for each of the next tokens. Deng & Raffel (2023) show that training a less efficient discriminator leads to better quality. Instead of relying on the language model objective, they explicitly train a discriminator to predict the attribute of interest: $r_{RAD}(x_{t}|x_{:t},a)=p(a|[x_{:t},x_{t}])$ , where $[\cdot,\cdot]$ denotes the concatenation of a prefix and a next token candidate. This approach requires passing each next token candidate as input to the model, thus, to obtain the scores for $k$ next token candidates they would need $k$ full forward passes, which can slow down inference significantly, and bounds them to use top-k decoding to limit the number of next token candidates.

Both of these approaches finetune an autoregressive GPT- $2$ (Radford et al., 2019) language model backbone, however, they parameterize the rewards differently suggesting that we are limited to either a fast or an expressive model.

In our work, we ask whether it is possible to have a both efficient and expressive autoregressive reward model suitable for efficient on-the-fly guidance of the base language model.

For a reward model to be efficiently utilized during autoregressive decoding, we aim to satisfy the following requirements:

1.

quality: guided decoding should produce fluent outputs with high constraint satisfaction probability comparable to Deng & Raffel (2023).
2.

autoregressive decoding efficiency: complexity of reward model should linearly depend on the context length for each decoding step (as in Deng & Raffel (2023))
3.

prediction efficiency: the backbone is called only once for each decoding step and does not depend on the number of next token candidates (as in Liu et al. (2021); Krause et al. (2021))

We answer affirmatively, by proposing ARM, an autoregressive reward model suitable for guided decoding, designed for efficient modelling of rewards scores. We demonstrate that guided decoding with our ARM results in high quality of constrained generation on detoxification and sentiment control tasks (see section 4).

3.2 Autoregressive Reward Model

In order to satisfy autoregressive decoding efficiency and, hopefully, high quality of predictions, we employ the same strategy as Deng & Raffel (2023) finetuning an autoregressive language model backbone with classifier heads to predict the reward scores of each next token given the prefix. This formulation allow us to cache the activations of the model during left-to-right decoding, thus having linear complexity w. r. t. the input length for each decoding step.

To ensure prediction efficiency, we aim to preserve the language modelling style of prediction to get a score for each of the next tokens modeled as a similarity between a predicted hidden vector $h_{t}\in\mathbb{R}^{d}$ and output embeddings $e_{i}\in\mathbb{R}^{|V|\times d}$ (Liu et al., 2021; Krause et al., 2021), and develop a suitable parameterization of the next token reward. We assume that reward scores are continuous values and without loss of generality assume $r_{t}\in[0;1]^{|V|}.$ We propose the following ARM parameterization of the scores for the next tokens given the prefix:

\hat{r}_{i}(e_{i}|h_{t})=\underset{\text{baseline}}{\tilde{r}(h_{t})}+\Delta% \hat{r}(e_{i}|h_{t})

(2)

Particularly, we use linear parameterization of the head: $\tilde{r}(h_{t})=\langle h_{t},w\rangle$ , $\Delta\hat{r}(e_{i}|h_{t})={\langle Wh_{t},e_{i}\rangle}$ . Here, we introduced two attribute-specific parameters: $w\in\mathbb{R}^{d}$ for modeling the baseline score of the prefix, and $W\in\mathbb{R}^{d\times d}$ to model marginal reward predicted for each of next token candidates at position $t$ .

3.3 Loss function

At the training stage, we assume that we have a dataset of $N$ sentences $x_{i}$ each labeled with an attribute $r_{i}\in[0,1]$ , $i\in\{1..N\}$ . We finetune the autoregressive reward model to predict the attribute value $\hat{r}_{t}$ (prefix reward) for each of the partial prefixes $x_{:t}$ . While a naive strategy would be to learn on the full sentences, we adopt a cumulative loss function from Deng & Raffel (2023) encouraging the model to learn the future attribute for an incomplete prefix:

\mathcal{L}(\hat{r},r)=\frac{\sum_{t=1}^{l}w_{t}\ell(\hat{r}_{t},r)}{\sum_{t=1% }^{l}w_{t}}.

(3)

where $w_{t}=t$ , and where $\ell$ defines the distance between true sentence attribute and predicted prefix reward. The intuition of the loss is to put more weight to the prefixes closer to the full sentence, where the ground truth labels are provided, while still encouraging the model to learn rewards for partial prefixes. Similar to (Deng & Raffel, 2023), we use $\ell(\hat{r},r)=(\hat{r}-r)^{2}$ throughout this work.

Additionally, we are interested whether we can finetune a reward model with our parameterization to match the performance of an already trained less efficient but potentially more expressive reward model. To answer this question, we use a distillation loss (Hinton et al., 2015) to mimic the outputs of a less efficient high-quality reward model:

\mathcal{L}_{\text{dstl}}(\hat{r},r)=\frac{1}{l}\sum_{t=1}^{l}\ell(\hat{r}_{t}% ,r_{t})

(4)

3.4 Regularization

A reward model only can observe a limited number of tokens during finetuning. While the loss defined above provides the positive signal for some tokens, it might be beneficial to make on objective more “contrastive” and regularize the prediction for other tokens (including rare ones). In our parameterization (2), it is natural to push the predicted reward towards the baseline for unrelated tokens, regularizing the outputs.

We regularize the prediction of our model to be close on average to the prefix baseline by forcing the marginal prediction $\Delta\hat{r}$ to be close to $0$ for randomly sampled token candidates:

\mathcal{L}_{\text{reg}}(h_{t})=\mathbb{E}_{e^{\prime}\sim U[V]}\left[\Delta% \hat{r}(e^{\prime}|h_{t})\right]^{2},

(5)

where we use one sample of $e^{\prime}$ for each prefix position, sampling uniformly from a set of all possible output embeddings.

4 Experiments

To test our model, we follow previous work (Liu et al., 2021; Deng & Raffel, 2023) and train two types of attribute discriminators: for toxicity reduction and sentiment control.

In our experiments, we guide the decoding from GPT-2-Large using a finetuned discriminator. For our discriminator backbone we are using GPT-2-Small. We finetune all parameters of the discriminator except embeddings $e_{i}$ , which remain frozen. We conduct experiments in two regimes: distilling an already trained less efficient discriminator from (Deng & Raffel, 2023) using $\mathcal{L}_{dstl}$ loss (4) ; training a discriminator from scratch on labels only using cumulative loss $\mathcal{L}$ (3). In both settings, we use additional regularization $\mathcal{L}_{\text{reg}}$ forcing marginal prediction for unrelated next tokens to stay close to zero.

For evaluation, we decode using top-k sampling from the multinomial distribution defined in (1), where top-k candidates are selected taking k largest logits of the base model at the current decoding step, while assigning $-\infty$ to the logits of other tokens.

4.1 Detoxification

For the detoxification evaluation, we follow previous work (Deng & Raffel, 2023; Liu et al., 2021) and evaluate samples from guided decoding given a 10k subset (Liu et al., 2021) of prompts from the RealToxicityPrompts dataset (Gehman et al., 2020). We follow Deng & Raffel (2023); Liu et al. (2021) and finetune our model on 2M pairs of text and continuous labels between $0$ and $1$ from the Jigsaw Unintended Bias in Toxicity Classification challenge (cjadams et al., 2019). Like previous work, we train our model on $7$ independent labels (‘toxicity’, ‘severe toxicity’, ‘obscene’, ‘identity attack’, ‘insult’, ‘threat’, ‘sexual explicit’) with different head parameters $w_{i},W_{i},i\in\{1,...,7\}$ for each sub-task. During decoding, we only use the predictor of toxicity. For the distillation experiment, we use the same dataset, and the released toxicity discriminator from Deng & Raffel (2023) as a teacher.

During decoding, we sample $25$ continuations generating at most $20$ new tokens. To evaluate toxicity, we use an external closed-source toxicity classifier Perspective API (Lees et al., 2022), and following previous work (Deng & Raffel, 2023; Liu et al., 2021), we rely on Maximal Average Toxicity metric, which is the maximal value of toxicity score over $25$ samples for a given prompt, averaged over the set of 10k prompts. We also report Toxic Rate, which is calculated as probability that at least one out of 25 continuations is toxic according to Perspective API (toxicity score > $0.5$ ); and Diversity score, which is the average number of distinct $n$ -grams normalized by the length of text (Li et al., 2018). To evaluate the fluency of model generations, we use the average perplexity of GPT-2-XL, calculated over the sampled continuations. In the experiments, we will look at the toxicity/fluency trade-off, alternating the weight $\beta$ of the discriminator (see section B.1 and section B.2). We expect to obtain a model with both low toxicity according to PerspectiveAPI, and high fluency.

4.2 Sentiment control

For sentiment control, we follow the previous work of Li et al. (2018); Sudhakar et al. (2019); Liu et al. (2021); Deng & Raffel (2023) to evaluate the samples given a prompt from one of the three categories: $2.5K$ negative, $5K$ neutral, and $2.5K$ positive prompts from OpenWebText (Gokaslan & Cohen, 2019). To finetune ARM on labels only, we follow Deng & Raffel (2023) and finetune our model on millions of reviews from the Amazon Polarity (Zhang et al., 2015) and and SST-2 (Socher et al., 2013) datasets. To distil the sentiment discriminator of Deng & Raffel (2023), we use text examples from the Amazon Polarity dataset. Additional training details are provided in appendix A.

For evaluation, we follow Deng & Raffel (2023), and use the average Positive Rate metric w. r. t. the finetuned external DistilBERT classifier (Sanh et al., 2019) provided via the HuggingFace text classification pipeline¹¹1https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english.

As in the toxicity task, we use GPT-2-XL to evaluate the fluency of the sampled continuations, and we expect to obtain high Positive Rate and low average perplexity.

4.3 Results

Detoxification

For detoxification task (figure 2), our efficient student (ARM) closely follows the RAD teacher for attribute control/fluency trade-off. Both ARM student and RAD teacher can significantly reduce Average Maximal Toxicity with only a slight drop in fluency. We observe that ARM train on labels only shows worse trade-off for the lower values of toxicity.

For completeness, in figure 7, we include the results for other baselines from Deng & Raffel (2023) computed for the older version of Perspective API.

Sentiment control

From the results on the sentiment control task in figure 3, we observe that the ARM student model shows even slightly better trade-off than RAD teacher model, closely following approaches that require training using feedback from evaluation pipeline (Lu et al., 2022, Quark), (Stiennon et al., 2020, PPO). Again, ARM trained of labels only perform less fluently but still competitively compared to other guided decoding baselines.

Our results suggest that we can indeed use our efficient parameterization of reward model to distill already trained effective but less efficient reward model without the loss of quality. Alternatively, we can use data labelled with sentence attributes to finetune ARM without a teacher model, obtaining competitive but slightly worse results compared to distilling the RAD teacher.

4.4 Ablation

Observing good results for our parameterization in previous experiments, we then investigate the effect of the regularization component and baseline component for head parameterization.

As shown in figure 4, we experiment with distillation and observe that turning off regularization, or further removing baseline from head parameterization results in higher average perplexity, although these simpler models still show adequate decrease in toxicity. By analyzing the distribution of rewards for all next tokens (see figure 5), we observe that the unregularized ARM tends to predict less flat distributions (higher values in $\Delta\hat{r}$ component), which may hurt the fluency.

Model	N calls
GeDi (Krause et al., 2021)	1
DExperts (Liu et al., 2021)	2
RAD (Deng & Raffel, 2023)	$k$
ARM (Ours)	1

4.5 Efficiency

The main advantage of our reward model is that it need only a single pass through the backbone for each step of decoding. In table 1, we compare the number of backbone passes needed for external expert models. In figure 6, we measure time per generated token when running the decoding for the toxicity task for us and the model of Deng & Raffel (2023). When we increase $k$ in top-k decoding with the fixed batch size (equal to the number of requested samples $=25$ ), the baseline model needs to perform $k$ full forward passes through intermediate layers. For our model, during top-k decoding, we simply index relevant next token embeddings to calculate the dot product with the output hidden state.

Furthermore, as in Deng & Raffel (2023); Liu et al. (2021), our expert model is autoregressive, which means that we cache the prefix activations during decoding.

5 Conclusion

We present ARM, an efficient approach to parameterize the reward modelling, suitable for autoregressive decoding, caching of prefix activations, and prediction of next token scores with a single call of a discriminator backbone model. We bridge the gap between two paradigms of training attribute discriminators, demonstrating that we can have both efficient and effective guided decoding with external expert models.

Limitations

Models discussed in this work can only reduce the probability of generating the toxic responses, not prevent it. Moreover, evaluation of toxicity is not perfect, and even very low toxicity score from automatic evaluation such as Perspective API does not necessary mean that the sample is ‘safe’. Furthermore, we should not exclusively rely on toxicity when evaluating the safety of samples from language models due to the complexity and variability of language. It is also not clear, that by reducing toxicity, we are not introducing other harms.

Acknowledgments

This publication is part of the project VI.Veni.212.228 of the research programme ‘Veni’, which is financed by the Dutch Research Council (NWO); and is part of ‘Hybrid Intelligence: augmenting human intellect’ (https://hybrid-intelligence-centre.nl) with project number 024.004.022 of the research programme ‘Gravitation’ which is (partly) financed by the Dutch Research Council (NWO).

We thank Eugeniia Tokarchuk, Kata Naszadi, Shaomu Tan, Yan Meng, Wafaa Mohammed and other members of Language Technology Lab for fruitful discussions and feedback. We also thank Perspective API team for increasing API quota for us.

References

Cao et al. (2022) Meng Cao, Mehdi Fatemi, Jackie CK Cheung, and Samira Shabanian. Systematic Rectification of Language Models via Dead-end Analysis. In The Eleventh International Conference on Learning Representations, September 2022.
cjadams et al. (2019) cjadams, Borkan Daniel, inversion, Sorensen Jeffrem, Dixon Lucas, Vasserman Lucy, and nithum. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations, September 2019.
Dekoninck et al. (2023) Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. Controlled Text Generation via Language Model Arithmetic. In The Twelfth International Conference on Learning Representations, October 2023.
Deng & Raffel (2023) Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 11781–11791, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.721.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
Ganguli et al. (2022) Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, and Jack Clark. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, November 2022.
Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301.
Ghazvininejad et al. (2017) Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: An Interactive Poetry Generation System. In Proceedings of ACL 2017, System Demonstrations, pp. 43–48, Vancouver, Canada, July 2017. Association for Computational Linguistics.
Gokaslan & Cohen (2019) Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019. Accessed: 2024-06-30.
Grice (1975) H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan (eds.), Syntax and Semantics: Vol. 3. Speech Acts, pp. 41–58. Academic Press, New York, 1975.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. In NIPS 2014 Deep Learning Workshop, March 2015. doi: 10.48550/arXiv.1503.02531.
Holtzman et al. (2018) Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to Write with Cooperative Discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1638–1649, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1152.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation, September 2019.
Kingma & Ba (2015) Diederick P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Krause et al. (2021) Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: Generative discriminator guided sequence generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4929–4952, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.424. URL https://aclanthology.org/2021.findings-emnlp.424.
Lees et al. (2022) Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A New Generation of Perspective API: Efficient Multilingual Character-level Transformers, February 2022.
Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1865–1874, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1169.
Liu et al. (2021) Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6691–6706. Association for Computational Linguistics, August 2021. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522.
Lu et al. (2022) Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. QUARK: Controllable Text Generation with Reinforced Unlearning. Advances in Neural Information Processing Systems, 35:27591–27609, December 2022.
OpenAI (2024) OpenAI. Gpt-4 technical report, 2024.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pp. 27730–27744. Curran Associates, Inc., 2022.
Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36:53728–53741, December 2023.
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019, 2019.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017.
Sheng et al. (2019) Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407–3412, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339. URL https://aclanthology.org/D19-1339.
Sitdikov et al. (2022) Askhat Sitdikov, Nikita Balagansky, Daniil Gavrilov, and Alexander Markov. Classifiers are Better Experts for Controllable Text Generation, November 2022.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, 2013. Association for Computational Linguistics. URL https://huggingface.co/datasets/stanfordnlp/sst2.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pp. 3008–3021. Curran Associates, Inc., 2020.
Sudhakar et al. (2019) Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. “Transforming” Delete, Retrieve, Generate Approach for Controlled Text Style Transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3269–3279, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1322.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL http://arxiv.org/abs/2302.13971.
Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
Yang & Klein (2021) Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3511–3535, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. URL https://aclanthology.org/2021.naacl-main.276.
Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015. URL https://huggingface.co/datasets/fancyzhx/amazon_polarity.

Supplementary Material

Warning: this appendix might contain disturbing examples.

Appendix A Training Details

We reuse the hyperparameters from Deng & Raffel (2023). We initialize the parameters of our models with GPT-2-Small weights, freeze tied input-output embeddings. We finetune our models with Adam optimizer (Kingma & Ba, 2015) with $\beta_{1}=0.9,\beta_{2}=0.95,\epsilon=1\mathrm{e}{-12}$ . We use weight decay $0.02$ , and batch size $100$ .

A.1 Detoxification

For the detoxification task, we finetune our models with constant learning rate $1\mathrm{e}{-5}$ for $5$ epochs.

A.2 Sentiment Control

To finetune our model on labels only for sentiment control task, we first finetune with learning rate $1\mathrm{e}{-5}$ on Amazon Polarity dataset, and then finetune for $5$ epochs on SST-2 dataset with learning rate $2\mathrm{e}{-6}$ . For distillation experiment, we finetune our model for $5$ epochs with learning rate $1\mathrm{e}{-5}$ on Amazon Polarity dataset.

Appendix B Results

B.1 Detoxification

Results for detoxification task are presented in section B.1. Additionally, in figure 7, we include results from Deng & Raffel (2023) for other baseline models (for older version of Perspective API).

			% Toxicity ( $\downarrow$ )		Fluency ( $\downarrow$ )	Diversity ( $\uparrow$ )
Model		$\beta$	Average Max Toxicity	Toxic Rate	Perplexity	Dist 2	Dist 3
ARM distill	k=20	10	0.301	0.139	11.70	0.81	0.84
		20	0.270	0.096	11.73	0.81	0.84
		30	0.246	0.071	11.77	0.81	0.84
		50	0.212	0.043	11.98	0.81	0.84
		100	0.160	0.019	12.67	0.80	0.83
		200	0.117	0.005	15.78	0.78	0.81
		300	0.097	0.002	24.53	0.75	0.79
\cdashline2-8	k=40	10	0.304	0.137	14.68	0.83	0.85
		20	0.270	0.092	14.73	0.83	0.85
		30	0.245	0.064	14.90	0.83	0.85
		50	0.210	0.039	15.14	0.83	0.85
		100	0.158	0.013	16.26	0.83	0.84
		200	0.112	0.003	21.28	0.81	0.83
		300	0.095	0.002	32.27	0.78	0.80
ARM (labels only)	k=20	10	0.278	0.097	11.71	0.81	0.84
		20	0.241	0.053	11.81	0.81	0.84
		30	0.218	0.029	12.02	0.81	0.84
		50	0.185	0.014	12.26	0.81	0.84
		100	0.143	0.004	14.79	0.80	0.83
		200	0.113	0.002	25.31	0.76	0.79
		300	0.102	0.002	45.82	0.72	0.75
\cdashline2-8	k=40	10	0.280	0.091	14.72	0.83	0.85
		20	0.242	0.046	14.92	0.83	0.85
		30	0.217	0.028	15.09	0.83	0.85
		50	0.185	0.013	15.69	0.83	0.85
		100	0.142	0.003	18.84	0.82	0.84
		200	0.111	0.002	39.53	0.79	0.80
		300	0.103	0.002	83.36	0.74	0.76
RAD	k=20	10	0.265	0.077	11.73	0.81	0.84
		20	0.231	0.040	11.81	0.81	0.84
		30	0.211	0.024	11.87	0.81	0.84
		50	0.184	0.014	12.09	0.81	0.84
		100	0.149	0.005	12.64	0.81	0.83
		200	0.115	0.002	14.98	0.79	0.81
		300	0.099	0.001	19.08	0.76	0.78
\cdashline2-8	k=40	10	0.267	0.072	14.86	0.83	0.85
		20	0.232	0.036	14.87	0.83	0.85
		30	0.211	0.021	14.99	0.83	0.85
		50	0.185	0.011	15.26	0.83	0.85
		100	0.146	0.005	16.30	0.83	0.84
		200	0.114	0.002	20.69	0.82	0.83
		300	0.098	0.001	30.36	0.79	0.80

Table 2: Results for detoxification task. Calls to Perspective API were performed in June-July 2024.

B.2 Sentiment Control

Results for sentiment control task are presented in section B.2.

			% Positive Rate ( $\uparrow$ )		Fluency ( $\downarrow$ )	Diversity ( $\uparrow$ )
Model		$\beta$	Negative Prompt	Neutral Prompt	Perplexity	Dist 2	Dist 3
ARM distill	k=20	10	12.94	81.08	12.16	0.76	0.78
		20	24.87	91.00	12.85	0.75	0.78
		30	35.18	94.87	14.11	0.75	0.78
		40	43.60	96.60	15.74	0.75	0.78
		50	49.84	97.38	18.03	0.74	0.78
		60	55.34	97.87	20.09	0.73	0.77
\cdashline2-8	k=40	10	13.50	80.97	15.53	0.78	0.79
		20	26.66	91.45	17.20	0.78	0.79
		30	39.12	95.32	18.29	0.78	0.80
		40	48.28	96.98	20.57	0.77	0.79
		50	55.94	97.80	24.36	0.76	0.79
		60	61.39	98.21	28.20	0.75	0.78
ARM (labels only)	k=20	10	12.13	80.02	12.19	0.75	0.78
		20	21.24	89.06	13.67	0.75	0.78
		30	29.94	92.66	15.29	0.74	0.78
		40	37.38	94.62	17.06	0.74	0.78
		50	43.19	95.65	20.11	0.72	0.77
		60	47.19	96.20	23.07	0.71	0.76
\cdashline2-8	k=40	10	12.17	79.49	15.58	0.78	0.79
		20	22.82	89.40	17.12	0.77	0.79
		30	32.63	93.22	19.46	0.77	0.79
		40	41.58	95.15	24.36	0.76	0.79
		50	47.98	96.10	27.48	0.75	0.79
		60	53.76	96.58	30.91	0.74	0.78
RAD	k=20	10	19.94	86.06	12.61	0.75	0.78
		20	35.37	92.70	14.87	0.75	0.78
		30	43.87	94.82	17.36	0.74	0.78
		40	48.51	95.74	20.35	0.73	0.77
		50	50.96	96.20	23.78	0.72	0.76
		60	52.99	96.62	28.36	0.71	0.75
\cdashline2-8	k=40	10	22.03	86.56	16.20	0.78	0.79
		20	40.09	93.14	19.90	0.78	0.80
		30	50.61	95.16	23.45	0.77	0.79
		40	55.77	96.05	27.74	0.76	0.79
		50	58.69	96.54	33.55	0.75	0.78
		60	60.66	96.81	41.57	0.74	0.77

Table 3: Results for sentiment control task.

B.3 Generated Examples

Examples for detoxification and sentiment control are presented in the table 4, table 5 and table 6.

Prompt: X”) received a series of text messages from her attacker, including one that read “If I…
$\beta$	Continuation	Toxicity
$10$	have to, I will kill myself and then I will take all of you as well and all of	0.86
$50$	get out of here I’m going to kill you“" “which was translated into a	0.89
$100$	get in my car I’m going to drive over a curb and break it in two,”	0.47
$300$	ever have the opportunity to meet with anyone in real life, it would be you. I know I	0.24

Table 4: Most toxic samples (out of 25 generated) from the ARM with different

\beta

coefficients, and corresponding Toxicity scores from PerspectiveAPI.

Prompt: No matter who you’ll pick you’ll…
GPT-2	need to find the correct pieces for it and the right combination of characters and weapons to kill the bosses
GeDi	need the focus for this one. Personally, I had a good time in jokingly calling out Card
DExperts	need a versatile DH player and Dahl will certainly help you as he’s ideally positioned at the back of
RAD	still be playing against a well-prepared team. In this day and age of instant gratification we
ARM	probably want to make sure you do at least some research on a team before deciding, as the competition

Table 5: Detoxification task: samples from ARM and RAD with

k=20

\beta=100

. Examples for GeDi and DExperts are from Deng & Raffel (2023).

Prompt: Meanwhile the iron and lead…
GPT-2	in the blood of an innocent child may be of no consequence if the parent dies before he or she
GeDi	gathered, our new friends danced, jests were merrily spiced, and plenty of songs fired
DExperts	fields may not seem like the perfect areas for reclaiming coal and steel, but technology has brought mining
RAD	industries, and also the energy and materials markets in the United States, have grown and matured. The
ARM	in my life have a great effect on me. They bring me to life when I think of all

Table 6: Sentiment control task: samples from ARM and RAD with

k=20

\beta=30

. Examples for GeDi and DExperts are from Deng & Raffel (2023).