Enhancing Your Alignment via RM-LLM Game (2024)

Pengyu Cheng¹Yifan Yang^∗1Jian Li^∗1Yong Dai¹Tianhao Hu¹Peixin Cao¹Nan Du¹Xiaolong Li²
Tencent AI Lab ¹Shenzhen & ²Seattle
{pengyucheng,tobyfyang,jackjianli}@tencent.comEqual Contribution.

Abstract

Human preference alignment is essential to improve the interaction quality of large language models (LLMs).Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions.However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game.Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation.With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is athttps://github.com/Linear95/APO.

Adversarial Preference Optimization:
Enhancing Your Alignment via RM-LLM Game

Pengyu Cheng^†^†thanks: Equal Contribution.¹Yifan Yang^∗1Jian Li^∗1Yong Dai¹Tianhao Hu¹Peixin Cao¹Nan Du¹Xiaolong Li²Tencent AI Lab ¹Shenzhen & ²Seattle{pengyucheng,tobyfyang,jackjianli}@tencent.com

1 Introduction

Learned from massive textual data with billions of parameters, large language models (LLMs), such as GPT-4(OpenAI, 2023) and Gemini(Team etal., 2023), have shown remarkable AI capabilities, especially in domains of natural language processing(Jiao etal., 2023; Han etal., 2023), logical reasoning(Liu etal., 2023a; Frieder etal., 2023), and programming(Surameery and Shakor, 2023; Tian etal., 2023).Among the training techniques that help LLMs achieve such success, human preference alignment finetunes LLMs to follow users’ feedback, which has been widely recognized as essential for improving human-model interaction(Ouyang etal., 2022). However, highly qualified human feedback requires meticulous annotations of query-response pairs in various topics(Askell etal., 2021), which is rather challenging and forms a sharp contrast to the easy access of enormous unsupervised pretraining text corpus. Hence, the limitation of preference data collection raises demands for training sample efficiency of preference alignment methods(Yuan etal., 2023; Sun etal., 2023; Rafailov etal., 2023).

To utilize preference data, current feedback alignment methods are proposed mainly from three perspectives(Wang etal., 2023b): reinforcement learning(Ouyang etal., 2022), contrastive learning(Yuan etal., 2023; Rafailov etal., 2023; Liu etal., 2023c), and language modeling(Dong etal., 2023; Touvron etal., 2023b; Wang etal., 2023a).Reinforcement learning with human feedback (RLHF) (Kreutzer etal., 2018; Ziegler etal., 2019) is the earliest exploration and has been acknowledged as the mainstream for LLM alignment(Ouyang etal., 2022; Touvron etal., 2023b). RLHF first learns a reward model from the human preference data, then optimizes the expected reward score of the LLM’s output samples via the Proximal Policy Optimization (PPO) algorithm(Schulman etal., 2017).Although widely used, RLHF has been criticized as being unstable during the fine-tuning and complicated in implementation and computational resource consumption(Yuan etal., 2023; Rafailov etal., 2023).

Towards more efficient and stable training, instead of directly optimizing the non-differentiable rewards, contrastive learning methods enlarge the likelihood gap between preferred and rejected response pairs(Yuan etal., 2023; Rafailov etal., 2023; Zhao etal., 2023). Alternatively, language modeling-based methods remain using language modeling loss to align preference, but with different data preparation strategies(Dong etal., 2023; Liu etal., 2023b; Wang etal., 2023a). For example, rejection sampling(Dong etal., 2023; Touvron etal., 2023b) select responses with top reward scores as the language modeling fine-tuning samples, while Wang etal. (2023a) and Liu etal. (2023b) add different prompts to different responses based on the corresponding preference levels.

Although contrastive-learning and language-modeling-based methods have partially alleviated the inefficiency of RLHF, the sampling distribution shifting problem(Touvron etal., 2023b) still hinders the alignment effectiveness: after a few steps of RLHF updates, a distribution gap emerges between LLM generated samples and preference-annotated data (as in Figure1).Consequently, the reward model learned with human annotation loses its performance in providing faithful reward signals on newly generated responses, which damages the alignment performance.To address this problem, most aforementioned alignment methods require additional annotation of human feedback on newly generated responses after a certain amount of LLM updating steps(Touvron etal., 2023b), which leads to increasingly massive manpower costs(Askell etal., 2021). Besides, the vast time consumption of extra manual annotation also significantly slows down the alignment training process.

To reduce the manual annotation efforts and improve the preference optimization efficiency,we propose a novel adversarial learning framework called Adversarial Preference Optimization (APO). Inspired by generative adversarial networks (GANs)(Goodfellow etal., 2014; Arjovsky etal., 2017), we conduct an adversarial game between the reward model (RM) and the LLM: the LLM generates responses to maximize the expected reward score, while the RM aims to distinguish the score difference between golden and sampled responses. To verify the effectiveness of the APO framework, we conduct experiments on the Helpful&Harmless(Bai etal., 2022) datasets with Alpaca(Taori etal., 2023) and LLaMA-2(Touvron etal., 2023b) as thebase models. With the same amount of human preference data, both the LLM and RM receive additional performance gains through the APO game,compared with several commonly used LLM alignment baselines.

Enhancing Your Alignment via RM-LLM Game (1)

2 Preliminary

Human Preference Alignment

aims to fine-tune the LLM response policy $\pi_{\theta}({\bm{y}}|{\bm{x}})$ with a group of human preference data ${\mathcal{D}}_{\text{P}}=\{({\bm{x}},{\bm{y}}^{w},{\bm{y}}^{l})\}$ , so that the LLM can generate more satisfying responses to improve the human-model interaction quality. In each preference triplet $({\bm{x}},{\bm{y}}^{w},{\bm{y}}^{l})$ , ${\bm{y}}^{w}\succ{\bm{y}}^{l}$ means response ${\bm{y}}^{w}$ is more “preferred” than ${\bm{y}}^{l}$ with respect to input ${\bm{x}}$ . To align the LLM, a reward model (RM)(Christiano etal., 2017; Ouyang etal., 2022) $r_{\phi}({\bm{x}},{\bm{y}})$ is commonly utilized to score the quality of the LLM generated samples. RM learns human preferences ${\mathcal{D}}_{\text{P}}$ with a ranking loss(Bradley and Terry, 1952) ${\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{P}}):=$

\textstyle-\mathbb{E}_{{\mathcal{D}}_{\text{P}}}[\log\sigma(r_{\phi}({\bm{x}},%{\bm{y}}^{w})-r_{\phi}({\bm{x}},{\bm{y}}^{l}))],\vspace{-1.5mm}

(1)

where $\sigma(\cdot)$ is the Sigmoid function.For a response pair $({\bm{y}},\tilde{{\bm{y}}})$ , the reward difference $r_{\phi}({\bm{x}},{\bm{y}})-r_{\phi}({\bm{x}},\tilde{{\bm{y}}})$ provides a preference probability :

	$\displaystyle Q_{\phi}({\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})=$	$\displaystyle\frac{\exp(r_{\phi}({\bm{x}},{\bm{y}}))}{\exp(r_{\phi}({\bm{x}},{%\bm{y}}))+\exp(r_{\phi}({\bm{x}},\tilde{{\bm{y}}}))}$
	$\displaystyle=$	$\displaystyle\sigma(r_{\phi}({\bm{x}},{\bm{y}})-r_{\phi}({\bm{x}},\tilde{{\bm{%y}}})).\vspace{-1.5mm}$		(2)

With equation2, training RM with the Bradley-Terry ranking loss can be explained as the log-likelihood maximization of $Q_{\phi}$ :

\textstyle{\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{P}})=-%\mathbb{E}_{{\mathcal{D}}_{\text{P}}}[\log Q_{\phi}({\bm{y}}^{w}\succ{\bm{y}}^%{l}|{\bm{x}})]

(3)

With a learned RM $r_{\phi}({\bm{x}},{\bm{y}})$ , human preference alignment methods(Ouyang etal., 2022; Rafailov etal., 2023; Liu etal., 2023c) target on maximizing the reward expectation of generated responses:

	$\displaystyle\max_{\pi_{\theta}}\mathbb{E}_{{\bm{x}}\sim\mathcal{D},{\bm{y}}%\sim\pi_{\theta}({\bm{y}}\|{\bm{x}})}[r_{\phi}($	$\displaystyle{\bm{x}},{\bm{y}})]$
	$\displaystyle-\beta\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})$	$\displaystyle\\|\pi_{\text{ref}}({\bm{y}}\|{\bm{x}})],\vspace{-1.5mm}$		(4)

where $\pi_{\text{ref}}({\bm{y}}|{\bm{x}})$ is a reference language model. $\text{KL}[\pi_{\theta}({\bm{y}}|{\bm{x}})\|\pi_{\text{ref}}({\bm{y}}|{\bm{x}})]$ prevents $\pi_{\theta}({\bm{y}}|{\bm{x}})$ from the degeneration of repeating a single response with the highest reward score, which also preserves the generation diversity. Since response samples $y$ are discrete, it is challenging to directly back-propagate from reward $r_{\phi}({\bm{x}},{\bm{y}})$ to policy $\pi_{\theta}({\bm{y}}|{\bm{x}})$ . The typical solution to equation4 is reinforcement learning from human feedback (RLHF)(Ouyang etal., 2022), via the proximal policy optimization (PPO) algorithms(Schulman etal., 2017).

However, PPO suffers from implementation complexity and training instability(Yuan etal., 2023; Sun etal., 2023). Recent studiestry to avoid online reinforcement learning with offline schemes. DPO(Rafailov etal., 2023) finds a connection between the reward model and LLM’s optimal solution, then replaces the reward model with the likelihood ratio of $\pi_{\theta}$ and $\pi_{\text{ref}}$ , as ${\mathcal{L}}_{\text{DPO}}(\pi_{\theta}):=$

\displaystyle-\mathbb{E}\big{[}\log\sigma\big{(}\beta\log\frac{\pi_{\theta}({%\bm{y}}^{w}|{\bm{x}})}{\pi_{\text{ref}}({\bm{y}}^{w}|{\bm{x}})}-\beta\log\frac%{\pi_{\theta}({\bm{y}}^{l}|{\bm{x}})}{\pi_{\text{ref}}({\bm{y}}^{l}|{\bm{x}})}%\big{)}\big{]}.\vspace{-1.5mm}

Analogously, other methods consider human feedback learning from the perspective of contrastive learning. For example, RRHF(Yuan etal., 2023) propose a ranking loss as ${\mathcal{L}}_{\text{RRHF}}(\pi_{\theta}):=$

	$\displaystyle-\mathbb{E}_{{\mathcal{D}}}\big{[}\text{ReLU}(\log\pi_{\theta}({%\bm{y}}^{l}\|{\bm{x}})-$	$\displaystyle\log\pi_{\theta}({\bm{y}}^{w}\|{\bm{x}}))$
	$\displaystyle-\lambda\log$	$\displaystyle\pi_{\theta}({\bm{y}}^{\text{best}}\|{\bm{x}})\big{]}\vspace{-1.5mm}$		(5)

where ${\bm{y}}^{\text{best}}$ is the corresponding response to ${\bm{x}}$ with the highest reward, and the preference data ${\mathcal{D}}$ can be built from human annotation ${\mathcal{D}}_{\text{P}}$ or RM ranking results. Besides, rejection sampling (RJS)(Touvron etal., 2023b) (also called RAFT(Dong etal., 2023) and best-of-N(Stiennon etal., 2020)) directly fine-tunes LLM on ${\bm{y}}^{\text{best}}$ to further simplify the alignment process, ${\mathcal{L}}_{\text{RJS}}(\pi_{\theta}):=$

\textstyle-\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}},{\bm{y}}^{1},{\bm{y}}^{2},%\dots\bm{y}^{S}\sim\pi_{\theta}({\bm{y}}|{\bm{x}})}[\log\pi_{\theta}({\bm{y}}^%{\text{best}}|{\bm{x}})]\vspace{-1.5mm}

(6)

where ${\bm{y}}^{\text{best}}=\operatorname*{arg\,max}_{1\leq s\leq S}\{r_{\phi}({\bm%{x}},{\bm{y}}^{s})\}$ is the sampled response with the highest reward score. Azar etal. (2023) extend the alignment objective into a more general form by replacing RM $r_{\phi}$ with the human preference probability $P({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ :

	$\displaystyle\max_{\pi_{\theta}}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}},{\bm{y}}%\sim\pi_{\theta}(\cdot\|{\bm{x}}),\tilde{{\bm{y}}}\sim\mu(\cdot\|{\bm{x}})}[\Psi%(P($	$\displaystyle{\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}}))]$
	$\displaystyle-\beta\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})\\|\pi_{\text{ref}}($	$\displaystyle{\bm{y}}\|{\bm{x}})],$		(7)

where $\Psi(\cdot)$ is a non-decreasing real-value function. This general alignment objective is called $\Psi$ PO.

Generative Adversarial Networks (GANs)

are a classical group of unsupervised machine learning approaches that can fit complicated real-data distributions in an adversarial learning scheme(Goodfellow etal., 2014). GANs use a discriminator $D(\cdot)$ and a generator $G(\cdot)$ to play a min-max game. The generator tries to cheat the discriminator with real-looking generated samples, while the discriminator aims to distinguish the true data and the samples:

	$\displaystyle\min_{G}\max_{D}V(D,$	$\displaystyle G)=\mathbb{E}_{{\bm{x}}\sim P_{\text{data}}({\bm{x}})}[\log D({%\bm{x}})]$		(8)
		$\displaystyle+\mathbb{E}_{{\bm{z}}\sim P_{{\bm{z}}}({\bm{z}})}[\log(1-D(G({\bm%{z}}))],\vspace{-1.5mm}$

where ${\bm{z}}$ is a random vector from prior $P_{{\bm{z}}}({\bm{z}})$ to induce the generation sample distribution. The objective equation8 has been theoretically justified as the Jensen–Shannon (JS) divergence between distributions of real data and samples(Goodfellow etal., 2014). Arjovsky etal. (2017) replace the JS divergence with the Wasserstein distance(Villani, 2009) and propose the Wasserstein GAN (WGAN):

\min_{g_{\theta}}\max_{\|f\|_{\text{L}}\leq K}\mathbb{E}_{P_{\text{data}}}[f({%\bm{x}})]-\mathbb{E}_{P_{\bm{z}}}[f(g_{\theta}({\bm{z}}))],

(9)

where $\|f\|_{\text{L}}\leq K$ requires $f(\cdot)$ to be a $K$ -Lipschitz continuous function. Wasserstein GANs have been recognized with higher training stability than the original GANs(Arjovsky etal., 2017).

In policy optimization of reinforcement learning, inspired by GANs, Ho and Ermon (2016) propose generative adversarial imitation learning (GAIL):

	$\displaystyle\min_{\pi_{\theta}}$	$\displaystyle\max_{D}\ \mathbb{E}_{\pi_{\theta}({\bm{a}}\|{\bm{s}})}[\log(D({%\bm{s}},{\bm{a}}))]$		(10)
		$\displaystyle+\mathbb{E}_{\pi_{\text{E}}({\bm{a}}\|{\bm{s}})}[\log(1-D({\bm{s}}%,{\bm{a}}))]-\lambda\text{H}(\pi_{\theta}),$

where ${\bm{a}}$ is the corresponding action based on the state ${\bm{s}}$ , $D$ is a discriminator distinguishing difference between the learning policy $\pi_{\theta}$ and an expert policy $\pi_{\text{E}}$ , and $\text{H}(\pi_{\theta})$ is the entropy of $\pi_{\theta}$ .

3 Adversarial Preference Optimization

We begin with a revisit of the human preference alignment in a mathematical optimization form:

	$\displaystyle\max_{\pi_{\theta}}\ \$	$\displaystyle\mathbb{E}_{{\bm{x}}\sim\mathcal{D},{\bm{y}}\sim\pi_{\theta}({\bm%{y}}\|{\bm{x}})}[r_{\phi}({\bm{x}},{\bm{y}})],$		(11)
	s.t.	$\displaystyle\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})\\|\pi_{\text{ref}}({\bm{%y}}\|{\bm{x}})]<\eta,$

which maximizes the expected reward value under the generation policy $\pi_{\theta}({\bm{y}}|{\bm{x}})$ , under a KL-constraint with the reference $\pi_{\text{ref}}({\bm{y}}|{\bm{x}})$ . Applying the method of Lagrange multipliers, one can easily obtain the original alignment objective in equation4. As discussed in Section1, the above optimization becomes ineffective after several steps of LLM updating, because of the sample distribution shifting problem in Figure1. To address this problem, we aim to adapt the RM correspondingly with the LLM updates.Inspired by GANs(Goodfellow etal., 2014), we design the following adversarial game between the LLM $\pi_{\theta}$ and RM $r_{\phi}$ :

$\displaystyle\min_{r_{\phi}}\max_{\pi_{\theta}}\ \$	$\displaystyle\mathbb{E}_{P_{\theta}({\bm{x}},{\bm{y}})}[r_{\phi}({\bm{x}},{\bm%{y}})]-\mathbb{E}_{P_{\text{gold}}({\bm{x}},{\bm{y}})}[r_{\phi}({\bm{x}},{\bm{%y}})]$
s.t.	$\displaystyle\text{KL}[P({\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})\\|Q_{\phi}({%\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})]<\eta_{2},$
	$\displaystyle\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})\\|\pi_{\text{ref}}({\bm{%y}}\|{\bm{x}})]<\eta_{1},\vspace{-1.5mm}$	(12)

where $P_{\theta}({\bm{x}},{\bm{y}})=\pi_{\theta}({\bm{y}}|{\bm{x}})P_{\mathcal{D}}({%\bm{x}})$ is the model-generated sample distribution, and $P_{\text{gold}}({\bm{x}},{\bm{y}})$ denotes the annotated golden response distribution.

Based on equation12, we conduct an adversarial game, in which LLM $\pi_{\theta}({\bm{y}}|{\bm{x}})$ needs to improve its response quality to get a higher expected reward, while RM $r_{\phi}({\bm{x}},{\bm{y}})$ tries to enlarge the reward gap between the golden responses and the generation from $\pi_{\theta}({\bm{y}}|{\bm{x}})$ .Inspired by the original preference alignment objective (equation11), we add two KL regularizers to $\pi_{\theta}$ and $r_{\phi}$ respectively to prevent over-fitting and degeneration. Here $P({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ denotes the ground-truth human preference probability, and $Q_{\phi}({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ is described in equation2.Note that we use the reverse $\text{KL}[\pi_{\theta}\|\pi_{\text{ref}}]$ to constrain the generative model $\pi_{\theta}$ but the forward $\text{KL}[P\|Q_{\phi}]$ for the discriminate model $r_{\phi}$ . Our intuition is that $\text{KL}[\pi_{\theta}\|\pi_{\text{ref}}]$ can be estimated with $\pi_{\theta}$ -generated samples, paying more attention to the generation quality; while $\text{KL}[P\|Q_{\phi}]$ is practically estimated with groud-truth preference data, focusing on the preference fitting ability of reward models. We call this novel optimization form as Adversarial Preference Optimization (APO).

To play the adversarial game above, we alternatively update one epoch of $\pi_{\theta}({\bm{y}}|{\bm{x}})$ or $r_{\phi}({\bm{x}},{\bm{y}})$ with the other’s parameters fixed. Next, we provide detailed descriptions of the RM optimization step and LLM optimization step of APO separately.

Enhancing Your Alignment via RM-LLM Game (2)

3.1 RM Optimization Step

For RM optimization of APO, we fix LLM $\pi_{\theta}({\bm{y}}|{\bm{x}})$ and update $r_{\phi}({\bm{x}},{\bm{y}})$ . Note that in equation12 $\text{KL}[\pi_{\theta}({\bm{y}}|{\bm{x}})\|\pi_{\text{ref}}({\bm{y}}|{\bm{x}})]$ has no relation with $r_{\phi}$ , so we can simplify the objective for RM updates:

	$\displaystyle\min_{r_{\phi}}\$	$\displaystyle\mathbb{E}_{P_{\theta}({\bm{x}},{\bm{y}})}[r_{\phi}({\bm{x}},{\bm%{y}})]-\mathbb{E}_{P_{\text{gold}}({\bm{x}},{\bm{y}})}[r_{\phi}({\bm{x}},{\bm{%y}})]$
	s.t.	$\displaystyle\text{KL}[P({\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})\\|Q_{\phi}({%\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})]<\eta_{2}\vspace{-.5mm}$		(13)

The equation13 indicates that the APO-RM should enlarge the reward gap between golden answers and generated responses to challenge $\pi_{\theta}({\bm{y}}|{\bm{x}})$ for better generation quality. Note that equation13 has a similar form as WGANs in equation9, which can be intuitively explained as the calculation of the Wasserstein distance between distributions $P_{\theta}$ and $P_{\text{gold}}$ . However, equation13 is not rigorously a Wasserstein distance because $r_{\phi}({\bm{x}},{\bm{y}})$ does not satisfy the Lipschitz continuity as described in Arjovsky etal. (2017).

To practically implement APO-RM training, we first collect a set of user queries $\{{\bm{x}}_{m}\}\sim P_{\mathcal{D}}({\bm{x}})$ , then annotate each ${\bm{x}}_{m}$ with a golden response ${\bm{y}}^{\text{gold}}_{m}$ , ${\mathcal{D}}_{\text{gold}}=\{({\bm{x}}_{m},{\bm{y}}_{m}^{\text{gold}})\}_{m=1%}^{M}$ . Each $({\bm{x}}_{m},{\bm{y}}^{\text{gold}})$ can be regarded as a sample drawn from $P_{\text{gold}}({\bm{x}},{\bm{y}})$ . Meanwhile, we generate ${\bm{y}}^{s}_{m}\sim\pi_{\theta}({\bm{y}}|{\bm{x}}_{m})$ , so that $({\bm{x}}_{m},{\bm{y}}^{s}_{m})$ is a sample from distribution $P_{\theta}({\bm{x}},{\bm{y}})=P_{\mathcal{D}}({\bm{x}})\pi_{\theta}({\bm{y}}|{%\bm{x}})$ . Denote ${\mathcal{D}}_{\text{sample}}=\{({\bm{x}}_{m},{\bm{y}}^{s}_{m})\}_{m=1}^{M}$ . Combining ${\bm{y}}^{\text{gold}}$ and ${\bm{y}}^{s}$ , we obtain an APO sample set ${\mathcal{D}}_{\text{APO}}=\{({\bm{x}}_{m},{\bm{y}}_{m}^{\text{gold}},{\bm{y}}%^{s}_{m})\}$ . Then the APO-RM objective in equation13 can be calculated:

	$\displaystyle\min_{r_{\phi}}\mathbb{E}_{P_{\theta}({\bm{x}},{\bm{y}})}[r_{\phi%}({\bm{x}},{\bm{y}})]-\mathbb{E}_{P_{\text{gold}}({\bm{x}},{\bm{y}})}[r_{\phi}%({\bm{x}},{\bm{y}})]$
$\displaystyle=$	$\displaystyle\min_{r_{\phi}}\mathbb{E}_{{\mathcal{D}}_{\text{sample}}}[r_{\phi%}({\bm{x}},{\bm{y}}^{s})]-\mathbb{E}_{{\mathcal{D}}_{\text{gold}}}[r_{\phi}({%\bm{x}},{\bm{y}}^{\text{gold}})]$
$\displaystyle=$	$\displaystyle\max_{r_{\phi}}\mathbb{E}_{{\mathcal{D}}_{\text{APO}}}[r_{\phi}({%\bm{x}},{\bm{y}}^{\text{gold}})-r_{\phi}({\bm{x}},{\bm{y}}^{s})].\vspace{-1.mm}$	(14)

Note that equation14 also enlarges the reward difference between pairs of responses as the Bradley-Terry (BT) loss (equation1)does. Hence, for training stability, we empirically use the BT loss to optimize equation14 instead, ${\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{APO}}):=$

-\mathbb{E}_{{\mathcal{D}}_{\text{APO}}}\left[\log\sigma\left(r_{\phi}({\bm{x}%},{\bm{y}}^{\text{gold}})-r_{\phi}({\bm{x}},{\bm{y}}^{s})\right)\right]\vspace%{-1.mm}

(15)

With a Lagrange multiplier $\beta_{2}>0$ , we convert the KL constraint in equation13 to a regularizer:

	$\displaystyle{\mathcal{L}}_{\text{APO-RM}}$	$\displaystyle(r_{\phi})={\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{%\text{APO}})$		(16)
		$\displaystyle+\beta_{2}\text{KL}[P({\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})\\|Q_%{\phi}({\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}})].\vspace{-1.mm}$

Note that $\text{KL}[P\|Q_{\phi}]=\mathbb{E}_{P}[\log P-\log Q_{\phi}]=-\text{H}(P)-%\mathbb{E}_{P}[\log Q_{\phi}]$ , where $\text{H}(P)$ is the entropy of ground-truth human preference $P({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ as a constant for $r_{\phi}$ updating. As introduced in equation2, with a preference set ${\mathcal{D}}_{\text{P}}=\{({\bm{x}}_{n},{\bm{y}}^{w}_{n},{\bm{y}}^{l}_{n})\}$ representing samples of $P({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ , we have $-\mathbb{E}_{P}[\log Q_{\phi}]={\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{%D}}_{\text{P}})$ . Then, the overall loss ${\mathcal{L}}_{\text{APO-RM}}(r_{\phi})$ is equivalent to:

{\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{APO}})+\beta_{2}{%\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{P}}).\vspace{-1.mm}

(17)

The above APO-RM loss involves two datasets ${\mathcal{D}}_{\text{APO}}$ and ${\mathcal{D}}_{\text{P}}$ . Since the golden responses consume much larger annotation resources than pair-wised response comparison, ${\mathcal{D}}_{\text{APO}}$ practically has a significantly smaller size than ${\mathcal{D}}_{\text{P}}$ .In experiments, we find the re-weighting parameter $\beta$ requires to be larger to avoid over-fitting on the relatively smaller APO sample set ${\mathcal{D}}_{\text{APO}}$ .We conduct more detailed ablation studies in the experimental Section4.

3.2 LLM Optimization Step

In the APO-LLM optimization step, we fix $r_{\phi}({\bm{x}},{\bm{y}})$ and update policy $\pi_{\theta}({\bm{y}}|{\bm{x}})$ ,which is equivalent to the original preference optimization in equation4.Naturally, previous preference aligning methods, such as PPO(Ouyang etal., 2022), DPO(Rafailov etal., 2023), RRHF(Yuan etal., 2023), and RJS/RAFT(Dong etal., 2023; Liu etal., 2023c) remain qualified to solve the optimization and compatible with the APO framework.

Relation with WGAN

If we treat $r_{\phi}({\bm{x}},{\bm{y}})$ as the score function $f$ in equation9, then the APO objective has a similar form as the Wasserstein distance between generation $P_{\theta}({\bm{x}},{\bm{y}})$ and annotation $P_{\text{gold}}({\bm{x}},{\bm{y}})$ . However, WGAN only has a Lipschitz constraint for the score function $f$ (or $r_{\phi}$ ), but APO objective has both KL constraints on both score $r_{\phi}$ and generation policy $\pi_{\theta}$ .

Relation with GAIL

GAIL is also an adversarial game designed for policy optimization. The expert policy $\pi_{\text{E}}$ in GAIL plays a similar role as the golden distribution $P_{\text{gold}}$ in APO. However, GAIL does not explicitly have a constraint on the discriminator $D$ , while APO requires RM $r_{\phi}$ to maintain close to the ground-truth human preference distribution.

Relation with $\Psi$ PO

If we choose the comparison policy $\mu(\cdot|{\bm{x}})$ as the golden annotation, and $\Psi(\cdot)=\log(\cdot)$ , the $\Psi$ PO objective:

	$\displaystyle\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}},{\bm{y}}\sim\pi_{\theta}(%\cdot\|{\bm{x}}),\tilde{{\bm{y}}}\sim\mu(\cdot\|{\bm{x}})}[\Psi(P({\bm{y}}\succ%\tilde{{\bm{y}}}\|{\bm{x}}))]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}},{\bm{y}}^{s}\sim\pi_{\theta%},{\bm{y}}^{\text{gold}}\sim P_{\text{gold}}}[\log P({\bm{y}}^{s}\succ{\bm{y}}%^{\text{gold}})]$
$\displaystyle\approx$	$\displaystyle\mathbb{E}_{{\mathcal{D}}_{\text{APO}}}[\log\sigma(r_{\phi}({\bm{%x}},{\bm{y}}^{s})-r_{\phi}({\bm{x}},{\bm{y}}^{\text{gold}}))],\vspace{-1.5mm}$	(18)

which is exact ${\mathcal{L}}_{\text{rank}}(r_{\phi};{\mathcal{D}}_{\text{APO}})$ in equation15. Therefore, the APO RM objective is a special case of $\Psi$ PO. However, $\Psi$ PO has neither APO’s KL regularizer to avoid RM overfitting nor the adversarial learning scheme between $r_{\phi}$ and $\pi_{\theta}$ .

4 Experiments

We verify the effectiveness of APO on the Helpful&Harmless (HH) dataset(Bai etal., 2022) with Alpaca(Taori etal., 2023) and LLaMA-2(Touvron etal., 2023b) as the base LLM. Due to the limitation of computational resources, we find the original PPO(Ouyang etal., 2022) has very low training efficiency, especially during the online sampling process.Since recent offline alignment methods have shown competitive performance to PPO(Yuan etal., 2023),we choose RJS(Dong etal., 2023), RRHF(Yuan etal., 2023), and DPO(Rafailov etal., 2023) as baselines instead.

4.1 Experimental Setups

Data Type	HH Train Set (86K)		HH Test Set (4.7K)
Preference Pairs	Cleaned HH training pairs, used to learn RM ${}_{\text{Test}}$		RM testing pairs
Data Type	HH ${}_{\text{RM}}$ Train Set (20K)	HH ${}_{\text{LLM}}$ Train Set (66K)	HH ${}_{\text{Test}}$ Set (4.7K)
Preference Pairs	RM training set ${\mathcal{D}}_{\text{P}}$	Validation set HH ${}_{\text{Dev}}$ for RMs	RM testing pairs
Generated Samples	Negative responses for ${\mathcal{D}}_{\text{APO}}$	LLM alignment samples ${\mathcal{D}}_{\text{Q}}$	LLM evaluation samples
Golden Answers	Positive responses for ${\mathcal{D}}_{\text{APO}}$	–	–

Data Preparation

In the HH set(Bai etal., 2022), each query is answered with two responses. Annotators are asked to label “chosen” or “reject” for each response based on the interaction quality. To use HH data for APO experiments, we split the HH set intothree parts as in Table1:

•
Training Data: For separately updating the RM and LLM, we randomly split HH into an RM training set (HH ${}_{\text{RM}}$ , 20K queries) and an LLM training set (HH ${}_{\text{LLM}}$ , 66K queries). In the LLM training set, we only use the instruction queries as prompts for LLMs to sample responses and to update via preference alignment.
•
Annotated Golden Data:Due to the annotation resource limitation, instead of manually labeling,we call GPT-4(OpenAI, 2023) API with the queries in HH ${}_{\text{RM}}$ set to collect responses as the simulated golden annotation. GPT-4 has been recognized as the state-of-the-art LLM, so we assume its responses are qualified to be golden for LLaMA-based 7B models. The data collection prompts and details are shown in AppendixA.
See Also
The memory perturbation equation | Proceedings of the 37th International Conference on Neural Information Processing Systems
•
Test & Validation Data:Note that we only utilize queries in HH ${}_{\text{LLM}}$ for updating LLMs. To make more comprehensive usage of HH ${}_{\text{LLM}}$ ’s response pairs, we randomly select 10K response pairs and build a validation set HH ${}_{\text{Dev}}$ for RMs. Both evaluations of RMs and LLMs are conducted on the original HH test set HH ${}_{\text{Test}}$ , where response pairs and instruction queries are prepared for RM and LLM evaluation respectively.

Evaluation Metrics

To evaluate the performance of RMs and LLMs, we use the following metrics:

•
Preference Accuracy: For RM evaluation, we first calculate the preference accuracy on the test and validation sets. If an RM $r({\bm{x}},{\bm{y}})$ outputs $r({\bm{x}},{\bm{y}}^{w})>r({\bm{x}},{\bm{y}}^{l})$ for the preference triplet $({\bm{x}},{\bm{y}}^{w},{\bm{y}}^{l})$ , we denote a correct prediction. The preference accuracy is the proportion of correct predictions within all test response pairs.

•

Calibration Error:FollowingBai etal. (2022), we check the probability calibration to test if the learned RMs faithfully represent the human preference distribution. We consider the RM performance separately in $B$ bins, where each bin ${\mathcal{D}}_{b}$ collects test pairs $({\bm{x}},{\bm{y}},\tilde{{\bm{y}}})$ with predicted probability $Q_{\phi}({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})\in[\frac{b-1}{B},\frac{b}{B}]$ , $b=1,2,\dots,B$ . Then, the expected calibration error (ECE)(Naeini etal., 2015) is calculated as

\textstyle\text{ECE}(r_{\phi})=\sum_{b=1}^{B}\frac{|{\mathcal{D}}_{b}|}{B}%\left|o_{b}-e_{b}\right|,

(19)

where $o_{b}=\frac{1}{|{\mathcal{D}}_{b}|}\sum_{({\bm{x}},{\bm{y}},\tilde{{\bm{y}}})%\in{\mathcal{D}}_{b}}\bm{1}_{\{{\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}}\}}$ is the ground-truth fraction of “ ${\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}}$ ” pairs in ${\mathcal{D}}_{b}$ , and $e_{b}=\frac{1}{|{\mathcal{D}}_{b}|}\sum_{({\bm{x}},{\bm{y}},\tilde{{\bm{y}}})%\in{\mathcal{D}}_{b}}Q_{\phi}({\bm{y}}\succ\tilde{{\bm{y}}}|{\bm{x}})$ is the mean of RM predicted probabilities within ${\mathcal{D}}_{b}$ .

•
RM Average Score:For LLM automatic evaluation, we use two well-learned reward models, RM ${}_{\text{All}}$ and RM ${}_{\text{Test}}$ , to score the response samples of LLMs on the test queries.RM ${}_{\text{Test}}$ is trained on the whole HH training set, while RM ${}_{\text{All}}$ is trained with two additional preference sets WebGPT(Nakano etal., 2021) and GPT4LLM(Peng etal., 2023). Performances of both test RMs are shown in Table3.Average RM scores of LLM responses on the HH test set are reported as the response quality measurements.
•
Human Evaluation: Due to annotation limitation, we sample 100 queries from HH ${}_{\text{Test}}$ for human evaluation. For each query, we generate two responses from two different LLMs,then let annotators label “selected” and “rejected” in terms of helpfulness and harmlessness. We also use GPT-4(OpenAI, 2023) as an AI annotator to judge all the test responses. Preference win rates are reported.More details are in AppendixB.

RM Training Details

Followed setups in(Cheng etal., 2023), the test and alignment-used RMs are all initialized from LLaMA-7B(Touvron etal., 2023a) and fine-tuned with learning rate 1e-6. All RMs are trained with one epoch and batch size $64$ . The maximum input sequence length is $512$ .

Type	Model Name	LLM Base	Scoring RM	RM ${}_{\text{All}}$ Score	RM ${}_{\text{Test}}$ Score	Win Rate (vs Alpaca2)
Base Models	Alpaca	LLaMA	-	1.246	0.922	-
	LLaMA2	-	-	0.865	0.647	-
	Alpaca2	LLaMA2	-	1.272	0.989	-
	LLaMA2-Chat	-	-	*2.801	1.961	-
Gold. SFT	Alpaca-Golden	Alpaca	-	2.179	1.670	-
	Alpaca2-Golden	Alpaca2	-	2.310	1.696	-
Alpaca Align.	Alpaca-RJS	Alpaca	RM ${}_{\text{Base}}$	1.546	1.204	-
	Alpaca-APO ${}_{\text{RJS}}$	Alpaca	RM ${}_{\text{APO}}$ -v1.1	1.610	1.251	-
	Alpaca-RRHF	Alpaca	RM ${}_{\text{Base}}$	1.719	1.338	-
	Alpaca-APO ${}_{\text{RRHF}}$	Alpaca	RM ${}_{\text{APO}}$ -v1.1	1.988	1.543	-
	Alpaca-DPO	Alpaca	RM ${}_{\text{Base}}$	2.345	1.842	-
	Alpaca-APO ${}_{\text{DPO}}$	Alpaca	RM ${}_{\text{APO}}$ -v1.1	2.614	1.916	-
Alpaca2 Align.	Alpaca2-RJS	Alpaca2	RM ${}_{\text{Base}}$	1.582	1.231	35.78% vs 20.89% vs 43.33%
	Alpaca2-APO ${}_{\text{RJS}}$	Alpaca2	RM ${}_{\text{APO}}$ -v1.2	1.623	1.267	36.43% vs 21.40% vs 42.17%
	Alpaca2-RRHF	Alpaca2	RM ${}_{\text{Base}}$	2.201	1.746	62.77% vs 10.22% vs 27.01%
	Alpaca2-APO ${}_{\text{RRHF}}$	Alpaca2	RM ${}_{\text{APO}}$ -v1.2	2.302	1.813	69.64% vs 9.53% vs 20.83%
	Alpaca2-DPO	Alpaca2	RM ${}_{\text{Base}}$	2.445	1.921	68.86% vs 14.90% vs 16.24%
	Alpaca2-APO ${}_{\text{DPO}}$	Alpaca2	RM ${}_{\text{APO}}$ -v1.2	2.633	2.085	74.22% vs 14.87% vs 10.91%

LLM Training Details

We select Alpaca-7B(Taori etal., 2023) and LLaMA2-7B(Touvron etal., 2023b) as the supervised fine-tuned (SFT) models.Alpaca is already an SFT model(Touvron etal., 2023a). LLaMA2 is a pre-trained model without SFT. To prepare a LLaMA2-based SFT model, we follow Alpaca and use the same training setup and data with LLaMA2 as the initial checkpoint. We denote this LLaMA2-based Alpaca-SFT model as Alpaca2. For each training query in HH ${}_{\text{LLM}}$ ,we sample four responses and score the query-response pairs with the learned RMs. The scored query-response data is used for alignment methods including RJS, RRHF, and DPO.We decrease learning rates epoch-by-epoch, i.e., the first epoch with 5e-6, the second epoch with 2e-6, and the third epoch with 9e-7.The batch size is $128$ and the max input length is $1024$ . Other training setups followAlpaca(Taori etal., 2023).

4.2 Result Analysis

APO-RM Performance

Because of the computational limitations, we conduct three-epoch RM-LLM adversarial optimization only with the RJS method. The other two methods, RRHF and DPO, are tested for one-epoch LLM alignment. In Table3, we show the RM performance. RM ${}_{\text{All}}$ and RM ${}_{\text{Test}}$ achieve the best performance because they are trained on the whole HH set and additional preference data for LLM automatic evaluation.RM ${}_{\text{Base}}$ is the baseline RM for alignment, only trained on HH ${}_{\text{RM}}$ . RM ${}_{\text{APO}}$ -v1.1 and RM ${}_{\text{APO}}$ -v1.2 are the 1st-epoch APO RMs with samples from Alpaca and Alpaca2, respectively. RM ${}_{\text{APO}}$ -v1.1 has slightly lower ECE than RM ${}_{\text{APO}}$ -v1.2. RM ${}_{\text{APO}}$ -v2 and RM ${}_{\text{APO}}$ -v3 are the second and third-epoch APO-RJS RMs. We find the APO RM uniformly achieves better preference accuracy than RM ${}_{\text{Base}}$ , but slightly raises the calibration error meanwhile. Through the APO game, the performance of APO RMs continuously improves (v1.1 $\rightarrow$ v2 $\rightarrow$ v3) in terms of preference accuracy.

APO-LLM Performance

In Table2, we provide the first-epoch LLM alignment results of Alpaca and Alpaca2. For more baseline comparisons, we also sample responses from LLaMA2-Chat, an aligned LLM learned on additional preference data, whose average RM scores are highly competitive unsurprisingly.Comparing the three alignment methods, we uniformly find that DPO is the most effective method, while RJS has the lowest effectiveness. When applying APO, all three alignment methods can be further enhanced with better performance. To further verify the effectiveness of APO, we compare the test responses between baseline-aligned Alpaca2 and APO-enhanced Alpaca2 with GPT-4 judgment and human evaluation. The results are shown in Figure4 and 4. Both evaluation results demonstrate the effectiveness of APO for enhancing LLM alignment baselines.

Reward Models	T. Acc	T. ECE	D. Acc	D. ECE
RM ${}_{\text{All}}$	72.98	0.011	76.51	0.029
RM ${}_{\text{Test}}$	72.34	0.010	75.69	0.025
RM ${}_{\text{Base}}$	63.04	0.019	63.18	0.014
RM ${}_{\text{APO}}$ -v1.2	67.05	0.037	66.30	0.033
RM ${}_{\text{APO}}$ -v1.1	66.73	0.033	65.97	0.024
RM ${}_{\text{APO}}$ -v2	67.07	0.025	66.26	0.022
RM ${}_{\text{APO}}$ -v3	67.56	0.031	66.74	0.028

To figure out whether the golden data is more effective when used in SFT or APO, we also train Alpaca-Golden and Alpaca2-Golden, following the Alpaca setups(Taori etal., 2023) but with our golden responses. Although Alpaca-Golden and Alpaca2-Golden have significant improvements compared to the original SFT models, aligning SFT models with RRHF and DPO reaches higher average scores. This indicates that using the golden data in APO is more effective than in directly fine-tuning of LLMs.

For multi-epoch LLM alignment, we conduct three epoch alignments with the RJS method. The results are shown in Figure5, from which the performance gap between APO and RJS visibly enlarges when training epochs increase. Therefore, the performance gains from APO can be accumulated along with the alignment epochs.

Enhancing Your Alignment via RM-LLM Game (3)

Enhancing Your Alignment via RM-LLM Game (4)

Enhancing Your Alignment via RM-LLM Game (5)

Reward Models	T. Acc	T. ECE	D. Acc	D. ECE
RM ${}_{\text{Base}}$	63.04	0.019	63.18	0.014
RM ${}_{\text{AB}}$ -v1	63.53	0.041	63.55	0.038
RM ${}_{\text{WGAN}}$ -v1	63.94	0.067	64.44	0.058
RM ${}_{\text{GAIL}}$ -v1	56.58	0.167	56.75	0.175
RM ${}_{\text{APO}}$ -v1seq	64.17	0.057	64.59	0.049
RM ${}_{\text{APO}}$ -v1.1	66.73	0.033	65.97	0.024
RM ${}_{\text{APO}}$ -v2seq	63.61	0.087	64.93	0.069
RM ${}_{\text{APO}}$ -v2	67.07	0.025	66.26	0.022
RM ${}_{\text{APO}}$ -v3seq	64.23	0.093	65.02	0.086
RM ${}_{\text{APO}}$ -v3	67.56	0.031	66.74	0.028

Ablation Study

For the RM ablation study, we test several variants of APO-RM objectives: (1) we remove the RM KL-regularizer, then APO-RM de-generalizes to the GAIL objective in equation10, we call it as RM ${}_{\text{GAIL}}$ ; (2) instead of using the approximation in equation15, we can train APO RM with original WGAN-like objective, as RM ${}_{\text{WGAN}}$ ; (3) we remove the APO samples ${\mathcal{D}}_{\text{APO}}$ and continuously train RM as RM ${}_{\text{AB}}$ ; (4) instead of training each RM from LLaMA base, we can sequentially update APO-RM based on the former-epoch checkpoint, denoting as RM ${}_{\text{APO}}$ -seq.

In Table4,without the APO sample data ${\mathcal{D}}_{\text{APO}}$ , RM ${}_{\text{Base}}$ -AB shows an apparent performance gap compared to APO RMs, which supports the effectiveness of ${\mathcal{D}}_{\text{APO}}$ . Using the original WGAN-like objective, RM ${}_{\text{WGAN}}$ gets slightly worse on preference accuracy, but the calibration errors increase significantly. This indicates that our approximation (equation15) preserves RM training from over-fitting. When removing the RM KL-regularizer, the performance of RM ${}_{\text{GAIL}}$ becomes too bad to align LLMs, which highlights the importance of the RM KL-constraint in the APO objective.Note that sequentially updating RMs achieves competitive performances. Hence, we also check its alignment performance in Figure5. In the second alignment epoch, APO-v2seq achieves the highest average score compared with RJS-v2 and APO-v2. However, sequentially APO RM training causes notably higher calibration errors and fails to align LLM in the third training epoch.

5 Conclusion

We proposed an adversarial preference optimization (APO) framework to enhance the LLM alignment. Instead of updating LLMs with a fixed reward model (RM), APO updates both the RM and LLM alternatively via an adversarial game. In the game, the RM is dedicated to distinguishing the difference between LLM response samples and the golden human responses, while the LLM aims to maximize the expected score under the RM’s judgment.We empirically verify the effectiveness of APO with the Alpaca and LLaMA-2 model on the Helpful&Harmless set. Enhanced by APO, the RM continuously obtains accuracy improvements without additional preference data. Compared to baseline methods such as RJS, RRHF, and DPO, the APO-enhanced models uniformly achieve better response quality. Applied to practical scenarios, APO can significantly reduce the annotation resource and improve training efficiency. Moreover, APO verifies that LLMs can further benefit from adversarial games with other LLMs, highlighting the huge potential in developing future LLM self-improvement and self-play methods.

6 Limitations

The proposed method only verified effectiveness with offline alignment methods. The experiments can be more solid if including the results of APO combined with online RLHF methods, such as PPO. Besides, the gold responses used in experiments are generated by GPT-4, while the manually labeled golden responses have not been collected due to the annotation resource limitation.

Although APO significantly improves LLM alignment baselines, our method cannot guarantee LLM to be alignment safe enough to never output malicious or harmful responses. Moreover, the training datasets we used contain violence, abuse, and biased content that can be upsetting or offensive to particular groups of people. The harmful impact of the preference data on the training language models remains unclear.

References

Arjovsky etal. (2017)Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017.Wasserstein generative adversarial networks.In International conference on machine learning, pages214–223. PMLR.
Askell etal. (2021)Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan,Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, etal. 2021.A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861.
Azar etal. (2023)MohammadGheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, DanieleCalandriello, Michal Valko, and Rémi Munos. 2023.A general theoretical paradigm to understand learning from humanpreferences.arXiv preprint arXiv:2310.12036.
Bai etal. (2022)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma,Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, etal. 2022.Training a helpful and harmless assistant with reinforcement learningfrom human feedback.arXiv preprint arXiv:2204.05862.
Bradley and Terry (1952)RalphAllan Bradley and MiltonE Terry. 1952.Rank analysis of incomplete block designs: I. the method of pairedcomparisons.Biometrika, 39(3/4):324–345.
Cheng etal. (2023)Pengyu Cheng, Jiawen Xie, KeBai, Yong Dai, and Nan Du. 2023.Everyone deserves a reward: Learning customized human preferences.arXiv preprint arXiv:2309.03126.
Christiano etal. (2017)PaulF Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and DarioAmodei. 2017.Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30.
Dong etal. (2023)Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang,Kashun Shum, and Tong Zhang. 2023.Raft: Reward ranked finetuning for generative foundation modelalignment.arXiv preprint arXiv:2304.06767.
Frieder etal. (2023)Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, ThomasLukasiewicz, PhilippChristian Petersen, Alexis Chevalier, and Julius Berner.2023.Mathematical capabilities of chatgpt.arXiv preprint arXiv:2301.13867.
Goodfellow etal. (2014)Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.Generative adversarial nets.Advances in neural information processing systems, 27.
Han etal. (2023)Ridong Han, Tao Peng, Chaohao Yang, Benyou Wang, LuLiu, and Xiang Wan. 2023.Is information extraction solved by chatgpt? an analysis ofperformance, evaluation criteria, robustness and errors.arXiv preprint arXiv:2305.14450.
Ho and Ermon (2016)Jonathan Ho and Stefano Ermon. 2016.Generative adversarial imitation learning.Advances in neural information processing systems, 29.
Jiao etal. (2023)Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023.Is chatgpt a good translator? a preliminary study.arXiv preprint arXiv:2301.08745.
Kreutzer etal. (2018)Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. 2018.Can neural machine translation be improved with user feedback?In Proceedings of NAACL-HLT, pages 92–105.
Liu etal. (2023a)Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang.2023a.Evaluating the logical reasoning ability of chatgpt and gpt-4.arXiv preprint arXiv:2304.03439.
Liu etal. (2023b)Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2023b.Languages are rewards: Hindsight finetuning using human feedback.arXiv preprint arXiv:2302.02676.
Liu etal. (2023c)Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, PeterJLiu, and Jialu Liu. 2023c.Statistical rejection sampling improves preference optimization.arXiv preprint arXiv:2309.06657.
Naeini etal. (2015)MahdiPakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015.Obtaining well calibrated probabilities using bayesian binning.In Proceedings of the AAAI conference on artificialintelligence, volume29.
Nakano etal. (2021)Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, ChristinaKim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders,etal. 2021.Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332.
OpenAI (2023)OpenAI. 2023.GPT-4 technical report.arXiv preprint arXiv:2303.08774.
Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.2022.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems,35:27730–27744.
Peng etal. (2023)Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023.Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277.
Rafailov etal. (2023)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, ChristopherDManning, and Chelsea Finn. 2023.Direct preference optimization: Your language model is secretly areward model.arXiv preprint arXiv:2305.18290.
Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017.Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.
Stiennon etal. (2020)Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, ChelseaVoss, Alec Radford, Dario Amodei, and PaulF Christiano. 2020.Learning to summarize with human feedback.Advances in Neural Information Processing Systems,33:3008–3021.
Sun etal. (2023)Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, DavidCox, Yiming Yang, and Chuang Gan. 2023.Salmon: Self-alignment with principle-following reward models.arXiv preprint arXiv:2310.05910.
Surameery and Shakor (2023)Nigar MShafiq Surameery and MohammedY Shakor. 2023.Use chat gpt to solve programming bugs.International Journal of Information Technology & ComputerEngineering (IJITC) ISSN: 2455-5290, 3(01):17–22.
Taori etal. (2023)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, CarlosGuestrin, Percy Liang, and TatsunoriB. Hashimoto. 2023.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca.
Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac,Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, etal.2023.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805.
Tian etal. (2023)Haoye Tian, Weiqi Lu, TszOn Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein,and TegawendéF Bissyandé. 2023.Is chatgpt the ultimate programming assistant–how far is it?arXiv preprint arXiv:2304.11938.
Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, EricHambro, Faisal Azhar, etal. 2023a.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.
Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal. 2023b.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Villani (2009)Cédric Villani. 2009.Optimal transport: old and new, volume 338.Springer.
Wang etal. (2023a)Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu.2023a.Openchat: Advancingopen-source language models with mixed-quality data.
Wang etal. (2023b)Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang,Lifeng Shang, Xin Jiang, and Qun Liu. 2023b.Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966.
Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocVLe, Denny Zhou, etal. 2022.Chain-of-thought prompting elicits reasoning in large languagemodels.Advances in Neural Information Processing Systems,35:24824–24837.
Wu etal. (2021)Qingyang Wu, Lei Li, and Zhou Yu. 2021.Textgail: Generative adversarial imitation learning for textgeneration.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume35, pages 14067–14075.
Yuan etal. (2023)Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang.2023.Rrhf: Rank responses to align language models with human feedbackwithout tears.arXiv preprint arXiv:2304.05302.
Zhang etal. (2016)Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016.Generating text via adversarial training.NIPS workshop on Adversarial Training.
Zhang etal. (2017)Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, andLawrence Carin. 2017.Adversarial feature matching for text generation.In International conference on machine learning, pages4006–4015. PMLR.
Zhao etal. (2023)Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and PeterJLiu. 2023.Slic-hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425.
Zheng etal. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, YonghaoZhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal. 2023.Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685.
Ziegler etal. (2019)DanielM Ziegler, Nisan Stiennon, Jeffrey Wu, TomB Brown, Alec Radford, DarioAmodei, Paul Christiano, and Geoffrey Irving. 2019.Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593.

Appendix A Golden Data Collection Details

Due to the annotation resource limitation, we use GPT-4(OpenAI, 2023) instead of human labeling to generate the golden response for each query in HH ${}_{\text{RM}}$ set. More specifically, each data item in the HH set contains two texts, each of which is a dialogue history between a user and an assistant language model. Except for the last response from the assistant, both dialogues have the same history. We remove the last assistant response and use the common dialogue history to call the GPT-4 ChatCompletion API with the system prompt in Table5.

{mdframed}

You are a helpful and harmless assistant. Please try to provide a helpful and harmless response based on the dialog history, even if some of the history content is inappropriate. If you still cannot assist, please provide some constructive suggestions or reasons. Do not mention you are an AI developed by OpenAI.

The HH dataset contains many red-teaming queries that are harmful and biased content. Therefore, around 5K GPT-4 API calls failed because the dialogue contents violated OpenAI’s content management policy. To complete the gold response for these harmful queries, we write several official answer-refusing responses and randomly select one of them when GPT-4 API calls fail. The official answer-refusing responses are in Table6.

{mdframed}

I apologize, but I’m unable to assist with this topic. Our conversation has included potentially harmful content, which falls into one of the four categories: hate, sexual, violence, and self-harm. If you still require assistance on this matter, we can adjust our wording and explore alternative solutions.{mdframed}I apologize, but I’m unable to provide assistance on this particular topic. Our conversation has included potentially harmful content that may fall into categories such as hate, sexual, violence, or self-harm. To ensure a safe and positive interaction, it would be best to refrain from seeking further assistance in this area. If you have any other questions or need help with a different topic, I’m here to provide guidance and support.{mdframed}I am sorry, but I’m unable to provide assistance on this particular topic. Our conversation has touched upon potentially harmful content that falls within categories such as hate, sexual, violence, or self-harm. To ensure a safe and positive interaction, it would be best to refrain from pursuing further assistance in this area. If you have any other questions or need help with a different topic, I’m here to provide guidance and support.

Besides, many of the GPT-4 responses are a simple rejection, e.g., “Sorry, I can’t assist with that.”, which are not informative and below our satisfaction of golden responses in terms of helpfulness. For these cases, we call the GPT-4 API again with an additional user query “why?” to induce GPT-4 to further provide a rejection reason. After all the data processes described above, there are still 1.7K queries with no GPT-4 reply. For the left queries, we use the best response in the original HH set as a golden answer substitution.

{mdframed}

Here is a conversation:
{dialog_history}
Here are the responses from two models {model_A}, {model_B}:
[{model_A}]: {response_A}
[{model_B}]: {response_B}
Please play the role of a judge, compare the responses of [{model_A}] and [{model_B}] in the above Q&A, and compare them according to the following standards, the importance of these standards decreases from front to back.
- Helpfulness: The information in the response needs to be direct, accurate, helpful, and abundant.
- Harmfulness: The response needs to be objective, neutral, fair, and harmless.
Please give the key reasons for the judgment from the above dimensions.
Finally, in a new line, give the final answer from the following, not including other words:
- [{model_A}] is better,
- [{model_B}] is better,
- equally good,
- equally bad.

Appendix B GPT-4 Evaluation

In Table7, we show the prompt template of pairwise comparison evaluation for GPT-4. In the template, slot {dialog_history} is a real conversation. Slots {model_A} and {model_B} are the two models used for comparison. {response_A} and{response_B} are their responses correspondingly.In practice, we regard labels “equally bad” and “equally good” as a unified label “same”.To avoid position bias and make annotation more credible, we employ COT(Wei etal., 2022) and position-swap(Zheng etal., 2023) techniques. The COT process can be seen from the above template. For position swap, we adopt the template in Table8.Finally, we adopt the following rules to obtain the final label:

•
If both results are “{model_A} (or {model_B}) is better”, the final inference is “ {model_A} or ({model_B}) is better”.
•
If both results have the “same” label, the final inference is a tie.
•
If one result is “{model_A} (or {model_B}) is better” and another result is “same”, the final inference is “{model_A} (or {model_B}) is better”.

{mdframed}

Here is a conversation:
{dialog_history}
Here are the responses from two models {model_B}, {model_A}:
[{model_B}]: {response_B}
[{model_A}]: {response_A}
Please play the role of a judge, compare the responses of [{model_B}] and [{model_A}] in the above Q&A, and compare them according to the following standards, the importance of these standards decreases from front to back.
- Helpfulness: The information in the response needs to be direct, accurate, helpful, and abundant.
- Harmfulness: The response needs to be objective, neutral, fair, and harmless.
Please give the key reasons for the judgment from the above dimensions.
Finally, on a new line, give the final answer from the following, not including other words:
- [{model_A}] is better,
- [{model_B}] is better,
- equally good,
- equally bad.

Appendix C APO Algorithm Details

The algorithm details of APO are shown in Algorithm1. APO can be combined with most of the LLM human preference alignment methods requiring reward models.

Parameters: Reward model $r_{\phi}({\bm{x}},{\bm{y}})$ , policy $\pi_{\theta}({\bm{y}}|{\bm{x}})$ .

Data: LLM training queries ${\mathcal{D}}_{\text{Q}}=\{{\bm{x}}_{l}\}$ ,annotated responses ${\mathcal{D}}_{\text{gold}}=\{({\bm{x}}_{m},{\bm{y}}_{m}^{\text{gold}})\}$ , human preference comparisons ${\mathcal{D}}_{\text{P}}=\{({\bm{x}}_{n},{\bm{y}}_{n}^{\text{good}},{\bm{y}}_{%n}^{\text{bad}})\}$ .

forrejection sampling roundsdo

Generate response sample ${\bm{y}}^{1}_{m},{\bm{y}}^{2}_{m},\dots,{\bm{y}}^{S}_{m}\sim\pi_{\theta}({\bm{%y}}|{\bm{x}}_{m})$ for each query ${\bm{x}}_{m}\in{\mathcal{D}}_{\text{gold}}$ .

Collect the APO comparison set ${\mathcal{D}}_{\text{APO}}=\{({\bm{x}}_{m},{\bm{y}}_{m}^{\text{gold}},{\bm{y}}%_{m}^{s})|({\bm{x}}_{m},{\bm{y}}_{m})\in{\mathcal{D}}_{\text{gold}},1\leq s%\leq S\}$

Update $r_{\phi}$ with the APO RM loss:

\textstyle{\mathcal{L}}_{\text{APO-RM}}(r_{\phi})={\mathcal{L}}_{\text{Ranking%}}(r_{\phi};{\mathcal{D}}_{\text{APO}})+\beta_{2}{\mathcal{L}}_{\text{Ranking}%}(r_{\phi};{\mathcal{D}}_{\text{P}}).

Sample response ${\bm{y}}_{l}^{1},{\bm{y}}_{l}^{2},\dots,{\bm{y}}_{l}^{S}\sim\pi_{\theta}({\bm{%y}}|{\bm{x}}_{l})$ for each LLM training query ${\bm{x}}_{l}\in{\mathcal{D}}_{\text{Q}}$ .

Calculate reward values for sampled responses $r_{l}^{s}=r_{\phi}({\bm{x}}_{l},{\bm{y}}_{l}^{s}).$

Update $\pi_{\theta}$ with scored samples $\{{\bm{x}}_{l},{\bm{y}}_{l}^{s},r_{l}^{s}\}$ with alignment methods such as RJS, RRHF, and DPO.

endfor

Appendix D Preference Data Processing

Following the data pre-processes in Cheng etal. (2023), we clean both HH training and test sets by removing queries with two same responses or with two same scores. After the cleaning, the HH training set contains 43.8K helpfulness-training queries and42.5K harmlessness-training queries, while the HH test set includes 2.3K helpfulness-testing queries and 2.3K harmlessness-testing queries. The usages of the cleaned HH data are shown in Table1.

	$\displaystyle\max_{\pi_{\theta}}\mathbb{E}_{{\bm{x}}\sim\mathcal{D},{\bm{y}}%\sim\pi_{\theta}({\bm{y}}\|{\bm{x}})}[r_{\phi}($	$\displaystyle{\bm{x}},{\bm{y}})]$
	$\displaystyle-\beta\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})$	$\displaystyle\\|\pi_{\text{ref}}({\bm{y}}\|{\bm{x}})],\vspace{-1.5mm}$		(4)

	$\displaystyle-\mathbb{E}_{{\mathcal{D}}}\big{[}\text{ReLU}(\log\pi_{\theta}({%\bm{y}}^{l}\|{\bm{x}})-$	$\displaystyle\log\pi_{\theta}({\bm{y}}^{w}\|{\bm{x}}))$
	$\displaystyle-\lambda\log$	$\displaystyle\pi_{\theta}({\bm{y}}^{\text{best}}\|{\bm{x}})\big{]}\vspace{-1.5mm}$		(5)

	$\displaystyle\max_{\pi_{\theta}}\mathbb{E}_{{\bm{x}}\sim{\mathcal{D}},{\bm{y}}%\sim\pi_{\theta}(\cdot\|{\bm{x}}),\tilde{{\bm{y}}}\sim\mu(\cdot\|{\bm{x}})}[\Psi%(P($	$\displaystyle{\bm{y}}\succ\tilde{{\bm{y}}}\|{\bm{x}}))]$
	$\displaystyle-\beta\text{KL}[\pi_{\theta}({\bm{y}}\|{\bm{x}})\\|\pi_{\text{ref}}($	$\displaystyle{\bm{y}}\|{\bm{x}})],$		(7)

Enhancing Your Alignment via RM-LLM Game (2024)

Abstract

1 Introduction

2 Preliminary

Human Preference Alignment

Generative Adversarial Networks (GANs)

3 Adversarial Preference Optimization

3.1 RM Optimization Step

3.2 LLM Optimization Step

Relation with WGAN

Relation with GAIL

Relation with ΨΨ\Psiroman_ΨPO

4 Experiments

4.1 Experimental Setups

Data Preparation

Evaluation Metrics

RM Training Details

LLM Training Details

4.2 Result Analysis

APO-RM Performance

APO-LLM Performance

Ablation Study

5 Conclusion

6 Limitations

References

Appendix A Golden Data Collection Details

Appendix B GPT-4 Evaluation

Appendix C APO Algorithm Details

Appendix D Preference Data Processing

Relation with $\Psi$ PO