## ArXiv / Github

## Abstract

While large-scale unsupervised language models (LMs) learn broad world knowl-
edge and some reasoning skills, achieving precise control of their behavior is
difficult due to the completely unsupervised nature of their training. Existing
methods for gaining

**such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these prefer- ences**, often with reinforcement learning from human feedback (RLHF). However,**RLHF is a complex and often unstable procedure**, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.**In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maxi- mization problem can be***optimized exactly***with a single stage of policy training, essentially solving a classification problem on the human preference data.**The resulting algorithm, which**we call***Direct Preference Optimization***(DPO)**, is stable, performant, and computationally lightweight,**eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning**. Our experiments show that**DPO can fine-tune LMs to align with human preferences as well as or better than existing methods**. Notably, fine-tuning with DPO exceeds RLHF’s ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.- 기존의 RLHF는 복잡하다

- Reward function과 Optimal Policy간의 Reward-maximization 문제를.. → 단순 1회의 Classification 문제로 치환할 수 있음

- 새로운 방법론의 이름은 DPO: Direct Preference Optimizaiton
- Reward Model 학습 X
- PPO에서는 Reward Model을 학습하니까.
- RRHF 등에서도 Reward Model을 학습하는 경우도 있음
- Finetune시에 LM에서 데이터 sampling할 필요 X
- PPO, RRHF 모두 에서 Instruction에 대한 Output을 샘플링
- 물론 RRHF에서는 가 아닌 다른 모델의 생성 결과만 써도 되긴 한다.
- Hparams 튜닝을 안해도 된다
- RLHF가 워낙 LR등 Hparams에 민감하니까, 민감하지 않으면 학습에 매우 이로움

## DPO?

RLHF에서 PPO를 사용했던 방식은 윗 그림 왼쪽처럼, Human Preferece Data를 갖고서 Reward Model 를 MLE를 통해 학습한 뒤, 해당 모델을 갖고서 모델을 추가로 학습하는 방식으로 전체 과정이 이뤄진다.

DPO 학습 pipeline은 크게 두 과정으로 이뤄진다.

- 모델(Base LM)을 SFT로 학습하는 과정 →

- 위에 Preference를 학습하는 과정 →

위의 1.과정의 경우 일반적인 Alpaca학습과 동일한 방식으로 진행하면 된다.

### DPO Loss

`def dpo_loss(policy_chosen_logps: torch.FloatTensor, policy_rejected_logps: torch.FloatTensor, reference_chosen_logps: torch.FloatTensor, reference_rejected_logps: torch.FloatTensor, beta: float, reference_free: bool = False) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]: """Compute the DPO loss for a batch of policy and reference model log probabilities. Args: policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,) policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,) reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,) reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,) beta: Temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0. reference_free: If True, we ignore the _provided_ reference model and implicitly use a reference model that assigns equal probability to all responses. Returns: A tuple of three tensors: (losses, chosen_rewards, rejected_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively. """ pi_logratios = policy_chosen_logps - policy_rejected_logps ref_logratios = reference_chosen_logps - reference_rejected_logps if reference_free: ref_logratios = 0 logits = pi_logratios - ref_logratios losses = -F.logsigmoid(beta * logits) chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach() rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach() return losses, chosen_rewards, rejected_rewards`

DPO Loss는

`policy_chosen_logps`

, `policy_rejected_logps`

, `reference_chosen_logps`

, `reference_rejected_logps`

4가지의 Log prob값을 통해 계산한다.`def get_batch_metrics(self, batch: Dict[str, Union[List, torch.LongTensor]], loss_config: DictConfig, train=True): """Compute the SFT or DPO loss and other metrics for the given batch of inputs.""" metrics = {} train_test = 'train' if train else 'eval' if loss_config.name == 'dpo': policy_chosen_logps, policy_rejected_logps = self.concatenated_forward(self.policy, batch) with torch.no_grad(): reference_chosen_logps, reference_rejected_logps = self.concatenated_forward(self.reference_model, batch) losses, chosen_rewards, rejected_rewards = dpo_loss( policy_chosen_logps, policy_rejected_logps, reference_chosen_logps, reference_rejected_logps, beta=loss_config.beta, reference_free=loss_config.reference_free) reward_accuracies = (chosen_rewards > rejected_rewards).float() chosen_rewards = all_gather_if_needed(chosen_rewards, self.rank, self.world_size) rejected_rewards = all_gather_if_needed(rejected_rewards, self.rank, self.world_size) reward_accuracies = all_gather_if_needed(reward_accuracies, self.rank, self.world_size) metrics[f'rewards_{train_test}/chosen'] = chosen_rewards.cpu().numpy().tolist() metrics[f'rewards_{train_test}/rejected'] = rejected_rewards.cpu().numpy().tolist() metrics[f'rewards_{train_test}/accuracies'] = reward_accuracies.cpu().numpy().tolist() metrics[f'rewards_{train_test}/margins'] = (chosen_rewards - rejected_rewards).cpu().numpy().tolist() policy_rejected_logps = all_gather_if_needed(policy_rejected_logps.detach(), self.rank, self.world_size) metrics[f'logps_{train_test}/rejected'] = policy_rejected_logps.cpu().numpy().tolist()`