What this is
This guide walks through DPO (Direct Preference Optimization) training using the cookbook. DPO learns from preference pairs (chosen vs. rejected responses) without a separate reward model.How DPO differs from GRPO
| DPO | GRPO | |
|---|---|---|
| Trainer jobs | 2 (policy + frozen reference) | 2 (policy + frozen reference) |
| Data | Preference pairs (chosen/rejected) | Prompts + reward function |
| Reference logprobs | Cached once at initialization | Computed every step |
| Loss | -log(sigmoid(beta * margin)) | Advantage-weighted policy gradient + KL |
Architecture
Using the recipe
Dataset format
DPO expects preference pairs. Supported formats: Format 1 — chosen/rejected messages:Step-by-step (API-level)
Provision trainers with setup_infra
DPO needs both a policy trainer and a forward-only reference trainer.
training.utils.rl.setup_infra handles shape resolution, parallel
provisioning of both trainers, and the LoRA shared-reference shortcut
(when lora_rank > 0, no separate reference trainer is needed — the
reference comes from the policy session’s base handle).
Cache reference logprobs
Reference logprobs are computed once at initialization and reused throughout training:DPO loss function
Training loop
Operational guidance
- Set
infra.training_shape_idandinfra.ref_training_shape_id— DPO launches both a policy trainer and a reference trainer. - DPO uses 2 RLOR jobs — policy trainer + frozen reference trainer.
- DPO defaults
weight_sync_interval=0(no weight sync by default), unlike GRPO. - Keep a versioned reference cache tied to tokenizer + base model revision. If the base model changes, recompute reference logprobs.
- Monitor margin statistics: increasing margins indicate the policy is learning preferences.
- DCP checkpoints are disabled by default (
dcp_save_interval=0). If you need to resume training from a checkpoint, explicitly setdcp_save_intervalto a positive value in yourWeightSyncConfig.
Common pitfalls
- Mismatched formatting between chosen/rejected sequences corrupts preference signals — ensure identical prompt prefixes.
- Stale reference cache: If you warm-start from a different model, cached reference logprobs are invalid.
Related guides
- Cookbook RL (GRPO) — reinforcement learning recipes
- Cookbook Reference — all config classes
- Loss Functions — API-level DPO loss details