What this is
This guide walks through DPO (Direct Preference Optimization) training using the Fireworks Training SDK. DPO learns from preference pairs (chosen vs. rejected responses) without a separate reward model.How DPO differs from GRPO
| DPO | GRPO | |
|---|---|---|
| Trainer jobs | 2 (policy + frozen reference) | 2 (policy + frozen reference) |
| Data | Preference pairs (chosen/rejected) | Prompts + reward function |
| Reference logprobs | Cached once at initialization from frozen reference | Computed every step via frozen reference trainer |
| Loss | -log(sigmoid(β × margin)) | Advantage-weighted policy gradient + KL |
--forward-only). Reference logprobs are computed once at initialization and cached for the entire training run.
Architecture
Step 1: Provision trainers
Step 2: Prepare preference data
DPO expects pairs of (chosen, rejected) responses to the same prompt. The starter script supports multiple dataset formats:Tokenize and build datums
Detect the shared prompt by finding the longest common token prefix between chosen and rejected:Step 3: Cache reference logprobs
Before training starts, compute reference logprobs from the frozen reference trainer:Step 4: DPO loss function
The loss function computes response-only log-probability sums using theweights from each datum, then applies the DPO margin:
Step 5: Training loop
Step 6: Save final checkpoint
Cookbook recipe entrypoint
Use the current cookbook recipe API (Config + main) for runnable DPO loops:
Config fields reference
| Field | Type | Default | Description |
|---|---|---|---|
log_path | str | — (required) | Directory for checkpoints.jsonl and logs |
base_model | str | "accounts/fireworks/models/qwen3-8b" | Base model name |
dataset | str | "" | Path or URL to preference JSONL |
tokenizer_model | str | "" | HuggingFace model name for client-side tokenization |
beta | float | 0.1 | DPO beta (controls preference sharpness) |
learning_rate | float | 1e-5 | Learning rate |
epochs | int | 1 | Training epochs |
grad_accum | int | 4 | Gradient accumulation steps |
max_seq_len | int | None | None | Max sequence length (auto-populated from training shape if not set) |
max_pairs | int | None | None | Max preference pairs to use from dataset |
lora_rank | int | 0 | 0 for full-parameter, >0 for LoRA |
ref_cache_concurrency | int | 16 | Max concurrent reference forward passes during cache warm-up |
init_from_checkpoint | str | None | None | Load pretrained DCP weights (supports "job_id:checkpoint_name") |
infra (InfraConfig), deployment (DeployConfig), hotload (HotloadConfig, defaults to hot_load_interval=0), wandb (WandBConfig).
Note: DPO defaults hotload.hot_load_interval=0 (no hotloading by default), unlike GRPO which enables it by default. If deployment_id is set, the existing deployment is used; otherwise a new one is auto-created by setup_deployment.
Operational guidance
- DPO uses 2 RLOR jobs — a policy trainer and a frozen reference trainer (with
--forward-only). Reference logprobs are cached at init. - Service mode supports both full-parameter and LoRA tuning.
- Keep a versioned reference cache tied to tokenizer + base model revision. If the base model changes, recompute reference logprobs.
- Monitor margin statistics: increasing margins indicate the policy is learning the preference signal. Flat or decreasing margins suggest issues.
- Resume is handled by
checkpoint_utils.resolve_resume()— it readscheckpoints.jsonlfromlog_pathand restores the last saved state automatically on startup. - Use
ReconnectableClientbehavior in the recipe to tolerate transient trainer preemption.
Common pitfalls
- Mismatched formatting between chosen/rejected sequences corrupts preference signals — ensure identical prompt prefixes.
- Stale reference cache: If you warm-start from a different model, the cached reference logprobs are invalid.
- Forgetting to refresh evaluation prompts can overfit to stale checks.