What this is
This runbook demonstrates DPO with chosen/rejected samples, reference-cache handling, and periodic checkpoint hotload for serving-side checks.Workflow
- Create a service-mode trainer and connect the training client.
- Load preference pairs and render chosen/rejected tokenized batches.
- Compute DPO margin objective and run optimizer updates.
- Checkpoint and evaluate via deployment.
End-to-end examples
Provision DPO trainer
One DPO batch update
Checkpoint and serving eval
Operational guidance
- Service-mode trainer jobs currently support full-parameter tuning only. Keep
lora_rank=0. - Keep a versioned reference cache tied to tokenizer + base model revision.
- Monitor both margin statistics and downstream quality metrics.
Common pitfalls
- Mismatched formatting between chosen/rejected sequences can corrupt preference signals.
- Forgetting to refresh evaluation prompts can overfit to stale checks.