Skip to main content

What this is

This runbook demonstrates DPO with chosen/rejected samples, reference-cache handling, and periodic checkpoint hotload for serving-side checks.

Workflow

  1. Create a service-mode trainer and connect the training client.
  2. Load preference pairs and render chosen/rejected tokenized batches.
  3. Compute DPO margin objective and run optimizer updates.
  4. Checkpoint and evaluate via deployment.

End-to-end examples

Provision DPO trainer

trainer_job = fw.reinforcement_fine_tuning_steps.create(
    training_config={
        "base_model": "accounts/fireworks/models/qwen3-8b",
        "lora_rank": 0,
        "max_context_length": 4096,
        "learning_rate": 1e-5,
        "gradient_accumulation_steps": 4,
    },
    extra_body={"serviceMode": True, "keepAlive": False},
)
training_client = make_training_client(trainer_job)
reference_cache = build_reference_cache(training_client, preference_dataset)

One DPO batch update

pairs = load_preference_pairs(batch_size=8)
chosen_batch, rejected_batch = build_dpo_batch(pairs)
loss_fn = make_dpo_loss_fn(beta=0.1, reference_cache=reference_cache)
training_client.forward_backward_custom(chosen_batch + rejected_batch, loss_fn).result()
training_client.optim_step(
    tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
).result()

Checkpoint and serving eval

checkpoint = training_client.save_weights_for_sampler(f"dpo-step-{step:05d}").result()
hotload_deployment(checkpoint.path)
responses = sample_with_deployment(prompts=dpo_eval_prompts)
print(evaluate_dpo_outputs(responses))

Operational guidance

  • Service-mode trainer jobs currently support full-parameter tuning only. Keep lora_rank=0.
  • Keep a versioned reference cache tied to tokenizer + base model revision.
  • Monitor both margin statistics and downstream quality metrics.

Common pitfalls

  • Mismatched formatting between chosen/rejected sequences can corrupt preference signals.
  • Forgetting to refresh evaluation prompts can overfit to stale checks.