What this is
This runbook demonstrates an on-policy GRPO loop where sampling uses a deployment that is periodically hotloaded with the latest policy checkpoint.Why this approach
- On-policy sampling reduces mismatch between policy updates and sampled trajectories.
- Reference-model KL terms stabilize optimization while preserving exploration.
How to use these APIs
Fireworks.reinforcement_fine_tuning_steps.create: Provision policy and reference trainer services.TrainingClient.forward_backward_custom: Apply GRPO objective with reward and KL components.TrainingClient.save_weights_for_sampler: Export checkpoints for deployment hotload.
Workflow
- Provision policy trainer (trainable) and reference trainer (frozen).
- Sample completions through deployment.
- Compute rewards and build token-weighted GRPO batches.
- Run custom loss update and optimizer step.
- Checkpoint and hotload deployment on cadence.
End-to-end examples
Provision policy and reference trainers
Single GRPO update iteration
Checkpoint and hotload serving
Operational guidance
- Service-mode trainer jobs currently support full-parameter tuning only. Keep
lora_rank=0for both policy and reference trainers. - Track reward distributions and KL terms every step to catch objective drift early.
- Align hotload interval with evaluation cadence to keep metrics meaningful.
Common pitfalls
- Reward normalization bugs can destabilize GRPO updates quickly.
- Reference and policy tokenizer mismatch invalidates KL estimates.