dcp_save_interval
Controls how often full training state (weights and optimizer) is checkpointed using DCP (Distributed Checkpoint) format.
| Property | Value |
|---|---|
| Type | integer |
| Default | 0 (disabled) |
| Typical config — SFT / RL / DPO cookbooks | WeightSyncConfig(dcp_save_interval=N) on the recipe Config (see Cookbook: RL and Checkpoints) |
0 (the default), no periodic DCP checkpoints are written for resume. Only sampler and HuggingFace-format weight snapshots may be produced — these preserve model weights but not optimizer state.
When set to a positive integer N, a full DCP checkpoint is written every N steps.
Why this matters: If a training job is interrupted, optimizer state is lost unless dcp_save_interval is set. The model resumes from the last checkpoint, but the optimizer re-initializes from scratch — which can affect training stability and effective learning rate.
Example (cookbook Config)
dcp_save_interval; see your recipe’s Config dataclass for the exact attribute path.
Job recovery and preemption
For transient control-plane or worker interruptions, the trainer job manager exposesreconnect_and_wait so your driver can wait for a resumable state and resume cleanly.
load_state_with_optimizer() only restores optimizer state from DCP-format checkpoints. If you point it at an HF or sampler snapshot, optimizer state silently won’t be restored. Always load from the path returned by save_state() when you need full optimizer restore. See Saving and loading.Metrics reference
ppo_kl vs ref_kld
GRPO training logs two KL divergence metrics that measure different things:
| Metric | What it measures | Expected behavior |
|---|---|---|
ppo_kl | KL between the current policy and the previous policy (importance-sampling ratio inside the PPO clip objective) | Stays near 0 with one minibatch per rollout — this is correct, not a bug |
ref_kld | KL between the current policy and the reference (base) model | Starts near 0, increases gradually as the policy diverges from base during training |
ref_kld is the metric to watch for policy drift. A sudden large jump in ref_kld may indicate reward hacking or that the KL penalty coefficient needs tuning.
The cookbook does not always surface ref_kld by default. To add it, you can use the k3 unbiased estimator: