Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt

Use this file to discover all available pages before exploring further.

Use this page for training checkpoint and resume knobs and GRPO metric interpretation that are easy to miss when running reinforcement fine-tuning (RFT) and cookbook-driven training. For sampling and optimization hyperparameters (learning rate, epochs, temperature, KL targets, etc.), see Parameter tuning. The canonical cookbook reference for save, resume, and promote is Checkpoints and Resume. Low-level SDK APIs are documented in Saving and loading.

dcp_save_interval

Controls how often full training state (weights and optimizer) is checkpointed using DCP (Distributed Checkpoint) format.
PropertyValue
Typeinteger
Default0 (disabled)
Typical config — SFT / RL / DPO cookbooksWeightSyncConfig(dcp_save_interval=N) on the recipe Config (see Cookbook: RL and Checkpoints)
When set to 0 (the default), no periodic DCP checkpoints are written for resume. Only sampler and HuggingFace-format weight snapshots may be produced — these preserve model weights but not optimizer state. When set to a positive integer N, a full DCP checkpoint is written every N steps. Why this matters: If a training job is interrupted, optimizer state is lost unless dcp_save_interval is set. The model resumes from the last checkpoint, but the optimizer re-initializes from scratch — which can affect training stability and effective learning rate.

Example (cookbook Config)

from training.recipes.rl_loop import Config, main
from training.utils import WeightSyncConfig

cfg = Config(
    log_path="./grpo_logs",
    base_model="accounts/fireworks/models/qwen3-8b",
    # ... other fields ...
    weight_sync=WeightSyncConfig(dcp_save_interval=50),  # full checkpoint every 50 steps
    # ...
)
main(cfg)
Some internal or forked recipes may expose the same interval on a nested config type (for example a weight sync block). The field name is always dcp_save_interval; see your recipe’s Config dataclass for the exact attribute path.

Job recovery and preemption

For transient control-plane or worker interruptions, the trainer job manager exposes reconnect_and_wait so your driver can wait for a resumable state and resume cleanly.
load_state_with_optimizer() only restores optimizer state from DCP-format checkpoints. If you point it at an HF or sampler snapshot, optimizer state silently won’t be restored. Always load from the path returned by save_state() when you need full optimizer restore. See Saving and loading.

Metrics reference

ppo_kl vs ref_kld

GRPO training logs two KL divergence metrics that measure different things:
MetricWhat it measuresExpected behavior
ppo_klKL between the current policy and the previous policy (importance-sampling ratio inside the PPO clip objective)Stays near 0 with one minibatch per rollout — this is correct, not a bug
ref_kldKL between the current policy and the reference (base) modelStarts near 0, increases gradually as the policy diverges from base during training
Which one to monitor: ref_kld is the metric to watch for policy drift. A sudden large jump in ref_kld may indicate reward hacking or that the KL penalty coefficient needs tuning. The cookbook does not always surface ref_kld by default. To add it, you can use the k3 unbiased estimator:
ref_kld = (ref_logp - policy_logp).exp() - (ref_logp - policy_logp) - 1