What this is
During training, you save checkpoints for two purposes:- Serving (
save_weights_for_sampler_ext): Export model weights that a deployment can load for inference and evaluation. - Resuming (
save_state/load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off.
Sampler checkpoints
Base vs. delta
| Type | What it saves | Size | When to use |
|---|---|---|---|
"base" | Full model weights | Large (~16 GB for 8B model) | First checkpoint, or when you need a clean snapshot |
"delta" | XOR diff from previous base | Small (~10x smaller) | Every subsequent checkpoint after a base |
Saving checkpoints
How delta chaining works
- Save a base checkpoint (full weights) — the deployment loads this as its starting point.
- Save delta checkpoints — each contains only the diff from the base.
- The deployment applies:
current_weights = base XOR delta.
Promoting a checkpoint to a model
After saving a sampler checkpoint, you can promote it to a deployable Fireworks model usingpromote_checkpoint:
| Parameter | Type | Description |
|---|---|---|
job_id | str | RLOR trainer job ID that produced the checkpoint |
checkpoint_id | str | The snapshot_name from save_weights_for_sampler_ext |
output_model_id | str | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only) |
accounts/<your-account>/models/<output_model_id> and can be deployed like any other Fireworks model.
output_model_id must be a valid Fireworks resource ID: lowercase letters, digits, and hyphens only, 1-63 characters, cannot start or end with a hyphen.Weight sync
Weight sync pushes a checkpoint onto a running inference deployment without restarting it. SeeDeploymentManager for direct hotload API and WeightSyncer for the recommended lifecycle manager.
Quick example with WeightSyncer:
Train-state checkpoints
Save and restore
Usesave_state and load_state_with_optimizer to persist and restore full training state (weights + optimizer momentum, learning rate schedules, etc.):
save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
For the raw
FiretitanTrainingClient, save_state() and load_state_with_optimizer() return futures, so call .result() to wait for completion. The cookbook’s ReconnectableClient wrapper already blocks for you.When to use train-state checkpoints
- Recovery from interruptions or failures
- Adjusting hyperparameters or data mid-run
- Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)
Cross-job checkpoint resolution
Load checkpoints from a previous trainer job:List available checkpoints
Operational guidance
- First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
- Track your checkpoint identities. Delta weight syncs reference a
previous_snapshot_identitythat must match what the deployment currently has loaded. - Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
- Save train state before long experiments so you can resume from the last good state.
Cookbook users: The cookbook’s
checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume wrap these SDK methods with structured persistence to checkpoints.jsonl. See Cookbook Reference for details.Related guides
- WeightSyncer reference — full weight sync lifecycle
- DeploymentManager reference — direct hotload API
- Training and Sampling — end-to-end workflow