Skip to main content

What this is

During training, you save checkpoints for two purposes:
  1. Serving (save_weights_for_sampler_ext): Export model weights that a deployment can load for inference and evaluation.
  2. Resuming (save_state / load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off.

Sampler checkpoints

Base vs. delta

TypeWhat it savesSizeWhen to use
"base"Full model weightsLarge (~16 GB for 8B model)First checkpoint, or when you need a clean snapshot
"delta"XOR diff from previous baseSmall (~10x smaller)Every subsequent checkpoint after a base
Delta checkpoints are much faster to save and transfer, making per-step weight sync practical for on-policy training.

Saving checkpoints

# First checkpoint — must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Subsequent checkpoints — delta is faster
result = training_client.save_weights_for_sampler_ext(
    "step-0010",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,
)

How delta chaining works

  1. Save a base checkpoint (full weights) — the deployment loads this as its starting point.
  2. Save delta checkpoints — each contains only the diff from the base.
  3. The deployment applies: current_weights = base XOR delta.

Promoting a checkpoint to a model

After saving a sampler checkpoint, you can promote it to a deployable Fireworks model using promote_checkpoint:
model = rlor_mgr.promote_checkpoint(
    job_id=endpoint.job_id,
    checkpoint_id=result.snapshot_name,
    output_model_id="my-fine-tuned-qwen3-8b",
)
print(f"Model state: {model['state']}, kind: {model['kind']}")
ParameterTypeDescription
job_idstrRLOR trainer job ID that produced the checkpoint
checkpoint_idstrThe snapshot_name from save_weights_for_sampler_ext
output_model_idstrDesired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only)
The promoted model appears under your account at accounts/<your-account>/models/<output_model_id> and can be deployed like any other Fireworks model.
output_model_id must be a valid Fireworks resource ID: lowercase letters, digits, and hyphens only, 1-63 characters, cannot start or end with a hyphen.

Weight sync

Weight sync pushes a checkpoint onto a running inference deployment without restarting it. See DeploymentManager for direct hotload API and WeightSyncer for the recommended lifecycle manager. Quick example with WeightSyncer:
from fireworks.training.sdk import WeightSyncer

tracker = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
)

# Automatically handles base (first) vs delta (subsequent)
tracker.save_and_hotload(f"step-{step:05d}")

Train-state checkpoints

Save and restore

Use save_state and load_state_with_optimizer to persist and restore full training state (weights + optimizer momentum, learning rate schedules, etc.):
# Save full train state for resume
training_client.save_state("train_state_step_100").result()

# Restore train state
training_client.load_state_with_optimizer("train_state_step_100").result()
save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
For the raw FiretitanTrainingClient, save_state() and load_state_with_optimizer() return futures, so call .result() to wait for completion. The cookbook’s ReconnectableClient wrapper already blocks for you.

When to use train-state checkpoints

  • Recovery from interruptions or failures
  • Adjusting hyperparameters or data mid-run
  • Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)

Cross-job checkpoint resolution

Load checkpoints from a previous trainer job:
checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()

List available checkpoints

checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]

Operational guidance

  • First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
  • Track your checkpoint identities. Delta weight syncs reference a previous_snapshot_identity that must match what the deployment currently has loaded.
  • Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
  • Save train state before long experiments so you can resume from the last good state.
Cookbook users: The cookbook’s checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume wrap these SDK methods with structured persistence to checkpoints.jsonl. See Cookbook Reference for details.