Skip to main content

What this is

There are two types of checkpoints in the Training SDK:
  • Sampler checkpoints (save_weights_for_sampler_ext): Exported for deployment hotloading. These are what the inference endpoint loads.
  • Train-state checkpoints (save_state): Full optimizer + weight state for resuming training where you left off.

Setup: enable checkpoint_type

The checkpoint_type parameter ("base" or "delta") is available on FiretitanTrainingClient.save_weights_for_sampler_ext(). Use FiretitanServiceClient to create the client:
import tinker
from fireworks.training.sdk import FiretitanServiceClient, WeightSyncer, DeploymentManager

Base vs. delta checkpoints

TypeWhat it savesSizeWhen to use
"base"Full model weightsLarge (~16 GB for 8B model)First checkpoint, or when you need a clean snapshot
"delta"XOR diff from previous baseSmall (~10x smaller)Every subsequent checkpoint after a base
Delta checkpoints are much faster to save and transfer, making per-step hotloading practical for on-policy training.

How delta chaining works

  1. Save a base checkpoint (full weights) — the deployment loads this as its starting point.
  2. Save delta checkpoints — each contains only the diff from the base.
  3. The deployment applies: current_weights = base XOR delta.
# Step 1: First checkpoint must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Step 2+: Subsequent checkpoints can be delta (much smaller)
result = training_client.save_weights_for_sampler_ext(
    "step-0002",
    checkpoint_type="delta",
)

Saving sampler checkpoints

# Full checkpoint
result = training_client.save_weights_for_sampler_ext(
    "my-checkpoint-name",
    checkpoint_type="base",
)
print(result.snapshot_name)  # Session-qualified name for hotloading

# Incremental checkpoint (after a base exists)
delta_result = training_client.save_weights_for_sampler_ext(
    "step-0050",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,  # auto-expire after 1 hour
)

save_weights_for_sampler_ext parameters

ParameterTypeDefaultDescription
namestrCheckpoint name (auto-suffixed with session ID)
checkpoint_typestr | NoneNone"base" for full weights, "delta" for incremental
ttl_secondsint | NoneNoneAuto-delete checkpoint after this many seconds
Returns a SaveSamplerResult with path (snapshot name from trainer) and snapshot_name (session-qualified name for hotloading).

Hotloading checkpoints onto a deployment

The SDK provides DeploymentManager for hotload operations and WeightSyncer for managing the full save-then-hotload lifecycle with automatic delta chain tracking.

Using DeploymentManager directly

from fireworks.training.sdk import DeploymentManager

deploy_mgr = DeploymentManager(
    api_key="<FIREWORKS_API_KEY>",
    account_id="<ACCOUNT_ID>",
    base_url="https://api.fireworks.ai",
)

# Hotload and wait for completion
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=result.snapshot_name,  # From save_weights_for_sampler_ext
    timeout_seconds=400,
)
For delta hotloads, pass incremental_snapshot_metadata:
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=delta_result.snapshot_name,
    incremental_snapshot_metadata={
        "previous_snapshot_identity": base_result.snapshot_name,
        "compression_format": "arc_v2",
        "checksum_format": "alder32",
    },
    timeout_seconds=400,
)
WeightSyncer manages the entire checkpoint-then-hotload lifecycle, including:
  • Automatic base/delta chain state tracking
  • Session-scoped snapshot naming (prevents Alluxio cache staleness)
  • Automatic warmup after hotload
  • Deployment state checking before first hotload
  • Separate DCP save for resume checkpoints
from fireworks.training.sdk import WeightSyncer

tracker = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
    warmup_after_hotload=True,
)

WeightSyncer fields

FieldTypeDefaultDescription
policy_clientFiretitanTrainingClientTraining client for save operations
deploy_mgrDeploymentManager | NoneNoneDeployment manager for hotload (None = no hotloading)
deployment_idstr | NoneNoneTarget deployment for hotload
base_modelstr""Model name for hotload API calls
hotload_timeoutint600Timeout in seconds for hotload_and_wait
first_checkpoint_typestr"base"Type for the first checkpoint ("base" or "delta")
compression_formatstr"arc_v2"Delta compression format
warmup_after_hotloadboolTrueSend a warmup request after each successful hotload
warmup_max_retriesint10Max retries for post-hotload warmup

WeightSyncer methods

MethodDescription
save_and_hotload(name, checkpoint_type=None)Save sampler weights and hotload to deployment. Returns snapshot_name or raises on failure.
save_only(name, checkpoint_type=None)Save sampler weights without hotloading. Returns snapshot_name or None.
hotload(snapshot_name)Hotload a previously saved snapshot. Returns True/False.
save_dcp(name)Save DCP checkpoint only (for resume). No sampler, no hotload. Returns True/False.
check_deployment_state()Query the deployment’s current hotload state. Returns current_snapshot_identity or None.
wait_for_hotload_ready(timeout_s, poll_interval_s)Block until the deployment’s hot load manager is initialized.

Pattern: on-policy hotloading (every step)

For on-policy training (e.g. GRPO), hotload after every optimizer step so the sampling policy matches the training policy. WeightSyncer handles the base/delta chain automatically:
for step in range(total_steps):
    # ... training step ...

    # WeightSyncer auto-selects base (first) or delta (subsequent)
    tracker.save_and_hotload(f"step-{step:05d}")

    # Now sample from the updated deployment
    completions = sampler.sample_with_tokens(messages=input_messages, n=4)

Pattern: interval hotloading (off-policy)

For off-policy training, hotload every N steps and use importance sampling to correct for the stale policy between hotloads:
hotload_interval = 10

for step in range(total_steps):
    # ... training step ...

    if step % hotload_interval == 0:
        tracker.save_and_hotload(f"step-{step:05d}")

Pattern: split save and hotload

When you need to separate save from hotload (e.g. do a deployment warmup in between):
snapshot = tracker.save_only("resume-step-0", checkpoint_type="base")
deploy_mgr.warmup(model)
tracker.hotload(snapshot)

Pattern: DCP checkpoints for resume

DCP (Distributed Checkpoint) saves are independent from sampler/hotload saves. Use save_dcp for resume checkpoints at intervals:
for step in range(total_steps):
    # ... training step ...

    tracker.save_and_hotload(f"step-{step:05d}")

    if step % dcp_interval == 0:
        tracker.save_dcp(f"step-{step}")

Saving and restoring train state

Train-state checkpoints save both model weights and optimizer state, enabling you to resume training exactly where you stopped:
# Save full state
training_client.save_state("train_state_step_100").result()

# Later: restore state (weights + optimizer momentum, etc.)
training_client.load_state_with_optimizer("train_state_step_100").result()

Operational guidance

  • First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
  • Track your checkpoint identities. Delta hotloads reference a previous_snapshot_identity that must match what the deployment currently has loaded.
  • Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
  • Save train state before long experiments so you can resume from the last good state.