Skip to main content

What this is

There are two types of checkpoints in the Training SDK:
  • Sampler checkpoints (save_weights_for_sampler_ext): Exported for deployment hotloading. These are what the inference endpoint loads.
  • Train-state checkpoints (save_checkpoint via checkpoint_utils): Full optimizer + weight state persisted to checkpoints.jsonl for resuming training where you left off.

Setup: enable checkpoint_type

The checkpoint_type parameter ("base" or "delta") is available on FiretitanTrainingClient.save_weights_for_sampler_ext(). Use FiretitanServiceClient to create the client:
import tinker
from fireworks.training.sdk import FiretitanServiceClient, WeightSyncer, DeploymentManager

Base vs. delta checkpoints

TypeWhat it savesSizeWhen to use
"base"Full model weightsLarge (~16 GB for 8B model)First checkpoint, or when you need a clean snapshot
"delta"XOR diff from previous baseSmall (~10x smaller)Every subsequent checkpoint after a base
Delta checkpoints are much faster to save and transfer, making per-step hotloading practical for on-policy training.

How delta chaining works

  1. Save a base checkpoint (full weights) — the deployment loads this as its starting point.
  2. Save delta checkpoints — each contains only the diff from the base.
  3. The deployment applies: current_weights = base XOR delta.
# Step 1: First checkpoint must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Step 2+: Subsequent checkpoints can be delta (much smaller)
result = training_client.save_weights_for_sampler_ext(
    "step-0002",
    checkpoint_type="delta",
)

Saving sampler checkpoints

# Full checkpoint
result = training_client.save_weights_for_sampler_ext(
    "my-checkpoint-name",
    checkpoint_type="base",
)
print(result.snapshot_name)  # Session-qualified name for hotloading

# Incremental checkpoint (after a base exists)
delta_result = training_client.save_weights_for_sampler_ext(
    "step-0050",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,  # auto-expire after 1 hour
)

save_weights_for_sampler_ext parameters

ParameterTypeDefaultDescription
namestrCheckpoint name (auto-suffixed with session ID)
checkpoint_typestr | NoneNone"base" for full weights, "delta" for incremental
ttl_secondsint | NoneNoneAuto-delete checkpoint after this many seconds
Returns a SaveSamplerResult with path (snapshot name from trainer) and snapshot_name (session-qualified name for hotloading).

Hotloading checkpoints onto a deployment

The SDK provides DeploymentManager for hotload operations and WeightSyncer for managing the full save-then-hotload lifecycle with automatic delta chain tracking.

Using DeploymentManager directly

from fireworks.training.sdk import DeploymentManager

deploy_mgr = DeploymentManager(
    api_key="<FIREWORKS_API_KEY>",
    account_id="<ACCOUNT_ID>",
    base_url="https://api.fireworks.ai",
)

# Hotload and wait for completion
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=result.snapshot_name,  # From save_weights_for_sampler_ext
    timeout_seconds=400,
)
For delta hotloads, pass incremental_snapshot_metadata:
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=delta_result.snapshot_name,
    incremental_snapshot_metadata={
        "previous_snapshot_identity": base_result.snapshot_name,
        "compression_format": "arc_v2",
        "checksum_format": "alder32",
    },
    timeout_seconds=400,
)
WeightSyncer manages the entire checkpoint-then-hotload lifecycle, including:
  • Automatic base/delta chain state tracking
  • Session-scoped snapshot naming (prevents Alluxio cache staleness)
  • Automatic warmup after hotload
  • Deployment state checking before first hotload
  • Separate DCP save for resume checkpoints
from fireworks.training.sdk import WeightSyncer

tracker = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
    warmup_after_hotload=True,
)

WeightSyncer fields

FieldTypeDefaultDescription
policy_clientFiretitanTrainingClientTraining client for save operations
deploy_mgrDeploymentManager | NoneNoneDeployment manager for hotload (None = no hotloading)
deployment_idstr | NoneNoneTarget deployment for hotload
base_modelstr""Model name for hotload API calls
hotload_timeoutint600Timeout in seconds for hotload_and_wait
first_checkpoint_typestr"base"Type for the first checkpoint ("base" or "delta")
compression_formatstr"arc_v2"Delta compression format
warmup_after_hotloadboolTrueSend a warmup request after each successful hotload
warmup_max_retriesint10Max retries for post-hotload warmup

WeightSyncer methods

MethodDescription
save_and_hotload(name, checkpoint_type=None)Save sampler weights and hotload to deployment. Returns snapshot_name or raises on failure.
save_only(name, checkpoint_type=None)Save sampler weights without hotloading. Returns snapshot_name or None.
hotload(snapshot_name)Hotload a previously saved snapshot. Returns True/False.
save_dcp(name)Save DCP checkpoint only (for resume). No sampler, no hotload. Returns True/False.
check_deployment_state()Query the deployment’s current hotload state. Returns current_snapshot_identity or None.
wait_for_hotload_ready(timeout_s, poll_interval_s)Block until the deployment’s hot load manager is initialized.

Pattern: on-policy hotloading (every step)

For on-policy training (e.g. GRPO), hotload after every optimizer step so the sampling policy matches the training policy. WeightSyncer handles the base/delta chain automatically:
for step in range(total_steps):
    # ... training step ...

    # WeightSyncer auto-selects base (first) or delta (subsequent)
    tracker.save_and_hotload(f"step-{step:05d}")

    # Now sample from the updated deployment
    completions = sampler.sample_with_tokens(messages=input_messages, n=4)

Pattern: interval hotloading (off-policy)

For off-policy training, hotload every N steps and use importance sampling to correct for the stale policy between hotloads:
hotload_interval = 10

for step in range(total_steps):
    # ... training step ...

    if step % hotload_interval == 0:
        tracker.save_and_hotload(f"step-{step:05d}")

Pattern: split save and hotload

When you need to separate save from hotload (e.g. do a deployment warmup in between):
snapshot = tracker.save_only("resume-step-0", checkpoint_type="base")
deploy_mgr.warmup(model)
tracker.hotload(snapshot)

Pattern: DCP checkpoints for resume

DCP (Distributed Checkpoint) saves are independent from sampler/hotload saves. Use save_dcp for resume checkpoints at intervals:
for step in range(total_steps):
    # ... training step ...

    tracker.save_and_hotload(f"step-{step:05d}")

    if step % dcp_interval == 0:
        tracker.save_dcp(f"step-{step}")

Saving and resuming train state

The cookbook uses checkpoint_utils to persist training state (step number, data position, optimizer + weights) in a checkpoints.jsonl file inside log_path. This replaces raw save_state / load_state_with_optimizer calls.

save_checkpoint

Saves a DCP checkpoint (optimizer + weights) and appends a record to checkpoints.jsonl:
from training.utils.checkpoint_utils import save_checkpoint

save_checkpoint(client, f"step-{step}", log_path, {
    "step": step,
    "data_consumed": data_consumed,
    "source_job_id": job_id,
})
ParameterTypeDescription
clienttraining clientThe training client to save state from
namestrCheckpoint name
log_pathstrDirectory where checkpoints.jsonl is written
loop_statedictArbitrary metadata (step, data_consumed, etc.) persisted alongside the checkpoint
kindstr"state" (optimizer + weights), "sampler" (serving only), or "both"

resolve_resume

On startup, resolves the resume state by reading checkpoints.jsonl and loading the last checkpoint:
from training.utils.checkpoint_utils import resolve_resume

resume_info = resolve_resume(client, log_path, init_from_checkpoint=None)
step = resume_info.step if resume_info else 0
data_consumed = resume_info.data_consumed if resume_info else 0
Returns None for a fresh start (no checkpoint found). When a checkpoint exists, it loads the DCP weights + optimizer state into the client before returning. ResumeInfo fields:
FieldTypeDescription
stepintLast completed step
data_consumedintNumber of data examples consumed (for dataset position)
source_job_idstr | NoneOriginating trainer job ID

init_from_checkpoint

Load pretrained DCP weights on a fresh dataset (step resets to 0). Supports cross-job format "job_id:checkpoint_name":
resume_info = resolve_resume(client, log_path, init_from_checkpoint="prev-job-id:step-100")
# resume_info.step == 0 (fresh start with loaded weights)
All recipe Config dataclasses expose init_from_checkpoint as a field.

log_path requirement

All recipe Config dataclasses require log_path (no default). This directory stores checkpoints.jsonl and any other run artifacts. Callers must provide it explicitly to prevent checkpoint collisions between runs:
cfg = Config(
    log_path="./my_experiment_logs",
    # ...
)

SDK: low-level checkpoint APIs

If you are writing a custom training loop without the cookbook, use these SDK methods directly.

Save and restore train state

# Save full train state (weights + optimizer) for resume
training_client.save_state("train_state_step_100")

# Restore train state for resume
training_client.load_state_with_optimizer("train_state_step_100")
save_state also accepts an optional ttl_seconds parameter for auto-expiring checkpoints.

Cross-job checkpoint resolution

checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref)

List available checkpoints

checkpoint_names, _ = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]
Cookbook users: The cookbook’s checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume wrap these SDK methods with structured persistence to checkpoints.jsonl. Prefer the cookbook helpers when using recipes.

Operational guidance

  • First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
  • Track your checkpoint identities. Delta hotloads reference a previous_snapshot_identity that must match what the deployment currently has loaded.
  • Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
  • Use dcp_save_interval (cookbook) or save_dcp (SDK) to save DCP checkpoints at regular intervals for resume.
  • init_from_checkpoint (cookbook) is useful for warm-starting a new experiment from a previous run’s weights without inheriting its dataset position.