Checkpointing and Hotload

What this is

There are two types of checkpoints in the Training SDK:

Sampler checkpoints (save_weights_for_sampler_ext): Exported for deployment hotloading. These are what the inference endpoint loads.
Train-state checkpoints (save_checkpoint via checkpoint_utils): Full optimizer + weight state persisted to checkpoints.jsonl for resuming training where you left off.

Setup: enable `checkpoint_type`

The checkpoint_type parameter ("base" or "delta") is available on FiretitanTrainingClient.save_weights_for_sampler_ext(). Use FiretitanServiceClient to create the client:

import tinker
from fireworks.training.sdk import FiretitanServiceClient, WeightSyncer, DeploymentManager

Base vs. delta checkpoints

Type	What it saves	Size	When to use
`"base"`	Full model weights	Large (~16 GB for 8B model)	First checkpoint, or when you need a clean snapshot
`"delta"`	XOR diff from previous base	Small (~10x smaller)	Every subsequent checkpoint after a base

Delta checkpoints are much faster to save and transfer, making per-step hotloading practical for on-policy training.

How delta chaining works

Save a base checkpoint (full weights) — the deployment loads this as its starting point.
Save delta checkpoints — each contains only the diff from the base.
The deployment applies: current_weights = base XOR delta.

# Step 1: First checkpoint must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Step 2+: Subsequent checkpoints can be delta (much smaller)
result = training_client.save_weights_for_sampler_ext(
    "step-0002",
    checkpoint_type="delta",
)

Saving sampler checkpoints

# Full checkpoint
result = training_client.save_weights_for_sampler_ext(
    "my-checkpoint-name",
    checkpoint_type="base",
)
print(result.snapshot_name)  # Session-qualified name for hotloading

# Incremental checkpoint (after a base exists)
delta_result = training_client.save_weights_for_sampler_ext(
    "step-0050",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,  # auto-expire after 1 hour
)

`save_weights_for_sampler_ext` parameters

Parameter	Type	Default	Description
`name`	`str`	—	Checkpoint name (auto-suffixed with session ID)
`checkpoint_type`	`str \| None`	`None`	`"base"` for full weights, `"delta"` for incremental
`ttl_seconds`	`int \| None`	`None`	Auto-delete checkpoint after this many seconds

Returns a SaveSamplerResult with path (snapshot name from trainer) and snapshot_name (session-qualified name for hotloading).

Hotloading checkpoints onto a deployment

The SDK provides DeploymentManager for hotload operations and WeightSyncer for managing the full save-then-hotload lifecycle with automatic delta chain tracking.

Using DeploymentManager directly

from fireworks.training.sdk import DeploymentManager

deploy_mgr = DeploymentManager(
    api_key="<FIREWORKS_API_KEY>",
    account_id="<ACCOUNT_ID>",
    base_url="https://api.fireworks.ai",
)

# Hotload and wait for completion
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=result.snapshot_name,  # From save_weights_for_sampler_ext
    timeout_seconds=400,
)

For delta hotloads, pass incremental_snapshot_metadata:

deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=delta_result.snapshot_name,
    incremental_snapshot_metadata={
        "previous_snapshot_identity": base_result.snapshot_name,
        "compression_format": "arc_v2",
        "checksum_format": "alder32",
    },
    timeout_seconds=400,
)

Using WeightSyncer (recommended)

WeightSyncer manages the entire checkpoint-then-hotload lifecycle, including:

Automatic base/delta chain state tracking
Session-scoped snapshot naming (prevents Alluxio cache staleness)
Automatic warmup after hotload
Deployment state checking before first hotload
Separate DCP save for resume checkpoints

from fireworks.training.sdk import WeightSyncer

tracker = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
    warmup_after_hotload=True,
)

`WeightSyncer` fields

Field	Type	Default	Description
`policy_client`	`FiretitanTrainingClient`	—	Training client for save operations
`deploy_mgr`	`DeploymentManager \| None`	`None`	Deployment manager for hotload (None = no hotloading)
`deployment_id`	`str \| None`	`None`	Target deployment for hotload
`base_model`	`str`	`""`	Model name for hotload API calls
`hotload_timeout`	`int`	`600`	Timeout in seconds for hotload_and_wait
`first_checkpoint_type`	`str`	`"base"`	Type for the first checkpoint (`"base"` or `"delta"`)
`compression_format`	`str`	`"arc_v2"`	Delta compression format
`warmup_after_hotload`	`bool`	`True`	Send a warmup request after each successful hotload
`warmup_max_retries`	`int`	`10`	Max retries for post-hotload warmup

`WeightSyncer` methods

Method	Description
`save_and_hotload(name, checkpoint_type=None)`	Save sampler weights and hotload to deployment. Returns snapshot_name or raises on failure.
`save_only(name, checkpoint_type=None)`	Save sampler weights without hotloading. Returns snapshot_name or None.
`hotload(snapshot_name)`	Hotload a previously saved snapshot. Returns True/False.
`save_dcp(name)`	Save DCP checkpoint only (for resume). No sampler, no hotload. Returns True/False.
`check_deployment_state()`	Query the deployment’s current hotload state. Returns current_snapshot_identity or None.
`wait_for_hotload_ready(timeout_s, poll_interval_s)`	Block until the deployment’s hot load manager is initialized.

Pattern: on-policy hotloading (every step)

For on-policy training (e.g. GRPO), hotload after every optimizer step so the sampling policy matches the training policy. WeightSyncer handles the base/delta chain automatically:

for step in range(total_steps):
    # ... training step ...

    # WeightSyncer auto-selects base (first) or delta (subsequent)
    tracker.save_and_hotload(f"step-{step:05d}")

    # Now sample from the updated deployment
    completions = sampler.sample_with_tokens(messages=input_messages, n=4)

Pattern: interval hotloading (off-policy)

For off-policy training, hotload every N steps and use importance sampling to correct for the stale policy between hotloads:

hotload_interval = 10

for step in range(total_steps):
    # ... training step ...

    if step % hotload_interval == 0:
        tracker.save_and_hotload(f"step-{step:05d}")

Pattern: split save and hotload

When you need to separate save from hotload (e.g. do a deployment warmup in between):

snapshot = tracker.save_only("resume-step-0", checkpoint_type="base")
deploy_mgr.warmup(model)
tracker.hotload(snapshot)

Pattern: DCP checkpoints for resume

DCP (Distributed Checkpoint) saves are independent from sampler/hotload saves. Use save_dcp for resume checkpoints at intervals:

for step in range(total_steps):
    # ... training step ...

    tracker.save_and_hotload(f"step-{step:05d}")

    if step % dcp_interval == 0:
        tracker.save_dcp(f"step-{step}")

Saving and resuming train state

The cookbook uses checkpoint_utils to persist training state (step number, data position, optimizer + weights) in a checkpoints.jsonl file inside log_path. This replaces raw save_state / load_state_with_optimizer calls.

`save_checkpoint`

Saves a DCP checkpoint (optimizer + weights) and appends a record to checkpoints.jsonl:

from training.utils.checkpoint_utils import save_checkpoint

save_checkpoint(client, f"step-{step}", log_path, {
    "step": step,
    "data_consumed": data_consumed,
    "source_job_id": job_id,
})

Parameter	Type	Description
`client`	training client	The training client to save state from
`name`	`str`	Checkpoint name
`log_path`	`str`	Directory where `checkpoints.jsonl` is written
`loop_state`	`dict`	Arbitrary metadata (`step`, `data_consumed`, etc.) persisted alongside the checkpoint
`kind`	`str`	`"state"` (optimizer + weights), `"sampler"` (serving only), or `"both"`

`resolve_resume`

On startup, resolves the resume state by reading checkpoints.jsonl and loading the last checkpoint:

from training.utils.checkpoint_utils import resolve_resume

resume_info = resolve_resume(client, log_path, init_from_checkpoint=None)
step = resume_info.step if resume_info else 0
data_consumed = resume_info.data_consumed if resume_info else 0

Returns None for a fresh start (no checkpoint found). When a checkpoint exists, it loads the DCP weights + optimizer state into the client before returning. ResumeInfo fields:

Field	Type	Description
`step`	`int`	Last completed step
`data_consumed`	`int`	Number of data examples consumed (for dataset position)
`source_job_id`	`str \| None`	Originating trainer job ID

`init_from_checkpoint`

Load pretrained DCP weights on a fresh dataset (step resets to 0). Supports cross-job format "job_id:checkpoint_name":

resume_info = resolve_resume(client, log_path, init_from_checkpoint="prev-job-id:step-100")
# resume_info.step == 0 (fresh start with loaded weights)

All recipe Config dataclasses expose init_from_checkpoint as a field.

`log_path` requirement

All recipe Config dataclasses require log_path (no default). This directory stores checkpoints.jsonl and any other run artifacts. Callers must provide it explicitly to prevent checkpoint collisions between runs:

cfg = Config(
    log_path="./my_experiment_logs",
    # ...
)

SDK: low-level checkpoint APIs

If you are writing a custom training loop without the cookbook, use these SDK methods directly.

Save and restore train state

# Save full train state (weights + optimizer) for resume
training_client.save_state("train_state_step_100")

# Restore train state for resume
training_client.load_state_with_optimizer("train_state_step_100")

save_state also accepts an optional ttl_seconds parameter for auto-expiring checkpoints.

Cross-job checkpoint resolution

checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref)

List available checkpoints

checkpoint_names, _ = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]

Cookbook users: The cookbook’s checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume wrap these SDK methods with structured persistence to checkpoints.jsonl. Prefer the cookbook helpers when using recipes.

Operational guidance

First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
Track your checkpoint identities. Delta hotloads reference a previous_snapshot_identity that must match what the deployment currently has loaded.
Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
Use dcp_save_interval (cookbook) or save_dcp (SDK) to save DCP checkpoints at regular intervals for resume.
init_from_checkpoint (cookbook) is useful for warm-starting a new experiment from a previous run’s weights without inheriting its dataset position.

Deployment Management for Training (SDK)
GRPO End-to-End Example (Cookbook)
Connect a Training Client (SDK)

API Reference

Inference

Training SDK

Deployments

Fine-tuning

Evals

Multimedia

Admin

What this is

Setup: enable `checkpoint_type`

Base vs. delta checkpoints

How delta chaining works

Saving sampler checkpoints

`save_weights_for_sampler_ext` parameters

Hotloading checkpoints onto a deployment

Using DeploymentManager directly

Using WeightSyncer (recommended)

`WeightSyncer` fields

`WeightSyncer` methods

Pattern: on-policy hotloading (every step)

Pattern: interval hotloading (off-policy)

Pattern: split save and hotload

Pattern: DCP checkpoints for resume

Saving and resuming train state

`save_checkpoint`

`resolve_resume`

`init_from_checkpoint`

`log_path` requirement

SDK: low-level checkpoint APIs

Save and restore train state

Cross-job checkpoint resolution

List available checkpoints

Operational guidance

API Reference

Inference

Training SDK

Deployments

Fine-tuning

Evals

Multimedia

Admin

​What this is

​Setup: enable checkpoint_type

​Base vs. delta checkpoints

​How delta chaining works

​Saving sampler checkpoints

​save_weights_for_sampler_ext parameters

​Hotloading checkpoints onto a deployment

​Using DeploymentManager directly

​Using WeightSyncer (recommended)

​WeightSyncer fields

​WeightSyncer methods

​Pattern: on-policy hotloading (every step)

​Pattern: interval hotloading (off-policy)

​Pattern: split save and hotload

​Pattern: DCP checkpoints for resume

​Saving and resuming train state

​save_checkpoint

​resolve_resume

​init_from_checkpoint

​log_path requirement

​SDK: low-level checkpoint APIs

​Save and restore train state

​Cross-job checkpoint resolution

​List available checkpoints

​Operational guidance

​Related Guides

What this is

Setup: enable `checkpoint_type`

Base vs. delta checkpoints

How delta chaining works

Saving sampler checkpoints

`save_weights_for_sampler_ext` parameters

Hotloading checkpoints onto a deployment

Using DeploymentManager directly

Using WeightSyncer (recommended)

`WeightSyncer` fields

`WeightSyncer` methods

Pattern: on-policy hotloading (every step)

Pattern: interval hotloading (off-policy)

Pattern: split save and hotload

Pattern: DCP checkpoints for resume

Saving and resuming train state

`save_checkpoint`

`resolve_resume`

`init_from_checkpoint`

`log_path` requirement

SDK: low-level checkpoint APIs

Save and restore train state

Cross-job checkpoint resolution

List available checkpoints

Operational guidance

Related Guides