What this is
There are two types of checkpoints in the Training SDK:
- Sampler checkpoints (
save_weights_for_sampler_ext): Exported for deployment hotloading. These are what the inference endpoint loads.
- Train-state checkpoints (
save_checkpoint via checkpoint_utils): Full optimizer + weight state persisted to checkpoints.jsonl for resuming training where you left off.
Setup: enable checkpoint_type
The checkpoint_type parameter ("base" or "delta") is available on FiretitanTrainingClient.save_weights_for_sampler_ext(). Use FiretitanServiceClient to create the client:
import tinker
from fireworks.training.sdk import FiretitanServiceClient, WeightSyncer, DeploymentManager
Base vs. delta checkpoints
| Type | What it saves | Size | When to use |
|---|
"base" | Full model weights | Large (~16 GB for 8B model) | First checkpoint, or when you need a clean snapshot |
"delta" | XOR diff from previous base | Small (~10x smaller) | Every subsequent checkpoint after a base |
Delta checkpoints are much faster to save and transfer, making per-step hotloading practical for on-policy training.
How delta chaining works
- Save a base checkpoint (full weights) — the deployment loads this as its starting point.
- Save delta checkpoints — each contains only the diff from the base.
- The deployment applies:
current_weights = base XOR delta.
# Step 1: First checkpoint must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
"step-0001",
checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")
# Step 2+: Subsequent checkpoints can be delta (much smaller)
result = training_client.save_weights_for_sampler_ext(
"step-0002",
checkpoint_type="delta",
)
Saving sampler checkpoints
# Full checkpoint
result = training_client.save_weights_for_sampler_ext(
"my-checkpoint-name",
checkpoint_type="base",
)
print(result.snapshot_name) # Session-qualified name for hotloading
# Incremental checkpoint (after a base exists)
delta_result = training_client.save_weights_for_sampler_ext(
"step-0050",
checkpoint_type="delta",
)
# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
"temp-checkpoint",
checkpoint_type="delta",
ttl_seconds=3600, # auto-expire after 1 hour
)
save_weights_for_sampler_ext parameters
| Parameter | Type | Default | Description |
|---|
name | str | — | Checkpoint name (auto-suffixed with session ID) |
checkpoint_type | str | None | None | "base" for full weights, "delta" for incremental |
ttl_seconds | int | None | None | Auto-delete checkpoint after this many seconds |
Returns a SaveSamplerResult with path (snapshot name from trainer) and snapshot_name (session-qualified name for hotloading).
Hotloading checkpoints onto a deployment
The SDK provides DeploymentManager for hotload operations and WeightSyncer for managing the full save-then-hotload lifecycle with automatic delta chain tracking.
Using DeploymentManager directly
from fireworks.training.sdk import DeploymentManager
deploy_mgr = DeploymentManager(
api_key="<FIREWORKS_API_KEY>",
account_id="<ACCOUNT_ID>",
base_url="https://api.fireworks.ai",
)
# Hotload and wait for completion
deploy_mgr.hotload_and_wait(
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
snapshot_identity=result.snapshot_name, # From save_weights_for_sampler_ext
timeout_seconds=400,
)
For delta hotloads, pass incremental_snapshot_metadata:
deploy_mgr.hotload_and_wait(
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
snapshot_identity=delta_result.snapshot_name,
incremental_snapshot_metadata={
"previous_snapshot_identity": base_result.snapshot_name,
"compression_format": "arc_v2",
"checksum_format": "alder32",
},
timeout_seconds=400,
)
Using WeightSyncer (recommended)
WeightSyncer manages the entire checkpoint-then-hotload lifecycle, including:
- Automatic base/delta chain state tracking
- Session-scoped snapshot naming (prevents Alluxio cache staleness)
- Automatic warmup after hotload
- Deployment state checking before first hotload
- Separate DCP save for resume checkpoints
from fireworks.training.sdk import WeightSyncer
tracker = WeightSyncer(
policy_client=training_client,
deploy_mgr=deploy_mgr,
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
hotload_timeout=600,
first_checkpoint_type="base",
warmup_after_hotload=True,
)
WeightSyncer fields
| Field | Type | Default | Description |
|---|
policy_client | FiretitanTrainingClient | — | Training client for save operations |
deploy_mgr | DeploymentManager | None | None | Deployment manager for hotload (None = no hotloading) |
deployment_id | str | None | None | Target deployment for hotload |
base_model | str | "" | Model name for hotload API calls |
hotload_timeout | int | 600 | Timeout in seconds for hotload_and_wait |
first_checkpoint_type | str | "base" | Type for the first checkpoint ("base" or "delta") |
compression_format | str | "arc_v2" | Delta compression format |
warmup_after_hotload | bool | True | Send a warmup request after each successful hotload |
warmup_max_retries | int | 10 | Max retries for post-hotload warmup |
WeightSyncer methods
| Method | Description |
|---|
save_and_hotload(name, checkpoint_type=None) | Save sampler weights and hotload to deployment. Returns snapshot_name or raises on failure. |
save_only(name, checkpoint_type=None) | Save sampler weights without hotloading. Returns snapshot_name or None. |
hotload(snapshot_name) | Hotload a previously saved snapshot. Returns True/False. |
save_dcp(name) | Save DCP checkpoint only (for resume). No sampler, no hotload. Returns True/False. |
check_deployment_state() | Query the deployment’s current hotload state. Returns current_snapshot_identity or None. |
wait_for_hotload_ready(timeout_s, poll_interval_s) | Block until the deployment’s hot load manager is initialized. |
Pattern: on-policy hotloading (every step)
For on-policy training (e.g. GRPO), hotload after every optimizer step so the sampling policy matches the training policy. WeightSyncer handles the base/delta chain automatically:
for step in range(total_steps):
# ... training step ...
# WeightSyncer auto-selects base (first) or delta (subsequent)
tracker.save_and_hotload(f"step-{step:05d}")
# Now sample from the updated deployment
completions = sampler.sample_with_tokens(messages=input_messages, n=4)
Pattern: interval hotloading (off-policy)
For off-policy training, hotload every N steps and use importance sampling to correct for the stale policy between hotloads:
hotload_interval = 10
for step in range(total_steps):
# ... training step ...
if step % hotload_interval == 0:
tracker.save_and_hotload(f"step-{step:05d}")
Pattern: split save and hotload
When you need to separate save from hotload (e.g. do a deployment warmup in between):
snapshot = tracker.save_only("resume-step-0", checkpoint_type="base")
deploy_mgr.warmup(model)
tracker.hotload(snapshot)
Pattern: DCP checkpoints for resume
DCP (Distributed Checkpoint) saves are independent from sampler/hotload saves. Use save_dcp for resume checkpoints at intervals:
for step in range(total_steps):
# ... training step ...
tracker.save_and_hotload(f"step-{step:05d}")
if step % dcp_interval == 0:
tracker.save_dcp(f"step-{step}")
Saving and resuming train state
The cookbook uses checkpoint_utils to persist training state (step number, data position, optimizer + weights) in a checkpoints.jsonl file inside log_path. This replaces raw save_state / load_state_with_optimizer calls.
save_checkpoint
Saves a DCP checkpoint (optimizer + weights) and appends a record to checkpoints.jsonl:
from training.utils.checkpoint_utils import save_checkpoint
save_checkpoint(client, f"step-{step}", log_path, {
"step": step,
"data_consumed": data_consumed,
"source_job_id": job_id,
})
| Parameter | Type | Description |
|---|
client | training client | The training client to save state from |
name | str | Checkpoint name |
log_path | str | Directory where checkpoints.jsonl is written |
loop_state | dict | Arbitrary metadata (step, data_consumed, etc.) persisted alongside the checkpoint |
kind | str | "state" (optimizer + weights), "sampler" (serving only), or "both" |
resolve_resume
On startup, resolves the resume state by reading checkpoints.jsonl and loading the last checkpoint:
from training.utils.checkpoint_utils import resolve_resume
resume_info = resolve_resume(client, log_path, init_from_checkpoint=None)
step = resume_info.step if resume_info else 0
data_consumed = resume_info.data_consumed if resume_info else 0
Returns None for a fresh start (no checkpoint found). When a checkpoint exists, it loads the DCP weights + optimizer state into the client before returning.
ResumeInfo fields:
| Field | Type | Description |
|---|
step | int | Last completed step |
data_consumed | int | Number of data examples consumed (for dataset position) |
source_job_id | str | None | Originating trainer job ID |
init_from_checkpoint
Load pretrained DCP weights on a fresh dataset (step resets to 0). Supports cross-job format "job_id:checkpoint_name":
resume_info = resolve_resume(client, log_path, init_from_checkpoint="prev-job-id:step-100")
# resume_info.step == 0 (fresh start with loaded weights)
All recipe Config dataclasses expose init_from_checkpoint as a field.
log_path requirement
All recipe Config dataclasses require log_path (no default). This directory stores checkpoints.jsonl and any other run artifacts. Callers must provide it explicitly to prevent checkpoint collisions between runs:
cfg = Config(
log_path="./my_experiment_logs",
# ...
)
SDK: low-level checkpoint APIs
If you are writing a custom training loop without the cookbook, use these SDK methods directly.
Save and restore train state
# Save full train state (weights + optimizer) for resume
training_client.save_state("train_state_step_100")
# Restore train state for resume
training_client.load_state_with_optimizer("train_state_step_100")
save_state also accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
Cross-job checkpoint resolution
checkpoint_ref = training_client.resolve_checkpoint_path(
"step-4",
source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref)
List available checkpoints
checkpoint_names, _ = training_client.list_checkpoints()
print(checkpoint_names) # e.g. ["step-2", "step-4"]
Cookbook users: The cookbook’s checkpoint_utils.save_checkpoint and checkpoint_utils.resolve_resume wrap these SDK methods with structured persistence to checkpoints.jsonl. Prefer the cookbook helpers when using recipes.
Operational guidance
- First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
- Track your checkpoint identities. Delta hotloads reference a
previous_snapshot_identity that must match what the deployment currently has loaded.
- Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
- Use
dcp_save_interval (cookbook) or save_dcp (SDK) to save DCP checkpoints at regular intervals for resume.
init_from_checkpoint (cookbook) is useful for warm-starting a new experiment from a previous run’s weights without inheriting its dataset position.