What this is
During training, you save checkpoints for three purposes:
- Weight sync (
save_weights_for_sampler_ext): Push updated weights to a running inference deployment without restarting it.
- Resuming (
save_state / load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off.
- Promotion (
promote_checkpoint): Turn a saved sampler checkpoint into a deployable Fireworks model.
Sampler checkpoints
Sampler checkpoints are weight-only snapshots used for weight sync and promotion. For promotability rules, see Checkpoint kinds — the cookbook page is the source of truth.
The raw SDK exposes two checkpoint_type modes that affect size and weight-sync speed:
checkpoint_type | What it saves | Size |
|---|
"base" | Full model weights | Large (~16 GB for 8B) |
"delta" | XOR diff from previous base | ~10× smaller |
Delta is much faster for per-step weight sync (current_weights = base XOR delta on the deployment). LoRA sampler checkpoints always contain the full adapter regardless of checkpoint_type.
On full-parameter training, save_weights_for_sampler_ext(checkpoint_type="delta") produces a blob that cannot be promoted — only "base" can. Use WeightSyncer (below) for the safe base-then-delta pattern, or the cookbook’s save_checkpoint(kind=SAMPLER|BOTH) which always saves base.
Saving checkpoints
# First checkpoint — must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
"step-0001",
checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")
# Subsequent checkpoints — delta is faster
result = training_client.save_weights_for_sampler_ext(
"step-0010",
checkpoint_type="delta",
)
# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
"temp-checkpoint",
checkpoint_type="delta",
ttl_seconds=3600,
)
Promote a sampler checkpoint to a deployable Fireworks model. Available on both FireworksClient and TrainerJobManager. The trainer job does not need to be running — job_id only resolves where checkpoint files are stored. See Checkpoint kinds for which checkpoints are promotable.
from fireworks.training.sdk import FireworksClient
client = FireworksClient(api_key=api_key)
model = client.promote_checkpoint(
job_id=endpoint.job_id,
checkpoint_id=result.snapshot_name,
output_model_id="my-fine-tuned-qwen3-8b",
base_model="accounts/fireworks/models/qwen3-8b",
)
| Parameter | Type | Description |
|---|
job_id | str | RLOR trainer job ID that produced the checkpoint |
checkpoint_id | str | The snapshot_name from save_weights_for_sampler_ext |
output_model_id | str | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only). Validate with validate_output_model_id before calling — a rejected ID orphans the staged sampler blob. |
base_model | str | Base model resource name for metadata inheritance (e.g. accounts/fireworks/models/qwen3-8b) |
Listing checkpoints on a trainer
curl "https://api.fireworks.ai/v1/accounts/<account-id>/rlorTrainerJobs/<job-id>/checkpoints?pageSize=200" \
-H "Authorization: Bearer $FIREWORKS_API_KEY"
Each entry includes name, createTime, updateTime, checkpointType, and promotable.
Weight sync
Weight sync pushes a checkpoint onto a running inference deployment without restarting it. See WeightSyncer for the recommended lifecycle manager.
from fireworks.training.sdk import WeightSyncer
syncer = WeightSyncer(
policy_client=training_client,
deploy_mgr=deploy_mgr,
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
hotload_timeout=600,
first_checkpoint_type="base",
)
# Automatically handles base (first) vs delta (subsequent)
syncer.save_and_hotload(f"step-{step:05d}")
save_and_hotload saves HF weights to remote storage and hotloads them, but does not write to the cookbook’s checkpoints.jsonl. To create a promotable checkpoint tracked in checkpoints.jsonl, use the cookbook’s save_checkpoint with kind=SAMPLER or kind=BOTH.
Train-state checkpoints
Use save_state to persist full training state, and one of two load methods to restore it:
| Method | Weights | Optimizer state |
|---|
load_state_with_optimizer(path) | Restored | Restored |
load_state(path) | Restored | Reset to zero |
# Save full train state for resume
training_client.save_state("train_state_step_100").result()
# Resume training (weights + optimizer restored)
training_client.load_state_with_optimizer("train_state_step_100").result()
save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
For the raw FiretitanTrainingClient, save_state(), load_state(), and load_state_with_optimizer() return futures — call .result() to block. The cookbook’s ReconnectableClient wrapper blocks for you.
Cross-job checkpoint resolution
checkpoint_ref = training_client.resolve_checkpoint_path(
"step-4",
source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()
List available checkpoints
checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names) # e.g. ["step-2", "step-4"]