Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt

Use this file to discover all available pages before exploring further.

Most users don’t need this page. If you’re launching training through a cookbook recipe (rl_loop, sft_loop, etc.), the recipe handles save, resume, and promote for you — set dcp_save_interval and output_model_id on your config and you’re done. See Checkpoints and Resume (cookbook) for the recipe-driven flow.This page is the SDK-level reference for advanced users who are forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn’t promote.

What this is

During training, you save checkpoints for three purposes:
  1. Weight sync (save_weights_for_sampler_ext): Push updated weights to a running inference deployment without restarting it.
  2. Resuming (save_state / load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off.
  3. Promotion (promote_checkpoint): Turn a saved sampler checkpoint into a deployable Fireworks model.

Sampler checkpoints

Sampler checkpoints are weight-only snapshots used for weight sync and promotion. For promotability rules, see Checkpoint kinds — the cookbook page is the source of truth. The raw SDK exposes two checkpoint_type modes that affect size and weight-sync speed:
checkpoint_typeWhat it savesSize
"base"Full model weightsLarge (~16 GB for 8B)
"delta"XOR diff from previous base~10× smaller
Delta is much faster for per-step weight sync (current_weights = base XOR delta on the deployment). LoRA sampler checkpoints always contain the full adapter regardless of checkpoint_type.
On full-parameter training, save_weights_for_sampler_ext(checkpoint_type="delta") produces a blob that cannot be promoted — only "base" can. Use WeightSyncer (below) for the safe base-then-delta pattern, or the cookbook’s TrainingCheckpoints.save(promotable=True) which always saves base.

Saving checkpoints

# First checkpoint — must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Subsequent checkpoints — delta is faster
result = training_client.save_weights_for_sampler_ext(
    "step-0010",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,
)

Promoting a checkpoint to a model

Promote a sampler checkpoint to a deployable Fireworks model. Available on both FireworksClient and TrainerJobManager. The trainer job does not need to be running — its row only needs to exist; promotion is a metadata + file-copy operation. See Checkpoint kinds for which checkpoints are promotable.

Preferred: pass the 4-segment name= from list_checkpoints

list_checkpoints returns each checkpoint’s full resource name (accounts/<account>/rlorTrainerJobs/<job>/checkpoints/<id>). Hand that string straight to promote_checkpoint — no manual disassembly into (job_id, checkpoint_id):
from fireworks.training.sdk import FireworksClient

client = FireworksClient(api_key=api_key)

# Pick a row from the trainer's checkpoints — usually newest promotable.
rows = client.list_checkpoints(job_id)
target = next(r for r in rows if r.get("promotable"))

model = client.promote_checkpoint(
    name=target["name"],                          # 4-segment resource path
    output_model_id="my-fine-tuned-qwen3-8b",
    base_model="accounts/fireworks/models/qwen3-8b",
)
ParameterTypeDescription
namestrFull 4-segment checkpoint resource name from list_checkpoints output
output_model_idstrDesired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only). Validate with validate_output_model_id before calling — a rejected ID orphans the staged sampler blob.
base_modelstrBase model resource name for metadata inheritance (e.g. accounts/fireworks/models/qwen3-8b)

Legacy: positional (job_id, checkpoint_id) form

The old positional form still works for callers that haven’t migrated. Calling it emits a DeprecationWarning:
model = client.promote_checkpoint(
    job_id=endpoint.job_id,
    checkpoint_id=result.snapshot_name,
    output_model_id="my-fine-tuned-qwen3-8b",
    base_model="accounts/fireworks/models/qwen3-8b",
)
# DeprecationWarning: promote_checkpoint(job_id, checkpoint_id, ...)
# positional form is deprecated. Pass the 4-segment resource name instead.
The hot_load_deployment_id parameter is also deprecated (the gateway resolves the bucket URL from the trainer’s stored metadata). Passing it emits a DeprecationWarning and is only needed for deployments that predate the stored-bucket-URL migration.

Listing checkpoints on a trainer

curl "https://api.fireworks.ai/v1/accounts/<account-id>/rlorTrainerJobs/<job-id>/checkpoints?pageSize=200" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY"
Each entry includes name, createTime, updateTime, checkpointType, and promotable.

Weight sync

Weight sync pushes a checkpoint onto a running inference deployment without restarting it. See WeightSyncer for the recommended lifecycle manager.
from fireworks.training.sdk import WeightSyncer

syncer = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
)

# Automatically handles base (first) vs delta (subsequent)
syncer.save_and_hotload(f"step-{step:05d}")
save_and_hotload saves HF weights to remote storage and weight-syncs them onto the running deployment. The resulting row is visible to list_checkpoints and (for LoRA, or for the first base save on full-param) is promotable=True — the cookbook’s TrainingCheckpoints.promote_latest will pick it up automatically. For full-param runs after the first base, you’ll want an explicit TrainingCheckpoints.save(promotable=True) to produce a promotable blob.

Train-state checkpoints

Use save_state to persist full training state, and one of two load methods to restore it:
MethodWeightsOptimizer state
load_state_with_optimizer(path)RestoredRestored
load_state(path)RestoredReset to zero
# Save full train state for resume
training_client.save_state("train_state_step_100").result()

# Resume training (weights + optimizer restored)
training_client.load_state_with_optimizer("train_state_step_100").result()
save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
For the raw FiretitanTrainingClient, save_state(), load_state(), and load_state_with_optimizer() return futures — call .result() to block. The cookbook’s ReconnectableClient wrapper blocks for you.

Cross-job checkpoint resolution

checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()

List available checkpoints

checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]