Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
Most users don’t need this page. If you’re launching training through a cookbook recipe (
rl_loop, sft_loop, etc.), the recipe handles save, resume, and promote for you — set dcp_save_interval and output_model_id on your config and you’re done. See Checkpoints and Resume (cookbook) for the recipe-driven flow.This page is the SDK-level reference for advanced users who are forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn’t promote.What this is
During training, you save checkpoints for three purposes:- Weight sync (
save_weights_for_sampler_ext): Push updated weights to a running inference deployment without restarting it. - Resuming (
save_state/load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off. - Promotion (
promote_checkpoint): Turn a saved sampler checkpoint into a deployable Fireworks model.
Sampler checkpoints
Sampler checkpoints are weight-only snapshots used for weight sync and promotion. For promotability rules, see Checkpoint kinds — the cookbook page is the source of truth. The raw SDK exposes twocheckpoint_type modes that affect size and weight-sync speed:
checkpoint_type | What it saves | Size |
|---|---|---|
"base" | Full model weights | Large (~16 GB for 8B) |
"delta" | XOR diff from previous base | ~10× smaller |
current_weights = base XOR delta on the deployment). LoRA sampler checkpoints always contain the full adapter regardless of checkpoint_type.
Saving checkpoints
Promoting a checkpoint to a model
Promote a sampler checkpoint to a deployable Fireworks model. Available on bothFireworksClient and TrainerJobManager. The trainer job does not need to be running — its row only needs to exist; promotion is a metadata + file-copy operation. See Checkpoint kinds for which checkpoints are promotable.
Preferred: pass the 4-segment name= from list_checkpoints
list_checkpoints returns each checkpoint’s full resource name (accounts/<account>/rlorTrainerJobs/<job>/checkpoints/<id>). Hand that string straight to promote_checkpoint — no manual disassembly into (job_id, checkpoint_id):
| Parameter | Type | Description |
|---|---|---|
name | str | Full 4-segment checkpoint resource name from list_checkpoints output |
output_model_id | str | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only). Validate with validate_output_model_id before calling — a rejected ID orphans the staged sampler blob. |
base_model | str | Base model resource name for metadata inheritance (e.g. accounts/fireworks/models/qwen3-8b) |
Legacy: positional (job_id, checkpoint_id) form
The old positional form still works for callers that haven’t migrated. Calling it emits a DeprecationWarning:
hot_load_deployment_id parameter is also deprecated (the gateway resolves the bucket URL from the trainer’s stored metadata). Passing it emits a DeprecationWarning and is only needed for deployments that predate the stored-bucket-URL migration.
Listing checkpoints on a trainer
name, createTime, updateTime, checkpointType, and promotable.
Weight sync
Weight sync pushes a checkpoint onto a running inference deployment without restarting it. SeeWeightSyncer for the recommended lifecycle manager.
save_and_hotload saves HF weights to remote storage and weight-syncs them onto the running deployment. The resulting row is visible to list_checkpoints and (for LoRA, or for the first base save on full-param) is promotable=True — the cookbook’s TrainingCheckpoints.promote_latest will pick it up automatically. For full-param runs after the first base, you’ll want an explicit TrainingCheckpoints.save(promotable=True) to produce a promotable blob.Train-state checkpoints
Usesave_state to persist full training state, and one of two load methods to restore it:
| Method | Weights | Optimizer state |
|---|---|---|
load_state_with_optimizer(path) | Restored | Restored |
load_state(path) | Restored | Reset to zero |
save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.
For the raw
FiretitanTrainingClient, save_state(), load_state(), and load_state_with_optimizer() return futures — call .result() to block. The cookbook’s ReconnectableClient wrapper blocks for you.Cross-job checkpoint resolution
List available checkpoints
Related guides
- Checkpoints and Resume (cookbook) — recipe-driven save / resume / promote (start here for most users)
- WeightSyncer reference — full weight sync lifecycle
- DeploymentManager reference — direct hotload API