Quickstart
Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig
config = Config(
log_path="./my_training",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
dcp_save_interval=10, # save every 10 steps
infra=InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(config)
# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)
Checkpoint kinds
Every cookbook checkpoint uses a CheckpointKind. This section is the single source of truth for checkpoint promotability — other pages link here.
| Kind | What is saved | Resumable | Promotable |
|---|
STATE | DCP (optimizer + weights) | Yes | No |
SAMPLER | HF weights for inference | No | Yes |
BOTH | DCP + HF weights | Yes | Yes |
- Mid-training saves (
dcp_save_interval) use STATE.
- The final checkpoint always uses
BOTH.
- To promote a mid-training checkpoint, call
save_checkpoint explicitly with kind=SAMPLER or kind=BOTH.
Three things called “type”
Three separate layers of the stack each have their own “type”, and confusing them is the usual reason a promotion fails. They are not synonyms:
| Layer | Where | Values | What it controls |
|---|
| Cookbook | save_checkpoint(kind=...) | STATE, SAMPLER, BOTH | Which of DCP / sampler blob (or both) gets saved |
| SDK | save_weights_for_sampler_ext(checkpoint_type=...) | "base", "delta" | Whether the sampler blob is full weights or an XOR diff over the previous base (LoRA ignores this — full adapter is always saved) |
| Server | inferred from GCS contents | INFERENCE_BASE, INFERENCE_LORA, INFERENCE_ARC_V2 | Promotability — the first two promote, the third (delta) is rejected |
When the cookbook saves a SAMPLER or BOTH checkpoint, it always calls the SDK with checkpoint_type="base", which the server detects as INFERENCE_BASE (full-param) or INFERENCE_LORA (LoRA). Both are promotable. The non-promotable INFERENCE_ARC_V2 only happens if you bypass the cookbook and call save_weights_for_sampler_ext("delta") on a full-parameter run.
“Promotable” means the server will accept the blob. “Callable” means you also have the metadata needed to invoke promote_checkpoint — snapshot_name, source_job_id, and base_model. Only the cookbook writes those fields to checkpoints.jsonl automatically; every other path leaves you to capture them yourself.
| How it was saved | LoRA promotable | Full-param promotable | Metadata in checkpoints.jsonl? |
|---|
save_checkpoint(kind=STATE) | No (DCP only) | No (DCP only) | state_path only |
save_checkpoint(kind=SAMPLER|BOTH) | Yes | Yes | Yes — sampler_path + source_job_id + base_model |
save_weights_for_sampler_ext(checkpoint_type="base") | Yes | Yes | No — capture from SaveSamplerResult.snapshot_name + client.job_id |
save_weights_for_sampler_ext(checkpoint_type="delta") | Yes (server always stores full adapter) | No | No |
WeightSyncer.save_and_hotload() — first save | Yes | Yes | No |
WeightSyncer.save_and_hotload() — later saves | Yes | No | No |
If you saved via the raw SDK or WeightSyncer, the blob may be promotable but promote_checkpoint.py --checkpoints-jsonl ... won’t work out of the box — there’s no jsonl row. You must either (a) hand-build a jsonl entry with {"name", "sampler_path", "source_job_id", "base_model"} and pass it to the script, or (b) call FireworksClient.promote_checkpoint(job_id, checkpoint_id, output_model_id, base_model) directly. Prefer the cookbook’s save_checkpoint(kind=SAMPLER|BOTH) when you can.
dcp_save_interval defaults to 0 (off). Without setting it, training cannot be resumed from intermediate steps.
checkpoints.jsonl
Checkpoint metadata is written to {log_path}/checkpoints.jsonl — one JSON line per save. The fields present depend on the kind:
{"name": "step-10", "step": 10, "data_consumed": 40, "state_path": "cross_job://job-abc/step-10", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
{"name": "step-50", "step": 50, "data_consumed": 200, "state_path": "cross_job://job-abc/step-50", "sampler_path": "step-50-a1b2c3d4", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
| Field | Present in | Description |
|---|
state_path | STATE, BOTH | Remote DCP reference for resume |
sampler_path | SAMPLER, BOTH | Snapshot name for promotion |
source_job_id | All | Trainer job that created this checkpoint |
base_model | All | Base model (auto-detected by promote script) |
WeightSyncer.save_and_hotload() saves HF weights to the weight-sync bucket but does not write to checkpoints.jsonl. Those checkpoints exist remotely but are not tracked here.
Resume
Automatic (same log_path)
Just rerun with the same log_path. The recipe reads checkpoints.jsonl, finds the last entry with a state_path, loads DCP state, and continues from the saved step.
From another job (init_from_checkpoint)
config = Config(
log_path="./new_run",
init_from_checkpoint="i44pvd4syzg8hjfk:step-4", # job_id:checkpoint_name
...
)
Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.
Only entries with sampler_path can be promoted (kind=SAMPLER or kind=BOTH). The final checkpoint is always promotable. Mid-training DCP saves are not.
export FIREWORKS_API_KEY=...
# Promote the latest promotable checkpoint:
python promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl
# Promote a specific step:
python promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl \
--step 50
Without checkpoints.jsonl, use the API directly with the source_job_id and sampler_path:
from fireworks.training.sdk import FireworksClient
client = FireworksClient(api_key=api_key)
client.promote_checkpoint(
job_id="job-abc",
checkpoint_id="step-50-a1b2c3d4",
output_model_id="my-fine-tuned-model",
base_model="accounts/fireworks/models/qwen3-8b",
)
See Saving and Loading — Promoting for full API details.
Config fields
| Field | Type | Default | Description |
|---|
log_path | str | (required) | Directory for checkpoints.jsonl and logs |
dcp_save_interval | int | 0 | Save DCP checkpoint every N steps. 0 = off. |
init_from_checkpoint | str | None | None | Load DCP state from another job ("job-id:checkpoint-name"). Step resets to 0. |