Skip to main content

Quickstart

Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

config = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    dcp_save_interval=10,  # save every 10 steps
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(config)

# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)

Checkpoint kinds

Every cookbook checkpoint uses a CheckpointKind. This section is the single source of truth for checkpoint promotability — other pages link here.
KindWhat is savedResumablePromotable
STATEDCP (optimizer + weights)YesNo
SAMPLERHF weights for inferenceNoYes
BOTHDCP + HF weightsYesYes
  • Mid-training saves (dcp_save_interval) use STATE.
  • The final checkpoint always uses BOTH.
  • To promote a mid-training checkpoint, call save_checkpoint explicitly with kind=SAMPLER or kind=BOTH.

Three things called “type”

Three separate layers of the stack each have their own “type”, and confusing them is the usual reason a promotion fails. They are not synonyms:
LayerWhereValuesWhat it controls
Cookbooksave_checkpoint(kind=...)STATE, SAMPLER, BOTHWhich of DCP / sampler blob (or both) gets saved
SDKsave_weights_for_sampler_ext(checkpoint_type=...)"base", "delta"Whether the sampler blob is full weights or an XOR diff over the previous base (LoRA ignores this — full adapter is always saved)
Serverinferred from GCS contentsINFERENCE_BASE, INFERENCE_LORA, INFERENCE_ARC_V2Promotability — the first two promote, the third (delta) is rejected
When the cookbook saves a SAMPLER or BOTH checkpoint, it always calls the SDK with checkpoint_type="base", which the server detects as INFERENCE_BASE (full-param) or INFERENCE_LORA (LoRA). Both are promotable. The non-promotable INFERENCE_ARC_V2 only happens if you bypass the cookbook and call save_weights_for_sampler_ext("delta") on a full-parameter run.

Promotability cheat sheet

“Promotable” means the server will accept the blob. “Callable” means you also have the metadata needed to invoke promote_checkpointsnapshot_name, source_job_id, and base_model. Only the cookbook writes those fields to checkpoints.jsonl automatically; every other path leaves you to capture them yourself.
How it was savedLoRA promotableFull-param promotableMetadata in checkpoints.jsonl?
save_checkpoint(kind=STATE)No (DCP only)No (DCP only)state_path only
save_checkpoint(kind=SAMPLER|BOTH)YesYesYes — sampler_path + source_job_id + base_model
save_weights_for_sampler_ext(checkpoint_type="base")YesYesNo — capture from SaveSamplerResult.snapshot_name + client.job_id
save_weights_for_sampler_ext(checkpoint_type="delta")Yes (server always stores full adapter)NoNo
WeightSyncer.save_and_hotload() — first saveYesYesNo
WeightSyncer.save_and_hotload() — later savesYesNoNo
If you saved via the raw SDK or WeightSyncer, the blob may be promotable but promote_checkpoint.py --checkpoints-jsonl ... won’t work out of the box — there’s no jsonl row. You must either (a) hand-build a jsonl entry with {"name", "sampler_path", "source_job_id", "base_model"} and pass it to the script, or (b) call FireworksClient.promote_checkpoint(job_id, checkpoint_id, output_model_id, base_model) directly. Prefer the cookbook’s save_checkpoint(kind=SAMPLER|BOTH) when you can.
dcp_save_interval defaults to 0 (off). Without setting it, training cannot be resumed from intermediate steps.

checkpoints.jsonl

Checkpoint metadata is written to {log_path}/checkpoints.jsonl — one JSON line per save. The fields present depend on the kind:
{"name": "step-10", "step": 10, "data_consumed": 40, "state_path": "cross_job://job-abc/step-10", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
{"name": "step-50", "step": 50, "data_consumed": 200, "state_path": "cross_job://job-abc/step-50", "sampler_path": "step-50-a1b2c3d4", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
FieldPresent inDescription
state_pathSTATE, BOTHRemote DCP reference for resume
sampler_pathSAMPLER, BOTHSnapshot name for promotion
source_job_idAllTrainer job that created this checkpoint
base_modelAllBase model (auto-detected by promote script)
WeightSyncer.save_and_hotload() saves HF weights to the weight-sync bucket but does not write to checkpoints.jsonl. Those checkpoints exist remotely but are not tracked here.

Resume

Automatic (same log_path)

Just rerun with the same log_path. The recipe reads checkpoints.jsonl, finds the last entry with a state_path, loads DCP state, and continues from the saved step.

From another job (init_from_checkpoint)

config = Config(
    log_path="./new_run",
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)
Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.

Promoting a checkpoint

Only entries with sampler_path can be promoted (kind=SAMPLER or kind=BOTH). The final checkpoint is always promotable. Mid-training DCP saves are not.
export FIREWORKS_API_KEY=...

# Promote the latest promotable checkpoint:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl

# Promote a specific step:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --step 50
Without checkpoints.jsonl, use the API directly with the source_job_id and sampler_path:
from fireworks.training.sdk import FireworksClient

client = FireworksClient(api_key=api_key)
client.promote_checkpoint(
    job_id="job-abc",
    checkpoint_id="step-50-a1b2c3d4",
    output_model_id="my-fine-tuned-model",
    base_model="accounts/fireworks/models/qwen3-8b",
)
See Saving and Loading — Promoting for full API details.

Config fields

FieldTypeDefaultDescription
log_pathstr(required)Directory for checkpoints.jsonl and logs
dcp_save_intervalint0Save DCP checkpoint every N steps. 0 = off.
init_from_checkpointstr | NoneNoneLoad DCP state from another job ("job-id:checkpoint-name"). Step resets to 0.