Skip to main content

Quickstart

Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

config = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    dcp_save_interval=10,  # save every 10 steps
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(config)

# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)

Checkpoint saving

Is it on by default?

No. dcp_save_interval defaults to 0 (off). You must set it to a positive value to enable periodic DCP checkpoints. Without it, training cannot be resumed from intermediate steps. However, a final checkpoint is always saved at the end of training (regardless of dcp_save_interval), so you can always promote the final result.

Where are checkpoints stored?

Checkpoint metadata is written to checkpoints.jsonl inside log_path:
./my_training/
  checkpoints.jsonl    ← one JSON line per checkpoint
The file path is {log_path}/checkpoints.jsonl — you control it by setting log_path in your Config. The filename checkpoints.jsonl is fixed and cannot be overridden. The actual model weights are stored remotely (on the trainer’s distributed storage). checkpoints.jsonl only records references (state_path) that point to the remote state.

What each entry contains

{
  "name": "step-10",
  "step": 10,
  "data_consumed": 40,
  "state_path": "cross_job://job-abc/step-10",
  "source_job_id": "job-abc",
  "base_model": "accounts/fireworks/models/qwen3-8b",
  "training_shape": "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"
}
FieldDescription
stepTraining step when this checkpoint was saved
data_consumedNumber of dataset examples consumed so far
state_pathRemote reference for load_state_with_optimizer (cross-job format)
source_job_idTrainer job that created this checkpoint
base_modelFireworks base model used for training
training_shapeTraining shape ID used

Resuming from a checkpoint

Automatic resume (same log_path)

The simplest path — just rerun with the same log_path:
# Same config as before, same log_path
main(config)
On startup, the recipe reads checkpoints.jsonl, finds the last entry with a state_path, creates a new trainer job, loads the DCP state via load_state_with_optimizer, and continues training from the saved step and data_consumed position.

Resume from a specific job’s checkpoint (init_from_checkpoint)

To load weights from a previous run into a fresh training job with step reset to 0:
config = Config(
    log_path="./new_run",       # different log_path = fresh start
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)
The format is "job_id:checkpoint_name". This creates a new trainer, loads the specified DCP state, and starts training from step 0 with a fresh dataloader.
init_from_checkpoint and automatic resume are mutually exclusive. If init_from_checkpoint is set, checkpoints.jsonl in log_path is ignored.

Promoting a checkpoint to a model

After training, promote any checkpoint to a deployable Fireworks model using train_promote_checkpoint.py:
# Auto-detects --model and --shape from checkpoint metadata:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl

# Promote a specific step:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --step 10

# Override model/shape if needed:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --model accounts/fireworks/models/qwen3-8b \
    --shape accounts/fireworks/trainingShapes/qwen3-8b-128k-h200
--model and --shape are optional when the checkpoint contains base_model and training_shape metadata (saved automatically by the cookbook recipes).

Config fields

FieldTypeDefaultDescription
log_pathstr(required)Directory for checkpoints.jsonl and logs
dcp_save_intervalint0Save DCP checkpoint every N steps. 0 = off. Set to a positive value to enable resume. A final checkpoint is always saved regardless of this setting.
init_from_checkpointstr | NoneNoneLoad DCP state from another job ("job-id:checkpoint-name"). Step resets to 0.

API reference

For lower-level details, see the Cookbook Reference: For SDK-level checkpoint primitives (save_state, load_state_with_optimizer, save_weights_for_sampler_ext), see Saving and Loading.