Quickstart
Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig
config = Config(
log_path="./my_training",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
dcp_save_interval=10, # save every 10 steps
infra=InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(config)
# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)
Checkpoint saving
Is it on by default?
No. dcp_save_interval defaults to 0 (off). You must set it to a positive value to enable periodic DCP checkpoints. Without it, training cannot be resumed from intermediate steps.
However, a final checkpoint is always saved at the end of training (regardless of dcp_save_interval), so you can always promote the final result.
Where are checkpoints stored?
Checkpoint metadata is written to checkpoints.jsonl inside log_path:
./my_training/
checkpoints.jsonl ← one JSON line per checkpoint
The file path is {log_path}/checkpoints.jsonl — you control it by setting log_path in your Config. The filename checkpoints.jsonl is fixed and cannot be overridden.
The actual model weights are stored remotely (on the trainer’s distributed storage). checkpoints.jsonl only records references (state_path) that point to the remote state.
What each entry contains
{
"name": "step-10",
"step": 10,
"data_consumed": 40,
"state_path": "cross_job://job-abc/step-10",
"source_job_id": "job-abc",
"base_model": "accounts/fireworks/models/qwen3-8b",
"training_shape": "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"
}
| Field | Description |
|---|
step | Training step when this checkpoint was saved |
data_consumed | Number of dataset examples consumed so far |
state_path | Remote reference for load_state_with_optimizer (cross-job format) |
source_job_id | Trainer job that created this checkpoint |
base_model | Fireworks base model used for training |
training_shape | Training shape ID used |
Resuming from a checkpoint
Automatic resume (same log_path)
The simplest path — just rerun with the same log_path:
# Same config as before, same log_path
main(config)
On startup, the recipe reads checkpoints.jsonl, finds the last entry with a state_path, creates a new trainer job, loads the DCP state via load_state_with_optimizer, and continues training from the saved step and data_consumed position.
Resume from a specific job’s checkpoint (init_from_checkpoint)
To load weights from a previous run into a fresh training job with step reset to 0:
config = Config(
log_path="./new_run", # different log_path = fresh start
init_from_checkpoint="i44pvd4syzg8hjfk:step-4", # job_id:checkpoint_name
...
)
The format is "job_id:checkpoint_name". This creates a new trainer, loads the specified DCP state, and starts training from step 0 with a fresh dataloader.
init_from_checkpoint and automatic resume are mutually exclusive. If init_from_checkpoint is set, checkpoints.jsonl in log_path is ignored.
After training, promote any checkpoint to a deployable Fireworks model using train_promote_checkpoint.py:
# Auto-detects --model and --shape from checkpoint metadata:
python train_promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl
# Promote a specific step:
python train_promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl \
--step 10
# Override model/shape if needed:
python train_promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl \
--model accounts/fireworks/models/qwen3-8b \
--shape accounts/fireworks/trainingShapes/qwen3-8b-128k-h200
--model and --shape are optional when the checkpoint contains base_model and training_shape metadata (saved automatically by the cookbook recipes).
Config fields
| Field | Type | Default | Description |
|---|
log_path | str | (required) | Directory for checkpoints.jsonl and logs |
dcp_save_interval | int | 0 | Save DCP checkpoint every N steps. 0 = off. Set to a positive value to enable resume. A final checkpoint is always saved regardless of this setting. |
init_from_checkpoint | str | None | None | Load DCP state from another job ("job-id:checkpoint-name"). Step resets to 0. |
API reference
For lower-level details, see the Cookbook Reference:
For SDK-level checkpoint primitives (save_state, load_state_with_optimizer, save_weights_for_sampler_ext), see Saving and Loading.