Checkpoints and Resume

Quickstart

Set dcp_save_interval and log_path, then rerun with the same log_path to resume:

from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

config = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    dcp_save_interval=10,  # save every 10 steps
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(config)

# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)

Checkpoint saving

Is it on by default?

No. dcp_save_interval defaults to 0 (off). You must set it to a positive value to enable periodic DCP checkpoints. Without it, training cannot be resumed from intermediate steps. However, a final checkpoint is always saved at the end of training (regardless of dcp_save_interval), so you can always promote the final result.

Where are checkpoints stored?

Checkpoint metadata is written to checkpoints.jsonl inside log_path:

./my_training/
  checkpoints.jsonl    ← one JSON line per checkpoint

The file path is {log_path}/checkpoints.jsonl — you control it by setting log_path in your Config. The filename checkpoints.jsonl is fixed and cannot be overridden. The actual model weights are stored remotely (on the trainer’s distributed storage). checkpoints.jsonl only records references (state_path) that point to the remote state.

What each entry contains

{
  "name": "step-10",
  "step": 10,
  "data_consumed": 40,
  "state_path": "cross_job://job-abc/step-10",
  "source_job_id": "job-abc",
  "base_model": "accounts/fireworks/models/qwen3-8b",
  "training_shape": "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"
}

Field	Description
`step`	Training step when this checkpoint was saved
`data_consumed`	Number of dataset examples consumed so far
`state_path`	Remote reference for `load_state_with_optimizer` (cross-job format)
`source_job_id`	Trainer job that created this checkpoint
`base_model`	Fireworks base model used for training
`training_shape`	Training shape ID used

Resuming from a checkpoint

Automatic resume (same log_path)

The simplest path — just rerun with the same log_path:

# Same config as before, same log_path
main(config)

On startup, the recipe reads checkpoints.jsonl, finds the last entry with a state_path, creates a new trainer job, loads the DCP state via load_state_with_optimizer, and continues training from the saved step and data_consumed position.

Resume from a specific job’s checkpoint (init_from_checkpoint)

To load weights from a previous run into a fresh training job with step reset to 0:

config = Config(
    log_path="./new_run",       # different log_path = fresh start
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)

The format is "job_id:checkpoint_name". This creates a new trainer, loads the specified DCP state, and starts training from step 0 with a fresh dataloader.

init_from_checkpoint and automatic resume are mutually exclusive. If init_from_checkpoint is set, checkpoints.jsonl in log_path is ignored.

Promoting a checkpoint to a model

After training, promote any checkpoint to a deployable Fireworks model using train_promote_checkpoint.py:

# Auto-detects --model and --shape from checkpoint metadata:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl

# Promote a specific step:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --step 10

# Override model/shape if needed:
python train_promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --model accounts/fireworks/models/qwen3-8b \
    --shape accounts/fireworks/trainingShapes/qwen3-8b-128k-h200

--model and --shape are optional when the checkpoint contains base_model and training_shape metadata (saved automatically by the cookbook recipes).

Config fields

Field	Type	Default	Description
`log_path`	`str`	(required)	Directory for `checkpoints.jsonl` and logs
`dcp_save_interval`	`int`	`0`	Save DCP checkpoint every N steps. `0` = off. Set to a positive value to enable resume. A final checkpoint is always saved regardless of this setting.
`init_from_checkpoint`	`str \| None`	`None`	Load DCP state from another job (`"job-id:checkpoint-name"`). Step resets to 0.

API reference

For lower-level details, see the Cookbook Reference:

save_checkpoint — save DCP and/or sampler checkpoints
resolve_resume — read checkpoints.jsonl and load state
ReconnectableClient — training client with auto-reconnect

For SDK-level checkpoint primitives (save_state, load_state_with_optimizer, save_weights_for_sampler_ext), see Saving and Loading.

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Quickstart

Checkpoint saving

Is it on by default?

Where are checkpoints stored?

What each entry contains

Resuming from a checkpoint

Automatic resume (same log_path)

Resume from a specific job’s checkpoint (init_from_checkpoint)

Promoting a checkpoint to a model

Config fields

API reference

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Quickstart

​Checkpoint saving

​Is it on by default?

​Where are checkpoints stored?

​What each entry contains

​Resuming from a checkpoint

​Automatic resume (same log_path)

​Resume from a specific job’s checkpoint (init_from_checkpoint)

​Promoting a checkpoint to a model

​Config fields

​API reference

Quickstart

Checkpoint saving

Is it on by default?

Where are checkpoints stored?

What each entry contains

Resuming from a checkpoint

Automatic resume (same log_path)

Resume from a specific job’s checkpoint (init_from_checkpoint)

Promoting a checkpoint to a model

Config fields

API reference