What this is
There are two types of checkpoints in the Training SDK:- Sampler checkpoints (
save_weights_for_sampler_ext): Exported for deployment hotloading. These are what the inference endpoint loads. - Train-state checkpoints (
save_state): Full optimizer + weight state for resuming training where you left off.
Setup: enable checkpoint_type
The checkpoint_type parameter ("base" or "delta") is available on FiretitanTrainingClient.save_weights_for_sampler_ext(). Use FiretitanServiceClient to create the client:
Base vs. delta checkpoints
| Type | What it saves | Size | When to use |
|---|---|---|---|
"base" | Full model weights | Large (~16 GB for 8B model) | First checkpoint, or when you need a clean snapshot |
"delta" | XOR diff from previous base | Small (~10x smaller) | Every subsequent checkpoint after a base |
How delta chaining works
- Save a base checkpoint (full weights) — the deployment loads this as its starting point.
- Save delta checkpoints — each contains only the diff from the base.
- The deployment applies:
current_weights = base XOR delta.
Saving sampler checkpoints
save_weights_for_sampler_ext parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | — | Checkpoint name (auto-suffixed with session ID) |
checkpoint_type | str | None | None | "base" for full weights, "delta" for incremental |
ttl_seconds | int | None | None | Auto-delete checkpoint after this many seconds |
SaveSamplerResult with path (snapshot name from trainer) and snapshot_name (session-qualified name for hotloading).
Hotloading checkpoints onto a deployment
The SDK providesDeploymentManager for hotload operations and WeightSyncer for managing the full save-then-hotload lifecycle with automatic delta chain tracking.
Using DeploymentManager directly
incremental_snapshot_metadata:
Using WeightSyncer (recommended)
WeightSyncer manages the entire checkpoint-then-hotload lifecycle, including:
- Automatic base/delta chain state tracking
- Session-scoped snapshot naming (prevents Alluxio cache staleness)
- Automatic warmup after hotload
- Deployment state checking before first hotload
- Separate DCP save for resume checkpoints
WeightSyncer fields
| Field | Type | Default | Description |
|---|---|---|---|
policy_client | FiretitanTrainingClient | — | Training client for save operations |
deploy_mgr | DeploymentManager | None | None | Deployment manager for hotload (None = no hotloading) |
deployment_id | str | None | None | Target deployment for hotload |
base_model | str | "" | Model name for hotload API calls |
hotload_timeout | int | 600 | Timeout in seconds for hotload_and_wait |
first_checkpoint_type | str | "base" | Type for the first checkpoint ("base" or "delta") |
compression_format | str | "arc_v2" | Delta compression format |
warmup_after_hotload | bool | True | Send a warmup request after each successful hotload |
warmup_max_retries | int | 10 | Max retries for post-hotload warmup |
WeightSyncer methods
| Method | Description |
|---|---|
save_and_hotload(name, checkpoint_type=None) | Save sampler weights and hotload to deployment. Returns snapshot_name or raises on failure. |
save_only(name, checkpoint_type=None) | Save sampler weights without hotloading. Returns snapshot_name or None. |
hotload(snapshot_name) | Hotload a previously saved snapshot. Returns True/False. |
save_dcp(name) | Save DCP checkpoint only (for resume). No sampler, no hotload. Returns True/False. |
check_deployment_state() | Query the deployment’s current hotload state. Returns current_snapshot_identity or None. |
wait_for_hotload_ready(timeout_s, poll_interval_s) | Block until the deployment’s hot load manager is initialized. |
Pattern: on-policy hotloading (every step)
For on-policy training (e.g. GRPO), hotload after every optimizer step so the sampling policy matches the training policy.WeightSyncer handles the base/delta chain automatically:
Pattern: interval hotloading (off-policy)
For off-policy training, hotload every N steps and use importance sampling to correct for the stale policy between hotloads:Pattern: split save and hotload
When you need to separate save from hotload (e.g. do a deployment warmup in between):Pattern: DCP checkpoints for resume
DCP (Distributed Checkpoint) saves are independent from sampler/hotload saves. Usesave_dcp for resume checkpoints at intervals:
Saving and restoring train state
Train-state checkpoints save both model weights and optimizer state, enabling you to resume training exactly where you stopped:Operational guidance
- First checkpoint must be base if you’re using delta chaining. The deployment needs full weights as a starting point.
- Track your checkpoint identities. Delta hotloads reference a
previous_snapshot_identitythat must match what the deployment currently has loaded. - Keep checkpoint intervals predictable so evaluation comparisons are stable across experiments.
- Save train state before long experiments so you can resume from the last good state.