Overview
WeightSyncer coordinates saving sampler checkpoints and syncing them to a deployment, including automatic base/delta chain state tracking, session-scoped snapshot naming, and post-sync warmup. The managed service client now owns this logic internally.
For full-parameter training, only the first checkpoint (saved as
base) is promotable; subsequent delta checkpoints are not. LoRA checkpoints are always promotable (delta chain is disabled via lora_rank > 0). See Checkpoint kinds for the full promotability matrix.Constructor
| Field | Type | Default | Description |
|---|---|---|---|
policy_client | FiretitanTrainingClient | — | Training client for save operations |
deploy_mgr | DeploymentManager | None | None | Deployment manager for weight sync (None = no weight sync) |
deployment_id | str | None | None | Target deployment for weight sync |
base_model | str | "" | Model name for weight sync API calls |
hotload_timeout | int | 600 | Timeout in seconds for hotload_and_wait |
first_checkpoint_type | str | "base" | Type for the first checkpoint ("base" or "delta") |
compression_format | str | "arc_v2" | Delta compression format |
warmup_after_hotload | bool | True | Send a warmup request after each successful weight sync |
warmup_max_retries | int | 10 | Max retries for post-weight-sync warmup |
reset_prompt_cache | bool | True | Reset the deployment’s prompt cache after each weight sync |
lora_rank | int | 0 | When > 0, forces all checkpoints to base type (no delta chain). LoRA adapter exports are standalone PEFT artifacts that cannot use incremental delta compression. |
Methods
save_and_hotload(name, checkpoint_type=None)
Save sampler weights and sync to deployment. Automatically handles base (first) vs delta (subsequent) checkpoint types.
Returns the snapshot_name (str | None) on success or raises on failure:
save_only(name, checkpoint_type=None)
Save sampler weights without syncing to deployment:
snapshot_name or None.
hotload(snapshot_name, checkpoint_type)
Sync a previously saved snapshot to the deployment:
True on success, False on failure.
check_deployment_state()
Query the deployment’s current weight sync state:
wait_for_hotload_ready(timeout_s=300, poll_interval_s=5)
Block until the deployment’s weight sync manager is initialized.
reset_delta_chain()
Force the next save to be treated as base. Call when the deployment’s bucket or trainer session changes — for example, after attaching an existing deployment to a new trainer job — otherwise the next delta could reference a base checkpoint the deployment never loaded.
Usage patterns
These patterns are for maintaining older integrations. New code should use the service-client sampler refresh pattern documented in Training and Sampling.Sync weights every step
To minimize sampler staleness in a synchronous loop, sync a new sampler snapshot after every optimizer step before submitting the next rollout batch. This makes new rollout requests target the latest synced checkpoint, but the loop still owns draining or rejecting any stale in-flight requests before training on them:Interval weight sync
For throughput-oriented loops that tolerate stale sampler weights, sync a new sampler snapshot every N steps. This only controls when new sampler snapshots are saved and synced; it does not prove that already-submitted or in-flight requests were generated by the latest policy:Split save and sync
Separate save from weight sync when you need intermediate steps (e.g. warmup):DCP checkpoints for resume
Save DCP checkpoints at intervals using the training client directly:Related guides
- DeploymentManager — deployment lifecycle and weight-sync API
- Saving and Loading — checkpoint concepts
- Training and Sampling — end-to-end workflow