Skip to main content

What this is

Sampler checkpoints are for deployment/hotload; train-state checkpoints are for reliable resume and continuation.

Workflow

  1. Save sampler checkpoints at stable intervals.
  2. Hotload deployment with candidate checkpoint.
  3. Persist optimizer state for resumable runs.

End-to-end examples

Checkpoint and resume primitives

sampler_ckpt = training_client.save_weights_for_sampler("step_0100").result()
training_client.save_state("train_state_step_0100").result()
training_client.load_state_with_optimizer("train_state_step_0100").result()