What this is
Fireworks Training SDK gives teams a flexibility ladder from managed jobs to fully custom training loops. Start managed for standard objectives, then move to Tinker-compatible loops when you need custom losses, full-parameter updates, and tighter experiment control.| Mode | Best for | Objective control | Infrastructure |
|---|---|---|---|
| Managed jobs (SFT, DPO, RFT) | Standard objectives, fast iteration | Platform-defined | Fully managed |
| Service-mode SDK loops | Custom losses, algorithm research | Full Python control | You drive the loop, platform runs GPUs |
New to custom training loops? Start with Core Concepts for an introduction to the architecture, then follow the Quickstart to run a minimal training loop.
Why this approach
- One platform, two modes: Move from managed baselines to frontier research without rebuilding infrastructure.
- When to use managed jobs: Standard objectives (SFT, DPO, managed GRPO) with minimal custom code.
- When to use Training SDK loops: Custom objectives, algorithm research, or fine-grained control over training beyond built-in methods.
- Production inference in the loop: Hotload checkpoints onto serving deployments for realistic evaluation during training.
System architecture
A control-plane API provisions trainer and deployment resources. Your local Python loop connects to the trainer service, runs custom train steps, and periodically exports checkpoints to a serving deployment for sampling and evaluation.Installation
SDK (custom loops)
Standalone cookbook recipes (optional)
The cookbook provides ready-to-run training recipes (GRPO, DPO, SFT, ORPO). Install it as a package:FiretitanServiceClient (extends tinker’s ServiceClient with checkpoint_type and session-scoped snapshot naming) and DeploymentSampler (client-side tokenization for training-compatible sampling):
Key APIs
Resource setup & teardown (Fireworks Training SDK)
| API | Purpose |
|---|---|
TrainerJobManager.create_and_wait(config) | Create a service-mode trainer and poll until healthy |
TrainerJobManager.wait_for_existing(job_id) | Wait for an already-existing trainer job to reach RUNNING |
TrainerJobManager.resume_and_wait(job_id) | Resume a failed/cancelled/paused job and wait |
TrainerJobManager.reconnect_and_wait(job_id) | Reconnect to a preempted/failed job (handles transitional states) |
TrainerJobManager.resolve_training_profile(shape_id) | Fetch training shape config from the control plane |
TrainerJobManager.delete(job_id) | Delete a trainer job |
DeploymentManager.create_or_get(config) | Create or reuse an inference deployment for sampling/hotload |
DeploymentManager.wait_for_ready(deployment_id) | Poll until deployment is READY |
DeploymentManager.scale_to_zero(deployment_id) | Scale to zero replicas without deleting |
DeploymentManager.delete(deployment_id) | Delete a deployment |
Training loop (Fireworks Training SDK + Tinker)
| API | Purpose |
|---|---|
FiretitanServiceClient(base_url, api_key) | Connect to a trainer endpoint (extends tinker ServiceClient) |
service.create_training_client(base_model, lora_rank) | Create a FiretitanTrainingClient with checkpoint extensions |
client.forward(datums, loss_type) | Forward pass only (e.g. for reference logprobs) |
client.forward_backward_custom(datums, loss_fn) | Forward + backward with your custom loss |
client.optim_step(tinker.AdamParams(...)) | Apply optimizer update |
client.save_weights_for_sampler_ext(name, checkpoint_type) | Export serving-compatible checkpoint with session-scoped naming |
client.list_checkpoints() | List available DCP checkpoints from the trainer |
client.resolve_checkpoint_path(name, source_job_id) | Resolve checkpoint input for cross-job resume |
client.save_state(name, ttl_seconds) | Save full train state (weights + optimizer) for resume |
client.load_state_with_optimizer(name) | Restore train state for resume |
DeploymentSampler(inference_url, model, api_key, tokenizer) | Client-side tokenized sampling from a deployment |
WeightSyncer(policy_client, deploy_mgr, ...) | Manages checkpoint + hotload lifecycle with delta chaining |
Workflow
- Create resources: Provision a trainer job and (optionally) a hotload-enabled deployment.
- Connect a training client: Use
FiretitanServiceClientto connect to the trainer endpoint. - Build batches and compute objectives: Construct
tinker.Datumobjects and implement your loss function in Python. - Iterate: Run
forward_backward_custom+optim_stepin a loop. - Checkpoint and evaluate: Save checkpoints, hotload onto deployment, sample, and evaluate.
End-to-end example
Bootstrap trainer and deployment
Run a custom update and checkpoint
TrainerJobConfig reference
TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig with these fields:
| Field | Type | Default | Description |
|---|---|---|---|
base_model | str | — | Base model name (e.g. "accounts/fireworks/models/qwen3-8b") |
lora_rank | int | 0 | LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA |
max_context_length | int | 4096 | Maximum sequence length |
learning_rate | float | 1e-5 | Learning rate for the optimizer |
gradient_accumulation_steps | int | 1 | Number of micro-batches before an optimizer step |
node_count | int | 1 | Number of trainer nodes |
display_name | str | None | None | Human-readable trainer name |
hot_load_deployment_id | str | None | None | Deployment ID used for checkpoint hotload workflows |
region | str | None | None | Region for the job (e.g. "US_VIRGINIA_1"). Auto-resolved when using training shapes. |
custom_image_tag | str | None | None | Override trainer image tag |
extra_args | list[str] | None | None | Extra trainer arguments |
accelerator_type | str | None | None | Accelerator type override |
accelerator_count | int | None | None | Accelerator count override |
skip_validations | bool | False | Bypass control-plane validation checks |
forward_only | bool | False | Create a forward-only trainer (reference model pattern) |
DeploymentConfig reference
DeploymentManager.create_or_get(...) accepts a DeploymentConfig with these fields:
| Field | Type | Default | Description |
|---|---|---|---|
deployment_id | str | — | Stable deployment identifier |
base_model | str | — | Base model name. Must match the trainer’s base model for hotload compatibility. |
deployment_shape | str | None | None | Deployment shape resource name (overrides accelerator/region) |
region | str | "US_VIRGINIA_1" | Region for the deployment |
min_replica_count | int | 0 | Minimum replicas (set 0 to scale to zero when idle) |
max_replica_count | int | 1 | Maximum replicas for autoscaling |
accelerator_type | str | "NVIDIA_H200_141GB" | Accelerator type |
hot_load_bucket_type | str | None | "FW_HOSTED" | Hotload storage backend |
skip_shape_validation | bool | False | Bypass deployment shape validation |
extra_args | list[str] | None | None | Extra serving arguments |
W&B integration
For SDK/cookbook loops, configure W&B via cookbook config (WandBConfig) rather than TrainerJobConfig:
DeploymentManager constructor
DeploymentManager supports separate URLs for control-plane, inference, and hotload operations:
base_url. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).
Operational guidance
- Service mode supports both full-parameter and LoRA tuning. Set
lora_rank=0for full-parameter or a positive integer (e.g.16,64) for LoRA, and matchcreate_training_client(lora_rank=...)accordingly. - Starter loops: Use
training.recipes.rl_loop(GRPO/DAPO/GSPO/CISPO),dpo_loop,orpo_loop, andsft_loopfrom the standalone cookbook repo as the current reference implementations. - Training shapes: Use
TrainerJobManager.resolve_training_profile(shape_id)to auto-populate infra config (region, accelerator, image tag, node count) from the control plane instead of setting them manually. - Preemption handling: Use
reconnect_and_wait(job_id)to resume preempted trainer jobs — it handles transitional states (CREATING, DELETING) by polling until the job reaches a resumable state.
Common pitfalls
- Evaluating against stale deployments can hide regressions — always verify the hotloaded checkpoint identity.
- Under-specified checkpoint metadata makes successful runs hard to reproduce — log step numbers, checkpoint names, and deployment revisions together.
- Mixing managed-job fields (for example
epochs,batch_size) intoTrainerJobConfig— these are separate APIs and are ignored by the training SDK manager layer.