What this is
Fireworks Training SDK gives teams a flexibility ladder from managed jobs to fully custom training loops. Start managed for standard objectives, then move to Tinker-compatible loops when you need custom losses, full-parameter updates, and tighter experiment control.| Mode | Best for | Objective control | Infrastructure |
|---|---|---|---|
| Managed jobs (SFT, DPO, RFT) | Standard objectives, fast iteration | Platform-defined | Fully managed |
| Cookbook recipes | GRPO/DPO/SFT/ORPO with config-driven customization | Fork and modify | You configure, platform runs GPUs |
| SDK loops | Custom losses, algorithm research | Full Python control | You drive the loop, platform runs GPUs |
Who does what (SDK loops)
When using the SDK directly, this is the responsibility split:| Fireworks handles | You implement |
|---|---|
| GPU provisioning and cluster management | Training loop logic (forward_backward_custom + optim_step) |
| Service-mode trainer lifecycle (create, health-check, reconnect, delete) | Loss function and batch construction (tinker.Datum objects, custom objectives) |
Checkpoint storage and export (save_weights_for_sampler_ext, DCP snapshots) | Reward signals and evaluation logic (sample from deployment, score responses) |
| Inference deployment and hotloading (checkpoint to live serving) | Hyperparameter tuning (learning rate, grad accum, context length) |
| Preemption recovery and job resume (transparent reconnect) | Data pipeline and dataset preparation |
| Distributed training (multi-node, sharding, FSDP) | Experiment tracking and logging (W&B, custom metrics) |
System architecture
A control-plane API provisions trainer and deployment resources. Your local Python loop connects to the trainer service, runs custom train steps, and periodically exports checkpoints to a serving deployment for sampling and evaluation.Key APIs
SDK APIs (from fireworks.training.sdk import ...)
| API | Purpose |
|---|---|
TrainerJobManager.create_and_wait(config) | Create a service-mode trainer and poll until healthy |
TrainerJobManager.wait_for_existing(job_id) | Wait for an already-existing trainer job to reach RUNNING |
TrainerJobManager.resume_and_wait(job_id) | Resume a failed/cancelled/paused job and wait |
TrainerJobManager.reconnect_and_wait(job_id) | Reconnect to a preempted/failed job (handles transitional states) |
TrainerJobManager.resolve_training_profile(shape_id) | Fetch training shape config from the control plane |
TrainerJobManager.delete(job_id) | Delete a trainer job |
DeploymentManager.create_or_get(config) | Create or reuse an inference deployment for sampling/hotload |
DeploymentManager.wait_for_ready(deployment_id) | Poll until deployment is READY |
DeploymentManager.scale_to_zero(deployment_id) | Scale to zero replicas without deleting |
DeploymentManager.delete(deployment_id) | Delete a deployment |
FiretitanServiceClient(base_url, api_key) | Connect to a trainer endpoint (extends tinker ServiceClient) |
service.create_training_client(base_model, lora_rank) | Create a FiretitanTrainingClient with checkpoint extensions |
client.forward(datums, loss_type) | Forward pass only (e.g. for reference logprobs) |
client.forward_backward_custom(datums, loss_fn) | Forward + backward with your custom loss |
client.optim_step(tinker.AdamParams(...)) | Apply optimizer update |
client.save_weights_for_sampler_ext(name, checkpoint_type) | Export serving-compatible checkpoint with session-scoped naming |
client.save_state(name, ttl_seconds) | Save full train state (weights + optimizer) for resume |
client.load_state_with_optimizer(name) | Restore train state for resume |
client.list_checkpoints() | List available DCP checkpoints from the trainer |
client.resolve_checkpoint_path(name, source_job_id) | Resolve checkpoint input for cross-job resume |
DeploymentSampler(inference_url, model, api_key, tokenizer) | Client-side tokenized sampling from a deployment |
WeightSyncer(policy_client, deploy_mgr, ...) | Manages checkpoint + hotload lifecycle with delta chaining |
Cookbook helpers (from training.utils import ...)
Requires the cookbook to be installed. These wrap the SDK APIs above.
| API | Purpose |
|---|---|
InfraConfig | GPU, region, and training shape settings (wraps TrainerJobConfig) |
DeployConfig | Deployment settings (wraps DeploymentConfig) |
HotloadConfig | Checkpoint and weight-sync intervals |
WandBConfig | Weights & Biases logging settings |
create_trainer_job(rlor_mgr, ...) | Create trainer with shape resolution and validation |
setup_deployment(deploy_mgr, ...) | Create or reuse a deployment with cookbook config |
ReconnectableClient | Training client wrapper with auto-reconnect on preemption |
checkpoint_utils.save_checkpoint(client, name, log_path, ...) | Save train state + append to checkpoints.jsonl for resume |
checkpoint_utils.resolve_resume(client, log_path) | Load last checkpoint from checkpoints.jsonl and restore state |
TrainerJobConfig reference
TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig with these fields:
| Field | Type | Default | Description |
|---|---|---|---|
base_model | str | — | Base model name (e.g. "accounts/fireworks/models/qwen3-8b") |
lora_rank | int | 0 | LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA |
max_context_length | int | 4096 | Maximum sequence length |
learning_rate | float | 1e-5 | Learning rate for the optimizer |
gradient_accumulation_steps | int | 1 | Number of micro-batches before an optimizer step |
node_count | int | 1 | Number of trainer nodes |
display_name | str | None | None | Human-readable trainer name |
hot_load_deployment_id | str | None | None | Deployment ID used for checkpoint hotload workflows |
region | str | None | None | Region for the job (e.g. "US_VIRGINIA_1"). Auto-resolved when using training shapes. |
custom_image_tag | str | None | None | Override trainer image tag |
extra_args | list[str] | None | None | Extra trainer arguments |
accelerator_type | str | None | None | Accelerator type override |
accelerator_count | int | None | None | Accelerator count override |
skip_validations | bool | False | Bypass control-plane validation checks |
forward_only | bool | False | Create a forward-only trainer (reference model pattern) |
DeploymentConfig reference
DeploymentManager.create_or_get(...) accepts a DeploymentConfig with these fields:
| Field | Type | Default | Description |
|---|---|---|---|
deployment_id | str | — | Stable deployment identifier |
base_model | str | — | Base model name. Must match the trainer’s base model for hotload compatibility. |
deployment_shape | str | None | None | Deployment shape resource name (overrides accelerator/region) |
region | str | "US_VIRGINIA_1" | Region for the deployment |
min_replica_count | int | 0 | Minimum replicas (set 0 to scale to zero when idle) |
max_replica_count | int | 1 | Maximum replicas for autoscaling |
accelerator_type | str | "NVIDIA_H200_141GB" | Accelerator type |
hot_load_bucket_type | str | None | "FW_HOSTED" | Hotload storage backend |
skip_shape_validation | bool | False | Bypass deployment shape validation |
extra_args | list[str] | None | None | Extra serving arguments |
DeploymentManager constructor
DeploymentManager supports separate URLs for control-plane, inference, and hotload operations:
base_url. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).
Operational guidance
- Service mode supports both full-parameter and LoRA tuning. Set
lora_rank=0for full-parameter or a positive integer (e.g.16,64) for LoRA, and matchcreate_training_client(lora_rank=...)accordingly. - Cookbook recipes: Use
training.recipes.rl_loop(GRPO/DAPO/GSPO/CISPO),dpo_loop,orpo_loop, andsft_loopfrom the cookbook repo as reference implementations. - Training shapes: Use
TrainerJobManager.resolve_training_profile(shape_id)to auto-populate infra config (region, accelerator, image tag, node count) from the control plane instead of setting them manually. - Preemption handling: Use
reconnect_and_wait(job_id)to resume preempted trainer jobs — it handles transitional states (CREATING, DELETING) by polling until the job reaches a resumable state.
Common pitfalls
- Evaluating against stale deployments can hide regressions — always verify the hotloaded checkpoint identity.
- Under-specified checkpoint metadata makes successful runs hard to reproduce — log step numbers, checkpoint names, and deployment revisions together.
- Mixing managed-job fields (for example
epochs,batch_size) intoTrainerJobConfig— these are separate APIs and are ignored by the training SDK manager layer.