Training SDK Overview

What this is

Fireworks Training SDK gives teams a flexibility ladder from managed jobs to fully custom training loops. Start managed for standard objectives, then move to Tinker-compatible loops when you need custom losses, full-parameter updates, and tighter experiment control.

Mode	Best for	Objective control	Infrastructure
Managed jobs (SFT, DPO, RFT)	Standard objectives, fast iteration	Platform-defined	Fully managed
Cookbook recipes	GRPO/DPO/SFT/ORPO with config-driven customization	Fork and modify	You configure, platform runs GPUs
SDK loops	Custom losses, algorithm research	Full Python control	You drive the loop, platform runs GPUs

Who does what (SDK loops)

When using the SDK directly, this is the responsibility split:

Fireworks handles	You implement
GPU provisioning and cluster management	Training loop logic (`forward_backward_custom` + `optim_step`)
Service-mode trainer lifecycle (create, health-check, reconnect, delete)	Loss function and batch construction (`tinker.Datum` objects, custom objectives)
Checkpoint storage and export (`save_weights_for_sampler_ext`, DCP snapshots)	Reward signals and evaluation logic (sample from deployment, score responses)
Inference deployment and hotloading (checkpoint to live serving)	Hyperparameter tuning (learning rate, grad accum, context length)
Preemption recovery and job resume (transparent reconnect)	Data pipeline and dataset preparation
Distributed training (multi-node, sharding, FSDP)	Experiment tracking and logging (W&B, custom metrics)

With cookbook recipes, the “You implement” column is largely handled for you — the cookbook provides ready-to-run training loops, loss functions, reward scoring, and checkpointing out of the box. You bring your data and config. See the Quickstart to get running in minutes with either path.

System architecture

A control-plane API provisions trainer and deployment resources. Your local Python loop connects to the trainer service, runs custom train steps, and periodically exports checkpoints to a serving deployment for sampling and evaluation.

Key APIs

SDK APIs (`from fireworks.training.sdk import ...`)

API	Purpose
`TrainerJobManager.create_and_wait(config)`	Create a service-mode trainer and poll until healthy
`TrainerJobManager.wait_for_existing(job_id)`	Wait for an already-existing trainer job to reach RUNNING
`TrainerJobManager.resume_and_wait(job_id)`	Resume a failed/cancelled/paused job and wait
`TrainerJobManager.reconnect_and_wait(job_id)`	Reconnect to a preempted/failed job (handles transitional states)
`TrainerJobManager.resolve_training_profile(shape_id)`	Fetch training shape config from the control plane
`TrainerJobManager.delete(job_id)`	Delete a trainer job
`DeploymentManager.create_or_get(config)`	Create or reuse an inference deployment for sampling/hotload
`DeploymentManager.wait_for_ready(deployment_id)`	Poll until deployment is READY
`DeploymentManager.scale_to_zero(deployment_id)`	Scale to zero replicas without deleting
`DeploymentManager.delete(deployment_id)`	Delete a deployment
`FiretitanServiceClient(base_url, api_key)`	Connect to a trainer endpoint (extends tinker `ServiceClient`)
`service.create_training_client(base_model, lora_rank)`	Create a `FiretitanTrainingClient` with checkpoint extensions
`client.forward(datums, loss_type)`	Forward pass only (e.g. for reference logprobs)
`client.forward_backward_custom(datums, loss_fn)`	Forward + backward with your custom loss
`client.optim_step(tinker.AdamParams(...))`	Apply optimizer update
`client.save_weights_for_sampler_ext(name, checkpoint_type)`	Export serving-compatible checkpoint with session-scoped naming
`client.save_state(name, ttl_seconds)`	Save full train state (weights + optimizer) for resume
`client.load_state_with_optimizer(name)`	Restore train state for resume
`client.list_checkpoints()`	List available DCP checkpoints from the trainer
`client.resolve_checkpoint_path(name, source_job_id)`	Resolve checkpoint input for cross-job resume
`DeploymentSampler(inference_url, model, api_key, tokenizer)`	Client-side tokenized sampling from a deployment
`WeightSyncer(policy_client, deploy_mgr, ...)`	Manages checkpoint + hotload lifecycle with delta chaining

Cookbook helpers (`from training.utils import ...`)

Requires the cookbook to be installed. These wrap the SDK APIs above.

API	Purpose
`InfraConfig`	GPU, region, and training shape settings (wraps `TrainerJobConfig`)
`DeployConfig`	Deployment settings (wraps `DeploymentConfig`)
`HotloadConfig`	Checkpoint and weight-sync intervals
`WandBConfig`	Weights & Biases logging settings
`create_trainer_job(rlor_mgr, ...)`	Create trainer with shape resolution and validation
`setup_deployment(deploy_mgr, ...)`	Create or reuse a deployment with cookbook config
`ReconnectableClient`	Training client wrapper with auto-reconnect on preemption
`checkpoint_utils.save_checkpoint(client, name, log_path, ...)`	Save train state + append to `checkpoints.jsonl` for resume
`checkpoint_utils.resolve_resume(client, log_path)`	Load last checkpoint from `checkpoints.jsonl` and restore state

`TrainerJobConfig` reference

TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig with these fields:

Field	Type	Default	Description
`base_model`	`str`	—	Base model name (e.g. `"accounts/fireworks/models/qwen3-8b"`)
`lora_rank`	`int`	`0`	LoRA rank. `0` for full-parameter tuning, or a positive integer (e.g. `16`, `64`) for LoRA
`max_context_length`	`int`	`4096`	Maximum sequence length
`learning_rate`	`float`	`1e-5`	Learning rate for the optimizer
`gradient_accumulation_steps`	`int`	`1`	Number of micro-batches before an optimizer step
`node_count`	`int`	`1`	Number of trainer nodes
`display_name`	`str \| None`	`None`	Human-readable trainer name
`hot_load_deployment_id`	`str \| None`	`None`	Deployment ID used for checkpoint hotload workflows
`region`	`str \| None`	`None`	Region for the job (e.g. `"US_VIRGINIA_1"`). Auto-resolved when using training shapes.
`custom_image_tag`	`str \| None`	`None`	Override trainer image tag
`extra_args`	`list[str] \| None`	`None`	Extra trainer arguments
`accelerator_type`	`str \| None`	`None`	Accelerator type override
`accelerator_count`	`int \| None`	`None`	Accelerator count override
`skip_validations`	`bool`	`False`	Bypass control-plane validation checks
`forward_only`	`bool`	`False`	Create a forward-only trainer (reference model pattern)

`DeploymentConfig` reference

DeploymentManager.create_or_get(...) accepts a DeploymentConfig with these fields:

Field	Type	Default	Description
`deployment_id`	`str`	—	Stable deployment identifier
`base_model`	`str`	—	Base model name. Must match the trainer’s base model for hotload compatibility.
`deployment_shape`	`str \| None`	`None`	Deployment shape resource name (overrides accelerator/region)
`region`	`str`	`"US_VIRGINIA_1"`	Region for the deployment
`min_replica_count`	`int`	`0`	Minimum replicas (set `0` to scale to zero when idle)
`max_replica_count`	`int`	`1`	Maximum replicas for autoscaling
`accelerator_type`	`str`	`"NVIDIA_H200_141GB"`	Accelerator type
`hot_load_bucket_type`	`str \| None`	`"FW_HOSTED"`	Hotload storage backend
`skip_shape_validation`	`bool`	`False`	Bypass deployment shape validation
`extra_args`	`list[str] \| None`	`None`	Extra serving arguments

`DeploymentManager` constructor

DeploymentManager supports separate URLs for control-plane, inference, and hotload operations:

deploy_mgr = DeploymentManager(
    api_key=api_key,
    account_id=account_id,
    base_url="https://api.fireworks.ai",      # Control-plane URL (deployment CRUD)
    inference_url="https://api.fireworks.ai",  # Gateway URL for inference completions (defaults to base_url)
    hotload_api_url="https://api.fireworks.ai",# Gateway URL for hotload operations (defaults to base_url)
)

For most users, all three default to base_url. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).

Operational guidance

Service mode supports both full-parameter and LoRA tuning. Set lora_rank=0 for full-parameter or a positive integer (e.g. 16, 64) for LoRA, and match create_training_client(lora_rank=...) accordingly.
Cookbook recipes: Use training.recipes.rl_loop (GRPO/DAPO/GSPO/CISPO), dpo_loop, orpo_loop, and sft_loop from the cookbook repo as reference implementations.
Training shapes: Use TrainerJobManager.resolve_training_profile(shape_id) to auto-populate infra config (region, accelerator, image tag, node count) from the control plane instead of setting them manually.
Preemption handling: Use reconnect_and_wait(job_id) to resume preempted trainer jobs — it handles transitional states (CREATING, DELETING) by polling until the job reaches a resumable state.

Common pitfalls

Evaluating against stale deployments can hide regressions — always verify the hotloaded checkpoint identity.
Under-specified checkpoint metadata makes successful runs hard to reproduce — log step numbers, checkpoint names, and deployment revisions together.
Mixing managed-job fields (for example epochs, batch_size) into TrainerJobConfig — these are separate APIs and are ignored by the training SDK manager layer.

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

What this is

Who does what (SDK loops)

System architecture

Key APIs

SDK APIs (`from fireworks.training.sdk import ...`)

Cookbook helpers (`from training.utils import ...`)

`TrainerJobConfig` reference

`DeploymentConfig` reference

`DeploymentManager` constructor

Operational guidance

Common pitfalls

Get Started

Developer Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​What this is

​Who does what (SDK loops)

​System architecture

​Key APIs

​SDK APIs (from fireworks.training.sdk import ...)

​Cookbook helpers (from training.utils import ...)

​TrainerJobConfig reference

​DeploymentConfig reference

​DeploymentManager constructor

​Operational guidance

​Common pitfalls

​Related Guides

What this is

Who does what (SDK loops)

System architecture

Key APIs

SDK APIs (`from fireworks.training.sdk import ...`)

Cookbook helpers (`from training.utils import ...`)

`TrainerJobConfig` reference

`DeploymentConfig` reference

`DeploymentManager` constructor

Operational guidance

Common pitfalls

Related Guides