Skip to main content

What this is

Fireworks Training SDK gives teams a flexibility ladder from managed jobs to fully custom training loops. Start managed for standard objectives, then move to Tinker-compatible loops when you need custom losses, full-parameter updates, and tighter experiment control.
ModeBest forObjective controlInfrastructure
Managed jobs (SFT, DPO, RFT)Standard objectives, fast iterationPlatform-definedFully managed
Cookbook recipesGRPO/DPO/SFT/ORPO with config-driven customizationFork and modifyYou configure, platform runs GPUs
SDK loopsCustom losses, algorithm researchFull Python controlYou drive the loop, platform runs GPUs

Who does what (SDK loops)

When using the SDK directly, this is the responsibility split:
Fireworks handlesYou implement
GPU provisioning and cluster managementTraining loop logic (forward_backward_custom + optim_step)
Service-mode trainer lifecycle (create, health-check, reconnect, delete)Loss function and batch construction (tinker.Datum objects, custom objectives)
Checkpoint storage and export (save_weights_for_sampler_ext, DCP snapshots)Reward signals and evaluation logic (sample from deployment, score responses)
Inference deployment and hotloading (checkpoint to live serving)Hyperparameter tuning (learning rate, grad accum, context length)
Preemption recovery and job resume (transparent reconnect)Data pipeline and dataset preparation
Distributed training (multi-node, sharding, FSDP)Experiment tracking and logging (W&B, custom metrics)
With cookbook recipes, the “You implement” column is largely handled for you — the cookbook provides ready-to-run training loops, loss functions, reward scoring, and checkpointing out of the box. You bring your data and config. See the Quickstart to get running in minutes with either path.

System architecture

A control-plane API provisions trainer and deployment resources. Your local Python loop connects to the trainer service, runs custom train steps, and periodically exports checkpoints to a serving deployment for sampling and evaluation.

Key APIs

SDK APIs (from fireworks.training.sdk import ...)

APIPurpose
TrainerJobManager.create_and_wait(config)Create a service-mode trainer and poll until healthy
TrainerJobManager.wait_for_existing(job_id)Wait for an already-existing trainer job to reach RUNNING
TrainerJobManager.resume_and_wait(job_id)Resume a failed/cancelled/paused job and wait
TrainerJobManager.reconnect_and_wait(job_id)Reconnect to a preempted/failed job (handles transitional states)
TrainerJobManager.resolve_training_profile(shape_id)Fetch training shape config from the control plane
TrainerJobManager.delete(job_id)Delete a trainer job
DeploymentManager.create_or_get(config)Create or reuse an inference deployment for sampling/hotload
DeploymentManager.wait_for_ready(deployment_id)Poll until deployment is READY
DeploymentManager.scale_to_zero(deployment_id)Scale to zero replicas without deleting
DeploymentManager.delete(deployment_id)Delete a deployment
FiretitanServiceClient(base_url, api_key)Connect to a trainer endpoint (extends tinker ServiceClient)
service.create_training_client(base_model, lora_rank)Create a FiretitanTrainingClient with checkpoint extensions
client.forward(datums, loss_type)Forward pass only (e.g. for reference logprobs)
client.forward_backward_custom(datums, loss_fn)Forward + backward with your custom loss
client.optim_step(tinker.AdamParams(...))Apply optimizer update
client.save_weights_for_sampler_ext(name, checkpoint_type)Export serving-compatible checkpoint with session-scoped naming
client.save_state(name, ttl_seconds)Save full train state (weights + optimizer) for resume
client.load_state_with_optimizer(name)Restore train state for resume
client.list_checkpoints()List available DCP checkpoints from the trainer
client.resolve_checkpoint_path(name, source_job_id)Resolve checkpoint input for cross-job resume
DeploymentSampler(inference_url, model, api_key, tokenizer)Client-side tokenized sampling from a deployment
WeightSyncer(policy_client, deploy_mgr, ...)Manages checkpoint + hotload lifecycle with delta chaining

Cookbook helpers (from training.utils import ...)

Requires the cookbook to be installed. These wrap the SDK APIs above.
APIPurpose
InfraConfigGPU, region, and training shape settings (wraps TrainerJobConfig)
DeployConfigDeployment settings (wraps DeploymentConfig)
HotloadConfigCheckpoint and weight-sync intervals
WandBConfigWeights & Biases logging settings
create_trainer_job(rlor_mgr, ...)Create trainer with shape resolution and validation
setup_deployment(deploy_mgr, ...)Create or reuse a deployment with cookbook config
ReconnectableClientTraining client wrapper with auto-reconnect on preemption
checkpoint_utils.save_checkpoint(client, name, log_path, ...)Save train state + append to checkpoints.jsonl for resume
checkpoint_utils.resolve_resume(client, log_path)Load last checkpoint from checkpoints.jsonl and restore state

TrainerJobConfig reference

TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig with these fields:
FieldTypeDefaultDescription
base_modelstrBase model name (e.g. "accounts/fireworks/models/qwen3-8b")
lora_rankint0LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA
max_context_lengthint4096Maximum sequence length
learning_ratefloat1e-5Learning rate for the optimizer
gradient_accumulation_stepsint1Number of micro-batches before an optimizer step
node_countint1Number of trainer nodes
display_namestr | NoneNoneHuman-readable trainer name
hot_load_deployment_idstr | NoneNoneDeployment ID used for checkpoint hotload workflows
regionstr | NoneNoneRegion for the job (e.g. "US_VIRGINIA_1"). Auto-resolved when using training shapes.
custom_image_tagstr | NoneNoneOverride trainer image tag
extra_argslist[str] | NoneNoneExtra trainer arguments
accelerator_typestr | NoneNoneAccelerator type override
accelerator_countint | NoneNoneAccelerator count override
skip_validationsboolFalseBypass control-plane validation checks
forward_onlyboolFalseCreate a forward-only trainer (reference model pattern)

DeploymentConfig reference

DeploymentManager.create_or_get(...) accepts a DeploymentConfig with these fields:
FieldTypeDefaultDescription
deployment_idstrStable deployment identifier
base_modelstrBase model name. Must match the trainer’s base model for hotload compatibility.
deployment_shapestr | NoneNoneDeployment shape resource name (overrides accelerator/region)
regionstr"US_VIRGINIA_1"Region for the deployment
min_replica_countint0Minimum replicas (set 0 to scale to zero when idle)
max_replica_countint1Maximum replicas for autoscaling
accelerator_typestr"NVIDIA_H200_141GB"Accelerator type
hot_load_bucket_typestr | None"FW_HOSTED"Hotload storage backend
skip_shape_validationboolFalseBypass deployment shape validation
extra_argslist[str] | NoneNoneExtra serving arguments

DeploymentManager constructor

DeploymentManager supports separate URLs for control-plane, inference, and hotload operations:
deploy_mgr = DeploymentManager(
    api_key=api_key,
    account_id=account_id,
    base_url="https://api.fireworks.ai",      # Control-plane URL (deployment CRUD)
    inference_url="https://api.fireworks.ai",  # Gateway URL for inference completions (defaults to base_url)
    hotload_api_url="https://api.fireworks.ai",# Gateway URL for hotload operations (defaults to base_url)
)
For most users, all three default to base_url. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).

Operational guidance

  • Service mode supports both full-parameter and LoRA tuning. Set lora_rank=0 for full-parameter or a positive integer (e.g. 16, 64) for LoRA, and match create_training_client(lora_rank=...) accordingly.
  • Cookbook recipes: Use training.recipes.rl_loop (GRPO/DAPO/GSPO/CISPO), dpo_loop, orpo_loop, and sft_loop from the cookbook repo as reference implementations.
  • Training shapes: Use TrainerJobManager.resolve_training_profile(shape_id) to auto-populate infra config (region, accelerator, image tag, node count) from the control plane instead of setting them manually.
  • Preemption handling: Use reconnect_and_wait(job_id) to resume preempted trainer jobs — it handles transitional states (CREATING, DELETING) by polling until the job reaches a resumable state.

Common pitfalls

  • Evaluating against stale deployments can hide regressions — always verify the hotloaded checkpoint identity.
  • Under-specified checkpoint metadata makes successful runs hard to reproduce — log step numbers, checkpoint names, and deployment revisions together.
  • Mixing managed-job fields (for example epochs, batch_size) into TrainerJobConfig — these are separate APIs and are ignored by the training SDK manager layer.