Skip to main content

Overview

DeploymentManager manages the lifecycle of inference deployments that serve as sampling and weight sync targets during training. For on-policy training (GRPO), the deployment is hotloaded with the latest policy weights.
from fireworks.training.sdk import DeploymentManager, DeploymentConfig

Constructor

DeploymentManager supports separate URLs for control-plane, inference, and hotload traffic:
deploy_mgr = DeploymentManager(
    api_key="<FIREWORKS_API_KEY>",
    account_id="<ACCOUNT_ID>",
    base_url="https://api.fireworks.ai",      # Control-plane URL (deployment CRUD)
    inference_url="https://api.fireworks.ai",  # Gateway URL for inference (defaults to base_url)
    hotload_api_url="https://api.fireworks.ai",# Gateway URL for hotload ops (defaults to base_url)
)
ParameterTypeDefaultDescription
api_keystrFireworks API key
account_idstrFireworks account ID
base_urlstr"https://api.fireworks.ai"Control-plane URL for deployment CRUD
inference_urlstr | NoneNoneGateway URL for inference completions (defaults to base_url)
hotload_api_urlstr | NoneNoneGateway URL for hotload operations (defaults to base_url)
additional_headersdict | NoneNoneExtra HTTP headers
verify_sslbool | NoneNoneSSL verification override
For most users, all three URLs default to base_url. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).

Methods

create_or_get(config, force_recreate=False)

Create a new deployment or retrieve an existing one. Set force_recreate=True to delete and recreate if it already exists:
deploy_info = deploy_mgr.create_or_get(DeploymentConfig(
    deployment_id="research-loop-serving",
    base_model="accounts/fireworks/models/qwen3-8b",
    min_replica_count=0,
    max_replica_count=1,
))
Returns a DeploymentInfo.

wait_for_ready(deployment_id, timeout_s=600, poll_interval_s=15)

Poll until the deployment is ready to serve:
deploy_mgr.wait_for_ready("research-loop-serving")
Returns a DeploymentInfo.

get(deployment_id)

Inspect deployment status. Returns a DeploymentInfo or None if not found:
current = deploy_mgr.get("research-loop-serving")
print(current.state if current else "MISSING")

hotload_and_wait(deployment_id, base_model, snapshot_identity, ...)

Load a checkpoint onto the deployment and wait for completion:
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=result.snapshot_name,
    timeout_seconds=400,
)
For delta weight syncs, pass incremental_snapshot_metadata:
deploy_mgr.hotload_and_wait(
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    snapshot_identity=delta_result.snapshot_name,
    incremental_snapshot_metadata={
        "previous_snapshot_identity": base_result.snapshot_name,
        "compression_format": "arc_v2",
        "checksum_format": "alder32",
    },
    timeout_seconds=400,
)

warmup(model)

Send a warmup request to the deployment after weight sync.

scale_to_zero(deployment_id)

Release GPU resources without deleting the deployment:
deploy_mgr.scale_to_zero("research-loop-serving")
Sets both minReplicaCount and maxReplicaCount to 0.

delete(deployment_id)

Delete a deployment entirely:
deploy_mgr.delete("research-loop-serving")

DeploymentConfig

DeploymentManager.create_or_get(...) accepts a DeploymentConfig dataclass: When deployment_shape is set, treat the shape as the source of truth for deployment hardware and serving configuration. In normal user-facing flows, you should not try to override shape-owned hardware fields separately.
FieldTypeDefaultDescription
deployment_idstrStable deployment identifier
base_modelstrBase model name. Must match the trainer’s base model for weight sync compatibility.
deployment_shapestr | NoneNoneDeployment shape resource name. In normal shape-based flows, this owns the deployment’s hardware and serving config.
regionstr | NoneNoneRegion for the deployment. Leave unset when the deployment shape already determines placement.
min_replica_countint0Minimum replicas (set 0 to scale to zero when idle)
max_replica_countint1Maximum replicas for autoscaling
accelerator_typestr"NVIDIA_H200_141GB"Accelerator type. In normal shape-based flows, leave this unset and let deployment_shape own the hardware choice.
hot_load_bucket_typestr | None"FW_HOSTED"Weight sync storage backend
disable_speculative_decodingboolFalseDisable speculative decoding
extra_argslist[str] | NoneNoneExtra serving arguments
extra_valuesdict | NoneNoneExtra deployment values

DeploymentInfo

Returned by create_or_get, wait_for_ready, and get:
FieldTypeDescription
deployment_idstrDeployment identifier
namestrFull resource name
statestrDeployment state (e.g. "READY", "CREATING")
hot_load_bucket_urlstr | NoneURL for weight sync storage
inference_modelstr | NoneModel string for completions API (accounts/{account}/deployments/{id})

Deployment shape and training shapes

When using a training shape, the linked deployment shape is determined by the training shape and cannot be changed. The training shape’s deploymentShapeVersion locks the GPU type, node count, and serving engine configuration for the inference deployment. The one thing you can adjust is the replica count. Use min_replica_count and max_replica_count to scale up throughput for sampling during RL loops:
deploy_mgr.create_or_get(DeploymentConfig(
    deployment_id="rl-serving",
    base_model="accounts/fireworks/models/qwen3-8b",
    deployment_shape="accounts/fireworks/deploymentShapes/qwen3-8b-128k-h200",
    min_replica_count=1,
    max_replica_count=4,
))

Operational guidance

  • Keep deployment IDs stable per experiment family for easier rollbacks.
  • Use min_replica_count=0 for development to avoid idle GPU costs.
  • Create the deployment before the trainer so the trainer can be linked at creation time via hot_load_deployment_id.
  • Use deployment_shape when the control plane has a pre-validated shape for your model.
  • Do not treat shape-owned hardware as a user-facing override surface — in normal flows, leave accelerator_type and placement decisions to the deployment shape and only tune replica counts.
  • Use scale_to_zero after training as a lighter alternative to delete.