Skip to main content

What this is

Training loops commonly pair with dedicated deployments that act as sampling and evaluation endpoints. For on-policy training (GRPO), the deployment is hotloaded with the latest policy weights so sampled completions come from the current model.

Creating a hotload-enabled deployment

import os
from fireworks.training.sdk import DeploymentManager, DeploymentConfig

api_key = os.environ["FIREWORKS_API_KEY"]
account_id = os.environ.get("FIREWORKS_ACCOUNT_ID", "")
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

deploy_mgr = DeploymentManager(
    api_key=api_key,
    account_id=account_id,
    base_url=base_url,
)

deploy_info = deploy_mgr.create_or_get(DeploymentConfig(
    deployment_id="research-loop-serving",
    base_model="accounts/fireworks/models/qwen3-8b",
    region="US_VIRGINIA_1",
    min_replica_count=0,
    max_replica_count=1,
))
deploy_info = deploy_mgr.wait_for_ready("research-loop-serving")

DeploymentConfig parameters

FieldTypeDefaultDescription
deployment_idstrStable ID per experiment family
base_modelstrMust match trainer base model for hotload compatibility
deployment_shapestr | NoneNoneDeployment shape resource name (overrides accelerator/region)
regionstr"US_VIRGINIA_1"Region for the deployment
min_replica_countint0Minimum replicas. Set 0 to scale to zero when idle
max_replica_countint1Maximum replicas for autoscaling
accelerator_typestr"NVIDIA_H200_141GB"Accelerator type
hot_load_bucket_typestr | None"FW_HOSTED"Hotload storage backend
skip_shape_validationboolFalseBypass deployment shape validation
extra_argslist[str] | NoneNoneExtra serving arguments

DeploymentManager constructor

DeploymentManager supports separate URLs for control-plane, inference, and hotload traffic:
deploy_mgr = DeploymentManager(
    api_key=api_key,
    account_id=account_id,
    base_url=base_url,            # Control-plane URL for deployment CRUD
    inference_url=base_url,       # Gateway URL for inference completions (defaults to base_url)
    hotload_api_url=base_url,     # Gateway URL for hotload operations (defaults to base_url)
)

Inspecting deployment status

current = deploy_mgr.get("research-loop-serving")
print(current.state if current else "MISSING")

Linking deployment to RLOR trainer

When creating an RLOR trainer job, set hot_load_deployment_id so the trainer knows where to upload checkpoints:
from fireworks.training.sdk import TrainerJobManager, TrainerJobConfig

rlor_mgr = TrainerJobManager(api_key=api_key, account_id=account_id, base_url=base_url)
endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model="accounts/fireworks/models/qwen3-8b",
    lora_rank=0,
    hot_load_deployment_id="research-loop-serving",
))

Sampling from the deployment

For training/eval loops that need token IDs and logprobs, use DeploymentSampler:
from transformers import AutoTokenizer
from fireworks.training.sdk import DeploymentSampler

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
    inference_url=deploy_mgr.inference_url,
    model=f"accounts/{account_id}/deployments/research-loop-serving",
    api_key=api_key,
    tokenizer=tokenizer,
)
completions = sampler.sample_with_tokens(
    messages=[{"role": "user", "content": "Solve: What is 15 + 27?"}],
    n=1,
    max_tokens=1024,
    temperature=0.7,
)
print(completions[0].text)

Scaling to zero

Release GPU resources without deleting the deployment:
deploy_mgr.scale_to_zero("research-loop-serving")
This sets both minReplicaCount and maxReplicaCount to 0, releasing all accelerators while keeping the deployment resource available for future scale-up.

Operational guidance

  • Keep deployment IDs stable per experiment family for easier rollbacks and metric comparisons.
  • Use min_replica_count=0 for development to avoid idle GPU costs.
  • Use scale_to_zero after training completes as a lighter alternative to delete — the deployment can be scaled back up without recreation.
  • Create the deployment before the trainer so the trainer can be linked at creation time.
  • Use deployment_shape when the control plane has a pre-validated shape for your model — it auto-configures accelerator type, world size, and serving args.
  • Delete deployments when experiments are done (see Cleanup).