Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:

  • Predictable performance unaffected by load caused by other users
  • No hard rate limits - but subject to the maximum load capacity of the deployment
  • Cheaper under high utilization
  • Access to larger selection of models not available via our serverless models
  • Custom base models from Hugging Face files
If you plan on using a significant amount of on-demand deployments, consider purchasing reserved capacity for cheaper pricing and higher GPU quotas.

Quickstart

1

Choose a model

See the “All models” list on our Models page for a list of pre-uploaded models on the Fireworks AI platform. You can also use a custom base model.

2

Create a deployment

To create a new deployment of a model provided by Fireworks, run:

firectl create deployment accounts/fireworks/models/<MODEL_ID> --wait

This command will complete when the deployment is READY. To let it run asynchronously, remove the --wait flag.

The string accounts/fireworks/models/<MODEL_ID> is an example of a <MODEL_NAME>. Read more about model names.

To create a new deployment using a custom base model, follow the Uploading custom models guide to first upload your custom base model to the Fireworks platform. Then run:

firectl create deployment <MODEL_ID>
The deployment ID is the last part of the deployment name: accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>.
3

Verify the deployment is running

You can verify the deployment is complete by running:

firectl get deployment <DEPLOYMENT_ID>

The state field should show READY.

4

Query the deployment

To query a specific deployment, use the model identifier in the format: <MODEL_NAME>#<DEPLOYMENT_NAME>

In most cases, the model identifier follows this pattern:

accounts/<ACCOUNT_ID>/models/<MODEL_ID> + # + accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>

Example:

The model identifier for querying Llama3.2-3B Instruct (listed as accounts/fireworks/models/llama-v3p2-3b-instruct) for Acme Inq.’s deployment (deployment-id being 12ab34cd56ef) would be:

accounts/fireworks/models/llama-v3p2-3b-instruct#accounts/acmeInc/deployments/12ab34cd56ef

Sample Request:

curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "accounts/fireworks/models/<MODEL_ID>#accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    "prompt": "Say this is a test"
  }' \
  --url https://api.fireworks.ai/inference/v1/completions
5

Tear down the deployment

By default, deployments will automatically scale down to zero replicas if unused (i.e. no inference requests) for 1 hour, and automatically delete itself if unused for one week.

To completely delete the deployment, run:

firectl delete deployment <DEPLOYMENT_ID>

Notes:

  • Make sure you include the #<DEPLOYMENT_NAME> in the model identifier when querying a specific deployment.
  • If you are unsure about the model identifier format, refer to the Model Identifiers section for more details and alternatives.

Deployment options

Replica count (horizontal scaling)

The number of replicas (horizontal scaling) is specified by passing the --min-replica-count and --max-replica-count flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will automatically scale based on server load.

Auto-scaling up may fail if there is a GPU stockout. Use reserved capacity to guarantee capacity for your deployments.

The default value for --min-replica-count is 0. Setting --min-replica-count to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified “scale-to-zero” time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.

The default value for --max-replica-count is 1 if --min-replica-count=0, or the value of --min-replica-count otherwise.

Customizing autoscaling behavior

You can customize certain aspects of the deployment’s autoscaling behavior by setting the following flags:

  • --scale-up-window The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.

  • --scale-down-window The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.

  • --scale-to-zero-window The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if --min-replica-count is greater than 0. Default is 1h. The minimum is 5m.

    There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling from 0 to 1 replicas.
    A deployment with --min-replica-count set to 0 will be automatically deleted if it receives no traffic for 7 days.

Refer to time.ParseDuration for valid syntax for the duration string.

Multiple GPUs (vertical scaling)

The number of GPUs used per replica is specified by passing the --accelerator-count flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.

Choosing hardware type

By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB or AMD MI300X GPUs by passing the --accelerator-type flag. Valid values for --accelerator-type are:

  • NVIDIA_H100_80GB
  • NVIDIA_A100_80GB
  • AMD_MI300X_192GB - Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300X
See Regions for a list of accelerator availability by region. Region can be either specified or auto-selected for a deployment upon creation. After creation, the region cannot be changed. If you plan on changing the accelerator type, you may need to re-create the deployment in a new region where it is availabile.

For advice on choosing a hardware type, see this FAQ

Quantization

By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics which may introduce small changes to the output.

In order to deploy a base model using quantization, it must be prepared first. See our Quantization guide for details.

To create a deployment using a quantized model, pass the --precision flag with the desired precision.

firectl create deployment <MODEL_NAME> \
    --accelerator-type="NVIDIA_H100_80GB" \
    --precision="FP8"
Quantized deployments can only be served using H100 GPUs.

Optimizing your deployments for long context

By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be achieved for long-prompt length (>~3000 tokens) workloads by passing the --long-prompt flag.

This option roughly doubles the amount of GPU memory required to serve the model and requires a minimum of two GPUs to be effective. If --accelerator-count is not specified, then a deployment using twice the minimum number of GPUs (to serve without --long-prompt) will be created.

To update a deployment to disable this option, pass --long-prompt=false.

Additional optimization options are available through our enterprise plan.

Deploying PEFT addons

By default, PEFT addons are disabled for deployments. To enable addons, pass the --enable-addons flag:

See Uploading a custom model for instructions on how to upload custom PEFT addons. To deploy a PEFT addon to a on-demand deployment, pass the --deployment-id flag to firectl deploy. For example:

firectl deploy <MODEL_ID> --deployment-id <DEPLOYMENT_ID>
The base model of the deployment must match the base model of the addon.

Pricing

On-demand deployments are billed by GPU-second. Consult our pricing page for details.