On-demand deployments

If you plan on using a significant amount of dedicated deployments, consider purchasing reserved capacity instead of using on-demand deployments for more reliable capacity and higher GPU quotas.

Fireworks allows you to create on-demand deployments of models that are reserved for your own use and billed by the GPU-second. This has several advantages over the shared deployment architecture you get when using Fireworks’ serverless models:

Predictable performance unaffected by load caused by other users
No hard rate limits, but subject to the maximum load capacity of the deployment
Cheaper under high utilization
Access to larger selection of models not available via our serverless models
Custom base models from Hugging Face files

Unlike serverless models, which are billed on a per-token basis, on-demand deployments are billed by GPU-second. Consult our pricing page for details.

Installing the CLI

To create and manage on-demand deployments, use our CLI tool firectl. Install the tool by following the installation guide. Verify that your installation is working and you are logged into the correct account by running:

firectl whoami

Creating an on-demand deployment

Choose a model

Check out our model library to see a list of models available for on-demand deployment. You can also upload and use your own custom base model.

Create a deployment

To create a new deployment of a model from the model library, use:

firectl create deployment accounts/fireworks/models/<MODEL_ID> --wait

Or, for a custom base model you have uploaded to your account:

firectl create deployment accounts/<YOUR_ACCOUNT_ID>/models/<CUSTOM_MODEL_ID> --wait

This command will complete when the deployment is READY. To let it run asynchronously, remove the --wait flag.

If your account has purchased reserved capacity and this deployment meets the reservation criteria, it will be counted against that reservation and not billed as an on-demand deployment.

Verify the deployment is running

You can verify the deployment is active by running:

firectl get deployment <DEPLOYMENT_ID>

The state field should show READY.

The deployment ID is the last part of the deployment name: accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>.

You can also list all deployments in your account at any time by running:

firectl list deployments

Query the deployment

Model identifier

After your model is successfully deployed, it will be ready for inference. A model can be queried using one of the following model identifiers:

The model and deployment names - accounts/<ACCOUNT_ID of model>/models/<MODEL_ID>#accounts/<ACCOUNT_ID of deployment>/deployments/<DEPLOYMENT_ID>, e.g.
- accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678
- accounts/alice/models/custom-model#accounts/alice/deployments/12345678
The model and deployment short-names - <ACCOUNT_ID of model>/<MODEL_ID>#<ACCOUNT_ID of deployment>/<DEPLOYMENT_ID>, e.g.
- fireworks/mixtral-8x7b#alice/12345678
- alice/custom-model#alice/12345678
Deployed model name - Instead of needing to use both the model and deployment name to refer to a deployed model, you can optionally just use a unique deployed model name. This name utilizes a unique deployed model ID that is created upon deployment. The deployed model ID takes the form <MODEL_ID>-<AUTOGENERATED_SUFFIX> and can be viewed with firectl list deployed-models
- accounts/alice/deployedModels/mixtral-8x7b-abcdef
If you are deploying a custom model, you can also query it using the model name or model short-name, e.g.:
- accounts/alice/models/custom-model
- alice/custom-model

You can also use short names in place of the model and deployment names. For example:

<ACCOUNT_ID>/<MODEL_ID>
<ACCOUNT_ID>/<MODEL_ID>#<ACCOUNT_ID>/<DEPLOYMENT_ID>

Be sure to include the #<ACCOUNT_ID>/<DEPLOYMENT_ID> in the model identifier when querying a specific deployment.

When querying a model with multiple deployments, you can use just the model name (without the #<ACCOUNT_ID>/<DEPLOYMENT_ID> suffix) to query the default deployment. For information about managing default deployments, see the Managing default deployments section below.

curl \
  --header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "fireworks/llama-v3p2-3b-instruct#<ACCOUNT_ID>/<DEPLOYMENT_ID>",
    "prompt": "Say this is a test"
  }' \
  --url https://api.fireworks.ai/inference/v1/completions

Tear down the deployment

By default, deployments will automatically scale down to zero replicas if unused (i.e. no inference requests) for 1 hour, and automatically delete itself if unused for one week.To completely delete the deployment, run:

firectl delete deployment <DEPLOYMENT_ID>

Checking whether a model is deployed

You can check if a particular model has been deployed by running:

firectl get model accounts/<ACCOUNT_ID>/models/<MODEL_ID>

If successful, there will be an entry with State: DEPLOYED in the Deployed Model Refs section. This works both for models provided by Fireworks and for custom models you have uploaded to your account. For example, if you want to confirm if “Qwen2.5 7B Instruct” is deployed, you can run:

firectl get model accounts/fireworks/models/qwen2p5-7b-instruct

One way to check if a model is available for serverless deployment is to see if the model has a Fireworks-owned deployment (i.e. a deployment starting with accounts/fireworks/...). Try running the above command using a model with the Serverless tag in the model library!

Managing default deployments

Since a model may be deployed to multiple deployments, querying by model name will route to the “default” deployed model. You can see which deployed model entry is marked with Default: true using:

firectl get model <MODEL_ID>

This will show the Deployed Model Refs section with the Default: true entry.

Deployed Model Refs:
  [{
    Name: accounts/<ACCOUNT_ID>/deployedModels/<DEPLOYED_MODEL_ID_1>
    Deployment: accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID_1>
    State: DEPLOYED
    Default: true
  },
  {
    Name: accounts/<ACCOUNT_ID>/deployedModels/<DEPLOYED_MODEL_ID_2>
    Deployment: accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID_2>
    State: DEPLOYED
  },
]

To update the default deployed model, note the Name of the deployed model reference above. Then run:

firectl update deployed-model <DEPLOYED_MODEL_ID_2> --default

To delete a default deployment you must delete all other deployments for the same model first, or designate a different deployed model as the default as described above. This is to ensure that querying by model name will always route to an unambiguous default deployment as long as deployments for the model exist.

Deployment options

Replica count (horizontal scaling)

The number of replicas (horizontal scaling) is specified by passing the --min-replica-count and --max-replica-count flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will automatically scale based on server load.

Auto-scaling up may fail if there is a GPU stockout. Use reserved capacity to guarantee capacity for your deployments.

The default value for --min-replica-count is 0. Setting --min-replica-count to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified “scale-to-zero” time window. While the deployment is scaled to 0, you will not pay for any GPU utilization. The default value for --max-replica-count is 1 if --min-replica-count=0, or the value of --min-replica-count otherwise.

firectl create deployment <MODEL_NAME> \
  --min-replica-count 2 \
  --max-replica-count 3

Customizing autoscaling behavior

You can customize certain aspects of the deployment’s autoscaling behavior by setting the following flags:

--scale-up-window The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--scale-down-window The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if --min-replica-count is greater than 0. Default is 1h. The minimum is 5m.
--load-targets <key>=<value>[,<key>=<value>...] Load target thresholds for scaling the replica count. If not specified, the load target is default with --load-targets default=0.8. If multiple load targets are specified the maximum replica count across all of them is used.
- default=<Fraction> - A general default value for 0 to 1 load targets. Default is default=0.8.
- tokens_generated_per_second=<Integer> - The desired tokens generated per second per replica.
There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling from 0 to 1 replicas.

A deployment with --min-replica-count set to 0 will be automatically deleted if it receives no traffic for 7 days.

Refer to time.ParseDuration for valid syntax for the duration string.

Multiple GPUs (vertical scaling)

The number of GPUs used per replica is specified by passing the --accelerator-count flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.

firectl create deployment <MODEL_NAME> --accelerator-count 2

Choosing hardware type

By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB, NVIDIA H200 141GB or AMD MI300X GPUs by passing the --accelerator-type flag. Valid values for --accelerator-type are:

NVIDIA_A100_80GB
NVIDIA_H100_80GB
NVIDIA_H200_141GB
AMD_MI300X_192GB - Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300X

See Regions for a list of accelerator availability by region. Region can be either specified or auto-selected for a deployment upon creation. After creation, the region cannot be changed. If you plan on changing the accelerator type, you may need to re-create the deployment in a new region where it is availabile.

For advice on choosing a hardware type, see this FAQ

firectl create deployment <MODEL_NAME> --accelerator-type="NVIDIA_H100_80GB"

Model based speculative decoding

Model based speculative decoding allows you to speed up output generation in some cases, by using a smaller model to assist the larger model in generation.

Fireworks also offers speculative decoding based on a user-provided prediction which works in addition to model based speculative decoding. Read Using Predicted Outputs to learn more.

Speculative decoding may slow down output generation if the smaller model is not a good speculator for the larger model, or token count / speculation length is too high or too low. Speculative decoding may also reduce the max throughput you can achieve with your deployment. Test different models and speculation lengths to determine the best settings for your use case.

We offer the following settings that can be set as flags in firectl, our CLI tool:

Flag	Type	Description
`--draft-model`	string	To use a draft model for speculative decoding, set this flag to the name of the draft model you want to use. See the table below for recommendations on draft models to use for popular model families. Note that draft models can be standalone models (referred from Fireworks account or custom models uploaded to your account) or an add-on (e.g. Eagle)
`--draft-token-count`	int32	When using a draft model, set this flag to the number of tokens to generate per step for speculative decoding. Setting `--draft-token-count=0` turns off draft model speculation for the deployment. As a rough guideline, use `--draft-token-count=3` for eagle draft models and `--draft-token-count=4` for other draft models
`--ngram-speculation-length`	int32	To use N-gram based speculation, set this flag to the length of the previous input sequence to be considered for N-gram speculation

--draft-token-count must be set when --draft-model or --ngram-speculation-length is used.

--draft-model and --ngram-speculation-length cannot be used together as they are alternative approaches to model-based speculation. Setting both will throw an error.

You can use the following draft models directly:

Draft model name	Recommended for
accounts/fireworks/models/llama-v3p2-1b-instruct	All Llama models > 3B
accounts/fireworks/models/qwen2p5-0p5b-instruct	All Qwen models > 3B
accounts/fireworks/models/eagle-llama-v3-3b-instruct-v2	Llama 3.2 3B
accounts/fireworks/models/eagle-qwen-v2p5-3b-instruct-v2	Qwen 2.5 3B
accounts/fireworks/models/eagle-llama-v3-8b-instruct-v2	Llama 3.1 8B, Llama 3.0 8B
accounts/fireworks/models/eagle-qwen-v2p5-7b-instruct-v2	Qwen 2.5 7B

Here’s an example of deploying Llama 3.3 70B with a draft model:

firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
  --accelerator-type="NVIDIA_H100_80GB" \
  --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
  --draft-token-count=4

In most cases, speculative decoding does not change the quality of the output generated (mathematically, outputs are unchanged, but there might be numerical differences, especially at higher temperatures). If speculation is used on the deployment and you want to verify the output is unchanged, you can set disable_speculation=True in the inference API call - in this case, the draft model is still called but its output are not used, so performance will be impacted.

Quantization

By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics which may introduce small changes to the output. In order to deploy a base model using quantization, it must be prepared first. See our Quantization guide for details. To create a deployment using a quantized model, pass the --precision flag with the desired precision.

firectl create deployment <MODEL_NAME> \
    --accelerator-type="NVIDIA_H100_80GB" \
    --precision="FP8"

Quantized deployments can only be served using H100 GPUs.

Optimizing your deployments for long context

By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be achieved for long-prompt length (>~3000 tokens) workloads by passing the --long-prompt flag.

This option roughly doubles the amount of GPU memory required to serve the model and requires a minimum of two GPUs to be effective. If --accelerator-count is not specified, then a deployment using twice the minimum number of GPUs (to serve without --long-prompt) will be created.

firectl create deployment <MODEL_NAME> --accelerator-count=2 --long-prompt

To update a deployment to disable this option, pass --long-prompt=false. Additional optimization options are available through our enterprise plan.

Publishing a deployed model

By default, models can only be queried by the account that owns them. To make a deployment public so anyone with a valid Fireworks API key can query it, update the deployed model with the --public flag.

Find the Deployed Model ID

Every model running on a deployment receives a unique deployed model ID. You can find this ID if you know the model name and deployment name using the following command:

firectl list deployed-models --filter 'model="<MODEL_NAME>" AND deployment="<DEPLOYMENT_NAME>"'

The first column in the output is the deployed model ID.

Publish a deployed model

To make a deployment public so anyone with a valid Fireworks API key can query it, update the deployed model with the --public flag.

firectl update deployed-model <DEPLOYED_MODEL_ID> --public

To unpublish it, run:

firectl update deployed-model <DEPLOYED_MODEL_ID> --public=false

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Installing the CLI

Creating an on-demand deployment

Model identifier

Checking whether a model is deployed

Managing default deployments

Deployment options

Replica count (horizontal scaling)

Customizing autoscaling behavior

Multiple GPUs (vertical scaling)

Choosing hardware type

Model based speculative decoding

Quantization

Optimizing your deployments for long context

Publishing a deployed model

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Installing the CLI

​Creating an on-demand deployment

​Model identifier

​Checking whether a model is deployed

​Managing default deployments

​Deployment options

​Replica count (horizontal scaling)

​Customizing autoscaling behavior

​Multiple GPUs (vertical scaling)

​Choosing hardware type

​Model based speculative decoding

​Quantization

​Optimizing your deployments for long context

​Publishing a deployed model

Installing the CLI

Creating an on-demand deployment

Model identifier

Checking whether a model is deployed

Managing default deployments

Deployment options

Replica count (horizontal scaling)

Customizing autoscaling behavior

Multiple GPUs (vertical scaling)

Choosing hardware type

Model based speculative decoding

Quantization

Optimizing your deployments for long context

Publishing a deployed model