On-demand deployments
Deploying on your own GPUs
Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:
- Predictable performance unaffected by load caused by other users
- No hard rate limits - but subject to the maximum load capacity of the deployment
- Cheaper under high utilization
- Access to larger selection of models not available via our serverless models
- Custom base models from Hugging Face files
Quickstart
Choose a model
See the “All models” list on our Models page for a list of pre-uploaded models on the Fireworks AI platform. You can also use a custom base model.
Create a deployment
To create a new deployment of a model provided by Fireworks, run:
This command will complete when the deployment is READY
. To let it run asynchronously, remove the --wait
flag.
accounts/fireworks/models/<MODEL_ID>
is an example of a <MODEL_NAME>
. Read more about model names.To create a new deployment using a custom base model, follow the Uploading custom models guide to first upload your custom base model to the Fireworks platform. Then run:
accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
.Verify the deployment is running
You can verify the deployment is complete by running:
The state field should show READY
.
Query the deployment
To query a specific deployment, use the model identifier in the format: <MODEL_NAME>#<DEPLOYMENT_NAME>
In most cases, the model identifier follows this pattern:
accounts/<ACCOUNT_ID>/models/<MODEL_ID>
+ #
+ accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
Example:
The model identifier for querying Llama3.2-3B Instruct (listed as accounts/fireworks/models/llama-v3p2-3b-instruct
) for Acme Inq.’s deployment (deployment ID being 12ab34cd56ef
) would be:
accounts/fireworks/models/llama-v3p2-3b-instruct#accounts/acmeInc/deployments/12ab34cd56ef
Sample Request:
Tear down the deployment
By default, deployments will automatically scale down to zero replicas if unused (i.e. no inference requests) for 1 hour, and automatically delete itself if unused for one week.
To completely delete the deployment, run:
Notes:
- Make sure you include the
#<DEPLOYMENT_NAME>
in the model identifier when querying a specific deployment. - If you are unsure about the model identifier format, refer to the Model Identifiers section for more details and alternatives.
Deployment options
Replica count (horizontal scaling)
The number of replicas (horizontal scaling) is specified by passing the --min-replica-count
and --max-replica-count
flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will
automatically scale based on server load.
The default value for --min-replica-count
is 0. Setting --min-replica-count
to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified “scale-to-zero” time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.
The default value for --max-replica-count
is 1 if
--min-replica-count=0
, or the value of --min-replica-count
otherwise.
Customizing autoscaling behavior
You can customize certain aspects of the deployment’s autoscaling behavior by setting the following flags:
-
--scale-up-window
The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is30s
. -
--scale-down-window
The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is10m
. -
--scale-to-zero-window
The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if--min-replica-count
is greater than 0. Default is1h
. The minimum is5m
.There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling from 0 to 1 replicas.A deployment with--min-replica-count
set to 0 will be automatically deleted if it receives no traffic for 7 days.
Refer to time.ParseDuration for valid syntax for the duration string.
Multiple GPUs (vertical scaling)
The number of GPUs used per replica is specified by passing the --accelerator-count
flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
Choosing hardware type
By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB or AMD MI300X GPUs by passing the --accelerator-type
flag. Valid values for --accelerator-type
are:
NVIDIA_H100_80GB
NVIDIA_A100_80GB
AMD_MI300X_192GB
- Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300X
For advice on choosing a hardware type, see this FAQ
Model based speculative decoding
Model based speculative decoding allows you to speed up output generation in some cases, by using a smaller model to assist the larger model in generation.
We offer the following settings that can be set as flags in firectl, our CLI tool:
--draft-model
string To use a draft model for speculative decoding, set this flag to the name of the draft model you want to use. See the table below for recommendations on draft models to use for popular model families. Note that draft models can be standalone models (referred from Fireworks account or custom models uploaded to your account) or an add-on (e.g. Eagle)
--draft-token-count
int32 When using a draft model, set this flag to the number of tokens to generate per step for speculative decoding. Setting --draft-token-count=0
turns off draft model speculation for the deployment. As a rough guideline, use —draft-token-count=3 for eagle draft models and —draft-token-count=4 for other draft models
--ngram-speculation-length
int32 To use N-gram based speculation, set this flag to the length of the previous input sequence to be considered for N-gram speculation
draft-token-count
must be set when draft-model
or --ngram-speculation-length
is used.draft-model
and ngram-speculation-length
cannot be used together as they are alternative approaches to model-based speculation. Setting both will throw an error.You can use the following draft models directly:
Draft model name | Recommended for |
---|---|
accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
accounts/fireworks/models/eagle-llama-v3-3b-instruct-v2 | Llama 3.2 3B |
accounts/fireworks/models/eagle-qwen-v2p5-3b-instruct-v2 | Qwen 2.5 3B |
accounts/fireworks/models/eagle-llama-v3-8b-instruct-v2 | Llama 3.1 8B, Llama 3.0 8B |
accounts/fireworks/models/eagle-qwen-v2p5-7b-instruct-v2 | Qwen 2.5 7B |
Here’s an example of deploying Llama 3.3 70B with a draft model:
In most cases, speculative decoding does not change the quality of the output generated (mathematically, outputs are unchanged, but there might be numerical differences, especially at higher temperatures). If speculation is used on the deployment and you want to verify the output is unchanged, you can set disable_speculation=True
in the inference API call - in this case, the draft model is still called but its output are not used, so performance will be impacted.
Quantization
By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics which may introduce small changes to the output.
In order to deploy a base model using quantization, it must be prepared first. See our Quantization guide for details.
To create a deployment using a quantized model, pass the --precision
flag with the desired precision.
Optimizing your deployments for long context
By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be
achieved for long-prompt length (>~3000 tokens) workloads by passing the --long-prompt
flag.
--accelerator-count
is not specified, then a deployment using twice the minimum number of
GPUs (to serve without --long-prompt
) will be created.To update a deployment to disable this option, pass --long-prompt=false
.
Additional optimization options are available through our enterprise plan.
Deploying LoRA addons
By default, LoRA addons are disabled for deployments. To enable addons, pass the --enable-addons
flag:
See Uploading a custom model for instructions on how to upload custom
LoRA addons. To deploy a LoRA addon to a on-demand deployment, pass the --deployment
flag to firectl deploy
. For
example:
Pricing
On-demand deployments are billed by GPU-second. Consult our pricing page for details.
Was this page helpful?