On-demand deployments
Deploying on your own GPUs
Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:
- Predictable performance unaffected by load caused by other users
- No hard rate limits - but subject to the maximum load capacity of the deployment
- Cheaper under high utilization
- Access to larger selection of models not available via our serverless models
- Custom base models from Hugging Face files
Need extra performance or want your on-demand deployment to be personally configured? Feel free to directly schedule time with our PM here. Curious about performance comparisons or on-demand vs serverless? Check out this blog for performance sweeps and this blog for recent updates.
Interested in trialing the performance of on-demand? You can immediately get started with the docs below or email raythai@fireworks.ai to get credits for a complimentary trial period for on-demand deployments.
Quickstart
Creating a base model
To create an on-demand deployment, you must first have a base model in your account. You can either import an existing base model that Fireworks already uploaded to the platform or upload your own custom base model (e.g. downloaded from Hugging Face).
See the “all models” list on our models page for a list of available models for import. To import a model, run
firectl import model <MODEL_ID>
To upload a custom base model, see the Custom base models guide.
Creating an on-demand deployment
To create a new deployment, run:
firectl create deployment <MODEL_ID> --wait
This command will complete when the deployment is READY
. To let it run asynchronously, remove the --wait
flag.
NOTE: The deployment ID is the last part of
accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
.
You can verify the deployment is complete by running:
firectl get deployment <DEPLOYMENT_ID>
# OR
firectl get model <MODEL_ID>
The state field should show READY
for the deployment and DEPLOYED
for the model with the deployment ID set.
By default, the deployment will automatically scale down to zero replicas if unused (i.e. no inference requests) for 1 hour, and automatically delete itself if unused for one week. To disable autoscaling to zero, pass --min-replica-count
greater than 0 to create/update.
Querying a model
Querying a model deployed to an on-demand deployment is the same as querying any other model. The model name will be the name of the base model or the PEFT addon you deploy. See the Querying text models for details.
curl \
--header 'Authorization: Bearer <FIREWORKS_API_KEY>' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/<ACCOUNT_ID>/models/<MODEL_ID>",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
Deleting a deployment
To delete a deployment, run:
firectl delete deployment <DEPLOYMENT_ID>
Deployment options
Replica count (horizontal scaling)
The number of replicas (horizontal scaling) is specified by passing the --min-replica-count
and --max-replica-count
flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. Setting --max-replica-count
to be higher than --min-replica-count
will enable automatic scaling between the two replica counts based on load (batch occupancy). The default value for --min-replica-count
is 0. The default value for --max-replica-count
is 1 if --min-replica-count=0
, or the value of --min-replica-count
otherwise. For example:
firectl create deployment <MODEL_ID> \
--min-replica-count 2 \
--max-replica-count 3
firectl update deployment <DEPLOYMENT_ID> \
--min-replica-count 2 \
--max-replica-count 3
Autoscaling to zero
Setting --min-replica-count=0
(or not setting the flag at all, as the default is 0) will scale the deployment down to 0 replicas after --scale-to-zero-window
(default 1 hour) with no traffic. While the deployment has 0 replicas, any new requests will scale it back up to 1 replica. There may be a 1 or 2 minute latency for requests made while the deployment is scaling from 0 to 1 replicas.
Note: A deployment with --min-replica-count
set to 0 will be automatically deleted if it receives no traffic for 7 days.
Customizing autoscaling behavior
You can customize certain aspects of the deployment’s autoscaling behavior by setting the following flags:
--scale-up-window
The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is30s
.--scale-down-window
The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is10m
.--scale-to-zero-window
The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if--min-replica-count
is greater than 0. Default is1h
.
Refer to time.ParseDuration for valid syntax for the duration string.
Multiple GPUs (vertical scaling)
The number of GPUs used per replica is specified by passing the --accelerator-count
flag. Increasing the world size will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
firectl create deployment <MODEL_ID> --accelerator-count 2
firectl update deployment <DEPLOYMENT_ID> --accelerator-count 2
Choosing hardware type
By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB GPUs by passing the --accelerator-type
flag.
A100s are more affordably priced than H100s but H100s have improved latency and higher total capacity. Generally, we recommend H100s unless your volume is too low to take advantage of the H100’s higher capacity.
firectl create deployment <MODEL_ID> --accelerator-type="NVIDIA_H100_80GB"
firectl update deployment <DEPLOYMENT_ID> --accelerator-type="NVIDIA_H100_80GB"
Optimizing your deployments
By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be achieved for long-prompt length workloads by passing the --long-prompt
flag. This option will require a minimum of 2 GPUs to be effective.
firectl create deployment <MODEL_ID> --accelerator-count=2 --long-prompt
firectl update deployment <UPDATE_ID> --long-prompt
To update a deployment and disable this option, pass --long-prompt=false
.
Additional optimization options are available through our enterprise plan.
Deploying PEFT addons
See Deploying fine-tuned models for instructions on how to upload PEFT addons. To deploy a PEFT addon to a on-demand deployment, pass the --deployment-id
flag to firectl deploy
. For example:
firectl deploy <MODEL_ID> --deployment-id <DEPLOYMENT_ID>
The base model of the deployment must match the base model of the addon.
Pricing
On-demand deployments are billed by GPU-second. Consult our pricing page for details.
Was this page helpful?