firectl
. Install the tool by following the installation guide. Verify that your installation is working and you are logged into the correct account by running:
Choose a model
Create a deployment
READY
. To let it run asynchronously, remove the --wait
flag.Verify the deployment is running
READY
.accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
.Query the deployment
<ACCOUNT_ID>/<MODEL_ID>#<ACCOUNT_ID>/<DEPLOYMENT_ID>
#<ACCOUNT_ID>/<DEPLOYMENT_ID>
in the model identifier when querying a specific deployment.Tear down the deployment
State: DEPLOYED
in the Deployed Model Refs
section.
This works both for models provided by Fireworks and for custom models you have uploaded to your account. For example, if you want to confirm if “Qwen2.5 7B Instruct” is deployed, you can run:
accounts/fireworks/...
). Try running the above command using a model with the Serverless
tag in the model library!Default: true
using:
Deployed Model Refs
section with the Default: true
entry.
Name
of the deployed model reference above. Then run:
--min-replica-count
and --max-replica-count
flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will automatically scale based on server load.
--min-replica-count
is 0. Setting --min-replica-count
to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified “scale-to-zero” time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.
The default value for --max-replica-count
is 1 if --min-replica-count=0
, or the value of
--min-replica-count
otherwise.
--scale-up-window
The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s
.
--scale-down-window
The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m
.
--scale-to-zero-window
The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if --min-replica-count
is greater than 0. Default is 1h
. The minimum is 5m
.
--load-targets <key>=<value>[,<key>=<value>...]
Load target thresholds for scaling the replica count. If not specified, the load target is default with --load-targets default=0.8
. If multiple load targets are specified the maximum replica count across all of them is used.
default=<Fraction>
- A general default value for 0 to 1 load targets. Default is default=0.8.tokens_generated_per_second=<Integer>
- The desired tokens generated per second per replica.--min-replica-count
set to 0 will be automatically deleted if it receives no traffic for 7 days.--accelerator-count
flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
--accelerator-type
flag. Valid values for --accelerator-type
are:
NVIDIA_A100_80GB
NVIDIA_H100_80GB
NVIDIA_H200_141GB
AMD_MI300X_192GB
- Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300Xfirectl
, our CLI tool:
Flag | Type | Description |
---|---|---|
--draft-model | string | To use a draft model for speculative decoding, set this flag to the name of the draft model you want to use. See the table below for recommendations on draft models to use for popular model families. Note that draft models can be standalone models (referred from Fireworks account or custom models uploaded to your account) or an add-on (e.g. Eagle) |
--draft-token-count | int32 | When using a draft model, set this flag to the number of tokens to generate per step for speculative decoding. Setting --draft-token-count=0 turns off draft model speculation for the deployment. As a rough guideline, use --draft-token-count=3 for eagle draft models and --draft-token-count=4 for other draft models |
--ngram-speculation-length | int32 | To use N-gram based speculation, set this flag to the length of the previous input sequence to be considered for N-gram speculation |
--draft-token-count
must be set when --draft-model
or --ngram-speculation-length
is used.--draft-model
and --ngram-speculation-length
cannot be used together as they are alternative approaches to model-based speculation. Setting both will throw an error.Draft model name | Recommended for |
---|---|
accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
accounts/fireworks/models/eagle-llama-v3-3b-instruct-v2 | Llama 3.2 3B |
accounts/fireworks/models/eagle-qwen-v2p5-3b-instruct-v2 | Qwen 2.5 3B |
accounts/fireworks/models/eagle-llama-v3-8b-instruct-v2 | Llama 3.1 8B, Llama 3.0 8B |
accounts/fireworks/models/eagle-qwen-v2p5-7b-instruct-v2 | Qwen 2.5 7B |
disable_speculation=True
in the inference API call - in this case, the draft model is still called but its output are not used, so performance will be impacted.
--precision
flag with the desired precision.
--long-prompt
flag.
--accelerator-count
is not specified, then a deployment using twice the minimum number of GPUs (to serve without --long-prompt
) will be created.--long-prompt=false
.
Additional optimization options are available through our enterprise plan.
--public
flag.
Find the Deployed Model ID
Publish a deployed model
--public
flag.