Skip to main content
On-demand deployments are dedicated GPUs that give you better performance, no rate limits, fast autoscaling, and a wider selection of models than serverless. This quickstart will help you spin up your first on-demand deployment in minutes.

Step 1: Create and export an API key

Before you begin, create an API key in the Fireworks dashboard. Click Create API key and store it in a safe location. Once you have your API key, export it as an environment variable in your terminal:
  • macOS / Linux
  • Windows
export FIREWORKS_API_KEY="your_api_key_here"

Step 2: Install the CLI

To create and manage on-demand deployments, you’ll need the firectl CLI tool. Install it using one of the following methods, based on your platform:
brew tap fw-ai/firectl
brew install firectl

# If you encounter a failed SHA256 check, try first running
brew update
Then, sign in:
firectl signin

Step 3: Create a deployment

This command will create a deployment of GPT OSS 120B optimized for speed. It will take a few minutes to complete. The resulting deployment will scale up to 1 replica.
firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --min-replica-count 0 \
        --max-replica-count 1 \
        --scale-to-zero-window 5m \
        --wait
fast is called a deployment shape, which is a pre-configured deployment template created by the Fireworks team that sets sensible defaults for most deployment options (such as hardware type).You can also pass throughput or cost to --deployment-shape:
  • throughput creates a deployment that trades off latency for lower cost-per-token at scale
  • cost creates a deployment that trades off latency and throughput for lowest cost-per-token at small scale, usually for early experimentation and prototyping
While we recommend using a deployment shape, you are also free to pass your own configuration to the deployment via our deployment options.
The response will look like this:
Name: accounts/<YOUR ACCOUNT ID>/deployments/<DEPLOYMENT ID>
Create Time: <CREATION_TIME>
Expire Time: <EXPIRATION_TIME>
Created By: <YOUR EMAIL>
State: CREATING
Status: OK
Min Replica Count: 0
Max Replica Count: 1
Desired Replica Count: 0
Replica Count: 0
Autoscaling Policy:
  Scale Up Window: 30s
  Scale Down Window: 5m0s
  Scale To Zero Window: 5m0s
Base Model: accounts/fireworks/models/gpt-oss-120b
...other fields...
Take note of the Name: field in the response, as it will be used in the next step to query your deployment. Learn more about deployment options→ Learn more about autoscaling options→

Step 4: Query your deployment

Now you can query your on-demand deployment using the same API as serverless models, but using your dedicated deployment. Replace <DEPLOYMENT_NAME> in the below snippets with the value from the Name: field in the previous step:
  • Python
  • JavaScript
  • curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing in simple terms",
    }],
)

print(response.choices[0].message.content)
The examples from the Serverless quickstart will work with this deployment as well, just replace the model string with the deployment-specific model string from above. Serverless quickstart→

Common use cases

Autoscale based on requests per second

firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --scale-to-zero-window 5m \
        --min-replica-count 0 \
        --max-replica-count 4 \
        --load-targets requests_per_second=5 \
        --wait

Autoscale based on concurrent requests

firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --scale-to-zero-window 5m \
        --min-replica-count 0 \
        --max-replica-count 4 \
        --load-targets concurrent_requests=5 \
        --wait

Next steps

Ready to scale to production, explore other modalities, or customize your models?
I