Deployments Quickstart

On-demand deployments are dedicated GPUs that give you better performance, no rate limits, fast autoscaling, and a wider selection of models than serverless. This quickstart will help you spin up your first on-demand deployment in minutes.

Step 1: Create and export an API key

Before you begin, create an API key in the Fireworks dashboard. Click Create API key and store it in a safe location. Once you have your API key, export it as an environment variable in your terminal:

macOS / Linux
Windows

export FIREWORKS_API_KEY="your_api_key_here"

Step 2: Install the CLI

To create and manage on-demand deployments, you’ll need the firectl CLI tool. Install it using one of the following methods, based on your platform:

brew tap fw-ai/firectl
brew install firectl

# If you encounter a failed SHA256 check, try first running
brew update

Then, sign in:

firectl signin

Step 3: Create a deployment

This command will create a deployment of GPT OSS 120B optimized for speed. It will take a few minutes to complete. The resulting deployment will scale up to 1 replica.

firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --min-replica-count 0 \
        --max-replica-count 1 \
        --scale-to-zero-window 5m \
        --wait

fast is called a deployment shape, which is a pre-configured deployment template created by the Fireworks team that sets sensible defaults for most deployment options (such as hardware type).You can also pass throughput or cost to --deployment-shape:

throughput creates a deployment that trades off latency for lower cost-per-token at scale
cost creates a deployment that trades off latency and throughput for lowest cost-per-token at small scale, usually for early experimentation and prototyping

While we recommend using a deployment shape, you are also free to pass your own configuration to the deployment via our deployment options.

The response will look like this:

Name: accounts/<YOUR ACCOUNT ID>/deployments/<DEPLOYMENT ID>
Create Time: <CREATION_TIME>
Expire Time: <EXPIRATION_TIME>
Created By: <YOUR EMAIL>
State: CREATING
Status: OK
Min Replica Count: 0
Max Replica Count: 1
Desired Replica Count: 0
Replica Count: 0
Autoscaling Policy:
  Scale Up Window: 30s
  Scale Down Window: 5m0s
  Scale To Zero Window: 5m0s
Base Model: accounts/fireworks/models/gpt-oss-120b
...other fields...

Take note of the Name: field in the response, as it will be used in the next step to query your deployment. Learn more about deployment options→ Learn more about autoscaling options→

Step 4: Query your deployment

Now you can query your on-demand deployment using the same API as serverless models, but using your dedicated deployment. Replace <DEPLOYMENT_NAME> in the below snippets with the value from the Name: field in the previous step:

Python
JavaScript
curl

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
    messages=[{
        "role": "user",
        "content": "Explain quantum computing in simple terms",
    }],
)

print(response.choices[0].message.content)

The examples from the Serverless quickstart will work with this deployment as well, just replace the model string with the deployment-specific model string from above. Serverless quickstart→

Common use cases

Autoscale based on requests per second

firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --scale-to-zero-window 5m \
        --min-replica-count 0 \
        --max-replica-count 4 \
        --load-targets requests_per_second=5 \
        --wait

Autoscale based on concurrent requests

firectl create deployment accounts/fireworks/models/gpt-oss-120b \
        --deployment-shape fast \
        --scale-down-window 5m \
        --scale-up-window 30s \
        --scale-to-zero-window 5m \
        --min-replica-count 0 \
        --max-replica-count 4 \
        --load-targets concurrent_requests=5 \
        --wait

Next steps

Ready to scale to production, explore other modalities, or customize your models?

Upload a custom model

Bring your own model and deploy it on Fireworks

Fine-tune Models

Improve model quality with supervised and reinforcement learning

Speech to Text

Real-time or batch audio transcription

Embeddings & Reranking

Use embeddings & reranking in search & context retrieval

Batch Inference

Run async inference jobs at scale, faster and cheaper

Browse 100+ Models

Explore all available models across modalities

API Reference

Complete API documentation

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Step 1: Create and export an API key

Step 2: Install the CLI

Step 3: Create a deployment

Step 4: Query your deployment

Common use cases

Autoscale based on requests per second

Autoscale based on concurrent requests

Next steps

Upload a custom model

Fine-tune Models

Speech to Text

Embeddings & Reranking

Batch Inference

Browse 100+ Models

API Reference

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Step 1: Create and export an API key

​Step 2: Install the CLI

​Step 3: Create a deployment

​Step 4: Query your deployment

​Common use cases

​Autoscale based on requests per second

​Autoscale based on concurrent requests

​Next steps

Upload a custom model

Fine-tune Models

Speech to Text

Embeddings & Reranking

Batch Inference

Browse 100+ Models

API Reference

Step 1: Create and export an API key

Step 2: Install the CLI

Step 3: Create a deployment

Step 4: Query your deployment

Common use cases

Autoscale based on requests per second

Autoscale based on concurrent requests

Next steps