Skip to main content
Serverless is the fastest way to get started with using open models. This quickstart will help you make your first API call in minutes.

Step 1: Create and export an API key

Before you begin, create an API key in the Fireworks dashboard. Click Create API key and store it in a safe location. Once you have your API key, export it as an environment variable in your terminal:
export FIREWORKS_API_KEY="your_api_key_here"

Step 2: Make your first Serverless API call

Install the Fireworks Python SDK:
The SDK is currently in alpha. Use the --pre flag when installing to get the latest version.
pip install --pre fireworks-ai
Then make your first Serverless API call:
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/fireworks/models/deepseek-v3p1",
  messages=[{
    "role": "user",
    "content": "Say hello in Spanish",
  }],
)

print(response.choices[0].message.content)
You should see a response like: "¡Hola!"

Common use cases

Streaming responses

Stream responses token-by-token for a better user experience:
from fireworks import Fireworks

client = Fireworks()

stream = client.chat.completions.create(
  model="accounts/fireworks/models/deepseek-v3p1",
  messages=[{"role": "user", "content": "Tell me a short story"}],
  stream=True
)

for chunk in stream:
  if chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

Function calling

Connect your models to external tools and APIs:
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2-instruct-0905",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name, e.g. San Francisco",
                        }
                    },
                    "required": ["location"],
                },
            },
        },
    ],
)

print(response.choices[0].message.tool_calls)
Learn more about function calling →

Structured outputs (JSON mode)

Get reliable JSON responses that match your schema:
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/fireworks/models/deepseek-v3p1",
  messages=[
    {
      "role": "user",
      "content": "Extract the name and age from: John is 30 years old",
    }
  ],
  response_format={
    "type": "json_schema",
    "json_schema": {
      "name": "person",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        },
        "required": ["name", "age"],
      },
    },
  },
)

print(response.choices[0].message.content)
Learn more about structured outputs →

Vision models

Analyze images with vision-language models:
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png"
          },
        },
      ],
    }
  ],
)

print(response.choices[0].message.content)
Learn more about vision models →

Serverless model lifecycle

Serverless models are managed by the Fireworks team and may be updated or deprecated as new models are released. We provide at least 2 weeks advance notice before removing any model, with longer notice periods for popular models based on usage. For production workloads requiring long-term model stability, we recommend using on-demand deployments, which give you full control over model versions and updates.
Make sure to add a payment method to access higher rate limits up to 6,000 RPM. Without a payment method, you’re limited to 10 RPM.
The 6,000 RPM figure is the maximum ceiling enforced by our spike arrest policy. Your actual limit scales dynamically with sustained usage, so short-lived spikes may be throttled below that cap. For predictable throughput needs, consider on-demand deployments or requesting a rate review.

Next steps

Ready to scale to production, explore other modalities, or customize your models?