Serverless Overview - Fireworks AI Docs

What is Serverless

Serverless is multi-tenant inference for popular open models running on Fireworks-managed infrastructure. You point your client at api.fireworks.ai, send tokens, and pay only for what you use — no GPUs to size, no autoscaler to tune, no cold starts to wait through. Models eligible for Serverless carry the Serverless tag in the model library. To make your first call, see the Serverless quickstart.

Serverless products at a glance

Three traffic tiers run on the same Serverless framework. They share the same rate-limit policy, but route and price differently:

Standard — the default tier. No service_tier parameter needed.
Priority — higher reliability during peak periods. Opt in by setting service_tier: "priority" on chat completions. Priced at a premium.
Fast — high-speed deployments for latency-sensitive workloads. Selected by switching the model id to a Fast variant (for example, accounts/fireworks/routers/kimi-k2p6-turbo).

For usage examples and the full list of supported models, see Serverless Priority and Fast. For per-tier pricing, see Serverless pricing.

Billing

Serverless is priced per token. Three dimensions are billed:

Input tokens — what you send to the model.
Cached input tokens — input tokens served from prompt cache, discounted (default 50% of input on text and vision models, unless a model lists a different cached rate).
Generated tokens — what the model produces.

Other things to know:

The usage object in each response is the source of truth for what was billed (prompt_tokens, completion_tokens, total_tokens).
Batch inference is billed at 50% of standard Serverless rates on both input and output. See Batch inference.
Your spend tier influences Serverless capacity caps in addition to your monthly budget — higher spend tiers unlock higher TPM upper bounds. See Account quotas and Serverless rate limits.

Request and response headers

Headers a Serverless caller will set or read.

Request headers

Header	Notes
`Authorization: Bearer $FIREWORKS_API_KEY`	Required for all requests.
`x-session-affinity`	Optional sticky-routing key. Pin repeated requests to the same replica to maximize prompt-cache hit rate. See Prompt caching.

Response headers

Fireworks sets the following on Serverless inference responses:

Header	What it tells you
`fireworks-prompt-tokens`	Input tokens for the request.
`fireworks-cached-prompt-tokens`	Cached portion of the input. See Prompt caching.
`X-Ratelimit-Limit-Tokens-Prompt`	Your current Total Prompt Tokens per Second Limit.
`X-Ratelimit-Limit-Tokens-Cache-Adjusted-Prompt`	Your current Total Uncached Prompt Tokens per Second Limit.
`X-Ratelimit-Limit-Tokens-Generated`	Your current Total Generated Tokens per Second Limit.

Streaming responses don’t carry per-request perf headers. To get the same metrics in the streaming response body, set the perf_metrics_in_response parameter on the request. See Querying text models.

Prompt caching

Prompt caching is on by default for every Serverless model. Cached input tokens are billed at the discounted rate (default 50% of input). Caching is replica-local, so to maximize hit rate you should route repeated prompts to the same replica — pass a stable identifier in x-session-affinity (or in the OpenAI user field) for each user or session whose prompts share a prefix. For the full guide, including how to structure prompts for cache hits and how to read cache metrics, see Prompt caching.

Serverless model lifecycle

Serverless models are managed by the Fireworks team and may be updated or deprecated as new models are released. We provide at least 2 weeks advance notice before removing any model, with longer notice periods for popular models based on usage. For production workloads requiring long-term model stability, we recommend on-demand deployments, which give you full control over model versions and updates.

Serverless vs On-demand

When Serverless fits	When On-demand fits
Pay per token, only for what you use	Pay per GPU-hour for dedicated capacity
You’re using popular base models that Fireworks already hosts	You’re running custom base models or fine-tuned LoRA models (LoRA requires On-demand)
You don’t want to manage scaling, replicas, or hardware sizing	You have custom latency requirements and want control over hardware and replicas

For dedicated infrastructure, see On-demand deployments.

Next steps

Serverless quickstart

Make your first Serverless API call.

Priority and Fast

Higher-reliability and higher-speed serverless tiers.

Pricing

Per-token rates for text, vision, embeddings, and Priority.

Rate limits

Adaptive TPM bounds and how the limit ramps with usage.

Prompt caching

How caching works and how to maximize hit rate.

On-demand deployments

Dedicated GPUs for predictable throughput and custom models.

Documentation Index

​What is Serverless

​Serverless products at a glance

​Billing

​Request and response headers

​Request headers

​Response headers

​Prompt caching

​Serverless model lifecycle

​Serverless vs On-demand

​Next steps

Serverless quickstart

Priority and Fast

Pricing

Rate limits

Prompt caching

On-demand deployments

What is Serverless

Serverless products at a glance

Billing

Request and response headers

Request headers

Response headers

Prompt caching

Serverless model lifecycle

Serverless vs On-demand

Next steps