Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is Serverless
Serverless is multi-tenant inference for popular open models running on Fireworks-managed infrastructure. You point your client atapi.fireworks.ai, send tokens, and pay only for what you use — no GPUs to size, no autoscaler to tune, no cold starts to wait through. Models eligible for Serverless carry the Serverless tag in the model library. To make your first call, see the Serverless quickstart.
Serverless products at a glance
Three traffic tiers run on the same Serverless framework. They share the same rate-limit policy, but route and price differently:- Standard — the default tier. No
service_tierparameter needed. - Priority — higher reliability during peak periods. Opt in by setting
service_tier: "priority"on chat completions. Priced at a premium. - Fast — high-speed deployments for latency-sensitive workloads. Selected by switching the
modelid to a Fast variant (for example,accounts/fireworks/routers/kimi-k2p6-turbo).
Billing
Serverless is priced per token. Three dimensions are billed:- Input tokens — what you send to the model.
- Cached input tokens — input tokens served from prompt cache, discounted (default 50% of input on text and vision models, unless a model lists a different cached rate).
- Generated tokens — what the model produces.
- The
usageobject in each response is the source of truth for what was billed (prompt_tokens,completion_tokens,total_tokens). - Batch inference is billed at 50% of standard Serverless rates on both input and output. See Batch inference.
- Your spend tier influences Serverless capacity caps in addition to your monthly budget — higher spend tiers unlock higher TPM upper bounds. See Account quotas and Serverless rate limits.
Request and response headers
Headers a Serverless caller will set or read.Request headers
| Header | Notes |
|---|---|
Authorization: Bearer $FIREWORKS_API_KEY | Required for all requests. |
x-session-affinity | Optional sticky-routing key. Pin repeated requests to the same replica to maximize prompt-cache hit rate. See Prompt caching. |
Response headers
Fireworks sets the following on Serverless inference responses:| Header | What it tells you |
|---|---|
fireworks-prompt-tokens | Input tokens for the request. |
fireworks-cached-prompt-tokens | Cached portion of the input. See Prompt caching. |
X-Ratelimit-Limit-Tokens-Prompt | Your current Total Prompt Tokens per Second Limit. |
X-Ratelimit-Limit-Tokens-Cache-Adjusted-Prompt | Your current Total Uncached Prompt Tokens per Second Limit. |
X-Ratelimit-Limit-Tokens-Generated | Your current Total Generated Tokens per Second Limit. |
Prompt caching
Prompt caching is on by default for every Serverless model. Cached input tokens are billed at the discounted rate (default 50% of input). Caching is replica-local, so to maximize hit rate you should route repeated prompts to the same replica — pass a stable identifier inx-session-affinity (or in the OpenAI user field) for each user or session whose prompts share a prefix.
For the full guide, including how to structure prompts for cache hits and how to read cache metrics, see Prompt caching.
Serverless model lifecycle
Serverless models are managed by the Fireworks team and may be updated or deprecated as new models are released. We provide at least 2 weeks advance notice before removing any model, with longer notice periods for popular models based on usage. For production workloads requiring long-term model stability, we recommend on-demand deployments, which give you full control over model versions and updates.Serverless vs On-demand
| When Serverless fits | When On-demand fits |
|---|---|
| Pay per token, only for what you use | Pay per GPU-hour for dedicated capacity |
| You’re using popular base models that Fireworks already hosts | You’re running custom base models or fine-tuned LoRA models (LoRA requires On-demand) |
| You don’t want to manage scaling, replicas, or hardware sizing | You have custom latency requirements and want control over hardware and replicas |
Next steps
Serverless quickstart
Make your first Serverless API call.
Priority and Fast
Higher-reliability and higher-speed serverless tiers.
Pricing
Per-token rates for text, vision, embeddings, and Priority.
Rate limits
Adaptive TPM bounds and how the limit ramps with usage.
Prompt caching
How caching works and how to maximize hit rate.
On-demand deployments
Dedicated GPUs for predictable throughput and custom models.