Rate limits, spend limits and quotas

Rate limits on serverless

Rate limits on Serverless exist to ensure fair usage and reasonable performance for all users. We enforce fixed, maximum rate limits with a spike arrest policy - please read this section completely to understand how rate limits work.

Fixed limits reflect the maximum usage allowed on Serverless
Usage that spikes quickly may be throttled if a serverless deployment is in the process of scaling

If you need higher rate limits, faster speeds, more consistent latency, or guaranteed reliability with SLAs, contact us to learn more about our Enterprise offerings, or consider using on-demand deployments

Fixed limits

Limits	Self-Serve
Requests per minute	6,000
Audio min per minute, Whisper-v3-large	200
Audio min per minute, Whisper-v3-turbo	400
Concurrent connections, streaming speech transcription	10
# LoRAs	100

Spike arrest policy

LLM traffic that spikes quickly has the potential to be throttled. Here’s how it works:

Each user has a guaranteed rate limit, which increases with sustained usage near the limit. Typically, you can expect to stay within the limits if your traffic gradually doubles within an hour.
You can see your guaranteed limits using API response headers (see below)
Exceeding your guaranteed limit means that there’s the potential for your requests to be processed with lower-priority. Fireworks operates serverless deployments by autoscaling capacity (within limits) as user traffic increases. However, if a deployment is overloaded while auto-scaling, requests that fall outside of guaranteed limits may be processed with lower-latency or dropped with HTTP code 429 (if limits are significantly exceeded). You can monitor if you exceed limits via API response header x-ratelimit-over-limit: yes.
Exceeding your guaranteed limit does not guarantee that your requests will be throttled. You can monitor if your requests are actually being throttled by monitoring latencies.

Here’s an example of how dynamic rate limits scale up:

Metric	Minimum Guaranteed Limit	10 Minutes	1 Hour	2 Hours
Requests per minute	60	120	720	1440
Input tokens per minute	60000	120000	720000	1440000
Output tokens per minute	6000	12000	72000	144000

Rate limit response headers

Header	Description
x-ratelimit-limit-requests, x-ratelimit-limit-tokens-prompt, x-ratelimit-limit-tokens-generated	The maximum number of requests or tokens that are permitted per minute before the limit is exhausted and future requests are de-prioritized. `requests` refers to the number of completions (`n > 1` counts as several requests). `tokens-prompt` and `tokens-generated` refer to the number of input and output tokens respectively.
x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens-prompt, x-ratelimit-remaining-tokens-generated	The remaining number of requests or tokens that are permitted before exhausting the rate limit. Note that the limit is replenished continuously. If your usage is sustainably below the rate limit, this number will hover near its maximum value.
x-ratelimit-over-limit	Contains “yes” or “no”. The value “yes” means that at least one of the limits is exhausted and this request was executed with lower priority.

Daily token limits

Daily token limits are set at thresholds that provide smooth transitions to enterprise reservations. If you think you may hit daily token limits, please contact us to learn about enterprise packages.

Limits	Self-Serve
Tokens per day, models < 40B	2.5B
Tokens per day, models between 40B - 100B	1.25B
Tokens per day, models > 100B (incl. large MoE like Deepseek R1)	600M

GPU limits with on-demand deployments

If you need higher limits, contact us to learn more about our Enterprise offerings.

Quota Name	Default Value
# Nvidia A100	8
# Nvidia H100	8
# Nvidia H200	8
# AMD MI300X	8
Total GPU Hours per month	2000
# LoRAs	100
Note that the limit on # LoRAs is a total limit across Serverless and On-Demand.

Spend limits

In order to prevent fraud, Fireworks imposes a monthly spending limit on your account. Once you hit the spending limit, your account will automatically enter a suspended state, API requests will be rejected and all Fireworks usage will be stopped. This includes serverless inference, dedicated deployments, and fine-tuning jobs. Your spend limit will organically increase over time as you spend more on the platform. You can also increase your spend limit at any time, by purchasing prepaid credits to meet the historical spend required for a higher tier. For instance, if you are a new Tier 1 user with $0 historical spend, you can purchase $100 prepaid credits and become a Tier 2 user.

You can qualify for a higher tier by adding credits into your Fireworks account. There may be a propagation delay for a few minutes after you prepay for credits - you may still see “monthly usage exceeded error” for a few minutes after adding credits.

Tier	Qualification	Spending Limit
Tier 1	Valid payment method added	$50/mo
Tier 2	$50 spent in payments or credits added	$500/mo
Tier 3	$500 spent in payments or credits added	$5,000/mo
Tier 4	$5000 spent in payments or credits added	$50,000/mo
Unlimited	Contact us at inquiries@fireworks.ai	Unlimited

Reducing spend limits

In certain cases, developers want to reduce their spend limit. For example, developers may fear unexpected costs from their app unexpectedly going viral. Users can lower or raise spend limits to any arbitrary number within their Tier with the following command:

firectl update quota monthly-spend-usd --value <VALUE>

Viewing quotas

You can view your current quota capacity by running:

firectl list quotas

Account suspension

Account suspension occurs when your spending limit is hit, no payment method is on file after credits are depleted, or past invoice payment fails. If you have a failed payment, go to the [Invoices] section at https://fireworks.ai/billing, pay all failed invoices, and your account will be automatically unsuspended. If your account is still suspended after 1 hour, contact the Fireworks team in Discord or via email.

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Rate limits, spend limits and quotas

Rate limits on serverless

Fixed limits

Spike arrest policy

Rate limit response headers

Daily token limits

GPU limits with on-demand deployments

Spend limits

Reducing spend limits

Viewing quotas

Account suspension

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Rate limits on serverless

​Fixed limits

​Spike arrest policy

​Rate limit response headers

​Daily token limits

​GPU limits with on-demand deployments

​Spend limits

​Reducing spend limits

​Viewing quotas

​Account suspension

Rate limits on serverless

Fixed limits

Spike arrest policy

Rate limit response headers

Daily token limits

GPU limits with on-demand deployments

Spend limits

Reducing spend limits

Viewing quotas

Account suspension