Rate limits, spend limits and quotas for serverless inference and on-demand deployments
Rate limits on Serverless exist to ensure fair usage and reasonable performance for all users. We enforce fixed, maximum rate limits with a spike arrest policy - please read this section completely to understand how rate limits work.
If you need higher rate limits, faster speeds, more consistent latency, or guaranteed reliability with SLAs, contact us to learn more about our Enterprise offerings, or consider using on-demand deployments
Limits | Self-Serve |
---|---|
Requests per minute | 6,000 |
Audio min per minute, Whisper-v3-large | 200 |
Audio min per minute, Whisper-v3-turbo | 400 |
Concurrent connections, streaming speech transcription | 10 |
# LoRAs | 100 |
LLM traffic that spikes quickly has the potential to be throttled. Here’s how it works:
x-ratelimit-over-limit: yes
.Here’s an example of how dynamic rate limits scale up:
Metric | Minimum Guaranteed Limit | 10 Minutes | 1 Hour | 2 Hours |
---|---|---|---|---|
Requests per minute | 60 | 120 | 720 | 1440 |
Input tokens per minute | 60000 | 120000 | 720000 | 1440000 |
Output tokens per minute | 6000 | 12000 | 72000 | 144000 |
Header | Description |
---|---|
x-ratelimit-limit-requests, x-ratelimit-limit-tokens-prompt, x-ratelimit-limit-tokens-generated | The maximum number of requests or tokens that are permitted per minute before the limit is exhausted and future requests are de-prioritized. requests refers to the number of completions (n > 1 counts as several requests). tokens-prompt and tokens-generated refer to the number of input and output tokens respectively. |
x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens-prompt, x-ratelimit-remaining-tokens-generated | The remaining number of requests or tokens that are permitted before exhausting the rate limit. Note that the limit is replenished continuously. If your usage is sustainably below the rate limit, this number will hover near its maximum value. |
x-ratelimit-over-limit | Contains “yes” or “no”. The value “yes” means that at least one of the limits is exhausted and this request was executed with lower priority. |
Daily token limits are set at thresholds that provide smooth transitions to enterprise reservations. If you think you may hit daily token limits, please contact us to learn about enterprise packages.
Limits | Self-Serve |
---|---|
Tokens per day, models < 40B | 2.5B |
Tokens per day, models between 40B - 100B | 1.25B |
Tokens per day, models > 100B (incl. large MoE like Deepseek R1) | 600M |
If you need higher limits, contact us to learn more about our Enterprise offerings.
Quota Name | Default Value |
---|---|
# Nvidia A100 | 8 |
# Nvidia H100 | 8 |
# Nvidia H200 | 8 |
# AMD MI300X | 8 |
Total GPU Hours per month | 2000 |
# LoRAs | 100 |
Note that the limit on # LoRAs is a total limit across Serverless and On-Demand. |
In order to prevent fraud, Fireworks imposes a monthly spending limit on your account. Once you hit the spending limit, your account will automatically enter a suspended state, API requests will be rejected and all Fireworks usage will be stopped. This includes serverless inference, dedicated deployments, and fine-tuning jobs.
Your spend limit will organically increase over time as you spend more on the platform. You can also increase your spend limit at any time, by purchasing prepaid credits to meet the historical spend required for a higher tier. For instance, if you are a new Tier 1 user with $0
historical spend, you can purchase $100
prepaid credits and become a Tier 2 user.
Tier | Qualification | Spending Limit |
---|---|---|
Tier 1 | Valid payment method added | $50/mo |
Tier 2 | $50 spent in payments or credits added | $500/mo |
Tier 3 | $500 spent in payments or credits added | $5,000/mo |
Tier 4 | $5000 spent in payments or credits added | $50,000/mo |
Unlimited | Contact us at inquiries@fireworks.ai | Unlimited |
In certain cases, developers want to reduce their spend limit. For example, developers may fear unexpected costs from their app unexpectedly going viral. Users can lower or raise spend limits to any arbitrary number within their Tier with the following command:
You can view your current quota capacity by running:
Account suspension occurs when your spending limit is hit, no payment method is on file after credits are depleted, or past invoice payment fails. If you have a failed payment, go to the [Invoices] section at https://fireworks.ai/billing, pay all failed invoices, and your account will be automatically unsuspended. If your account is still suspended after 1 hour, contact the Fireworks team in Discord or via email.