Rate limits, spend limits and quotas
Rate limits, spend limits and quotas for serverless inference and on-demand deployments
Dynamic Rate Limits on Serverless
Dynamic rate limits allow users to see high rate limits, e.g. upto 2000 RPM, while ensuring fair platform usage and reasonable performance for all users. Here’s how it works:
-
Each user has a dynamic rate limit, which increases with sustained usage near the current limit. Typically, you can expect to stay within the limits if your traffic gradually doubles within an hour.
-
The actual rate of increase depends on model size, traffic load, capacity availability, and other factors. The API response headers (see below) will let you know what your current limits are, so you know when more capacity is available.
-
If you exceed your dynamic rate limit, the requests will still be processed but with lower priority. Those requests may see higher latency. You can monitor it via API response header
x-ratelimit-over-limit: yes
. If you significantly exceed your dynamic rate limit, the requests will be dropped with HTTP code 429. -
Dynamic rate limits work similarly to “autoscaling” in many infrastructure systems. A gradual increase in traffic volume results in increased available capacity. Abrupt spikes in traffic may cause overload.
-
If you need higher rate limits, more consistent latency, or guaranteed reliability with SLAs, we offer a Scale tier with committed spend - contact us. You can also consider using on-demand deployments, where you can scale the size of your deployment to meet your needs.
Here’s an example of how dynamic rate limits scale up:
Metric | Starting Limit | 10 Minutes | 1 Hour | 2 Hours |
---|---|---|---|---|
Requests per minute | 60 | 120 | 720 | 1440 |
Input tokens per minute | 60000 | 120000 | 720000 | 1440000 |
Output tokens per minute | 6000 | 12000 | 72000 | 144000 |
Rate limit response headers
Header | Description |
---|---|
x-ratelimit-limit-requests, x-ratelimit-limit-tokens-prompt, x-ratelimit-limit-tokens-generated | The maximum number of requests or tokens that are permitted per minute before the limit is exhausted and future requests are de-prioritized. requests refers to the number of completions (n > 1 counts as several requests). tokens-prompt and tokens-generated refer to the number of input and output tokens respectively. |
x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens-prompt, x-ratelimit-remaining-tokens-generated | The remaining number of requests or tokens that are permitted before exhausting the rate limit. Note that the limit is replenished continuously. If your usage is sustainably below the rate limit, this number will hover near its maximum value. |
x-ratelimit-over-limit | Contains “yes” or “no”. The value “yes” means that at least one of the limits is exhausted and this request was executed with lower priority. |
Spend limits
In order to prevent fraud, Fireworks imposes a monthly spending limit on your account. Once you hit the spending limit, your account will automatically enter a suspended state, API requests will be rejected and all Fireworks usage will be stopped. This includes serverless inference, dedicated deployments, and fine-tuning jobs.
Your spend limit will organically increase over time as you spend more on the platform. You can also increase your spend limit at any time, by purchasing prepaid credits to meet the historical spend required for a higher tier. For instance, if you are a new Tier 1 user with $0
historical spend, you can purchase $100
prepaid credits and become a Tier 2 user.
Tier | Spending Limit | Qualification |
---|---|---|
Tier 1 | $50/mo | Valid payment method added |
Tier 2 | $500/mo | Total historical spend of $100+ |
Tier 3 | $5,000/mo | Total historical spend of $1,000+ |
Tier 4 | $50,000/mo | Total historical spend of $10,000+ |
Unlimited | Unlimited | Contact us at inquiries@fireworks.ai |
Reducing Spend Limits
In certain cases, developers want to reduce their spend limit. For example, developers may fear unexpected costs from their app unexpectedly going viral. Users can lower or raise spend limits to any arbitrary number within their Tier with the following command:
Other quotas
We impose limits on the number of custom models & LoRA you can have in your account, as well as the number of A100 and H100 GPUs you can deploy in your on-demand deployments. Higher quotas are available for enterprise accounts - contact the Fireworks team at inquiries@fireworks.ai.
Quota Name | Default Value | Can be raised? |
---|---|---|
# deployed models | 100 | Yes |
# A100 GPUs | 8 | Yes |
# H100 GPUs | 8 | Yes |
Viewing quotas
You can view your current quota capacity by running:
Account suspension
Account suspension occurs when your spending limit is hit, no payment method is on file after credits are depleted, or past invoice payment fails. If you have a failed payment, go to the [Invoices] section at https://fireworks.ai/billing, pay all failed invoices, and your account will be automatically unsuspended. If your account is still suspended after 1 hour, contact the Fireworks team in Discord or via email.
Was this page helpful?