Billing & scaling

On this page

Autoscaling and costs
Rate-limits for on-demand deployment
On-demand billing
Scaling structure
Additional resources

Autoscaling and costs

Q: How does autoscaling affect my costs?

Scaling from 0: No minimum cost when scaled to zero
Scaling up: Each new replica adds to your total cost proportionally. For example:
- Scaling from 1 to 2 replicas doubles your GPU costs
- If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)

For current pricing details, please visit our pricing page.

Rate-limits for on-demand deployment

Q: What are the rate limits for on-demand deployments? Request throughput scales with your GPU allocation. Base allocations include:

Up to 8 A100 GPUs
Up to 8 H100 GPUs

On-demand deployments offer several advantages:

Predictable pricing based on time units, not token I/O
Protected latency and performance, independent of traffic on the serverless platform
Choice of GPUs, including A100s and H100s

Need more GPUs? Contact us to discuss higher allocations for your specific use case.

On-demand billing

Q: How does billing work for on-demand deployments? On-demand deployments come with automatic cost optimization features:

Default autoscaling: Automatically scales to 0 replicas when not in use
Pay for what you use: Charged only for GPU time when replicas are active
Flexible configuration: Customize autoscaling behavior to match your needs

Best practices for cost management:

Leverage default autoscaling: The system automatically scales down deployments when not in use
Customize carefully: While you can modify autoscaling behavior using our configuration options, note that preventing scale-to-zero will result in continuous GPU charges
Consider your use case: For intermittent or low-frequency usage, serverless deployments might be more cost-effective

For detailed configuration options, see our deployment guide.

Scaling structure

Q: How does billing and scaling work for on-demand GPU deployments? On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: Billing:

Charges start when the server begins accepting requests
Billed by GPU-second for each active instance
Costs accumulate even if there are no active API calls

Scaling options:

Supports autoscaling from 0 to multiple GPUs
Each additional GPU adds to the billing rate
Can handle unlimited requests within the GPU’s capacity

Management requirements:

Not fully serverless; requires some manual management
Manually delete deployments when no longer needed
Or configure autoscaling to scale down to 0 during inactive periods

Cost control tips:

Regularly monitor active deployments
Delete unused deployments to avoid unnecessary costs
Consider serverless options for intermittent usage
Use autoscaling to 0 to optimize costs during low-demand times

Additional resources

Discord Community: discord.gg/fireworks-ai
Email Support: inquiries@fireworks.ai
Contact our sales team for custom pricing options

Account & Access

Billing & Pricing

Deployment & Infrastructure

Models & Inference

Fine-tuning

Security & Compliance

Support & General

Autoscaling and costs

Rate-limits for on-demand deployment

On-demand billing

Scaling structure

Additional resources

Account & Access

Billing & Pricing

Deployment & Infrastructure

Models & Inference

Fine-tuning

Security & Compliance

Support & General

​Autoscaling and costs

​Rate-limits for on-demand deployment

​On-demand billing

​Scaling structure

​Additional resources

Autoscaling and costs

Rate-limits for on-demand deployment

On-demand billing

Scaling structure

Additional resources