Billing & scaling
Understanding billing and scaling mechanisms for on-demand deployments.
Autoscaling and costs
Q: How does autoscaling affect my costs?
- Scaling from 0: No minimum cost when scaled to zero
- Scaling up: Each new replica adds to your total cost proportionally. For example:
- Scaling from 1 to 2 replicas doubles your GPU costs
- If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our pricing page.
Rate-limits for on-demand deployment
Q: What are the rate limits for on-demand deployments?
Request throughput scales with your GPU allocation. Base allocations include:
- Up to 8 A100 GPUs
- Up to 8 H100 GPUs
On-demand deployments offer several advantages:
- Predictable pricing based on time units, not token I/O
- Protected latency and performance, independent of traffic on the serverless platform
- Choice of GPUs, including A100s and H100s
Need more GPUs? Contact us to discuss higher allocations for your specific use case.
On-demand billing
Q: How does billing work for on-demand deployments?
On-demand deployments come with automatic cost optimization features:
- Default autoscaling: Automatically scales to 0 replicas when not in use
- Pay for what you use: Charged only for GPU time when replicas are active
- Flexible configuration: Customize autoscaling behavior to match your needs
Best practices for cost management:
- Leverage default autoscaling: The system automatically scales down deployments when not in use
- Customize carefully: While you can modify autoscaling behavior using our configuration options, note that preventing scale-to-zero will result in continuous GPU charges
- Consider your use case: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our deployment guide.
Scaling structure
Q: How does billing and scaling work for on-demand GPU deployments?
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
Billing:
- Charges start when the server begins accepting requests
- Billed by GPU-second for each active instance
- Costs accumulate even if there are no active API calls
Scaling options:
- Supports autoscaling from 0 to multiple GPUs
- Each additional GPU adds to the billing rate
- Can handle unlimited requests within the GPU’s capacity
Management requirements:
- Not fully serverless; requires some manual management
- Manually delete deployments when no longer needed
- Or configure autoscaling to scale down to 0 during inactive periods
Cost control tips:
- Regularly monitor active deployments
- Delete unused deployments to avoid unnecessary costs
- Consider serverless options for intermittent usage
- Use autoscaling to 0 to optimize costs during low-demand times
Additional resources
- Discord Community: discord.gg/fireworks-ai
- Email Support: inquiries@fireworks.ai
- Contact our sales team for custom pricing options
Was this page helpful?