On-demand infrastructure
On-demand deployment scaling
Understanding Fireworks.ai system scaling and request handling capabilities.
System scaling
Q: How does the system scale?
Our system is horizontally scalable, meaning it:
- Scales linearly with additional replicas of the deployment
- Automatically allocates resources based on demand
- Manages distributed load handling efficiently
Auto scaling
Q: Do you support Auto Scaling?
Yes, our system supports auto scaling with the following features:
- Scaling down to zero capability for resource efficiency
- Controllable scale-up and scale-down velocity
- Custom scaling rules and thresholds to match your specific needs
Throughput capacity
Q: What’s the supported throughput?
Throughput capacity typically depends on several factors:
- Deployment type (serverless or on-demand)
- Traffic patterns and request patterns
- Hardware configuration
- Model size and complexity
Request handling
Q: What factors affect the number of simultaneous requests that can be handled?
The request handling capacity is influenced by multiple factors:
- Model size and type
- Number of GPUs allocated to the deployment
- GPU type (e.g., A100 vs. H100)
- Prompt size and generation token length
- Deployment type (serverless vs. on-demand)
Additional resources
- Discord Community: discord.gg/fireworks-ai
- Email Support: inquiries@fireworks.ai
- Documentation: Fireworks.ai docs
Was this page helpful?