Fireworks AI Docs home page
Search...
⌘K
Ask AI
Community
Status
Dashboard
Dashboard
Search...
Navigation
On-demand deployment scaling
Documentation
Examples
SDKs
CLI
API Reference
Model Library
FAQ
Changelog
Account & Access
Company account access
Close account
Multiple accounts login
GitHub authentication email
LinkedIn authentication email
Billing & Pricing
Pricing structure
Fine-tuned model fees
Bulk usage discounts
Serverless discounts
Credits & billing system
Account suspension reasons
$1 credit depleted
Missing credits issue
Invoice vs credits
Credit receipts
Models API billing
Serverless prompt caching billing
Input image pricing
Deployment & Infrastructure
Performance optimization
Performance benchmarking
Model latency ranges
Performance factors
Performance best practices
Serverless latency guarantees
Serverless SLAs
Serverless quotas
Fine-tuned serverless costs
Model removal notice
Serverless timeout issues
System scaling
Auto scaling support
Throughput capacity
Request handling factors
Autoscaling cost impact
On-demand rate limits
On-demand billing
GPU deployment billing
GPU selection guide
Custom model deployment issues
Deployment performance expectations
Performance consultation
Single replica optimization
Models & Inference
Custom base models
Serverless model availability
Model availability requests
Llama 3.1 405B quantization
API batching & load balancing
Request handling capacity
Safety filter controls
Token limit controls
Streaming performance metrics
FLUX multiple images
FLUX image-to-image
FLUX custom LoRA
SDXL ControlNet sizing
Fine-tuning
Fine-tuning service
Fine-tuning model support
Fine-tuned model access
firectl invalid ID errors
Llama 3.1 LoRA deployment
Security & Compliance
Data encryption at rest
Data encryption in transit
Client-side encryption options
Security policy documentation
LLM model guardrails
Private network connections
Security certifications
Support & General
General support
Performance support
Deployment regions
Support options
Support process
Enterprise support
Enterprise support Slack
Enterprise support tiers & SLAs
Enterprise tier quotas
On this page
System scaling
Auto scaling
Throughput capacity
Request handling
Additional resources
On-demand deployment scaling
Copy page
Understanding Fireworks.ai system scaling and request handling capabilities.
Copy page
System scaling
Q: How does the system scale?
Our system is
horizontally scalable
, meaning it:
Scales linearly with additional
replicas
of the deployment
Automatically allocates resources
based on demand
Manages
distributed load handling
efficiently
Auto scaling
Q: Do you support Auto Scaling?
Yes, our system supports
auto scaling
with the following features:
Scaling down to zero
capability for resource efficiency
Controllable
scale-up and scale-down velocity
Custom scaling rules and thresholds
to match your specific needs
Throughput capacity
Q: What’s the supported throughput?
Throughput capacity typically depends on several factors:
Deployment type
(serverless or on-demand)
Traffic patterns
and
request patterns
Hardware configuration
Model size and complexity
Request handling
Q: What factors affect the number of simultaneous requests that can be handled?
The request handling capacity is influenced by multiple factors:
Model size and type
Number of GPUs
allocated to the deployment
GPU type
(e.g., A100 vs. H100)
Prompt size
and
generation token length
Deployment type
(serverless vs. on-demand)
Additional resources
Discord Community
:
discord.gg/fireworks-ai
Email Support
:
inquiries@fireworks.ai
Documentation
:
Fireworks.ai docs
Was this page helpful?
Yes
No
Assistant
Responses are generated using AI and may contain mistakes.