System scaling

Q: How does the system scale?

Our system is horizontally scalable, meaning it:

  • Scales linearly with additional replicas of the deployment
  • Automatically allocates resources based on demand
  • Manages distributed load handling efficiently

Auto scaling

Q: Do you support Auto Scaling?

Yes, our system supports auto scaling with the following features:

  • Scaling down to zero capability for resource efficiency
  • Controllable scale-up and scale-down velocity
  • Custom scaling rules and thresholds to match your specific needs

Throughput capacity

Q: What’s the supported throughput?

Throughput capacity typically depends on several factors:

  • Deployment type (serverless or on-demand)
  • Traffic patterns and request patterns
  • Hardware configuration
  • Model size and complexity

Request handling

Q: What factors affect the number of simultaneous requests that can be handled?

The request handling capacity is influenced by multiple factors:

  • Model size and type
  • Number of GPUs allocated to the deployment
  • GPU type (e.g., A100 vs. H100)
  • Prompt size and generation token length
  • Deployment type (serverless vs. on-demand)

Additional resources