Autoscaling and costs

Q: How does autoscaling affect my costs?
  • Scaling from 0: No minimum cost when scaled to zero
  • Scaling up: Each new replica adds to your total cost proportionally. For example:
    • Scaling from 1 to 2 replicas doubles your GPU costs
    • If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our pricing page.

Rate-limits for on-demand deployment

Q: What are the rate limits for on-demand deployments? Request throughput scales with your GPU allocation. Base allocations include:
  • Up to 8 A100 GPUs
  • Up to 8 H100 GPUs
On-demand deployments offer several advantages:
  • Predictable pricing based on time units, not token I/O
  • Protected latency and performance, independent of traffic on the serverless platform
  • Choice of GPUs, including A100s and H100s
Need more GPUs? Contact us to discuss higher allocations for your specific use case.

On-demand billing

Q: How does billing work for on-demand deployments? On-demand deployments come with automatic cost optimization features:
  • Default autoscaling: Automatically scales to 0 replicas when not in use
  • Pay for what you use: Charged only for GPU time when replicas are active
  • Flexible configuration: Customize autoscaling behavior to match your needs
Best practices for cost management:
  1. Leverage default autoscaling: The system automatically scales down deployments when not in use
  2. Customize carefully: While you can modify autoscaling behavior using our configuration options, note that preventing scale-to-zero will result in continuous GPU charges
  3. Consider your use case: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our deployment guide.

Scaling structure

Q: How does billing and scaling work for on-demand GPU deployments? On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: Billing:
  • Charges start when the server begins accepting requests
  • Billed by GPU-second for each active instance
  • Costs accumulate even if there are no active API calls
Scaling options:
  • Supports autoscaling from 0 to multiple GPUs
  • Each additional GPU adds to the billing rate
  • Can handle unlimited requests within the GPU’s capacity
Management requirements:
  • Not fully serverless; requires some manual management
  • Manually delete deployments when no longer needed
  • Or configure autoscaling to scale down to 0 during inactive periods
Cost control tips:
  • Regularly monitor active deployments
  • Delete unused deployments to avoid unnecessary costs
  • Consider serverless options for intermittent usage
  • Use autoscaling to 0 to optimize costs during low-demand times

Additional resources