Autoscaling and costs

Q: How does autoscaling affect my costs?

  • Scaling from 0: No minimum cost when scaled to zero
  • Scaling up: Each new replica adds to your total cost proportionally. For example:
    • Scaling from 1 to 2 replicas doubles your GPU costs
    • If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)

For current pricing details, please visit our pricing page.


Rate-limits for on-demand deployment

Q: What are the rate limits for on-demand deployments?

Request throughput scales with your GPU allocation. Base allocations include:

  • Up to 8 A100 GPUs
  • Up to 8 H100 GPUs

On-demand deployments offer several advantages:

  • Predictable pricing based on time units, not token I/O
  • Protected latency and performance, independent of traffic on the serverless platform
  • Choice of GPUs, including A100s and H100s

Need more GPUs? Contact us to discuss higher allocations for your specific use case.


On-demand billing

Q: How does billing work for on-demand deployments?

On-demand deployments come with automatic cost optimization features:

  • Default autoscaling: Automatically scales to 0 replicas when not in use
  • Pay for what you use: Charged only for GPU time when replicas are active
  • Flexible configuration: Customize autoscaling behavior to match your needs

Best practices for cost management:

  1. Leverage default autoscaling: The system automatically scales down deployments when not in use
  2. Customize carefully: While you can modify autoscaling behavior using our configuration options, note that preventing scale-to-zero will result in continuous GPU charges
  3. Consider your use case: For intermittent or low-frequency usage, serverless deployments might be more cost-effective

For detailed configuration options, see our deployment guide.


Scaling structure

Q: How does billing and scaling work for on-demand GPU deployments?

On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:

Billing:

  • Charges start when the server begins accepting requests
  • Billed by GPU-second for each active instance
  • Costs accumulate even if there are no active API calls

Scaling options:

  • Supports autoscaling from 0 to multiple GPUs
  • Each additional GPU adds to the billing rate
  • Can handle unlimited requests within the GPU’s capacity

Management requirements:

  • Not fully serverless; requires some manual management
  • Manually delete deployments when no longer needed
  • Or configure autoscaling to scale down to 0 during inactive periods

Cost control tips:

  • Regularly monitor active deployments
  • Delete unused deployments to avoid unnecessary costs
  • Consider serverless options for intermittent usage
  • Use autoscaling to 0 to optimize costs during low-demand times

Additional resources