Q: How does billing work for on-demand deployments?On-demand deployments come with automatic cost optimization features:
Default autoscaling: Automatically scales to 0 replicas when not in use
Pay for what you use: Charged only for GPU time when replicas are active
Flexible configuration: Customize autoscaling behavior to match your needs
Best practices for cost management:
Leverage default autoscaling: The system automatically scales down deployments when not in use
Customize carefully: While you can modify autoscaling behavior using our configuration options, note that preventing scale-to-zero will result in continuous GPU charges
Consider your use case: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
Q: How does billing and scaling work for on-demand GPU deployments?On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:Billing:
Charges start when the server begins accepting requests
Billed by GPU-second for each active instance
Costs accumulate even if there are no active API calls
Scaling options:
Supports autoscaling from 0 to multiple GPUs
Each additional GPU adds to the billing rate
Can handle unlimited requests within the GPU’s capacity
Management requirements:
Not fully serverless; requires some manual management
Manually delete deployments when no longer needed
Or configure autoscaling to scale down to 0 during inactive periods
Cost control tips:
Regularly monitor active deployments
Delete unused deployments to avoid unnecessary costs
Consider serverless options for intermittent usage
Use autoscaling to 0 to optimize costs during low-demand times