- Charges start when the server begins accepting requests
- Billed by GPU-second for each active instance
- Costs accumulate even if there are no active API calls
- Supports autoscaling from 0 to multiple GPUs
- Each additional GPU adds to the billing rate
- Can handle unlimited requests within the GPU’s capacity
- Not fully serverless; requires some manual management
- Manually delete deployments when no longer needed
- Or configure autoscaling to scale down to 0 during inactive periods
- Regularly monitor active deployments
- Delete unused deployments to avoid unnecessary costs
- Consider serverless options for intermittent usage
- Use autoscaling to 0 to optimize costs during low-demand times