On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: Billing:Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Charges start when the server begins accepting requests
- Billed by GPU-second for each active instance
- Costs accumulate even if there are no active API calls
- Supports autoscaling from 0 to multiple GPUs
- Each additional GPU adds to the billing rate
- Can handle unlimited requests within the GPU’s capacity
- Not fully serverless; requires some manual management
- Manually delete deployments when no longer needed
- Or configure autoscaling to scale down to 0 during inactive periods
- Regularly monitor active deployments
- Delete unused deployments to avoid unnecessary costs
- Consider serverless options for intermittent usage
- Use autoscaling to 0 to optimize costs during low-demand times