How does billing and scaling work for on-demand GPU deployments? - Fireworks AI Docs

On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: Billing:

Charges start when the server begins accepting requests
Billed by GPU-second for each active instance
Costs accumulate even if there are no active API calls

Scaling options:

Supports autoscaling from 0 to multiple GPUs
Each additional GPU adds to the billing rate
Can handle unlimited requests within the GPU’s capacity

Management requirements:

Not fully serverless; requires some manual management
Manually delete deployments when no longer needed
Or configure autoscaling to scale down to 0 during inactive periods

Cost control tips:

Regularly monitor active deployments
Delete unused deployments to avoid unnecessary costs
Consider serverless options for intermittent usage
Use autoscaling to 0 to optimize costs during low-demand times

How does billing work for on-demand deployments?

Which accelerator/GPU should I use?