Configuration options
Flag | Type | Default | Description |
---|---|---|---|
--min-replica-count | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero |
--max-replica-count | Integer | 1 | Maximum number of replicas |
--scale-up-window | Duration | 30s | Wait time before scaling up |
--scale-down-window | Duration | 10m | Wait time before scaling down |
--scale-to-zero-window | Duration | 1h | Idle time before scaling to zero (min: 5m) |
--load-targets | Key-value | default=0.8 | Scaling thresholds. See options below |
--load-targets <key>=<value>[,<key>=<value>...]
):
default=<Fraction>
- General load target from 0 to 1tokens_generated_per_second=<Integer>
- Desired tokens per second per replicarequests_per_second=<Number>
- Desired requests per second per replicaconcurrent_requests=<Number>
- Desired concurrent requests per replica
Common patterns
- Cost optimization
- Performance-focused
- Predictable traffic
Scale to zero when idle to minimize costs:Best for: Development, testing, or intermittent production workloads.
Cold starts take up to a few minutes when scaling from 0→1. Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. Reserved capacity guarantees availability during scale-up.