Skip to main content
Control how your deployment scales based on traffic and load.

Configuration options

FlagTypeDefaultDescription
--min-replica-countInteger0Minimum number of replicas. Set to 0 for scale-to-zero
--max-replica-countInteger1Maximum number of replicas
--scale-up-windowDuration30sWait time before scaling up
--scale-down-windowDuration10mWait time before scaling down
--scale-to-zero-windowDuration1hIdle time before scaling to zero (min: 5m)
--load-targetsKey-valuedefault=0.8Scaling thresholds. See options below
Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):
  • default=<Fraction> - General load target from 0 to 1
  • tokens_generated_per_second=<Integer> - Desired tokens per second per replica
  • requests_per_second=<Number> - Desired requests per second per replica
  • concurrent_requests=<Number> - Desired concurrent requests per replica
When multiple targets are specified, the maximum replica count across all is used.

Common patterns

  • Cost optimization
  • Performance-focused
  • Predictable traffic
Scale to zero when idle to minimize costs:
firectl create deployment <MODEL_NAME> \
  --min-replica-count 0 \
  --max-replica-count 3 \
  --scale-to-zero-window 1h
Best for: Development, testing, or intermittent production workloads.
Cold starts take up to a few minutes when scaling from 0→1. Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. Reserved capacity guarantees availability during scale-up.
I