Skip to main content
Control how your deployment scales based on traffic and load.

Configuration options

FlagTypeDefaultDescription
--min-replica-countInteger0Minimum number of replicas. Set to 0 for scale-to-zero
--max-replica-countInteger1Maximum number of replicas
--scale-up-windowDuration30sWait time before scaling up
--scale-down-windowDuration10mWait time before scaling down
--scale-to-zero-windowDuration1hIdle time before scaling to zero (min: 5m)
--load-targetsKey-valuedefault=0.8Scaling thresholds. See options below
Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):
  • default=<Fraction> - General load target from 0 to 1
  • tokens_generated_per_second=<Integer> - Desired tokens per second per replica
  • requests_per_second=<Number> - Desired requests per second per replica
  • concurrent_requests=<Number> - Desired concurrent requests per replica
When multiple targets are specified, the maximum replica count across all is used.

Common patterns

Scale to zero when idle to minimize costs:
firectl deployment create <MODEL_NAME> \
  --min-replica-count 0 \
  --max-replica-count 3 \
  --scale-to-zero-window 1h
Best for: Development, testing, or intermittent production workloads.

Scaling from zero behavior

When a deployment is scaled to zero and receives a request, the system immediately returns a 503 error with the DEPLOYMENT_SCALING_UP error code while initiating the scale-up process:
{
  "error": {
    "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
    "code": "DEPLOYMENT_SCALING_UP",
    "type": "error"
  }
}
Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.

Handling scale-from-zero responses

Implement retry logic with exponential backoff to gracefully handle scale-up delays:
import time
import requests

def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
    """Query a deployment with retry logic for scale-from-zero scenarios."""
    delay = initial_delay
    
    for attempt in range(max_retries):
        response = requests.post(url, json=payload, headers=headers)
        
        # Only retry if deployment is scaling up
        if response.status_code == 503:
            error_code = response.json().get("error", {}).get("code")
            if error_code == "DEPLOYMENT_SCALING_UP":
                print(f"Deployment scaling up, retrying in {delay}s...")
                time.sleep(delay)
                delay = min(delay * 1.5, 60)  # Cap at 60 seconds
                continue
            
        response.raise_for_status()
        return response.json()
    
    raise Exception("Deployment did not scale up in time")
Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set --min-replica-count 1 or higher to keep replicas always running.
Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. Reserved capacity guarantees availability during scale-up.