> ## Documentation Index > Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt > Use this file to discover all available pages before exploring further. # Autoscaling > Configure how your deployment scales based on traffic Control how your deployment scales based on traffic and load. ## Configuration options | Flag | Type | Default | Description | | ------------------------ | --------- | ------------- | ------------------------------------------------------ | | `--min-replica-count` | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero | | `--max-replica-count` | Integer | 1 | Maximum number of replicas | | `--scale-up-window` | Duration | 30s | Wait time before scaling up | | `--scale-down-window` | Duration | 10m | Wait time before scaling down | | `--scale-to-zero-window` | Duration | 1h | Idle time before scaling to zero (min: 5m) | | `--load-targets` | Key-value | `default=0.8` | Scaling thresholds. See options below | **Load target options** (use as `--load-targets =[,=...]`): * `default=` - General load target from 0 to 1 * `tokens_generated_per_second=` - Desired tokens per second per replica * `prompt_tokens_per_second=` - Desired prompt tokens per second per replica * `requests_per_second=` - Desired requests per second per replica * `concurrent_requests=` - Desired concurrent requests per replica When multiple targets are specified, the maximum replica count across all is used. ## Common patterns Scale to zero when idle to minimize costs: ```bash theme={null} firectl deployment create \ --min-replica-count 0 \ --max-replica-count 3 \ --scale-to-zero-window 1h ``` Best for: Development, testing, or intermittent production workloads. Keep replicas running for instant response: ```bash theme={null} firectl deployment create \ --min-replica-count 2 \ --max-replica-count 10 \ --scale-up-window 15s \ --load-targets concurrent_requests=5 ``` Best for: Low-latency requirements, avoiding cold starts, high-traffic applications. Match known traffic patterns: ```bash theme={null} firectl deployment create \ --min-replica-count 3 \ --max-replica-count 5 \ --scale-down-window 30m \ --load-targets tokens_generated_per_second=150 ``` Best for: Steady workloads where you know typical load ranges. ## Scaling from zero behavior When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process: ```json theme={null} { "error": { "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.", "code": "DEPLOYMENT_SCALING_UP", "type": "error" } } ``` Requests to a scaled-to-zero deployment are **not queued**. Your application must implement retry logic to handle `503` responses while the deployment scales up. ### Handling scale-from-zero responses Implement retry logic with exponential backoff to gracefully handle scale-up delays: ```python theme={null} import time import requests def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5): """Query a deployment with retry logic for scale-from-zero scenarios.""" delay = initial_delay for attempt in range(max_retries): response = requests.post(url, json=payload, headers=headers) # Only retry if deployment is scaling up if response.status_code == 503: error_code = response.json().get("error", {}).get("code") if error_code == "DEPLOYMENT_SCALING_UP": print(f"Deployment scaling up, retrying in {delay}s...") time.sleep(delay) delay = min(delay * 1.5, 60) # Cap at 60 seconds continue response.raise_for_status() return response.json() raise Exception("Deployment did not scale up in time") ``` ```javascript theme={null} async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) { let delay = initialDelay; for (let attempt = 0; attempt < maxRetries; attempt++) { const response = await fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', ...headers }, body: JSON.stringify(payload) }); // Only retry if deployment is scaling up if (response.status === 503) { const body = await response.json(); if (body.error?.code === 'DEPLOYMENT_SCALING_UP') { console.log(`Deployment scaling up, retrying in ${delay/1000}s...`); await new Promise(resolve => setTimeout(resolve, delay)); delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds continue; } } if (!response.ok) throw new Error(`HTTP ${response.status}`); return response.json(); } throw new Error('Deployment did not scale up in time'); } ``` ```bash theme={null} # Simple retry loop for scale-from-zero MAX_RETRIES=30 RETRY_DELAY=5 for i in $(seq 1 $MAX_RETRIES); do response=$(curl -s -w "\n%{http_code}" \ https://api.fireworks.ai/inference/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -d '{"model": "accounts//deployments/", ...}') http_code=$(echo "$response" | tail -n1) body=$(echo "$response" | head -n -1) # Only retry if deployment is scaling up if [ "$http_code" -eq 503 ]; then error_code=$(echo "$body" | jq -r '.error.code // empty') if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..." sleep $RETRY_DELAY RETRY_DELAY=$((RETRY_DELAY * 2)) continue fi echo "$body" exit 1 fi # Check for success (2xx status codes) if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then echo "$body" exit 0 fi echo "$body" exit 1 done echo "Deployment did not scale up in time" exit 1 ``` Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running. Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up.