Control how your deployment scales based on traffic and load.
Configuration options
| Flag | Type | Default | Description |
|---|
--min-replica-count | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero |
--max-replica-count | Integer | 1 | Maximum number of replicas |
--scale-up-window | Duration | 30s | Wait time before scaling up |
--scale-down-window | Duration | 10m | Wait time before scaling down |
--scale-to-zero-window | Duration | 1h | Idle time before scaling to zero (min: 5m) |
--load-targets | Key-value | default=0.8 | Scaling thresholds. See options below |
Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):
default=<Fraction> - General load target from 0 to 1
tokens_generated_per_second=<Integer> - Desired tokens per second per replica
requests_per_second=<Number> - Desired requests per second per replica
concurrent_requests=<Number> - Desired concurrent requests per replica
When multiple targets are specified, the maximum replica count across all is used.
Common patterns
Cost optimization
Performance-focused
Predictable traffic
Scale to zero when idle to minimize costs:firectl deployment create <MODEL_NAME> \
--min-replica-count 0 \
--max-replica-count 3 \
--scale-to-zero-window 1h
Best for: Development, testing, or intermittent production workloads. Keep replicas running for instant response:firectl deployment create <MODEL_NAME> \
--min-replica-count 2 \
--max-replica-count 10 \
--scale-up-window 15s \
--load-targets concurrent_requests=5
Best for: Low-latency requirements, avoiding cold starts, high-traffic applications. Match known traffic patterns:firectl deployment create <MODEL_NAME> \
--min-replica-count 3 \
--max-replica-count 5 \
--scale-down-window 30m \
--load-targets tokens_generated_per_second=150
Best for: Steady workloads where you know typical load ranges.
Scaling from zero behavior
When a deployment is scaled to zero and receives a request, the system immediately returns a 503 error with the DEPLOYMENT_SCALING_UP error code while initiating the scale-up process:
{
"error": {
"message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
"code": "DEPLOYMENT_SCALING_UP",
"type": "error"
}
}
Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.
Handling scale-from-zero responses
Implement retry logic with exponential backoff to gracefully handle scale-up delays:
import time
import requests
def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
"""Query a deployment with retry logic for scale-from-zero scenarios."""
delay = initial_delay
for attempt in range(max_retries):
response = requests.post(url, json=payload, headers=headers)
# Only retry if deployment is scaling up
if response.status_code == 503:
error_code = response.json().get("error", {}).get("code")
if error_code == "DEPLOYMENT_SCALING_UP":
print(f"Deployment scaling up, retrying in {delay}s...")
time.sleep(delay)
delay = min(delay * 1.5, 60) # Cap at 60 seconds
continue
response.raise_for_status()
return response.json()
raise Exception("Deployment did not scale up in time")
async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
let delay = initialDelay;
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json', ...headers },
body: JSON.stringify(payload)
});
// Only retry if deployment is scaling up
if (response.status === 503) {
const body = await response.json();
if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
await new Promise(resolve => setTimeout(resolve, delay));
delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
continue;
}
}
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
}
throw new Error('Deployment did not scale up in time');
}
# Simple retry loop for scale-from-zero
MAX_RETRIES=30
RETRY_DELAY=5
for i in $(seq 1 $MAX_RETRIES); do
response=$(curl -s -w "\n%{http_code}" \
https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>", ...}')
http_code=$(echo "$response" | tail -n1)
body=$(echo "$response" | head -n -1)
# Only retry if deployment is scaling up
if [ "$http_code" -eq 503 ]; then
error_code=$(echo "$body" | jq -r '.error.code // empty')
if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
sleep $RETRY_DELAY
RETRY_DELAY=$((RETRY_DELAY * 2))
continue
fi
echo "$body"
exit 1
fi
# Check for success (2xx status codes)
if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
echo "$body"
exit 0
fi
echo "$body"
exit 1
done
echo "Deployment did not scale up in time"
exit 1
Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set --min-replica-count 1 or higher to keep replicas always running.
Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. Reserved capacity guarantees availability during scale-up.