Autoscaling

Control how your deployment scales based on traffic and load.

Configuration options

Flag	Type	Default	Description
`--min-replica-count`	Integer	0	Minimum number of replicas. Set to 0 for scale-to-zero
`--max-replica-count`	Integer	1	Maximum number of replicas
`--scale-up-window`	Duration	30s	Wait time before scaling up
`--scale-down-window`	Duration	10m	Wait time before scaling down
`--scale-to-zero-window`	Duration	1h	Idle time before scaling to zero (min: 5m)
`--load-targets`	Key-value	`default=0.8`	Scaling thresholds. See options below

Load target options (use as --load-targets <key>=<value>[,<key>=<value>...]):

default=<Fraction> - General load target from 0 to 1
tokens_generated_per_second=<Integer> - Desired tokens per second per replica
prompt_tokens_per_second=<Integer> - Desired prompt tokens per second per replica
requests_per_second=<Number> - Desired requests per second per replica
concurrent_requests=<Number> - Desired concurrent requests per replica

When multiple targets are specified, the maximum replica count across all is used.

Common patterns

Cost optimization
Performance-focused
Predictable traffic

Scale to zero when idle to minimize costs:

firectl deployment create <MODEL_NAME> \
  --min-replica-count 0 \
  --max-replica-count 3 \
  --scale-to-zero-window 1h

Best for: Development, testing, or intermittent production workloads.

Keep replicas running for instant response:

firectl deployment create <MODEL_NAME> \
  --min-replica-count 2 \
  --max-replica-count 10 \
  --scale-up-window 15s \
  --load-targets concurrent_requests=5

Best for: Low-latency requirements, avoiding cold starts, high-traffic applications.

Match known traffic patterns:

firectl deployment create <MODEL_NAME> \
  --min-replica-count 3 \
  --max-replica-count 5 \
  --scale-down-window 30m \
  --load-targets tokens_generated_per_second=150

Best for: Steady workloads where you know typical load ranges.

Scaling from zero behavior

When a deployment is scaled to zero and receives a request, the system immediately returns a 503 error with the DEPLOYMENT_SCALING_UP error code while initiating the scale-up process:

{
  "error": {
    "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
    "code": "DEPLOYMENT_SCALING_UP",
    "type": "error"
  }
}

Requests to a scaled-to-zero deployment are not queued. Your application must implement retry logic to handle 503 responses while the deployment scales up.

Handling scale-from-zero responses

Implement retry logic with exponential backoff to gracefully handle scale-up delays:

Python
JavaScript
curl

import time
import requests

def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
    """Query a deployment with retry logic for scale-from-zero scenarios."""
    delay = initial_delay
    
    for attempt in range(max_retries):
        response = requests.post(url, json=payload, headers=headers)
        
        # Only retry if deployment is scaling up
        if response.status_code == 503:
            error_code = response.json().get("error", {}).get("code")
            if error_code == "DEPLOYMENT_SCALING_UP":
                print(f"Deployment scaling up, retrying in {delay}s...")
                time.sleep(delay)
                delay = min(delay * 1.5, 60)  # Cap at 60 seconds
                continue
            
        response.raise_for_status()
        return response.json()
    
    raise Exception("Deployment did not scale up in time")

async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
  let delay = initialDelay;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json', ...headers },
      body: JSON.stringify(payload)
    });
    
    // Only retry if deployment is scaling up
    if (response.status === 503) {
      const body = await response.json();
      if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
        console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
        await new Promise(resolve => setTimeout(resolve, delay));
        delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
        continue;
      }
    }
    
    if (!response.ok) throw new Error(`HTTP ${response.status}`);
    return response.json();
  }
  
  throw new Error('Deployment did not scale up in time');
}

# Simple retry loop for scale-from-zero
MAX_RETRIES=30
RETRY_DELAY=5

for i in $(seq 1 $MAX_RETRIES); do
  response=$(curl -s -w "\n%{http_code}" \
    https://api.fireworks.ai/inference/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $FIREWORKS_API_KEY" \
    -d '{"model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>", ...}')
  
  http_code=$(echo "$response" | tail -n1)
  body=$(echo "$response" | head -n -1)
  
  # Only retry if deployment is scaling up
  if [ "$http_code" -eq 503 ]; then
    error_code=$(echo "$body" | jq -r '.error.code // empty')
    if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
      echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
      sleep $RETRY_DELAY
      RETRY_DELAY=$((RETRY_DELAY * 2))
      continue
    fi

    echo "$body"
    exit 1
  fi
  
  # Check for success (2xx status codes)
  if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
    echo "$body"
    exit 0
  fi

  echo "$body"
  exit 1
done

echo "Deployment did not scale up in time"
exit 1

Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set --min-replica-count 1 or higher to keep replicas always running.

Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. Reserved capacity guarantees availability during scale-up.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Configuration options

Common patterns

Scaling from zero behavior

Handling scale-from-zero responses

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Configuration options

​Common patterns

​Scaling from zero behavior

​Handling scale-from-zero responses

Configuration options

Common patterns

Scaling from zero behavior

Handling scale-from-zero responses