Reliability and Error Handling

Building reliable applications requires handling network conditions, transient errors, and long-running requests. This guide covers recommended patterns for production use.

Timeout configuration

Set timeouts based on your workload type:

Workload	Recommended client timeout
Interactive / chat	30–60 seconds
Agentic (tool calls, multi-step)	5–30 minutes
Large model inference (long context)	10–30 minutes
Batch job submission	60 seconds (results are async)

Python SDK

from openai import OpenAI
import httpx

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<your-api-key>",
    timeout=httpx.Timeout(
        connect=10.0,
        read=1800.0,   # 30 min for long generations
        write=30.0,
        pool=10.0,
    ),
)

Raw HTTP

import requests

response = requests.post(
    "https://api.fireworks.ai/inference/v1/chat/completions",
    headers={"Authorization": "Bearer <your-api-key>"},
    json={"model": "...", "messages": [...]},
    timeout=(10, 1800),  # (connect, read) in seconds
)

Retry logic

Which errors are retryable

Status	Meaning	Retry?
`429`	Rate limit	✅ Yes — with backoff
`500`	Internal server error	✅ Yes — transient
`502`	Bad gateway	✅ Yes — transient
`503`	Service unavailable	✅ Yes — with backoff
`504`	Gateway timeout	✅ Yes — transient
`400`	Bad request	❌ No — fix the request
`401`	Unauthorized	❌ No — check API key
`404`	Not found	❌ No — check model/deployment ID
`422`	Unprocessable entity	❌ No — fix the request body

Exponential backoff with jitter

import time, random
from openai import OpenAI, RateLimitError, APIStatusError

def call_with_retry(client, max_retries=5, base_delay=1.0, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
        except APIStatusError as e:
            if e.status_code in (500, 502, 503, 504):
                if attempt == max_retries - 1:
                    raise
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                time.sleep(delay)
            else:
                raise

OpenAI SDK built-in retry

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<your-api-key>",
    max_retries=3,
)

Handling 429 rate limits

On serverless: Limits scale automatically with sustained usage. For immediate capacity, contact support or switch to a dedicated deployment. On dedicated deployments: Increase concurrency by raising replica counts (for example with firectl deployment update and autoscaling settings). See Autoscaling.

Long-running training jobs

For RL / RFT trainer jobs, use reconnect_and_wait on the job manager to recover from preemption or transient failures. See Trainer job manager for parameters and examples. To preserve optimizer state across interruptions, set dcp_save_interval in your training config. See RFT parameters reference.

The analytics dashboard vs. client-side failures

The Fireworks analytics and usage views count server-acknowledged requests. They do not capture connection errors that occur before a request reaches the server — those appear as failures on the client but may show as zero or reduced traffic in the console. If your client shows failures but the dashboard looks clean, the issue is likely client-side: timeout before connection, DNS resolution failure, or network path problems. Use Exporting metrics for per-deployment Prometheus metrics that reflect what Fireworks infrastructure observed for dedicated deployments.

Documentation Index

​Timeout configuration

​Python SDK

​Raw HTTP

​Retry logic

​Which errors are retryable

​Exponential backoff with jitter

​OpenAI SDK built-in retry

​Handling 429 rate limits

​Long-running training jobs

​The analytics dashboard vs. client-side failures