Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt

Use this file to discover all available pages before exploring further.

Building reliable applications requires handling network conditions, transient errors, and long-running requests. This guide covers recommended patterns for production use.

Timeout configuration

Set timeouts based on your workload type:
WorkloadRecommended client timeout
Interactive / chat30–60 seconds
Agentic (tool calls, multi-step)5–30 minutes
Large model inference (long context)10–30 minutes
Batch job submission60 seconds (results are async)

Python SDK

from openai import OpenAI
import httpx

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<your-api-key>",
    timeout=httpx.Timeout(
        connect=10.0,
        read=1800.0,   # 30 min for long generations
        write=30.0,
        pool=10.0,
    ),
)

Raw HTTP

import requests

response = requests.post(
    "https://api.fireworks.ai/inference/v1/chat/completions",
    headers={"Authorization": "Bearer <your-api-key>"},
    json={"model": "...", "messages": [...]},
    timeout=(10, 1800),  # (connect, read) in seconds
)

Retry logic

Which errors are retryable

StatusMeaningRetry?
429Rate limit✅ Yes — with backoff
500Internal server error✅ Yes — transient
502Bad gateway✅ Yes — transient
503Service unavailable✅ Yes — with backoff
504Gateway timeout✅ Yes — transient
400Bad request❌ No — fix the request
401Unauthorized❌ No — check API key
404Not found❌ No — check model/deployment ID
422Unprocessable entity❌ No — fix the request body

Exponential backoff with jitter

import time, random
from openai import OpenAI, RateLimitError, APIStatusError

def call_with_retry(client, max_retries=5, base_delay=1.0, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
        except APIStatusError as e:
            if e.status_code in (500, 502, 503, 504):
                if attempt == max_retries - 1:
                    raise
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                time.sleep(delay)
            else:
                raise

OpenAI SDK built-in retry

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<your-api-key>",
    max_retries=3,
)

Handling 429 rate limits

On serverless: Limits scale automatically with sustained usage. For immediate capacity, contact support or switch to a dedicated deployment. On dedicated deployments: Increase concurrency by raising replica counts (for example with firectl deployment update and autoscaling settings). See Autoscaling.

Long-running training jobs

For RL / RFT trainer jobs, use reconnect_and_wait on the job manager to recover from preemption or transient failures. See Trainer job manager for parameters and examples. To preserve optimizer state across interruptions, set dcp_save_interval in your training config. See RFT parameters reference.

The analytics dashboard vs. client-side failures

The Fireworks analytics and usage views count server-acknowledged requests. They do not capture connection errors that occur before a request reaches the server — those appear as failures on the client but may show as zero or reduced traffic in the console. If your client shows failures but the dashboard looks clean, the issue is likely client-side: timeout before connection, DNS resolution failure, or network path problems. Use Exporting metrics for per-deployment Prometheus metrics that reflect what Fireworks infrastructure observed for dedicated deployments.