New to Fireworks? Start with the Serverless Quickstart for a step-by-step guide to making your first API call.
Fireworks provides fast, cost-effective access to leading open-source text models through OpenAI-compatible APIs. Query models via serverless inference or dedicated deployments using the chat completions API (recommended), completions API, or responses API.Browse 100+ available models →
For consistent performance, guaranteed capacity, or higher throughput, you can query on-demand deployments instead of serverless models. Deployments use the same APIs with a deployment-specific model identifier:
Maintain conversation history by including all previous messages:
Copy
Ask AI
messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}, {"role": "user", "content": "What's its population?"}]response = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=messages)print(response.choices[0].message.content)
The model uses the full conversation history to provide contextually relevant responses.
Stream tokens as they’re generated for real time, interactive UX. Covered in detail in the Serverless Quickstart.
Copy
Ask AI
stream = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Tell me a story"}], stream=True)for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")
Aborting streams: Close the connection to stop generation and avoid billing for ungenerated tokens:
Copy
Ask AI
for chunk in stream: print(chunk.choices[0].delta.content, end="") if some_condition: stream.close() break
Every response includes token usage information and performance metrics for debugging and observability. For aggregate metrics over time, see the usage dashboard.Token usage (prompt, completion, total tokens) is included in the response body for all requests.Performance metrics (latency, time-to-first-token, etc.) are included in response headers for non-streaming requests. For streaming requests, use the perf_metrics_in_response parameter to include all metrics in the response body.
Non-streaming
Streaming (usage only)
Streaming (with performance metrics)
Copy
Ask AI
response = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Hello"}])# Token usage (always included)print(response.usage.prompt_tokens) # Tokens in your promptprint(response.usage.completion_tokens) # Tokens generatedprint(response.usage.total_tokens) # Total tokens billed# Performance metrics are in response headers:# fireworks-prompt-tokens, fireworks-server-time-to-first-token, etc.
Copy
Ask AI
stream = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Hello"}], stream=True)for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") # Usage is included in the final chunk if chunk.usage: print(f"\n\nTokens used: {chunk.usage.total_tokens}") print(f"Prompt: {chunk.usage.prompt_tokens}, Completion: {chunk.usage.completion_tokens}")
Copy
Ask AI
stream = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Hello, world!"}], stream=True, extra_body={"perf_metrics_in_response": True})for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") # Both usage and performance metrics are in the final chunk if chunk.choices[0].finish_reason: if chunk.usage: print(f"\n\nTokens: {chunk.usage.total_tokens}") if hasattr(chunk, 'perf_metrics'): print(f"Performance: {chunk.perf_metrics}")
Usage information is automatically included in the final chunk for streaming responses (the chunk with finish_reason set). This is a Fireworks extension - OpenAI SDK doesn’t return usage for streaming by default.
Control how the model generates text. Fireworks automatically uses recommended sampling parameters from each model’s HuggingFace generation_config.json when you don’t specify them explicitly, ensuring optimal performance out-of-the-box.We pull temperature, top_k, top_p, min_p, and typical_p from the model’s configuration when not explicitly provided.
The fireworks-sampling-options header contains the actual default sampling parameters used for the model, including values from the model’s HuggingFace generation_config.json:
Python
JavaScript
Copy
Ask AI
response = client.chat.completions.with_raw_response.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Hello"}])# Access headers from the raw responsesampling_options = response.headers.get('fireworks-sampling-options')print(sampling_options) # e.g., '{"temperature": 0.7, "top_p": 0.9}'completion = response.parse() # get the parsed response objectprint(completion.choices[0].message.content)
Copy
Ask AI
import OpenAI from "openai";const client = new OpenAI({ apiKey: process.env.FIREWORKS_API_KEY, baseURL: "https://api.fireworks.ai/inference/v1",});const response = await client.chat.completions.with_raw_response.create({ model: "accounts/fireworks/models/deepseek-v3p1", messages: [{ role: "user", content: "Hello" }],});// Access headers from the raw responseconst samplingOptions = response.headers.get('fireworks-sampling-options');console.log(samplingOptions); // e.g., '{"temperature": 0.7, "top_p": 0.9}'const completion = response.parse(); // get the parsed response objectconsole.log(completion.choices[0].message.content);
See the API reference for detailed parameter descriptions.
Multiple generations
Generate multiple completions in one request:
Copy
Ask AI
response = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Tell me a joke"}], n=3 # Generate 3 different jokes)for choice in response.choices: print(choice.message.content)
Token probabilities (logprobs)
Inspect token probabilities for debugging or analysis:
Copy
Ask AI
response = client.chat.completions.create( model="accounts/fireworks/models/deepseek-v3p1", messages=[{"role": "user", "content": "Hello"}], logprobs=True, top_logprobs=5 # Show top 5 alternatives per token)for content in response.choices[0].logprobs.content: print(f"Token: {content.token}, Logprob: {content.logprob}")
Prompt inspection (echo & raw_output)
Verify how your prompt was formatted:Echo: Return the prompt along with the generation:
Language models process text in chunks called tokens. In English, a token can be as short as one character or as long as one word. Different model families use different tokenizers, so the same text may translate to different token counts depending on the model.Why tokens matter:
Models have maximum context lengths measured in tokens
Pricing is based on token usage (prompt + completion)
Token count affects response time
For Llama models, use this tokenizer tool to estimate token counts. Actual usage is returned in the usage field of every API response.
Fireworks provides an OpenAI-compatible API, making migration from OpenAI straightforward. For detailed information on setup, usage examples, and API compatibility notes, see the OpenAI compatibility guide.