Performance Metrics Overview

The Inference API returns several per-request metrics in the response. They can be useful for one-off debugging or can be logged by the client in their preferred observability tool. For aggregate metrics, see the usage dashboard.

Non-streaming requests: Performance metrics are always included in response headers (e.g., fireworks-prompt-tokens, fireworks-server-time-to-first-token).

Streaming requests: Only selected performance metrics, such as “fireworks-server-time-to-first-token,” are available because HTTP headers must be sent before the first token can be streamed. Use the perf_metrics_in_response body parameter to include all metrics in the last SSE event of the response body.

Using perf_metrics_in_response

To get performance metrics for streaming responses, set the perf_metrics_in_response parameter to true in your request. This will include performance data in the response body under the perf_metrics field.

Response Body Location

For streaming responses, performance metrics are included in the response body under the perf_metrics field in the final chunk (the one with finish_reason set). This is because headers may not be accessible during streaming.

Example with Fireworks Build SDK

Python
from fireworks import LLM
import os

llm = LLM(
    model="llama-v3p1-8b-instruct",
    deployment_type="serverless",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

# Streaming completion with performance metrics
stream = llm.chat.completions.create(
    messages=[{"role": "user", "content": "Hello, world!"}],
    max_tokens=100,
    stream=True,
    perf_metrics_in_response=True,
)

for chunk in stream:
    # Use getattr to avoid linter errors for unknown attributes
    perf_metrics = getattr(chunk, "perf_metrics", None)
    finish_reason = getattr(chunk.choices[0], "finish_reason", None)
    if perf_metrics is not None and finish_reason:
        print("Performance metrics:", perf_metrics)

Example with cURL

curl -X POST "https://api.fireworks.ai/inference/v1/completions" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
    "prompt": "The quick brown fox",
    "max_tokens": 100,
    "stream": true,
    "perf_metrics_in_response": true
  }'

Available Metrics

For detailed information about all available performance metrics, see the API reference documentation.