/v1/completions and /v1/chat/completions endpoints expose a few extra features tailored to multi-turn, stateful rollout traffic. You can use these whether or not the underlying deployment is a hot-load deployment.
These features are fully compatible with the OpenAI SDKs — they’re all
attached as either request headers or optional body fields, so no SDK upgrade
is required.
Session affinity
Multi-turn rollouts typically reuse a long prefix between turns (same system prompt, same trajectory so far). To get the KV cache to hit, all turns of a trajectory should land on the same inference replica. Two headers are relevant here:x-multi-turn-session-id— identifies the agent trajectory. Set this once per trajectory and keep it constant across turns. If both headers are present, Fireworks currently prefers this value when deriving the request’s session-affinity key.x-session-affinity— fallback sticky routing key whenx-multi-turn-session-idis absent. In most RL rollout setups, set it to the same trajectory ID.
- Python
- curl
Behavior during weight swap
If your rollout traffic hits a hot-load deployment, a new checkpoint can arrive mid-rollout. What happens to your requests depends on the deployment’s configured transition mode:- Async transition (recommended for RL): in-flight requests pause then resume on the same HTTP connection using the new weights. The active turn keeps its current KV state, so it continues rather than restarting. New requests queue up. You see elevated TTFT but no errors.
- Synchronous transition: in-flight requests finish on the old weights; new requests get HTTP
425 Too Earlyuntil the swap is done. Your client should retry with back-off, ideally keeping the same session-affinity key so it lands on a replica that has already finished the swap.
reset_prompt_cache only affects what future requests or session IDs can reuse after the swap. See Checkpoint-swap behavior for the full semantics.
MoE Router Replay
For Mixture-of-Experts models, training-inference divergence often comes from the router picking different top-K experts at the same token position between trainer and inference. Aligning those choices across rollouts and training is known as Rollout Router Replay (R3). Fireworks inference supports returning the selected MoE experts for every token and every MoE layer. Passinclude_routing_matrix: true together with logprobs: true on your request:
/v1/chat/completions you find them at choices[i].logprobs.content[j].routing_matrix; for /v1/completions the structure is analogous. Each value is a flattened, base64-encoded uint8 array of shape [num_layers_with_moe, num_active_experts].
Example response (DeepSeek V3)
Decoding the routing matrix
DeepSeek V3 has 58 MoE layers (the first 3 of 61 total are dense) and selects 8 active experts per token, so each decoded buffer is58 * 8 = 464 bytes.
Other API modes
- Completions API (
/v1/completions): same mechanism —include_routing_matrixandlogprobsare top-level body fields. - Streaming (
stream: true):routing_matrixis included on each streamed token chunk’slogprobs.contententry. - Prompt tokens (
echo: true): returns expert selection for the prompt tokens too. Combine withecho_last: Nto only include expert selection for the last N prompt tokens.
Related pages
BYOT integration guide
The full bring-your-own-trainer flow that usually wraps these inference features.
Ledger & checkpoint swap
Detailed semantics of request behavior across weight swaps.
Prompt caching
Session-affinity patterns for general cache hit optimization.
Chat completions API
Full request/response schema for
/v1/chat/completions.