Ledger & Debugging for RL Rollouts

Early Access Feature. This page is part of the same private-preview external-bucket hot-load workflow for RL rollouts. Contact Fireworks to enable this path on your account before using non-FW_HOSTED storage.

If you are using Fireworks-managed RLOR trainers with FW_HOSTED, the ledger and checkpoint-swap behavior here still matter, but you can usually ignore the external-bucket setup and manual upload/signaling details from the BYOT integration guide.

A hot-load deployment maintains a ledger of every snapshot it has loaded, along with which replica finished which snapshot at what time. The ledger is the fastest way to answer “what weights is my deployment serving right now?” and to recover from a stuck state.

Inspect snapshot history

Dump the ledger, sorted by most recent snapshot first:

firectl get ledger <deployment_id>

Each row shows the identity you signaled, whether it was a full or delta snapshot, the per-replica readiness transition timestamps, and any load error.

Inspect deployment status and failures

If the deployment itself is unhealthy (crashlooping after a bad snapshot, out-of-memory on merge, etc.), the reason is on the deployment resource itself:

firectl deployment get <deployment_id>

Look at the status, latestStatus.reason, and the most recent ledger entry together to reason about whether the problem is load-side, weights-side, or infra-side.

Snapshot config validation errors

Weight sync validates each snapshot’s config.json against the deployment’s base-model config before serving the snapshot. A validation failure means the snapshot stayed unloaded; continue serving the previous ready snapshot or fall back to a new full snapshot after fixing the files. Common messages include:

Extra base model config options or Extra snapshot model config options: one config has a top-level field that the other does not.
Config value mismatch for <field>: both configs contain the field, but the values differ.
Types mismatch: the snapshot config resolves to a different HuggingFace config class than the base model.

If the only difference is a known-safe additive metadata field, retry the weight sync request with validation.extra_fields_ignore, for example:

{
  "identity": "version_002",
  "validation": {
    "extra_fields_ignore": ["snapshot_only_option"]
  }
}

Important: Ignoring model-affecting fields can cause load or serving failures; only bypass known-safe metadata fields.

Reset the ledger

If the delta chain is wedged or you want to force the deployment back to the base model, you can clear server-side ledger history. This preserves the deployment itself; it just forgets every hot-loaded snapshot.

curl -X DELETE \
  https://api.fireworks.ai/v1/accounts/<account_id>/deployments/<deployment_id>/ledger \
  -H "Authorization: Bearer <fireworks_api_key>"

After reset, your next signal must be a full snapshot (delta metadata will be rejected because there’s nothing to diff against).

Checkpoint-swap behavior

When you signal a new snapshot, Fireworks has to eventually swap weights on every replica. What happens to in-flight and new requests during the swap depends on which transition mode the deployment is configured with.

Both modes behave the same way for checkpoint download — it always starts immediately after the signal, in parallel with ongoing inference. The modes differ in how they handle the actual weight-swap moment.Set the mode at deployment create time with --hot-load-transition-type ASYNC or SYNC (default ASYNC). See Create a hot-load deployment.

Async transition (recommended, default for RL)

This mode is similar in spirit to PipelineRL:

In-flight requests: paused for the duration of the swap, then resumed on the same HTTP connection. The active turn keeps its current KV state, so the request continues streaming instead of restarting.
New requests: queued until the swap finishes. Clients observe this as elevated time-to-first-token (TTFT).
No 4xx or 5xx is returned for the swap itself. Users may specify x-fireworks-hot-load-drain-timeout timeout request header in seconds (default 90) to receive HTTP 425 Too Early once the timeout expires.

Synchronous transition

In-flight requests: the server waits for them to complete on the old weights before swapping.
New requests arriving during the swap are rejected with HTTP 425 Too Early. Your rollout client should back off and retry, ideally using the same session-affinity key so it lands on a replica that has already finished the swap.

Prompt cache reset behavior

reset_prompt_cache only affects what can be reused after the swap. It does not interrupt the active turn (the in-flight HTTP stream), but it affects the next turn in the same session and new sessions. Configure per snapshot in POST /hot_load/v1/models/hot_load, for example { "identity": "version_002", "reset_prompt_cache": "new_session" }.

`reset_prompt_cache`	Existing turn (same HTTP stream)	New turn, same `x-multi-turn-session-id`	New session (new session id)
`all` (default)	Async: continues with prior KV on the stream. Sync: waits for turn to finish before swap.	Recompute KV	Recompute KV
`new_session`	Continues	Reuse KV for that session id	Recompute KV
`none`	Continues	Reuse KV	Reuse KV

Under async transition, the active turn keeps streaming on the same connection; cache reset applies to subsequent requests. Under sync transition, the server drains in-flight work before swapping, so you typically see stricter ordering before new weights apply.

Need help?

If the ledger stops advancing, a snapshot never becomes ready, or the deployment stays unhealthy after you fall back to a full snapshot, contact Fireworks. Include the account ID, deployment ID, snapshot identity you tried to load, and the latest ledger output.

Quickstart (BYOT)

Prerequisites, deployment setup, and the hot-load API.

Incremental snapshots

ARC2 deltas, hints, and incremental signal bodies.

Inference for RL rollouts

Session affinity, policy version in streams, and MoE Router Replay.

Documentation Index

​Inspect snapshot history

​Inspect deployment status and failures

​Snapshot config validation errors

​Reset the ledger

​Checkpoint-swap behavior

​Async transition (recommended, default for RL)

​Synchronous transition

​Prompt cache reset behavior

​Need help?

​Related pages

Quickstart (BYOT)

Incremental snapshots

Inference for RL rollouts

Inspect snapshot history

Inspect deployment status and failures

Snapshot config validation errors

Reset the ledger

Checkpoint-swap behavior

Async transition (recommended, default for RL)

Synchronous transition

Prompt cache reset behavior

Need help?

Related pages