Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
What this is
This guide walks through GRPO (Group Relative Policy Optimization) training using the cookbook’s rl_loop recipe. GRPO samples multiple completions per prompt, scores them with a reward function, and uses group reward statistics for policy gradient updates.
For agentic RL workloads, the cookbook also supports an async RL loop that overlaps rollouts and training. Use it when agent trajectories or tool calls make rollout latency the bottleneck.
Architecture
The RL recipe always uses a policy trainer plus an inference deployment. Add a reference trainer when your setup needs reference logprobs:
| Component | Role |
|---|
| Policy trainer | Trainable model — runs forward_backward_custom + optim_step |
| Reference trainer | Optional frozen copy — provides KL/reference logprobs (--forward-only) when infra.ref_training_shape_id is set |
| Deployment | Sampling completions via DeploymentSampler (client-side tokenized) |
Using the recipe
The simplest way to run GRPO is via the cookbook’s Config + main:
from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig, WandBConfig
cfg = Config(
log_path="./grpo_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/gsm8k.jsonl",
max_rows=200,
epochs=1,
completions_per_prompt=4,
max_completion_tokens=1024,
temperature=1.0,
max_seq_len=4096,
policy_loss="grpo", # or "importance_sampling", "dapo", "dro", "gspo", "cispo"
infra=InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
),
deployment=DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
),
weight_sync=WeightSyncConfig(weight_sync_interval=1),
wandb=WandBConfig(entity="my-team", project="grpo-experiment"),
)
main(cfg)
The recipe handles resource provisioning, rollout scheduling, reference logprobs, checkpointing, and cleanup automatically.
Policy loss variants
policy_loss | Description |
|---|
"grpo" | REINFORCE + KL penalty (default) |
"importance_sampling" | Off-policy ratio weighting with optional clipping |
"reinforce" | Vanilla REINFORCE |
"dapo" | Dynamic advantage with asymmetric PPO clipping |
"dro" | Distributionally robust off-policy objective |
"gspo" | Sequence-level clipped PPO |
"cispo" | Clipped importance sampling policy optimization |
Step-by-step (API-level)
For teams that need full control beyond what the recipe provides, here is the API-level flow.
Provision resources with setup_infra
training.utils.rl.setup_infra is the cookbook’s single entrypoint for shape
resolution, trainer/deployment provisioning, weight sync wiring, and
trainer/deployment re-attach. It requests the policy trainer first, links the
deployment, then waits for readiness in parallel. Recipes pass a config + two booleans
(needs_reference, needs_inference) and get back an Infra bundle of wired
trainer clients. Teams that fork training/recipes/rl_loop.py should reuse
setup_infra rather than re-wiring the lower-level helpers below.
import os
import transformers
from fireworks.training.sdk import (
TrainerJobManager, DeploymentManager, DeploymentSampler, WeightSyncer,
AdaptiveConcurrencyController,
)
from training.utils import (
InfraConfig, DeployConfig, ResourceCleanup, WeightSyncScope,
)
from training.utils.rl import setup_infra
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)
base_model = "accounts/fireworks/models/qwen3-8b"
infra_cfg = InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
)
deploy_cfg = DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
weight_sync_scope=WeightSyncScope.PER_TRAINER, # default
)
with ResourceCleanup(rlor_mgr, deploy_mgr) as cleanup:
infra = setup_infra(
rlor_mgr=rlor_mgr,
deploy_mgr=deploy_mgr,
base_model=base_model,
infra_cfg=infra_cfg,
deploy_cfg=deploy_cfg,
lora_rank=0,
needs_reference=True, # KL baseline
needs_inference=True, # rollouts
role_prefix="grpo",
api_key=api_key,
cleanup=cleanup, # scope-exit: cancel trainers, scale deployment to 0
)
policy = infra.policy # ReconnectableClient (policy trainer)
reference = infra.reference # ReconnectableClient (forward-only) or LoRA shared handle
inference_model = infra.inference_model
tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
inference_url=deploy_mgr.inference_url,
model=inference_model,
api_key=api_key,
tokenizer=tokenizer,
concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
)
See Weight sync for WeightSyncScope.PER_TRAINER (default) vs PER_DEPLOYMENT. For the full setup_infra contract, lower-level building blocks, and implementation rationale, see the cookbook’s dev skill: skills/dev/.
Training loop
import asyncio
tracker = WeightSyncer(
policy_client=policy.inner,
deploy_mgr=deploy_mgr,
deployment_id="grpo-serving",
base_model=base_model,
hotload_timeout=600,
first_checkpoint_type="base",
)
for row in dataset:
input_messages = [m for m in row["messages"] if m.get("role") != "assistant"]
completions = asyncio.run(
sampler.sample_with_tokens(messages=input_messages, n=4, max_tokens=512)
)
rewards = [score(c) for c in completions]
if len(set(rewards)) == 1:
continue
datums = build_grpo_datums(completions)
ref_fwd = reference.forward(datums, "cross_entropy")
ref_logprobs = [list(x["logprobs"].data) for x in ref_fwd.loss_fn_outputs]
loss_fn = make_grpo_loss_fn(rewards, ref_logprobs, kl_beta=0.001)
policy.forward_backward_custom(datums, loss_fn)
policy.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
)
tracker.save_and_hotload(f"step-{step:05d}")
See Loss Functions for make_grpo_loss_fn and build_grpo_datums implementations.
Pipeline overlap
Sampling and training overlap within policy windows controlled by weight_sync_interval. All prompts in a window sample concurrently; results train as they arrive. At window boundaries the pipeline drains, weights sync to the deployment, and the next window samples against the updated weights.
weight_sync_interval | Behavior |
|---|
1 (default) | No overlap — sample, train, sync, repeat |
N > 1 | N-step windows with overlap inside, sync at boundaries |
0 | No syncs — the deployment keeps the base weights for the entire run. Useful for debugging or ablations, not standard RL training. |
Operational guidance
deployment.tokenizer_model is required — the API raises ValueError if not set.
- Set
infra.training_shape_id — training shapes are the launch path for cookbook trainers.
- Set
infra.ref_training_shape_id when you want a reference trainer — if it is unset, the recipe skips reference-model provisioning entirely.
- Skip prompts with uniform rewards (all correct or all wrong) — they provide no learning signal.
- Track reward distributions and KL every step to catch objective drift early.
- When configured, the reference trainer uses
--forward-only — never call optim_step on it.
- Sampling is async under the hood:
DeploymentSampler.sample_with_tokens() issues n concurrent n=1 requests, so synchronous scripts should wrap it with asyncio.run(...).
- DCP checkpoints are disabled by default (
dcp_save_interval=0). If you need to resume training from a checkpoint, explicitly set dcp_save_interval to a positive value in your WeightSyncConfig.
Common pitfalls
- Reward normalization bugs can destabilize GRPO updates quickly — verify advantage computation.
- Reference/policy tokenizer mismatch invalidates KL estimates — always use the same
base_model.
- Logprob alignment: Trainer returns N-1 logprobs for N tokens. Inference returns N logprobs where the first is
None. Use inference[1:] to align.
This page covers rl_loop (GRPO and its policy_loss variants — dapo, gspo, cispo, dro, importance_sampling, reinforce). One sibling RL recipe ships in the cookbook alongside it:
- IGPO (
training.recipes.igpo_loop) — Information Gain Policy Optimization for multi-turn agent trajectories. Adds turn-level IG rewards on top of the GRPO machinery; same policy_loss variants apply.
For runnable examples and rationale, see the recipe sources directly in the public cookbook repo. Implementation depth (RL internals, weight-sync state machine, hotload triage) lives in the skills/dev/ skill.
Async RL (experimental)
training.recipes.async_rl_loop overlaps rollout and training so sampling and gradient steps run on separate workers concurrently — the trainer no longer waits for a full batch of rollouts before stepping. You write a single async function:
async def rollout_fn(sample_prompt: dict) -> RolloutSample | None:
...
The recipe handles everything else: the outer loop, batching, weight sync, off-policy staleness bounds, and the rollout/train scheduler. No backward-compatibility guarantee — config fields and the rollout protocol may change between releases. For the full rollout_fn / RolloutSample contract, scheduler details, and tuning guidance, see the cookbook’s skills/dev/references/rl/async-rl.md skill.