> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Cookbook: Reinforcement Learning

> GRPO training with policy/reference trainers, reward scoring, and serving weight sync via cookbook recipes.

## What this is

This guide walks through GRPO (Group Relative Policy Optimization) training using the cookbook's `rl_loop` recipe. GRPO samples multiple completions per prompt, scores them with a reward function, and uses group reward statistics for policy gradient updates.

<Tip>
  For agentic RL workloads, the cookbook also supports an async RL loop that overlaps rollouts and training. Use it when agent trajectories or tool calls make rollout latency the bottleneck.
</Tip>

## Architecture

The RL recipe always uses a **policy trainer** plus an **inference deployment**. Add a **reference trainer** when your setup needs reference logprobs:

| Component             | Role                                                                                                               |
| --------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Policy trainer**    | Trainable model — runs `forward_backward_custom` + `optim_step`                                                    |
| **Reference trainer** | Optional frozen copy — provides KL/reference logprobs (`--forward-only`) when `infra.ref_training_shape_id` is set |
| **Deployment**        | Sampling completions via `DeploymentSampler` (client-side tokenized)                                               |

```mermaid theme={null}
flowchart LR
  loop[Your Python Loop] -->|sample completions| deployment[Inference Deployment]
  deployment -->|completions| loop
  loop -->|forward only when configured| refTrainer[Reference Trainer optional]
  refTrainer -->|ref logprobs| loop
  loop -->|forward_backward_custom + optim_step| policyTrainer[Policy Trainer]
  policyTrainer -->|save_weights_for_sampler_ext| deployment
```

## Using the recipe

The simplest way to run GRPO is via the cookbook's `Config` + `main`:

```python theme={null}
from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig, WandBConfig

cfg = Config(
    log_path="./grpo_logs",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="/path/to/gsm8k.jsonl",
    max_rows=200,
    epochs=1,
    completions_per_prompt=4,
    max_completion_tokens=1024,
    temperature=1.0,
    max_seq_len=4096,
    policy_loss="grpo",  # or "importance_sampling", "dapo", "dro", "gspo", "cispo"
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
        ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
    ),
    deployment=DeployConfig(
        deployment_id="grpo-serving",
        tokenizer_model="Qwen/Qwen3-8B",
    ),
    weight_sync=WeightSyncConfig(weight_sync_interval=1),
    wandb=WandBConfig(entity="my-team", project="grpo-experiment"),
)

main(cfg)
```

The recipe handles resource provisioning, rollout scheduling, reference logprobs, checkpointing, and cleanup automatically.

### Policy loss variants

| `policy_loss`           | Description                                       |
| ----------------------- | ------------------------------------------------- |
| `"grpo"`                | REINFORCE + KL penalty (default)                  |
| `"importance_sampling"` | Off-policy ratio weighting with optional clipping |
| `"reinforce"`           | Vanilla REINFORCE                                 |
| `"dapo"`                | Dynamic advantage with asymmetric PPO clipping    |
| `"dro"`                 | Distributionally robust off-policy objective      |
| `"gspo"`                | Sequence-level clipped PPO                        |
| `"cispo"`               | Clipped importance sampling policy optimization   |

## Step-by-step (API-level)

For teams that need full control beyond what the recipe provides, here is the API-level flow.

### Provision resources with `setup_infra`

`training.utils.rl.setup_infra` is the cookbook's single entrypoint for shape
resolution, trainer/deployment provisioning, weight sync wiring, and
trainer/deployment re-attach. It requests the policy trainer first, links the
deployment, then waits for readiness in parallel. Recipes pass a config + two booleans
(`needs_reference`, `needs_inference`) and get back an `Infra` bundle of wired
trainer clients. Teams that fork `training/recipes/rl_loop.py` should reuse
`setup_infra` rather than re-wiring the lower-level helpers below.

```python theme={null}
import os
import transformers
from fireworks.training.sdk import (
    TrainerJobManager, DeploymentManager, DeploymentSampler, WeightSyncer,
    AdaptiveConcurrencyController,
)
from training.utils import (
    InfraConfig, DeployConfig, ResourceCleanup, WeightSyncScope,
)
from training.utils.rl import setup_infra

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)

base_model = "accounts/fireworks/models/qwen3-8b"
infra_cfg = InfraConfig(
    training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
)
deploy_cfg = DeployConfig(
    deployment_id="grpo-serving",
    tokenizer_model="Qwen/Qwen3-8B",
    weight_sync_scope=WeightSyncScope.PER_TRAINER,  # default
)

with ResourceCleanup(rlor_mgr, deploy_mgr) as cleanup:
    infra = setup_infra(
        rlor_mgr=rlor_mgr,
        deploy_mgr=deploy_mgr,
        base_model=base_model,
        infra_cfg=infra_cfg,
        deploy_cfg=deploy_cfg,
        lora_rank=0,
        needs_reference=True,   # KL baseline
        needs_inference=True,   # rollouts
        role_prefix="grpo",
        api_key=api_key,
        cleanup=cleanup,        # scope-exit: cancel trainers, scale deployment to 0
    )

policy = infra.policy          # ReconnectableClient (policy trainer)
reference = infra.reference    # ReconnectableClient (forward-only) or LoRA shared handle
inference_model = infra.inference_model

tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
    inference_url=deploy_mgr.inference_url,
    model=inference_model,
    api_key=api_key,
    tokenizer=tokenizer,
    concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
)
```

See [Weight sync](/fine-tuning/training-api/cookbook/weight-sync) for `WeightSyncScope.PER_TRAINER` (default) vs `PER_DEPLOYMENT`. For the full `setup_infra` contract, lower-level building blocks, and implementation rationale, see the cookbook's dev skill: [`skills/dev/`](https://github.com/fw-ai/cookbook/blob/main/skills/dev/SKILL.md).

### Training loop

```python theme={null}
import asyncio

tracker = WeightSyncer(
    policy_client=policy.inner,
    deploy_mgr=deploy_mgr,
    deployment_id="grpo-serving",
    base_model=base_model,
    hotload_timeout=600,
    first_checkpoint_type="base",
)

for row in dataset:
    input_messages = [m for m in row["messages"] if m.get("role") != "assistant"]
    completions = asyncio.run(
        sampler.sample_with_tokens(messages=input_messages, n=4, max_tokens=512)
    )
    rewards = [score(c) for c in completions]
    if len(set(rewards)) == 1:
        continue

    datums = build_grpo_datums(completions)
    ref_fwd = reference.forward(datums, "cross_entropy")
    ref_logprobs = [list(x["logprobs"].data) for x in ref_fwd.loss_fn_outputs]

    loss_fn = make_grpo_loss_fn(rewards, ref_logprobs, kl_beta=0.001)
    policy.forward_backward_custom(datums, loss_fn)
    policy.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    )

    tracker.save_and_hotload(f"step-{step:05d}")
```

See [Loss Functions](/fine-tuning/training-api/loss-functions) for `make_grpo_loss_fn` and `build_grpo_datums` implementations.

## Pipeline overlap

Sampling and training overlap within **policy windows** controlled by `weight_sync_interval`. All prompts in a window sample concurrently; results train as they arrive. At window boundaries the pipeline drains, weights sync to the deployment, and the next window samples against the updated weights.

| `weight_sync_interval` | Behavior                                                                                                                          |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `1` (default)          | No overlap — sample, train, sync, repeat                                                                                          |
| `N > 1`                | N-step windows with overlap inside, sync at boundaries                                                                            |
| `0`                    | No syncs — the deployment keeps the base weights for the entire run. Useful for debugging or ablations, not standard RL training. |

## Operational guidance

* **`deployment.tokenizer_model` is required** — the API raises `ValueError` if not set.
* **Set `infra.training_shape_id`** — training shapes are the launch path for cookbook trainers.
* **Set `infra.ref_training_shape_id` when you want a reference trainer** — if it is unset, the recipe skips reference-model provisioning entirely.
* **Skip prompts with uniform rewards** (all correct or all wrong) — they provide no learning signal.
* **Track reward distributions and KL** every step to catch objective drift early.
* **When configured, the reference trainer uses `--forward-only`** — never call `optim_step` on it.
* **Sampling is async under the hood**: `DeploymentSampler.sample_with_tokens()` issues `n` concurrent `n=1` requests, so synchronous scripts should wrap it with `asyncio.run(...)`.
* **DCP checkpoints are disabled by default** (`dcp_save_interval=0`). If you need to resume training from a checkpoint, explicitly set `dcp_save_interval` to a positive value in your `WeightSyncConfig`.

## Common pitfalls

* **Reward normalization bugs** can destabilize GRPO updates quickly — verify advantage computation.
* **Reference/policy tokenizer mismatch** invalidates KL estimates — always use the same `base_model`.
* **Logprob alignment**: Trainer returns N-1 logprobs for N tokens. Inference returns N logprobs where the first is `None`. Use `inference[1:]` to align.

## Related RL recipes

This page covers `rl_loop` (GRPO and its `policy_loss` variants — `dapo`, `gspo`, `cispo`, `dro`, `importance_sampling`, `reinforce`). One sibling RL recipe ships in the cookbook alongside it:

* **IGPO** (`training.recipes.igpo_loop`) — Information Gain Policy Optimization for multi-turn agent trajectories. Adds turn-level IG rewards on top of the GRPO machinery; same `policy_loss` variants apply.

For runnable examples and rationale, see the recipe sources directly in the public [cookbook repo](https://github.com/fw-ai/cookbook/tree/main/training/recipes). Implementation depth (RL internals, weight-sync state machine, hotload triage) lives in the [`skills/dev/`](https://github.com/fw-ai/cookbook/tree/main/skills/dev) skill.

## Async RL *(experimental)*

`training.recipes.async_rl_loop` overlaps rollout and training so sampling and gradient steps run on separate workers concurrently — the trainer no longer waits for a full batch of rollouts before stepping. You write a single async function:

```python theme={null}
async def rollout_fn(sample_prompt: dict) -> RolloutSample | None:
    ...
```

The recipe handles everything else: the outer loop, batching, weight sync, off-policy staleness bounds, and the rollout/train scheduler. **No backward-compatibility guarantee** — config fields and the rollout protocol may change between releases. For the full `rollout_fn` / `RolloutSample` contract, scheduler details, and tuning guidance, see the cookbook's [`skills/dev/references/rl/async-rl.md`](https://github.com/fw-ai/cookbook/blob/main/skills/dev/references/rl/async-rl.md) skill.

## Related guides

* [Cookbook DPO](/fine-tuning/training-api/cookbook/dpo) — preference optimization
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) — all config classes
* [Loss Functions](/fine-tuning/training-api/loss-functions) — API-level loss function details
