Skip to main content

What this is

Prompt rendering, labels, and reward functions are part of your algorithm. Treat dataset construction and evaluation logic as versioned experiment code.

Dataset formats

GRPO: prompt + ground truth for reward scoring

{"messages": [{"role": "user", "content": "What is 15 + 27?"}], "ground_truth": "42"}
{"messages": [{"role": "user", "content": "Solve: 3x = 12"}], "ground_truth": "4"}

DPO: preference pairs

{"chosen": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "good"}]},
 "rejected": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "bad"}]}}
{"input": {"messages": [{"role": "user", "content": "..."}]},
 "preferred_output": [{"role": "assistant", "content": "good"}],
 "non_preferred_output": [{"role": "assistant", "content": "bad"}]}

SFT: instruction/response conversations

{"messages": [{"role": "user", "content": "Translate hello"}, {"role": "assistant", "content": "Bonjour"}]}

Loading datasets in training scripts

Use the cookbook data helpers from training.utils:
from training.utils import (
    load_jsonl_dataset,
    load_preference_dataset,
)

# GRPO/SFT-style JSONL (local path or URL)
rows = load_jsonl_dataset("/path/to/grpo_or_sft.jsonl", max_rows=1000)

# Also works with URLs
rows = load_jsonl_dataset("https://example.com/dataset.jsonl", max_rows=1000)

# DPO preference pairs (supports multiple formats: chosen/rejected, preferred_output/non_preferred_output, samples with scores)
pairs = load_preference_dataset("/path/to/dpo_pairs.jsonl", max_pairs=5000)
Additional data utilities:
from training.utils import (
    compute_advantages,
    find_common_prefix_length,
    extract_text,
)

advantages = compute_advantages([1.0, 0.0, 0.5, 0.8])
prefix_len = find_common_prefix_length(chosen_tokens, rejected_tokens)
text = extract_text({"messages": [{"role": "user", "content": "hello"}]})

Tokenization

Tokenization is client-side — your local machine converts text to token IDs before sending them to the remote trainer.
The tokenizer must match the base model you’re fine-tuning. If you’re training qwen3-8b, use Qwen/Qwen3-8B. A mismatched tokenizer will produce incorrect token IDs and silently corrupt training. Tokenizers are lightweight (a few MB) even for very large models — only the tokenizer vocabulary is downloaded, not model weights.

GRPO (sampling path)

GRPO uses DeploymentSampler, which tokenizes prompts locally with a HuggingFace tokenizer and sends token IDs to the deployment (token-in/token-out). Set deployment.tokenizer_model in GRPO config.
from transformers import AutoTokenizer
from fireworks.training.sdk import DeploymentSampler

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
    inference_url="https://api.fireworks.ai",
    model="accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    api_key="<FIREWORKS_API_KEY>",
    tokenizer=tokenizer,
)

DPO and SFT (dataset preprocessing path)

DPO and SFT recipes also tokenize locally with AutoTokenizer using cfg.tokenizer_model before building training datums.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(cfg.tokenizer_model, trust_remote_code=True)
token_ids = tokenizer.encode(text_or_chat_formatted_text)

Optional trainer tokenizer endpoint

The RLOR trainer exposes a tokenizer endpoint (encode_text) for custom workflows, but it is not the default path used by GRPO/DPO/SFT cookbook loops.
from training.utils import encode_text

tokens = encode_text(base_url=trainer_endpoint, text="Hello, world!")

Building training datums

Use datum_from_tokens_weights from tinker_cookbook to construct properly-formatted datums with token weights:
import torch
from tinker_cookbook.supervised.common import datum_from_tokens_weights

# Token IDs for the full sequence (prompt + response)
tokens = torch.tensor([101, 2054, 2003, ...], dtype=torch.long)

# Weights: 0 for prompt tokens, 1 for response tokens
weights = torch.zeros(len(tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0

datum = datum_from_tokens_weights(tokens, weights, max_length=8192)
The datum_from_tokens_weights function handles internal token shifting — you don’t need to manually offset tokens.

Operational guidance

  • Version your datasets alongside your training scripts for reproducibility.
  • Use a fixed evaluation set across experiments to compare model quality.
  • Validate data before training: check for empty texts, overly long sequences, and malformed JSON.
  • Pre-tokenize if possible to avoid repeated tokenizer calls during training.