What this is
Prompt rendering, labels, and reward functions are part of your algorithm. Treat dataset construction and evaluation logic as versioned experiment code.
GRPO: prompt + ground truth for reward scoring
{"messages": [{"role": "user", "content": "What is 15 + 27?"}], "ground_truth": "42"}
{"messages": [{"role": "user", "content": "Solve: 3x = 12"}], "ground_truth": "4"}
DPO: preference pairs
{"chosen": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "good"}]},
"rejected": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "bad"}]}}
{"input": {"messages": [{"role": "user", "content": "..."}]},
"preferred_output": [{"role": "assistant", "content": "good"}],
"non_preferred_output": [{"role": "assistant", "content": "bad"}]}
SFT: instruction/response conversations
{"messages": [{"role": "user", "content": "Translate hello"}, {"role": "assistant", "content": "Bonjour"}]}
Loading datasets in training scripts
Use the cookbook data helpers from training.utils:
from training.utils import (
load_jsonl_dataset,
load_preference_dataset,
)
# GRPO/SFT-style JSONL (local path or URL)
rows = load_jsonl_dataset("/path/to/grpo_or_sft.jsonl", max_rows=1000)
# Also works with URLs
rows = load_jsonl_dataset("https://example.com/dataset.jsonl", max_rows=1000)
# DPO preference pairs (supports multiple formats: chosen/rejected, preferred_output/non_preferred_output, samples with scores)
pairs = load_preference_dataset("/path/to/dpo_pairs.jsonl", max_pairs=5000)
Additional data utilities:
from training.utils import (
compute_advantages,
find_common_prefix_length,
extract_text,
)
advantages = compute_advantages([1.0, 0.0, 0.5, 0.8])
prefix_len = find_common_prefix_length(chosen_tokens, rejected_tokens)
text = extract_text({"messages": [{"role": "user", "content": "hello"}]})
Tokenization
Tokenization is client-side — your local machine converts text to token IDs before sending them to the remote trainer.
The tokenizer must match the base model you’re fine-tuning. If you’re training qwen3-8b, use Qwen/Qwen3-8B. A mismatched tokenizer will produce incorrect token IDs and silently corrupt training. Tokenizers are lightweight (a few MB) even for very large models — only the tokenizer vocabulary is downloaded, not model weights.
GRPO (sampling path)
GRPO uses DeploymentSampler, which tokenizes prompts locally with a HuggingFace tokenizer and sends token IDs to the deployment (token-in/token-out). Set deployment.tokenizer_model in GRPO config.
from transformers import AutoTokenizer
from fireworks.training.sdk import DeploymentSampler
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
inference_url="https://api.fireworks.ai",
model="accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
api_key="<FIREWORKS_API_KEY>",
tokenizer=tokenizer,
)
DPO and SFT (dataset preprocessing path)
DPO and SFT recipes also tokenize locally with AutoTokenizer using cfg.tokenizer_model before building training datums.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(cfg.tokenizer_model, trust_remote_code=True)
token_ids = tokenizer.encode(text_or_chat_formatted_text)
Optional trainer tokenizer endpoint
The RLOR trainer exposes a tokenizer endpoint (encode_text) for custom workflows, but it is not the default path used by GRPO/DPO/SFT cookbook loops.
from training.utils import encode_text
tokens = encode_text(base_url=trainer_endpoint, text="Hello, world!")
Building training datums
Use datum_from_tokens_weights from tinker_cookbook to construct properly-formatted datums with token weights:
import torch
from tinker_cookbook.supervised.common import datum_from_tokens_weights
# Token IDs for the full sequence (prompt + response)
tokens = torch.tensor([101, 2054, 2003, ...], dtype=torch.long)
# Weights: 0 for prompt tokens, 1 for response tokens
weights = torch.zeros(len(tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0
datum = datum_from_tokens_weights(tokens, weights, max_length=8192)
The datum_from_tokens_weights function handles internal token shifting — you don’t need to manually offset tokens.
Operational guidance
- Version your datasets alongside your training scripts for reproducibility.
- Use a fixed evaluation set across experiments to compare model quality.
- Validate data before training: check for empty texts, overly long sequences, and malformed JSON.
- Pre-tokenize if possible to avoid repeated tokenizer calls during training.