For guidance on when to change these parameters, see the Parameter Tuning guide.
Training parameters
| Flag | Default | Valid range | When to change |
|---|---|---|---|
--epochs | 1 | 1 – 10 (whole numbers only) | Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting. |
--batch-size | 32 k tokens | Hardware-bounded | Lower if you hit OOM; raise only when GPUs have >30 % headroom. |
--learning-rate | 1 e-4 | 1 e-5 – 5 e-4 | Decrease when the reward spikes then collapses; increase when the curve plateaus too early. |
--lora-rank | 8 | 4 – 128 (powers of 2) | Higher ranks give more capacity but require more GPU memory; stay ≤64 unless you have high-end GPUs. |
--max-context-length | 8192 tokens | Up to model limit | Raise only when your prompts truncate; remember longer sequences consume quadratic compute. |
Example usage
Rollout (sampling) parameters
During each training step, the model generates multiple responses with stochastic decoding. These parameters control that generation process.| Field | CLI flag | Default | Recommended range | Why it matters |
|---|---|---|---|---|
| Maximum tokens | --inference-max-tokens | 2 048 | 16 – 16 384 | Longer responses improve reward on summarisation / story tasks but add cost. |
| Temperature | --inference-temperature | 0.7 | 0.1 – 2.0 ( > 0 only ) | Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF. |
| Top-p | --inference-top-p | 1.0 | 0 – 1 | Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations. |
| Top-k | --inference-top-k | 40 | 0 – 100 (0 = off) | Combine with temperature for more creative exploration; keep ≤50 for latency. |
| n (choices) | --inference-n | 4 | 2 – 8 | Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory. |
| Extra body JSON | --inference-extra-body | empty | valid JSON | Pass extra OpenAI-style params (e.g., stop, logit_bias). Invalid JSON is rejected. |
Example usage
Quick reference by goal
| Goal | Parameters to adjust |
|---|---|
| Faster convergence | ↑ epochs, tune learning-rate < 2× default |
| Safer / less toxic | ↓ temperature, top_p, top_k |
| More creative | temperature ≈ 1 – 1.2, top_p 0.9 |
| Cheaper roll-outs | ↓ n, max_tokens, batch size |
| Higher capacity | ↑ lora-rank, but monitor memory usage |
Important constraints
Temperature must be > 0
Greedy sampling (temperature 0) is deterministic and collapses exploration, often leading to mode-dropping and repetitive text.At least 2 rollouts required
Policy optimization needs multiple candidates per prompt to compute a meaningful KL divergence term. Setting--inference-n 1 will fail.