Reinforcement fine-tuning (RFT)
Reinforcement fine-tuning is especially effective for:
- Domain reasoning: Applying domain-specific logic to solve problems.
- Function calling: Understanding when and how to use external tools based on conversation history and tool descriptions.
- Math with reasoning: Solving math problems with logical steps.
- Code generation/fixes: Modifying or generating code by interpreting context and requirements.
RFT works best when you can determine whether a model’s output is “good” or “bad,” even if only approximately.
👉 For more background, check out this blog post on RFT.
1. Design Your Evaluation Strategy
Before creating a dataset, define how you’ll evaluate the quality of model outputs.
Example: Math Solver
- You want a model that outputs step-by-step solutions.
- Evaluating each reasoning step is hard, but checking the final answer is easy.
- So, if the final answer is correct, you assume the reasoning is likely acceptable.
This strategy simplifies evaluation:
- Extract the final answer from the output.
- Compare it to the known ground truth.
- If they match → score = 1.0. If not → score = 0.0.
Be creative and iterate to find the best evaluation method for your task.
2. Prepare your dataset
Your dataset should be in JSONL format, similar to supervised fine-tuning datasets. Each entry must include a messages
key containing OpenAI-style chat messages.
Example dataset
You can also prefill generations from a base model, even if they’re not perfect—this helps with evaluator development.
Optional metadata
You may include additional fields for use in your evaluator. For example, with math problems, include a ground_truth
field:
You can name additional fields arbitrarily and they will all be transparently passed through to your evaluation function. Note: the model’s answer here is incorrect; this is just a test case.
3. Build and iterate on the evaluator
Start simple—use the Web IDE for quick iterations. For complex use cases, use reward-kit
.
Nagivating to the Evaluations
tab in your fireworks dashboard, and click Create Evaluator
, you should see the following page
On the left side, there is a prefilled template where you can code up your evaluator. On the right, there is a dataset preview which allows you to run your evaluator code against a dataset of your choice. The interface is meant for simple debuggings. Note that you can run print
inside the evaluator code and view the output in the console panel.
Example evaluator (math task)
Evaluator function requirements
- Inputs: The function is called for each dataset row. It receives the
messages
and any custom fields likeground_truth
coming from your dataset. - Output: A dictionary with:
score
: Float between 0.0 and 1.0reason
: (Optional) A string for loggingis_score_valid
: (Optional, defaults toTrue
) Flag to skip training on invalid outputs
If the evaluator throws an error or returns invalid data, that sample is skipped during training.
You can optionally include a field metrics
for a mapping from metric name to MetricResult
to include auxiliary metrics. A common practice is to include individual metrics you want to track in the metrics field, where as actual score
in the EvaluateResult
is some weighted average of the individual metrics that will actually be used for training.
Example
Note that only the final score needs to be within [0.0, 1.0]
range, and individual metric values can be in arbitrary range.
4. Create an RFT job
You can launch an RFT job directly from the UI.
- Go to the “Fine-Tuning” tab.
- Click “Fine-tune a Model”.
- Select “Reinforcement” as the tuning method.
- Follow the wizard to complete the setup.
5. Monitor training progress
After launching the job, the UI will display:
- Training progress
- Evaluation metrics
- Model checkpoints
6. Deploy and use the model
Once training completes, you can deploy the model like any other LoRA model. Refer to deploying a fine-tuned model for more information.
Access
As of today, Fireworks accounts should have access to Reinforcement Fine Tuning via dashboard. We have enabled default quota of 1-GPU for developer accounts, which should be good for running RFT for models under 10B in size so long as capacity permits.
Hyper-parameters for reinforcement fine tuning
Below is an updated Markdown block that you can paste into reinforcement-fine-tuning-models.md
.
It fixes the three issues you pointed out—temperature 0, non-integer epochs, and n
(choices) ≥ 2—while keeping the tone and layout identical to the rest of the Fireworks docs.
Additional RFT job settings
Most experiments converge with the defaults below. Change them only when you have a clear hypothesis — and record every change in your experiment tracker.
Training-time hyperparameters
Flag | Default | Valid range | When to change |
---|---|---|---|
--epochs | 1 | 1 – 10 (whole numbers only) | Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting. |
--batch-size | 32 k tokens | Hardware-bounded | Lower if you hit OOM; raise only when GPUs have >30 % headroom. |
--learning-rate | 1 e-4 | 1 e-5 – 5 e-4 | Decrease when the reward spikes then collapses; increase when the curve plateaus too early. |
--lora-rank | 8 | 4 – 128 (powers of 2) | Higher ranks give more capacity but cost VRAM; stay ≤64 unless you have >40 GB per GPU. |
--max-context-length | 8192 tokens | Up to model limit | Raise only when your prompts truncate; remember longer sequences consume quadratic compute. |
Roll-out (sampling) parameters
During each Policy-Optimization step the trainer queries the current policy with stochastic decoding.
The UI and firectl
CLI enforce the ranges shown here; out-of-bound values throw an Invalid rollout parameters error.
Field | CLI flag | Default | Recommended range | Why it matters |
---|---|---|---|---|
Maximum tokens | --inference-max-tokens | 2 048 | 16 – 16 384 | Longer responses improve reward on summarisation / story tasks but add cost. (blog.ml.cmu.edu) |
Temperature | --inference-temperature | 0.7 | 0.1 – 2.0 ( > 0 only ) | Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF. (arxiv.org, huyenchip.com) |
Top-p | --inference-top-p | 1.0 | 0 – 1 | Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations. (codefinity.com) |
Top-k | --inference-top-k | 40 | 0 – 100 (0 = off) | Combine with temperature for more creative exploration; keep ≤50 for latency. (medium.com) |
n (choices) | --inference-n | 4 | 2 – 8 | Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory. (blog.ml.cmu.edu, rlhfbook.com) |
Extra body JSON | --inference-extra-body | empty | valid JSON | Pass extra OpenAI-style params (e.g., stop , logit_bias ). Invalid JSON is rejected. |
Practical tips
- Keep temperature > 0 – greedy sampling (temperature 0) is deterministic and collapses exploration, often leading to mode-dropping and repetitive text. (reddit.com)
- Use at least two choices – multi-sample roll-outs are standard in Policy-Optimization and rejection-sampling pipelines. (rlhfbook.com, blog.ml.cmu.edu)
- Log everything – Fireworks dashboards export Weights & Biases runs, so tag each sweep and compare reward curves side-by-side.
- Watch VRAM – bigger
n
,max_tokens
, or batch size scale memory roughly linearly. Scale horizontally or enable Turbo mode if needed. - Iterate in small steps – change one hyperparameter at a time; RLHF is sensitive and unstable grids waste compute. (arxiv.org)
Why these limits?
The ranges match the client-side validation baked into the dashboard (MIN_*/MAX_*
constants). Entering a value outside these windows surfaces an error immediately, saving wasted GPU hours.
Quick reference
Goal | Turn these knobs |
---|---|
Faster convergence | ↑ epochs , tune learning-rate < 2× default |
Safer / less toxic | ↓ temperature , top_p , top_k |
More creative | temperature ≈ 1 – 1.2, top_p 0.9 |
Cheaper roll-outs | ↓ n , max_tokens , batch size |
Higher capacity | ↑ lora-rank , but monitor VRAM |
By keeping temperature above zero, generating multiple candidates per prompt, and sticking to integer epoch counts, you’ll ensure your reinforcement fine-tuning runs stay both exploratory and stable — just what Policy-Optimization needs to find better policies.