Reinforcement fine-tune models
When to Use Reinforcement Fine Tuning
Supervised fine-tuning (SFT) works well for many common scenarios, especially when:
- You have a sizable dataset (~1000+ examples) with high-quality, ground-truth outputs.
- The dataset covers most possible input scenarios.
- Tasks are relatively straightforward, such as:
- Classification
- Content extraction
However, SFT may struggle in situations where:
- Your dataset is small.
- You lack ground-truth outputs (a.k.a. “golden generations”).
- The task requires multi-step reasoning.
Here is a simplistic decision guideline:
Verifiable
refers to whether it is relatively easy to make a judgement on the quality of the model generation.
Example Use Cases for RFT
Reinforcement Fine Tuning is especially effective for:
- Domain reasoning: Applying domain-specific logic to solve problems.
- Function calling: Understanding when and how to use external tools based on conversation history and tool descriptions.
- Math with reasoning: Solving math problems with logical steps.
- Code generation/fixes: Modifying or generating code by interpreting context and requirements.
RFT works best when you can determine whether a model’s output is “good” or “bad,” even if only approximately.
👉 For more background, check out this blog post on RFT.
How to Fine-Tune a Model with RFT
1. Design Your Evaluation Strategy
Before creating a dataset, define how you’ll evaluate the quality of model outputs.
Example: Math Solver
- You want a model that outputs step-by-step solutions.
- Evaluating each reasoning step is hard, but checking the final answer is easy.
- So, if the final answer is correct, you assume the reasoning is likely acceptable.
This strategy simplifies evaluation:
- Extract the final answer from the output.
- Compare it to the known ground truth.
- If they match → score = 1.0. If not → score = 0.0.
Be creative and iterate to find the best evaluation method for your task.
2. Prepare Your Dataset
Your dataset should be in JSONL format, similar to supervised fine-tuning datasets. Each entry must include a messages
key containing OpenAI-style chat messages.
Example Dataset
You can also prefill generations from a base model, even if they’re not perfect—this helps with evaluator development.
Optional Metadata
You may include additional fields for use in your evaluator. For example, with math problems, include a ground_truth
field:
You can name additional fields arbitrarily and they will all be transparently passed through to your evaluation function. Note: the model’s answer here is incorrect; this is just a test case.
3. Build and Iterate on the Evaluator
Start simple—use the Web IDE for quick iterations. For complex use cases, use reward-kit
.
Nagivating to the Evaluations
tab in your fireworks dashboard, and click Create Evaluator
, you should see the following page
On the left side, there is a prefilled template where you can code up your evaluator. On the right, there is a dataset preview which allows you to run your evaluator code against a dataset of your choice. The interface is meant for simple debuggings. Note that you can run print
inside the evaluator code and view the output in the console panel.
Example Evaluator (Math Task)
Evaluator Function Requirements
- Inputs: The function is called for each dataset row. It receives the
messages
and any custom fields likeground_truth
coming from your dataset. - Output: A dictionary with:
score
: Float between 0.0 and 1.0reason
: (Optional) A string for loggingis_score_valid
: (Optional, defaults toTrue
) Flag to skip training on invalid outputs
If the evaluator throws an error or returns invalid data, that sample is skipped during training.
You can optionally include a field metrics
for a mapping from metric name to MetricResult
to include auxiliary metrics. A common practice is to include individual metrics you want to track in the metrics field, where as actual score
in the EvaluateResult
is some weighted average of the individual metrics that will actually be used for training.
Example
Note that only the final score needs to be within [0.0, 1.0]
range, and individual metric values can be in arbitrary range.
4. Create an RFT Job
You can launch an RFT job directly from the UI.
- Go to the “Fine-Tuning” tab.
- Click “Fine-tune a Model”.
- Select “Reinforcement” as the tuning method.
- Follow the wizard to complete the setup.
5. Monitor Training Progress
After launching the job, the UI will display:
- Training progress
- Evaluation metrics
- Model checkpoints
6. Deploy and Use the Model
Once training completes, deploy the model like any other LoRA model. It’s now ready for inference with improved performance on your custom task.
Access
As of today, Fireworks accounts should have access to Reinforcement Fine Tuning via dashboard. We have enabled default quota of 1-GPU for developer accounts, which should be good for running RFT for models under 10B in size so long as capacity permits.