When to Use Reinforcement Fine Tuning

Supervised fine-tuning (SFT) works well for many common scenarios, especially when:

  • You have a sizable dataset (~1000+ examples) with high-quality, ground-truth outputs.
  • The dataset covers most possible input scenarios.
  • Tasks are relatively straightforward, such as:
    • Classification
    • Content extraction

However, SFT may struggle in situations where:

  • Your dataset is small.
  • You lack ground-truth outputs (a.k.a. “golden generations”).
  • The task requires multi-step reasoning.

Here is a simplistic decision guideline:

Verifiable refers to whether it is relatively easy to make a judgement on the quality of the model generation.

Example Use Cases for RFT

Reinforcement Fine Tuning is especially effective for:

  • Domain reasoning: Applying domain-specific logic to solve problems.
  • Function calling: Understanding when and how to use external tools based on conversation history and tool descriptions.
  • Math with reasoning: Solving math problems with logical steps.
  • Code generation/fixes: Modifying or generating code by interpreting context and requirements.

RFT works best when you can determine whether a model’s output is “good” or “bad,” even if only approximately.

👉 For more background, check out this blog post on RFT.


How to Fine-Tune a Model with RFT

1. Design Your Evaluation Strategy

Before creating a dataset, define how you’ll evaluate the quality of model outputs.

Example: Math Solver

  • You want a model that outputs step-by-step solutions.
  • Evaluating each reasoning step is hard, but checking the final answer is easy.
  • So, if the final answer is correct, you assume the reasoning is likely acceptable.

This strategy simplifies evaluation:

  • Extract the final answer from the output.
  • Compare it to the known ground truth.
  • If they match → score = 1.0. If not → score = 0.0.

Be creative and iterate to find the best evaluation method for your task.


2. Prepare Your Dataset

Your dataset should be in JSONL format, similar to supervised fine-tuning datasets. Each entry must include a messages key containing OpenAI-style chat messages.

Example Dataset

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "Paris."}
]}

You can also prefill generations from a base model, even if they’re not perfect—this helps with evaluator development.

Optional Metadata

You may include additional fields for use in your evaluator. For example, with math problems, include a ground_truth field:

{
  "messages": [
    {"role": "system", "content": "Return your solution on the last line, formatted as 'Answer: the answer'"},
    {"role": "user", "content": "What is the result of 3+2?"},
    {"role": "assistant", "content": "2 + 2 = 1 + 1 + 1 + 1 + 1 = 4, I think the answer is 4.\nAnswer: 4"}
  ],
  "ground_truth": 5
}

You can name additional fields arbitrarily and they will all be transparently passed through to your evaluation function. Note: the model’s answer here is incorrect; this is just a test case.


3. Build and Iterate on the Evaluator

Start simple—use the Web IDE for quick iterations. For complex use cases, use reward-kit.

Nagivating to the Evaluations tab in your fireworks dashboard, and click Create Evaluator, you should see the following page

On the left side, there is a prefilled template where you can code up your evaluator. On the right, there is a dataset preview which allows you to run your evaluator code against a dataset of your choice. The interface is meant for simple debuggings. Note that you can run print inside the evaluator code and view the output in the console panel.

Example Evaluator (Math Task)

import numpy as np
import pydantic
import reward_kit
from reward_kit import reward_function
from reward_kit.models import EvaluateResult, MetricResult # https://docs.fireworks.ai/evaluators/developer_guide/core_data_types

@reward_function
def evaluate(messages: list[dict], ground_truth: int, **kwargs) -> EvaluateResult:
    """
    The reward function to evaluates a single entry from the dataset. This function is required in the `main.py` file.
    For more details and examples, please refer to https://docs.fireworks.ai/evaluators/examples/examples_overview
    Args:
        messages: A list of dictionaries representing a single line from the dataset jsonl file. This maps to the `messages` field in the dataset json entry.
        ground_truth: The particular dataset has a ground_truth column, which is an integer for the math question answer
        kwargs: Additional fields in your dataset besides `messages`. Highly recommended to not remove this due to potential more keywords being passed to the function.
    Returns:
        EvaluateResult: Evaluate result that should include score, and optionally is_score_valid, reason and sub metrics.
    """
    answer = messages[-1]['content']
    last_line = answer.split('\n')[-1]
    mg = re.match(r"Answer: (\d+)", last_line)
    score = 0.0
    if mg is not None:
        extracted_answer = int(mg.group(1))
        score = 1.0 if extracted_answer == ground_truth else 0.0
    return EvaluateResult(
        score=score,
        is_score_valid=True,
        reason=f"extracted_answer {extracted_answer}, actual answer {ground_truth}"
    )

Evaluator Function Requirements

  • Inputs: The function is called for each dataset row. It receives the messages and any custom fields like ground_truth coming from your dataset.
  • Output: A dictionary with:
    • score: Float between 0.0 and 1.0
    • reason: (Optional) A string for logging
    • is_score_valid: (Optional, defaults to True) Flag to skip training on invalid outputs

If the evaluator throws an error or returns invalid data, that sample is skipped during training. You can optionally include a field metrics for a mapping from metric name to MetricResult to include auxiliary metrics. A common practice is to include individual metrics you want to track in the metrics field, where as actual score in the EvaluateResult is some weighted average of the individual metrics that will actually be used for training.

Example

return EvaluateResult(
        score=0.5 * metric_1_score + 0.5 * metric_2_score, 
        reason="This is the eval result for the rollup",
        metrics={
            "metric_1": MetricResult(
                is_score_valid=True,
                score=metric_1_score,
                reason="This is the eval result for metric_1"
            ),
            "metric_2": MetricResult(
                is_score_valid=True,
                score=metric_2_score,
                reason="This is the eval result for metric_2"
            )
        }
    )

Note that only the final score needs to be within [0.0, 1.0] range, and individual metric values can be in arbitrary range.


4. Create an RFT Job

You can launch an RFT job directly from the UI.

  1. Go to the “Fine-Tuning” tab.
  2. Click “Fine-tune a Model”.
  3. Select “Reinforcement” as the tuning method.
  4. Follow the wizard to complete the setup.


5. Monitor Training Progress

After launching the job, the UI will display:

  • Training progress
  • Evaluation metrics
  • Model checkpoints


6. Deploy and Use the Model

Once training completes, deploy the model like any other LoRA model. It’s now ready for inference with improved performance on your custom task.


Access

As of today, Fireworks accounts should have access to Reinforcement Fine Tuning via dashboard. We have enabled default quota of 1-GPU for developer accounts, which should be good for running RFT for models under 10B in size so long as capacity permits.