Reinforcement fine-tuning (RFT)

Reinforcement fine-tuning is especially effective for:

Domain reasoning: Applying domain-specific logic to solve problems.
Function calling: Understanding when and how to use external tools based on conversation history and tool descriptions.
Math with reasoning: Solving math problems with logical steps.
Code generation/fixes: Modifying or generating code by interpreting context and requirements.

RFT works best when you can determine whether a model’s output is “good” or “bad,” even if only approximately. 👉 For more background, check out this blog post on RFT.

1. Design Your Evaluation Strategy

Before creating a dataset, define how you’ll evaluate the quality of model outputs. Example: Math Solver

You want a model that outputs step-by-step solutions.
Evaluating each reasoning step is hard, but checking the final answer is easy.
So, if the final answer is correct, you assume the reasoning is likely acceptable.

This strategy simplifies evaluation:

Extract the final answer from the output.
Compare it to the known ground truth.
If they match → score = 1.0. If not → score = 0.0.

Be creative and iterate to find the best evaluation method for your task.

2. Prepare your dataset

Your dataset should be in JSONL format, similar to supervised fine-tuning datasets. Each entry must include a messages key containing OpenAI-style chat messages.

For testing reward functions in the UI

When developing and testing your reward function in the Web IDE, you can include assistant messages to simulate what the model might generate. This helps you iterate on your evaluator quickly:

development_dataset.jsonl

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is the capital of France?"},
  {"role": "assistant", "content": "Paris."}
]}

Include assistant messages in your development dataset to quickly validate your reward function logic before running expensive RFT training.

For actual RFT training

During reinforcement fine-tuning, the model generates the assistant responses automatically. Therefore, your dataset need only contain the prompts (system and user messages) that you want the model to respond to:

training_dataset.jsonl

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is the capital of France?"}
]}

You can also reuse your development dataset (with assistant messages) for RFT training—the training process will ignore the existing assistant messages and generate new responses for evaluation.

The RFT process will:

Use your prompts to generate multiple assistant responses
Pass each generated response to your reward function for evaluation
Use the reward scores to train the model

Output dataset example with optional metadata

This is your output dataset, each entry in this file is passed directly to your reward function as input. You may include additional fields for use in your evaluator. For example, with math problems, include a ground_truth field:

output_dataset.jsonl

{
  "messages": [
    {"role": "system", "content": "Return your solution on the last line, formatted as 'Answer: the answer'"},
    {"role": "user", "content": "What is the result of 3+2?"},
    {"role": "assistant", "content": "2 + 2 = 1 + 1 + 1 + 1 + 1 = 4, I think the answer is 4.\nAnswer: 4"}
  ],
  "ground_truth": 5
}

You can name additional fields arbitrarily and they will all be transparently passed through to your evaluation function. Note: the model’s answer here is incorrect; this is just a test case.

3. Build and iterate on the evaluator

Start simple—use the Web IDE for quick iterations. For complex use cases, use reward-kit. Nagivating to the Evaluations tab in your fireworks dashboard, and click Create Evaluator, you should see the following page Web IDE

On the left side, there is a prefilled template where you can code up your evaluator. On the right, there is a dataset preview which allows you to run your evaluator code against a dataset of your choice. The interface is meant for simple debuggings. Note that you can run print inside the evaluator code and view the output in the console panel.

Example evaluator (math task)

import numpy as np
import pydantic
import reward_kit
from reward_kit import reward_function
from reward_kit.models import EvaluateResult, MetricResult # https://docs.fireworks.ai/evaluators/developer_guide/core_data_types

@reward_function
def evaluate(messages: list[dict], ground_truth: int, **kwargs) -> EvaluateResult:
    """
    The reward function to evaluates a single entry from the dataset. This function is required in the `main.py` file.
    For more details and examples, please refer to https://docs.fireworks.ai/evaluators/examples/examples_overview
    Args:
        messages: A list of dictionaries representing a single line from the dataset jsonl file. This maps to the `messages` field in the dataset json entry.
        ground_truth: The particular dataset has a ground_truth column, which is an integer for the math question answer
        kwargs: Additional fields in your dataset besides `messages`. Highly recommended to not remove this due to potential more keywords being passed to the function.
    Returns:
        EvaluateResult: Evaluate result that should include score, and optionally is_score_valid, reason and sub metrics.
    """
    answer = messages[-1]['content']
    last_line = answer.split('\n')[-1]
    mg = re.match(r"Answer: (\d+)", last_line)
    score = 0.0
    if mg is not None:
        extracted_answer = int(mg.group(1))
        score = 1.0 if extracted_answer == ground_truth else 0.0
    return EvaluateResult(
        score=score,
        is_score_valid=True,
        reason=f"extracted_answer {extracted_answer}, actual answer {ground_truth}"
    )

Evaluator function requirements

Inputs: The function is called for each dataset row. It receives the messages and any custom fields like ground_truth coming from your dataset.
Output: A dictionary with:
- score: Float between 0.0 and 1.0
- reason: (Optional) A string for logging
- is_score_valid: (Optional, defaults to True) Flag to skip training on invalid outputs

If the evaluator throws an error or returns invalid data, that sample is skipped during training. You can optionally include a field metrics for a mapping from metric name to MetricResult to include auxiliary metrics. A common practice is to include individual metrics you want to track in the metrics field, where as actual score in the EvaluateResult is some weighted average of the individual metrics that will actually be used for training. Example

return EvaluateResult(
        score=0.5 * metric_1_score + 0.5 * metric_2_score, 
        reason="This is the eval result for the rollup",
        metrics={
            "metric_1": MetricResult(
                is_score_valid=True,
                score=metric_1_score,
                reason="This is the eval result for metric_1"
            ),
            "metric_2": MetricResult(
                is_score_valid=True,
                score=metric_2_score,
                reason="This is the eval result for metric_2"
            )
        }
    )

Note that only the final score needs to be within [0.0, 1.0] range, and individual metric values can be in arbitrary range.

4. Create an RFT job

You can launch an RFT job directly from the UI.

Go to the “Fine-Tuning” tab.
Click “Fine-tune a Model”.
Select “Reinforcement” as the tuning method.
Follow the wizard to complete the setup.

5. Monitor training progress

After launching the job, the UI will display:

Training progress
Evaluation metrics
Model checkpoints

6. Deploy and use the model

Once training completes, you can deploy the model like any other LoRA model. Refer to deploying a fine-tuned model for more information.

Access

As of today, Fireworks accounts should have access to Reinforcement Fine Tuning via dashboard. We have enabled default quota of 1-GPU for developer accounts, which should be good for running RFT for models under 10B in size so long as capacity permits.

Additional RFT job settings

Most experiments converge with the defaults below. Change them only when you have a clear hypothesis — and record every change in your experiment tracker.

Training-time hyperparameters

Flag	Default	Valid range	When to change
`--epochs`	1	1 – 10 (whole numbers only)	Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting.
`--batch-size`	32 k tokens	Hardware-bounded	Lower if you hit OOM; raise only when GPUs have >30 % headroom.
`--learning-rate`	1 e-4	1 e-5 – 5 e-4	Decrease when the reward spikes then collapses; increase when the curve plateaus too early.
`--lora-rank`	8	4 – 128 (powers of 2)	Higher ranks give more capacity but cost VRAM; stay ≤64 unless you have >40 GB per GPU.
`--max-context-length`	8192 tokens	Up to model limit	Raise only when your prompts truncate; remember longer sequences consume quadratic compute.

firectl create rftj \
  --base-model llama-v3p1-8b-instruct \
  --dataset my-dataset \
  --output-model my-rft-model \
  --epochs 3 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --max-context-length 16384

Roll-out (sampling) parameters

During each Policy-Optimization step the trainer queries the current policy with stochastic decoding. The UI and firectl CLI enforce the ranges shown here; out-of-bound values throw an Invalid rollout parameters error.

Field	CLI flag	Default	Recommended range	Why it matters
Maximum tokens	`--inference-max-tokens`	2 048	16 – 16 384	Longer responses improve reward on summarisation / story tasks but add cost. (blog.ml.cmu.edu)
Temperature	`--inference-temperature`	0.7	0.1 – 2.0 ( > 0 only )	Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF. (arxiv.org, huyenchip.com)
Top-p	`--inference-top-p`	1.0	0 – 1	Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations. (codefinity.com)
Top-k	`--inference-top-k`	40	0 – 100 (0 = off)	Combine with `temperature` for more creative exploration; keep ≤50 for latency. (medium.com)
n (choices)	`--inference-n`	4	2 – 8	Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory. (blog.ml.cmu.edu, rlhfbook.com)
Extra body JSON	`--inference-extra-body`	empty	valid JSON	Pass extra OpenAI-style params (e.g., `stop`, `logit_bias`). Invalid JSON is rejected.

firectl create rftj \
  ... \
  --inference-max-tokens 1024 \
  --inference-temperature 0.8 \
  --inference-top-p 0.9 \
  --inference-top-k 40 \
  --inference-n 6 \
  --inference-extra-body '{"stop":["\n\n"]}'

Practical tips

Keep temperature > 0 – greedy sampling (temperature 0) is deterministic and collapses exploration, often leading to mode-dropping and repetitive text. (reddit.com)
Use at least two choices – multi-sample roll-outs are standard in Policy-Optimization and rejection-sampling pipelines. (rlhfbook.com, blog.ml.cmu.edu)
Log everything – Fireworks dashboards export Weights & Biases runs, so tag each sweep and compare reward curves side-by-side.
Watch VRAM – bigger n, max_tokens, or batch size scale memory roughly linearly. Scale horizontally or enable Turbo mode if needed.
Iterate in small steps – change one hyperparameter at a time; RLHF is sensitive and unstable grids waste compute. (arxiv.org)

Why these limits?

The ranges match the client-side validation baked into the dashboard (MIN_*/MAX_* constants). Entering a value outside these windows surfaces an error immediately, saving wasted GPU hours.

Quick reference

Goal	Turn these knobs
Faster convergence	↑ `epochs`, tune `learning-rate` < 2× default
Safer / less toxic	↓ `temperature`, `top_p`, `top_k`
More creative	`temperature` ≈ 1 – 1.2, `top_p` 0.9
Cheaper roll-outs	↓ `n`, `max_tokens`, batch size
Higher capacity	↑ `lora-rank`, but monitor VRAM

By keeping temperature above zero, generating multiple candidates per prompt, and sticking to integer epoch counts, you’ll ensure your reinforcement fine-tuning runs stay both exploratory and stable — just what Policy-Optimization needs to find better policies.

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Reinforcement fine-tuning (RFT)

1. Design Your Evaluation Strategy

2. Prepare your dataset

For testing reward functions in the UI

For actual RFT training

Output dataset example with optional metadata

3. Build and iterate on the evaluator

Example evaluator (math task)

Evaluator function requirements

4. Create an RFT job

5. Monitor training progress

6. Deploy and use the model

Access

Additional RFT job settings

Training-time hyperparameters

Roll-out (sampling) parameters

Practical tips

Why these limits?

Quick reference

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​1. Design Your Evaluation Strategy

​2. Prepare your dataset

​For testing reward functions in the UI

​For actual RFT training

​Output dataset example with optional metadata

​3. Build and iterate on the evaluator

​Example evaluator (math task)

​Evaluator function requirements

​4. Create an RFT job

​5. Monitor training progress

​6. Deploy and use the model

​Access

​Additional RFT job settings

​Training-time hyperparameters

​Roll-out (sampling) parameters

​Practical tips

​Why these limits?

​Quick reference

1. Design Your Evaluation Strategy

2. Prepare your dataset

For testing reward functions in the UI

For actual RFT training

Output dataset example with optional metadata

3. Build and iterate on the evaluator

Example evaluator (math task)

Evaluator function requirements

4. Create an RFT job

5. Monitor training progress

6. Deploy and use the model

Access

Additional RFT job settings

Training-time hyperparameters

Roll-out (sampling) parameters

Practical tips

Why these limits?

Quick reference