What is reinforcement fine-tuning?
In traditional supervised fine-tuning, you provide a dataset with labeled examples showing exactly what the model should output. In reinforcement fine-tuning, you instead provide:- A dataset: Prompts, with input examples for the model to respond to
- An evaluator: Code that scores the model’s outputs from 0.0 (bad) to 1.0 (good), also known as a reward function
- An environment: The system where your agent runs, with access to tools, APIs, and data needed for your task
Use cases
Reinforcement fine-tuning helps you train models to excel at:- Code generation and analysis - Writing and debugging functions with verifiable execution results or test outcomes
- Structured output generation - JSON formatting, data extraction, classification, and schema compliance with programmatic validation
- Domain-specific reasoning - Legal analysis, financial modeling, or medical triage with verifiable criteria and compliance checks
- Tool-using agents - Multi-step workflows where agents call external APIs with measurable success criteria
How it works
1
Design your evaluator
Define how you’ll score model outputs from 0 to 1. For example, scoring outputs higherchecking if your agent called the right tools, or if your LLM-as-judge rates the output highly.
2
Prepare dataset
Create a JSONL file with prompts (system and user messages). These will be used to generate rollouts during training.
3
Connect your environment
Train locally, or connect your environment as a remote server to Fireworks with our /init and /status endpoints.
4
Launch training
Create an RFT job via the UI or CLI. Fireworks orchestrates rollouts, evaluates them, and trains the model to maximize reward.
5
Deploy model
Once training completes, deploy your fine-tuned LoRA model to production with an on-demand deployment.
RFT works best when:
- You can determine whether a model’s output is “good” or “bad,” even if only approximately
- You have prompts but lack perfect “golden” completions to learn from
- The task requires multi-step reasoning where evaluating intermediate steps is hard
- You want the model to explore creative solutions beyond your training examples