Why evaluators matter
Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what “good” means. This is powerful because:- No perfect data required - Just prompts and a way to score outputs
- Encourages exploration - Models learn strategies, not just patterns
- Noise tolerant - Even noisy signals can improve model performance
- Encodes domain expertise - Complex rules and logic that are hard to demonstrate with examples
Anatomy of an evaluator
Every evaluator has three core components:1. Input data
The prompt and any ground truth data needed for evaluation:2. Model output
The assistant’s response to evaluate:3. Scoring logic
Code that compares the output to your criteria:Types of evaluators
Rule-based evaluators
Check if outputs match specific patterns or rules:- Exact match - Output exactly equals expected value
- Contains - Output includes required text
- Regex - Output matches a pattern
- Format validation - Output follows required structure (e.g., valid JSON)
Start with rule-based evaluators. They’re simple, fast, and surprisingly effective.
Execution-based evaluators
Run code or commands to verify correctness:- Code execution - Run generated code and check results
- Test suites - Pass generated code through unit tests
- API calls - Execute commands and verify outcomes
- Simulations - Run agents in environments and measure success
LLM-as-judge evaluators
Use another model to evaluate quality:- Rubric scoring - Judge outputs against criteria
- Comparative ranking - Compare multiple outputs
- Natural language assessment - Evaluate subjective qualities like helpfulness
Scoring guidelines
Your evaluator should return a score between 0.0 and 1.0:| Score range | Meaning | Example |
|---|---|---|
| 1.0 | Perfect | Exact correct answer |
| 0.7-0.9 | Good | Right approach, minor error |
| 0.4-0.6 | Partial | Some correct elements |
| 0.1-0.3 | Poor | Wrong but attempted |
| 0.0 | Failure | Completely wrong |
Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes.
Best practices
Start simple, iterate
Start simple, iterate
Begin with basic evaluation logic and refine over time:Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results.
Make evaluators fast
Make evaluators fast
Training generates many outputs to evaluate, so performance matters:
- Cache expensive computations: Store results of repeated calculations
- Use timeouts for code execution: Prevent hanging on infinite loops
- Batch API calls when possible: Reduce network overhead
- Profile slow evaluators and optimize: Identify and fix bottlenecks
Handle edge cases
Handle edge cases
Models will generate unexpected outputs, so build robust error handling:Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain.
Avoid reward hacking
Avoid reward hacking
Models will exploit evaluation weaknesses, so design defensively:Example: Length exploitationIf you score outputs by length, the model might generate verbose nonsense. Add constraints:Example: Format over substanceIf you only check JSON validity, the model might return valid but wrong JSON. Check content too:Always combine format checks with content validation to prevent models from gaming the system.
Debugging evaluators
Test your evaluator before training:- Correct scoring - Good outputs score high, bad outputs score low
- Reasonable runtime - Each evaluation completes in reasonable time
- Clear feedback - Evaluation reasons explain scores
Run your evaluator on manually created good and bad examples first. If it doesn’t score them correctly, fix the evaluator before training.