Evaluators

An evaluator (also called a reward function) is code that scores model outputs from 0.0 (worst) to 1.0 (best). During reinforcement fine-tuning, your evaluator guides the model toward better responses by providing feedback on its generated outputs.

Why evaluators matter

Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what “good” means. This is powerful because:

No perfect data required - Just prompts and a way to score outputs
Encourages exploration - Models learn strategies, not just patterns
Noise tolerant - Even noisy signals can improve model performance
Encodes domain expertise - Complex rules and logic that are hard to demonstrate with examples

Anatomy of an evaluator

Every evaluator has three core components:

1. Input data

The prompt and any ground truth data needed for evaluation:

{
  "messages": [
    {"role": "system", "content": "You are a math tutor."},
    {"role": "user", "content": "What is 15 * 23?"}
  ],
  "ground_truth": "345"  # Optional additional data
}

2. Model output

The assistant’s response to evaluate:

{
  "role": "assistant",
  "content": "Let me calculate that step by step:\n15 * 23 = 345"
}

3. Scoring logic

Code that compares the output to your criteria:

def evaluate(model_output: str, ground_truth: str) -> float:
    # Extract answer from model's response
    predicted = extract_number(model_output)
    
    # Score it
    if predicted == int(ground_truth):
        return 1.0  # Perfect
    else:
        return 0.0  # Wrong

Types of evaluators

Rule-based evaluators

Check if outputs match specific patterns or rules:

Exact match - Output exactly equals expected value
Contains - Output includes required text
Regex - Output matches a pattern
Format validation - Output follows required structure (e.g., valid JSON)

Start with rule-based evaluators. They’re simple, fast, and surprisingly effective.

Execution-based evaluators

Run code or commands to verify correctness:

Code execution - Run generated code and check results
Test suites - Pass generated code through unit tests
API calls - Execute commands and verify outcomes
Simulations - Run agents in environments and measure success

LLM-as-judge evaluators

Use another model to evaluate quality:

Rubric scoring - Judge outputs against criteria
Comparative ranking - Compare multiple outputs
Natural language assessment - Evaluate subjective qualities like helpfulness

Scoring guidelines

Your evaluator should return a score between 0.0 and 1.0:

Score range	Meaning	Example
1.0	Perfect	Exact correct answer
0.7-0.9	Good	Right approach, minor error
0.4-0.6	Partial	Some correct elements
0.1-0.3	Poor	Wrong but attempted
0.0	Failure	Completely wrong

Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes.

Best practices

Start simple, iterate

Begin with basic evaluation logic and refine over time:

# Start here
score = 1.0 if predicted == expected else 0.0

# Then refine if needed
score = calculate_similarity(predicted, expected)

Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results.

Make evaluators fast

Training generates many outputs to evaluate, so performance matters:

Cache expensive computations: Store results of repeated calculations
Use timeouts for code execution: Prevent hanging on infinite loops
Batch API calls when possible: Reduce network overhead
Profile slow evaluators and optimize: Identify and fix bottlenecks

Aim for evaluations that complete in seconds, not minutes. Slow evaluators directly increase training time and cost.

Handle edge cases

Models will generate unexpected outputs, so build robust error handling:

try:
    result = execute_code(model_output)
    score = check_result(result)
except TimeoutError:
    score = 0.0  # Code ran too long
except SyntaxError:
    score = 0.0  # Invalid code
except Exception as e:
    score = 0.0  # Any other error

Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain.

Avoid reward hacking

Models will exploit evaluation weaknesses, so design defensively:Example: Length exploitationIf you score outputs by length, the model might generate verbose nonsense. Add constraints:

# Bad: Model learns to write long outputs
score = min(len(output) / 1000, 1.0)

# Better: Require correctness AND reasonable length
if is_correct(output):
    score = 1.0 if len(output) < 500 else 0.8
else:
    score = 0.0

Example: Format over substanceIf you only check JSON validity, the model might return valid but wrong JSON. Check content too:

# Bad: Only checks format
score = 1.0 if is_valid_json(output) else 0.0

# Better: Check format AND content
if is_valid_json(output):
    data = json.loads(output)
    score = evaluate_content(data)
else:
    score = 0.0

Always combine format checks with content validation to prevent models from gaming the system.

Debugging evaluators

Test your evaluator before training. Look for:

Correct scoring - Good outputs score high, bad outputs score low
Reasonable runtime - Each evaluation completes in reasonable time
Clear feedback - Evaluation reasons explain scores

Run your evaluator on manually created good and bad examples first. If it doesn’t score them correctly, fix the evaluator before training.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Why evaluators matter

Anatomy of an evaluator

1. Input data

2. Model output

3. Scoring logic

Types of evaluators

Rule-based evaluators

Execution-based evaluators

LLM-as-judge evaluators

Scoring guidelines

Best practices

Debugging evaluators

Next steps

Connect environments

Quickstart: Math solver

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Why evaluators matter

​Anatomy of an evaluator

​1. Input data

​2. Model output

​3. Scoring logic

​Types of evaluators

​Rule-based evaluators

​Execution-based evaluators

​LLM-as-judge evaluators

​Scoring guidelines

​Best practices

​Debugging evaluators

​Next steps

Connect environments

Quickstart: Math solver

Why evaluators matter

Anatomy of an evaluator

1. Input data

2. Model output

3. Scoring logic

Types of evaluators

Rule-based evaluators

Execution-based evaluators

LLM-as-judge evaluators

Scoring guidelines

Best practices

Debugging evaluators

Next steps