Reward Functions Overview

This guide provides an overview of all out-of-the-box reward functions available in the Reward Kit library.

Introduction

Reward Kit includes several pre-built reward functions for common evaluation tasks. These functions can be used directly or as building blocks for more complex evaluations.

Available Reward Functions

Format and Structure Rewards

These reward functions evaluate the format and structure of responses.

  • Format Reward: Evaluate responses against a regex pattern (e.g., <think>...</think><answer>...</answer>)

    from reward_kit.rewards.format import format_reward
    
    result = format_reward(
        messages=messages,
        pattern=r"^<think>\n.*?</think>\n<answer>\n.*?</answer>$",
        flags=re.DOTALL
    )
    
  • Tag Count Reward: Check for exactly one of each specified tag

    from reward_kit.rewards.tag_count import tag_count_reward
    
    result = tag_count_reward(
        messages=messages,
        tags=["pros", "cons"]
    )
    

Accuracy and Correctness Rewards

These reward functions evaluate the accuracy of responses against expected answers.

  • Accuracy Reward: Compare answers to ground truth

    from reward_kit.rewards.accuracy import accuracy_reward
    
    result = accuracy_reward(
        messages=messages,
        ground_truth="Paris"
    )
    
  • Math Reward: Compare numerical answers with expected values

    from reward_kit.rewards.math import math_reward
    
    result = math_reward(
        messages=messages,
        expected_answer="42"
    )
    

Language and Style Rewards

These reward functions evaluate linguistic aspects of responses.

  • Language Consistency Reward: Ensure response is in the target language

    from reward_kit.rewards.language_consistency import language_consistency_reward
    
    result = language_consistency_reward(
        messages=messages,
        target_language="spanish"
    )
    
  • Reasoning Steps Reward: Encourage step-by-step reasoning

    from reward_kit.rewards.reasoning_steps import reasoning_steps_reward
    
    result = reasoning_steps_reward(
        messages=messages,
        min_steps=3
    )
    

Length and Verbosity Rewards

These reward functions evaluate the length and verbosity of responses.

  • Length Reward: Evaluate response against length targets

    from reward_kit.rewards.length import length_reward
    
    result = length_reward(
        messages=messages,
        target_length=200,  # Target token count
        token_method="whitespace"
    )
    
  • Cosine Length Reward: Scale rewards based on length using cosine schedule

    from reward_kit.rewards.length import cosine_length_reward
    
    result = cosine_length_reward(
        messages=messages,
        correctness=0.9,  # High correctness score
        max_length=500,
        min_value_correct=0.5,
        max_value_correct=1.0
    )
    
  • Repetition Penalty Reward: Penalize repetitive content

    from reward_kit.rewards.repetition import repetition_penalty_reward
    
    result = repetition_penalty_reward(
        messages=messages,
        max_penalty=0.5,
        ngram_size=3
    )
    

Code Execution Rewards

These reward functions evaluate code by running it and comparing the output to expected results.

  • Binary Code Reward: Binary pass/fail for code execution

    from reward_kit.rewards.code_execution import binary_code_reward
    
    result = binary_code_reward(
        messages=messages,
        expected_output="expected result",
        language="python"
    )
    
  • Fractional Code Reward: Return exact pass rate for code execution

    from reward_kit.rewards.code_execution import fractional_code_reward
    
    result = fractional_code_reward(
        messages=messages,
        test_cases=[
            {"input": "arg1", "expected_output": "result1"},
            {"input": "arg2", "expected_output": "result2"}
        ],
        language="python"
    )
    
  • IOI C/C++ Code Reward: Evaluate C/C++ code using Piston engine

    from reward_kit.rewards.cpp_code import ioi_cpp_code_reward
    
    result = ioi_cpp_code_reward(
        messages=messages,
        test_cases=[
            {"input": "4\n5", "expected_output": "9"},
            {"input": "10\n20", "expected_output": "30"}
        ],
        language="cpp"  # or "c"
    )
    
  • Binary C/C++ Code Reward: Binary pass/fail for C/C++ code

    from reward_kit.rewards.cpp_code import binary_cpp_code_reward
    
    result = binary_cpp_code_reward(
        messages=messages,
        test_cases=[
            {"input": "4\n5", "expected_output": "9"}
        ],
        language="cpp"
    )
    

Function Calling Rewards

These reward functions evaluate function calls in LLM responses against expected schemas and behaviors.

  • Schema Jaccard Reward: Compare function calls to expected schema

    from reward_kit.rewards.function_calling import schema_jaccard_reward
    
    result = schema_jaccard_reward(
        messages=messages,
        expected_schema=schema
    )
    
  • LLM Judge Reward: Use an LLM to evaluate function call quality

    from reward_kit.rewards.function_calling import llm_judge_reward
    
    result = llm_judge_reward(
        messages=messages,
        expected_schema=schema,
        expected_behavior=behavior_description
    )
    
  • Composite Function Call Reward: Combine schema validation and LLM judgment

    from reward_kit.rewards.function_calling import composite_function_call_reward
    
    result = composite_function_call_reward(
        messages=messages,
        expected_schema=schema,
        expected_behavior=behavior_description
    )
    

JSON Schema Rewards

These reward functions validate JSON outputs against predefined schemas.

  • JSON Schema Reward: Validate JSON against a schema
    from reward_kit.rewards.json_schema import json_schema_reward
    
    result = json_schema_reward(
        messages=messages,
        schema=json_schema
    )
    

Combined Metrics Rewards

These reward functions combine multiple evaluation aspects into a single score.

  • Cosine-Scaled Accuracy + Length Reward: Combine accuracy with length efficiency
    from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
    
    result = cosine_scaled_accuracy_length_reward(
        messages=messages,
        ground_truth="Paris",
        max_length=200,
        correctness_weight=0.7,
        length_weight=0.3
    )
    

Choosing the Right Reward Function

Here’s a guide to help you choose the appropriate reward function for your task:

TaskRecommended Reward Function
Evaluating format adherenceformat_reward
Checking tag usage and structuretag_count_reward
Evaluating factual accuracyaccuracy_reward
Ensuring consistent languagelanguage_consistency_reward
Encouraging step-by-step reasoningreasoning_steps_reward
Controlling response lengthlength_reward
Optimizing for brevity and correctnesscosine_scaled_accuracy_length_reward
Reducing repetitionrepetition_penalty_reward
Evaluating Python codefractional_code_reward or binary_code_reward
Evaluating C/C++ codeioi_cpp_code_reward or binary_cpp_code_reward
Validating tool use and function callscomposite_function_call_reward
Checking structured data outputsjson_schema_reward
Evaluating mathematical solutionsmath_reward
Evaluating formal proofs in Leanlean_prover_reward, deepseek_prover_v2_reward

Lean Theorem Prover Rewards

These reward functions evaluate formal proofs written in the Lean theorem prover language.

  • Lean Prover Reward: Basic evaluation of Lean proofs

    from reward_kit.rewards.lean_prover import lean_prover_reward
    
    result = lean_prover_reward(
        response=model_response,
        statement="For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2.",
        lean_version="4",
        check_partial_progress=True
    )
    
  • DeepSeek Prover V2 Reward: Evaluate Lean proofs with focus on subgoal decomposition

    from reward_kit.rewards.lean_prover import deepseek_prover_v2_reward
    
    result = deepseek_prover_v2_reward(
        response=model_response,
        statement="For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2.",
        check_subgoals=True,
        verbose=True
    )
    
  • DeepSeek HuggingFace Prover Benchmark: Evaluate proofs against the DeepSeek-ProverBench dataset

    from reward_kit.rewards.lean_prover import deepseek_huggingface_prover_benchmark
    
    result = deepseek_huggingface_prover_benchmark(
        response=model_response,
        statement="For any positive integers a and b, gcd(a,b) divides any linear combination of a and b",
        dataset_name="deepseek-ai/DeepSeek-ProverBench",
        check_for_answer=True
    )
    

Combining Reward Functions

You can combine multiple reward functions to create comprehensive evaluations:

from reward_kit.rewards.accuracy import accuracy_reward
from reward_kit.rewards.length import length_reward
from reward_kit import reward_function, RewardOutput, MetricRewardOutput

@reward_function
def combined_accuracy_length(messages, ground_truth=None, **kwargs):
    """Combine accuracy and length evaluation."""
    # Check accuracy
    accuracy_result = accuracy_reward(
        messages=messages,
        ground_truth=ground_truth
    )
    
    # Check length
    length_result = length_reward(
        messages=messages,
        target_length=150
    )
    
    # Combine scores with weighting
    # 70% accuracy, 30% length
    combined_score = 0.7 * accuracy_result["score"] + 0.3 * length_result["score"]
    
    # Combine metrics
    metrics = {
        "accuracy": MetricRewardOutput(
            score=accuracy_result["score"],
            reason=accuracy_result["reason"]
        ),
        "length": MetricRewardOutput(
            score=length_result["score"],
            reason=length_result["reason"]
        )
    }
    
    return RewardOutput(score=combined_score, metrics=metrics)

Pre-Built Combined Metrics

Reward Kit offers pre-built functions that combine multiple metrics:

  • Cosine-Scaled Accuracy + Length: Combines accuracy with length using a cosine schedule

    from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
    
    result = cosine_scaled_accuracy_length_reward(
        messages=messages,
        ground_truth="Paris",
        max_length=200,
        correctness_weight=0.7,  # Weight for accuracy component
        length_weight=0.3        # Weight for length component
    )
    

    This function:

    • Evaluates response accuracy against ground truth
    • Measures response length efficiency using a cosine schedule
    • Rewards shorter correct answers more than longer ones
    • Maintains a clear separation between correct and incorrect answers
    • Allows customizable weighting between accuracy and length

Next Steps