Math evaluation
Math Evaluation
This guide explains how to evaluate mathematical answers in LLM responses, primarily focusing on the math_reward
function.
For a complete, runnable example of math evaluation using the GSM8K dataset, including Hydra configuration for reward-kit run
, please refer to the Math Example README located in the examples/math_example/
directory.
The content below details the capabilities and programmatic usage of the underlying math_reward
function, which is utilized within the examples/math_example
.
math_reward
Function Overview
The math_reward
function (found in reward_kit.rewards.math
) allows you to:
- Extract numerical answers from LLM responses
- Compare them with expected answers or reference solutions
- Handle various formats including fractions, decimals, and scientific notation
- Support LaTeX formatted answers in markdown
Prerequisites for Programmatic Use
To use the math_reward
function directly in Python as shown below, ensure you have:
- Python 3.8+ installed on your system.
- Reward Kit installed:
pip install reward-kit
. (Note: Running the fullexamples/math_example
might requirepip install -e ".[dev]"
as per its README).
Basic Programmatic Usage of math_reward
The following examples demonstrate direct programmatic use of the math_reward
function. This can be useful for testing the function’s behavior or integrating it into custom scripts. For evaluating a dataset of math problems, refer to the examples/math_example/
.
Here’s a simple example:
How math_reward
Works
The math_reward
function:
- Extracts potential answer values from the last assistant message
- Extracts expected answer value from the provided string
- Compares them with tolerance for floating-point values
- Returns a score of 1.0 for correct answers and 0.0 for incorrect answers
- Provides detailed metrics about the extraction and comparison process
Supported Answer Formats
The math reward function can extract and compare answers in various formats:
Integer and Decimal Numbers
Fractions
Scientific Notation
LaTeX Formatting
Units
Advanced Programmatic Usage of math_reward
Customizing Extraction
You can customize the extraction process to look for answers in particular formats or locations:
Multiple Valid Answers
Sometimes, multiple forms of the same answer are acceptable. You can evaluate against multiple correct answers:
Original Messages as Reference
If the correct answer is in the original messages, you can extract it automatically:
Use Cases
Evaluating Math Problem Solving
The math reward function is perfect for evaluating responses to:
- Basic arithmetic problems
- Algebra equations
- Calculus problems
- Physics calculations
- Economics computations
- Statistics problems
Educational Applications
Use the math reward function to:
- Automatically grade math homework
- Provide instant feedback on practice problems
- Evaluate mathematical reasoning in tutoring systems
Best Practices
- Be Explicit About Units: Specify whether units should be considered in the comparison
- Consider Fractions vs. Decimals: Decide if approximate decimal answers are acceptable for fraction problems
- Set Appropriate Tolerance: Use a tolerance appropriate for the problem (e.g., higher for complex calculations)
- Look for Final Answers: Set up extraction patterns to focus on the final answer rather than intermediate steps
- Multiple Representations: Consider all valid forms of an answer (fraction, decimal, scientific notation)
- LaTeX Handling: Take advantage of the LaTeX support for nicely formatted answers
Limitations
- Cannot evaluate the correctness of the solution method, only the final answer
- May have difficulty with extremely complex LaTeX expressions
- Cannot evaluate mathematical proofs or abstract reasoning
- Works best with numerical answers rather than symbolic expressions
Next Steps
- Learn about Code Execution Evaluation for evaluating code solutions
- See Tool Calling Example for evaluating tool use
- See Creating Custom Reward Functions to build your own specialized math evaluators