Math evaluation
Math Evaluation
This guide demonstrates how to evaluate mathematical answers in LLM responses using the math reward functions.
Overview
The math_reward
function allows you to:
- Extract numerical answers from LLM responses
- Compare them with expected answers or reference solutions
- Handle various formats including fractions, decimals, and scientific notation
- Support LaTeX formatted answers in markdown
Prerequisites
Before using the math evaluation rewards, ensure you have:
- Python 3.8+ installed on your system
- Reward Kit installed:
pip install reward-kit
Basic Usage
Here’s a simple example of how to use the math reward function:
How It Works
The math reward function:
- Extracts potential answer values from the last assistant message
- Extracts expected answer value from the provided string
- Compares them with tolerance for floating-point values
- Returns a score of 1.0 for correct answers and 0.0 for incorrect answers
- Provides detailed metrics about the extraction and comparison process
Supported Answer Formats
The math reward function can extract and compare answers in various formats:
Integer and Decimal Numbers
Fractions
Scientific Notation
LaTeX Formatting
Units
Advanced Usage
Customizing Extraction
You can customize the extraction process to look for answers in particular formats or locations:
Multiple Valid Answers
Sometimes, multiple forms of the same answer are acceptable. You can evaluate against multiple correct answers:
Original Messages as Reference
If the correct answer is in the original messages, you can extract it automatically:
Use Cases
Evaluating Math Problem Solving
The math reward function is perfect for evaluating responses to:
- Basic arithmetic problems
- Algebra equations
- Calculus problems
- Physics calculations
- Economics computations
- Statistics problems
Educational Applications
Use the math reward function to:
- Automatically grade math homework
- Provide instant feedback on practice problems
- Evaluate mathematical reasoning in tutoring systems
Best Practices
- Be Explicit About Units: Specify whether units should be considered in the comparison
- Consider Fractions vs. Decimals: Decide if approximate decimal answers are acceptable for fraction problems
- Set Appropriate Tolerance: Use a tolerance appropriate for the problem (e.g., higher for complex calculations)
- Look for Final Answers: Set up extraction patterns to focus on the final answer rather than intermediate steps
- Multiple Representations: Consider all valid forms of an answer (fraction, decimal, scientific notation)
- LaTeX Handling: Take advantage of the LaTeX support for nicely formatted answers
Limitations
- Cannot evaluate the correctness of the solution method, only the final answer
- May have difficulty with extremely complex LaTeX expressions
- Cannot evaluate mathematical proofs or abstract reasoning
- Works best with numerical answers rather than symbolic expressions
Next Steps
- Learn about Code Execution Evaluation for evaluating code solutions
- Explore Function Calling Evaluation for evaluating tool use
- See Creating Custom Reward Functions to build your own specialized math evaluators