Combined metrics rewards
Combined Metrics Rewards
This guide focuses on reward functions that combine multiple evaluation aspects into a single score. These combined metrics provide a more comprehensive assessment of model responses.
Introduction to Combined Metrics
In real-world evaluation scenarios, we often want to consider multiple aspects of quality simultaneously. For example:
- Responses should be both accurate AND concise
- Code solutions should be both correct AND efficient
- Explanations should be both clear AND well-structured
Combined metric rewards allow you to assess multiple dimensions in a single reward function with appropriate weightings.
Available Combined Metric Rewards
Cosine-Scaled Accuracy + Length Reward
The cosine_scaled_accuracy_length_reward
function combines accuracy evaluation with length efficiency into a unified score. Note that this function depends on the accuracy detection mechanisms, which may need customization for different types of content through the extract_fn
and compare_fn
parameters.
Key Features
- Dual Evaluation: Assesses both factual accuracy and response length
- Cosine Scaling: Uses cosine scheduling to reward brevity in correct responses
- Weighted Components: Allows customizing the importance of accuracy vs. length
- Asymmetric Penalties: Handles correct and incorrect responses differently
- For correct answers: shorter is better (higher reward)
- For incorrect answers: longer explanations are penalized less (encouraging showing work)
- Customizable Parameters: Flexible configuration for different use cases
How It Works
-
Accuracy Evaluation:
- Extracts an answer from the response
- Compares to ground truth with semantic matching
- Produces an accuracy score (0.0-1.0)
-
Length Evaluation:
- Counts tokens in the response
- Applies cosine scaling based on token count vs. max_length
- Produces a length score (0.0-1.0)
-
Combined Scoring:
- Weighted average of accuracy and length scores
- Clear separation between correct and incorrect answers
- Final score prioritizes accuracy while considering length
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
messages | List[Dict/Message] | Required | Conversation messages to evaluate |
ground_truth | str | None | Expected correct answer |
extract_fn | Callable | None | Custom function to extract answer from text |
compare_fn | Callable | None | Custom function to compare answers |
max_length | int | 1000 | Maximum token length for scaling |
min_value_wrong | float | 0.0 | Minimum reward for wrong answers |
max_value_wrong | float | 0.3 | Maximum reward for wrong answers |
min_value_correct | float | 0.5 | Minimum reward for correct answers |
max_value_correct | float | 1.0 | Maximum reward for correct answers |
token_method | str | ”whitespace” | Method to count tokens |
correctness_weight | float | 0.7 | Weight for accuracy component |
length_weight | float | 0.3 | Weight for length component |
Return Value
An EvaluateResult
object with:
- score: Combined weighted score (0.0-1.0)
- reason: Detailed explanation of the evaluation
- metrics:
- combined_reward: Overall evaluation result
- accuracy: Accuracy component evaluation
- length: Length component evaluation
- token_count: Token count details
Example
Use Cases
- Factual QA: Reward concise, correct answers over verbose ones
- Mathematical problems: Evaluate correctness while encouraging brevity
- Knowledge retrieval: Balance accuracy with response length
- Instruction following: Ensure responses are both correct and appropriately sized
Advanced Configuration
Fine-tune the behavior with these parameter adjustments:
- Encouraging brevity: Increase
length_weight
and decreasemax_length
- Prioritizing accuracy: Increase
correctness_weight
and decreaselength_weight
- Allowing detailed explanations: Increase
max_length
while maintaining weighting - Strict scoring: Increase gap between
max_value_wrong
andmin_value_correct
Creating Custom Combined Metrics
You can create custom combined metrics by using the @reward_function
decorator:
Tips for Creating Combined Metrics
- Choose appropriate weights based on the relative importance of each component
- Ensure scale consistency across all component metrics (typically 0.0-1.0)
- Provide detailed reasons for each component and the combined score
- Handle edge cases where one component might fail
- Document parameters clearly for users of your combined metric
Best Practices
When using combined metrics rewards:
- Start simple: Begin with equal weights and adjust based on results
- Test on diverse examples: Ensure your metrics work across different response styles
- Avoid too many components: Two or three aspects is typically optimal
- Balance importance: Set weights to reflect true priorities
- Document clearly: Make sure users understand what each component measures
Next Steps
- Explore other out-of-the-box reward functions
- Learn how to create your own reward functions
- Study best practices for reward functions
- See how to deploy your reward functions