Function calling evaluation
Function Calling Evaluation
This guide demonstrates how to evaluate function calls made by AI models using a combination of schema validation and LLM judgment.
Prerequisites
Before using the function calling evaluation rewards, ensure you have:
- Python 3.8+ installed on your system
- Reward Kit installed:
pip install reward-kit
- OpenAI Python Client installed (for LLM judge):
pip install openai
- OpenAI API Key (for LLM judge evaluation)
Function Calling Reward Components
The Reward Kit provides three approaches to evaluating function calls:
- Schema Jaccard Reward: Compares function call structure to expected schema using Jaccard similarity
- LLM Judge Reward: Uses GPT-4o-mini to evaluate function call quality based on expected behavior
- Composite Reward: Combines schema validation and LLM judgment for comprehensive evaluation
Schema Jaccard Reward
The Schema Jaccard Reward evaluates how well a function call matches the expected schema by calculating the Jaccard similarity between property sets.
Example Usage
How It Works
- Extracts function call information from the messages or directly from provided function_call parameter
- Compares the function name against the expected name (exact match required)
- Compares argument schema structure using Jaccard similarity, which measures:
- The intersection of properties divided by the union of properties
- Generates a comprehensive report of matching, missing, and unexpected properties
- Calculates final score as weighted combination of name match and schema similarity
LLM Judge Reward
The LLM Judge Reward uses GPT-4o-mini to evaluate the quality and correctness of function calls based on expected behavior.
Example Usage
How It Works
- Extracts function call information from the messages
- Formats a prompt with:
- Conversation context
- Function call details
- Expected schema
- Expected behavior description
- Sends the prompt to GPT-4o-mini (or another specified model)
- Parses the response to extract:
- Numeric score between 0.0 and 1.0
- Detailed explanation of strengths and weaknesses
- Returns the LLM’s evaluation as a reward score with explanation
Composite Function Call Reward
The Composite Function Call Reward combines both schema validation and LLM judgment for a comprehensive evaluation.
Example Usage
How It Works
- Runs both schema_jaccard_reward and llm_judge_reward separately
- Combines the metrics from both evaluations with prefixes:
schema_
for schema validation metricsllm_
for LLM judgment metrics
- Calculates a weighted average of both scores based on provided weights
- Returns a comprehensive set of metrics with the weighted final score
Advanced Usage
Custom Weights
You can customize the weights for different components:
Custom LLM Model
You can specify a different model for LLM evaluation:
Direct Function Call Evaluation
You can also evaluate a function call directly without extracting from messages:
Use Case: Evaluating Tool Use in Models
One common application is evaluating how well different models use tools:
Best Practices
- Clear Expected Schemas: Define schemas with precise types and required properties
- Detailed Expected Behavior: Provide specific guidance for what constitutes correct behavior
- Combined Evaluation: Use the composite reward for the most comprehensive evaluation
- Custom Weights: Adjust weights based on whether structure or behavior is more important
- Testing: Test reward functions with a variety of function calls, including edge cases
- Fallback Options: Always handle API errors gracefully in the LLM judge evaluation
Next Steps
- Learn about Creating Custom Reward Functions
- Explore Advanced Reward Functions for more complex evaluations
- See Best Practices for reward function design