Code execution evaluation
Code Execution Evaluation
This guide demonstrates how to evaluate code solutions using the Reward Kit’s code execution reward functions.
Overview
The code execution reward functions allow you to:
- Extract code blocks from LLM responses
- Execute the code in a secure environment
- Compare the output with expected results
- Get detailed execution metrics and error reports
Prerequisites
Before using the code execution rewards, ensure you have:
- Python 3.8+ installed on your system
- Reward Kit installed:
pip install reward-kit
- For JavaScript evaluation: Node.js installed on your system
Available Reward Functions
Reward Kit provides two main methods for code execution evaluation:
- Local Code Execution: Executes code securely on your local machine
- E2B Code Execution: Executes code in a cloud sandbox (requires E2B account)
Local Code Execution
Basic Usage
Here’s a simple example of evaluating Python code:
This function uses recursion to calculate the factorial. For n = 5, it should output 120.""" } ]
Evaluate the code
result = local_code_execution_reward( messages=messages, expected_output=“120”, language=“python”, timeout=5 )
Print the results
print(f”Score: ”) print(“Metrics:”) for name, metric in result.metrics.items(): print(f” : ”) print(f” ”)
This function removes any non-alphanumeric characters and converts the string to lowercase before checking if it reads the same forward and backward.""" } ]
Evaluate the JavaScript code
result = local_code_execution_reward( messages=messages, expected_output=“true\nfalse”, language=“javascript”, timeout=5 )
Automatic Expected Output Extraction
If the expected output is mentioned in the conversation, it can be extracted automatically:
This function uses the built-in sum() and range() functions to calculate the sum efficiently.""" } ]
Extract expected output from the conversation
result = local_code_execution_reward( messages=messages, original_messages=messages, # Provide the original messages for extraction language=“python” )
Expected: “Hello, world!” Actual: “Hello, world!” Score: 1.0
Expected: “42” Actual: “42.001” Score: 0.99 # Very close match
Expected: “[1, 2, 3]” Actual: “[1, 2, 3, 4]” Score: 0.75 # Partial match
Expected: “Line 1\nLine 2\nLine 3” Actual: “Line 1\nLine 2\nLine X” Score: 0.89 # Most lines match
Algorithm Comparison
Compare different algorithms for the same problem:
Multiple Language Support
Evaluate solutions in different programming languages:
Best Practices
- Security First: Always use the built-in security mechanisms and don’t disable them
- Timeout Setting: Choose reasonable timeouts based on task complexity
- Expected Output: Be specific about expected output format for accurate comparison
- Error Handling: Check execution error metrics even when code runs successfully
- Resource Limits: Set appropriate memory limits for the complexity of the code
- Test Environment: Ensure required dependencies are available in the execution environment
- Edge Cases: Test the reward function with a variety of inputs, including edge cases
Limitations
- Cannot evaluate non-deterministic code reliably
- Limited to languages supported by the reward function (Python and JavaScript for local execution)
- Cannot evaluate code that requires external resources (databases, APIs, etc.) without mocking
- May have limitations with GUI applications or complex I/O operations
- Security mechanisms may prevent some valid code from executing
Next Steps
- For cloud-based code execution, see Code Execution with E2B
- Learn about Function Calling Evaluation for evaluating tool use
- Explore JSON Schema Validation for structured outputs
- See Creating Custom Reward Functions to build your own evaluators