Code Execution with E2B

This guide demonstrates how to use the E2B code execution reward function to evaluate code by running it in the E2B cloud sandbox.

Overview

The e2b_code_execution_reward function allows you to:

  1. Extract code blocks from LLM responses
  2. Execute the code securely in E2B’s cloud sandbox
  3. Compare the output with expected results
  4. Generate a score and detailed metrics

Prerequisites

To use the E2B code execution reward function, you need:

  1. An E2B API key from E2B Dashboard
  2. The e2b_code_interpreter Python package installed: pip install e2b_code_interpreter

Note: The code will also work with the e2b package, but e2b_code_interpreter is recommended as it provides a more stable interface specifically designed for code execution.

Basic Usage

Here’s a simple example of how to use the reward function:

from reward_kit.rewards.code_execution import e2b_code_execution_reward

# Example conversation with a Python coding task
messages = [
    {
        "role": "user",
        "content": "Write a Python function to calculate the factorial of a number."
    },
    {
        "role": "assistant",
        "content": """Here's a Python function to calculate the factorial of a number:

```python
def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n - 1)

# Test the function
print(factorial(5))  # Should output 120

This function uses recursion to calculate the factorial. For n = 5, it computes 5 * 4 * 3 * 2 * 1 = 120.""" } ]

Define expected output

expected_output = “120”

Evaluate the code using E2B

result = e2b_code_execution_reward( messages=messages, expected_output=expected_output, language=“python”, api_key=“your_e2b_api_key”, timeout=10 )

Use the results

print(f”Score: ”) for metric_name, metric in result.metrics.items(): print(f”\n: “)


## Supported Languages

The E2B code execution reward function currently supports:

- Python (`language="python"`)
- JavaScript (`language="javascript"` or `language="js"`)

## Advanced Options

### Automatic Output Extraction

You can let the reward function automatically extract the expected output from the prompt:

```python
# Conversation with expected output in the prompt
messages = [
    {
        "role": "user",
        "content": "Write a Python function to find the sum of a list. Expected output: 15 (for [1,2,3,4,5])"
    },
    {
        "role": "assistant",
        "content": """```python
def sum_list(numbers):
    return sum(numbers)

print(sum_list([1, 2, 3, 4, 5]))
```"""
    }
]

# Pass the original messages for expected output extraction
result = e2b_code_execution_reward(
    messages=messages,
    original_messages=messages,
    language="python",
    api_key="your_e2b_api_key"
)

Fallback to Local Execution

You can gracefully fall back to local execution when an E2B API key is not available:

from reward_kit.rewards.code_execution import (
    e2b_code_execution_reward,
    local_code_execution_reward
)

# Try to use E2B if API key is provided
api_key = os.environ.get("E2B_API_KEY")

if api_key:
    result = e2b_code_execution_reward(
        messages=messages,
        expected_output=expected_output,
        language="python",
        api_key=api_key
    )
else:
    # Fall back to local execution
    result = local_code_execution_reward(
        messages=messages,
        expected_output=expected_output,
        language="python"
    )

Parameters

The e2b_code_execution_reward function accepts the following parameters:

ParameterTypeDescription
messagesList[Dict[str, str]]Generated conversation messages (required)
original_messagesList[Dict[str, str]]Original conversation context (optional)
expected_outputstrExpected output from code execution (optional)
languagestrProgramming language of the code (default: “python”)
timeoutintMaximum execution time in seconds (default: 30)
api_keystrE2B API key (default: None, uses E2B_API_KEY environment variable)

Return Value

The reward function returns a RewardOutput object with:

  • score: A float between 0.0 and 1.0 indicating how well the code performed
  • metrics: A dictionary of MetricRewardOutput objects with detailed information about the execution

Key metrics include:

  • extracted_code: The code that was extracted and executed
  • expected_output: The expected output (if provided or extracted)
  • execution_result: Details about the execution (success or failure)
  • output_match: Comparison between actual and expected outputs

Examples

See the examples/ directory for complete examples:

  • e2b_reward_example.py: Basic Python example
  • e2b_javascript_example.py: JavaScript example
  • e2b_auto_extract_example.py: Automatic output extraction example
  • e2b_fallback_example.py: Fallback to local execution example