Agent Evaluation Framework

The Agent Evaluation Framework allows you to evaluate agent models with tool-augmented reasoning using “Task Bundles” - self-contained directories that include all the necessary components for testing and evaluation.

Task Bundle Structure

A task bundle is a self-contained directory with all the components needed to evaluate an agent:

my_task/
├─ reward.py           # Reward function with @reward_function decorator
├─ tools.py            # Tool registry for this specific task
├─ seed.sql            # Initial DB state (optional)
└─ task.jsonl          # Dataset rows with task specifications

CLI Usage

The agent evaluation framework is integrated with the Reward Kit CLI through the agent-eval command.

Basic Usage

# Run agent evaluation on a task bundle
reward-kit agent-eval --task-dir ./flight_task

# You can also specify just the task.jsonl file
reward-kit agent-eval --dataset ./flight_task/task.jsonl

Environment Variables

Models can be specified using environment variables:

# Set model for agent evaluation
export MODEL_AGENT=openai/gpt-4o

# Set model for simulated user (optional)
export MODEL_SIM=openai/gpt-3.5-turbo

# Then run evaluation
reward-kit agent-eval --task-dir ./flight_task

Advanced Options

# Specify model directly (overrides environment variable)
reward-kit agent-eval --task-dir ./flight_task --model openai/gpt-4o

# Use custom output directory
reward-kit agent-eval --task-dir ./flight_task --output-dir ./my_runs

# Disable simulated user (use static initial messages only)
reward-kit agent-eval --task-dir ./flight_task --no-sim-user

# Use test mode without requiring API keys
reward-kit agent-eval --task-dir ./flight_task --test-mode

# Use mock response in test mode
reward-kit agent-eval --task-dir ./flight_task --test-mode --mock-response

# Run in debug mode with verbose output
reward-kit agent-eval --task-dir ./flight_task --debug

# Limit the number of tasks to evaluate
reward-kit agent-eval --task-dir ./flight_task --max-tasks 2

# Run specific tasks by ID
reward-kit agent-eval --task-dir ./flight_task --task-ids flight.booking.001,flight.booking.002

# Use a specific registry for a task
reward-kit agent-eval --task-dir ./flight_task --registry-override my_custom_tools.flight_tools

# Use multiple tool registries
reward-kit agent-eval --task-dir ./complex_task --registries flight=flight_tools,hotel=hotel_tools

# Specify evaluator
reward-kit agent-eval --task-dir ./flight_task --evaluator flight_reward.success_evaluator

Testing & Debugging

The CLI provides several options for testing and debugging:

# Test mode verifies tool setup without making API calls
reward-kit agent-eval --task-dir ./flight_task --test-mode

# Debug mode shows detailed information about tool execution
reward-kit agent-eval --task-dir ./flight_task --debug

# Export tools as OpenAPI spec for manual testing
reward-kit agent-eval --task-dir ./flight_task --export-tools ./tools_spec

# Validate task bundle structure and requirements
reward-kit agent-eval --task-dir ./flight_task --validate-only

Examples

Basic Flight Task Evaluation

export MODEL_AGENT=openai/gpt-4o
reward-kit agent-eval --task-dir ./examples/flight_task

Testing Without API Keys

reward-kit agent-eval --task-dir ./examples/flight_task --test-mode --mock-response

Complex Task with Multiple Tool Registries

reward-kit agent-eval --task-dir ./examples/travel_task --registries flight=flight_tools,hotel=hotel_tools

Running with Specific Task IDs

reward-kit agent-eval --task-dir ./examples/flight_task --task-ids flight.booking.001,flight.booking.002

Using Debug Mode

reward-kit agent-eval --task-dir ./examples/flight_task --debug