The Agent Evaluation Framework allows you to evaluate agent models with tool-augmented reasoning using “Task Bundles” - self-contained directories that include all the necessary components for testing and evaluation.
A task bundle is a self-contained directory with all the components needed to evaluate an agent:
Copy
Ask AI
my_task/├─ reward.py # Reward function with @reward_function decorator├─ tools.py # Tool registry for this specific task├─ seed.sql # Initial DB state (optional)└─ task.jsonl # Dataset rows with task specifications
# Run agent evaluation on a task bundlereward-kit agent-eval --task-dir ./flight_task# You can also specify just the task.jsonl filereward-kit agent-eval --dataset ./flight_task/task.jsonl
Models can be specified using environment variables:
Copy
Ask AI
# Set model for agent evaluationexport MODEL_AGENT=openai/gpt-4o# Set model for simulated user (optional)export MODEL_SIM=openai/gpt-3.5-turbo# Then run evaluationreward-kit agent-eval --task-dir ./flight_task
# Specify model directly (overrides environment variable)reward-kit agent-eval --task-dir ./flight_task --model openai/gpt-4o# Use custom output directoryreward-kit agent-eval --task-dir ./flight_task --output-dir ./my_runs# Disable simulated user (use static initial messages only)reward-kit agent-eval --task-dir ./flight_task --no-sim-user# Use test mode without requiring API keysreward-kit agent-eval --task-dir ./flight_task --test-mode# Use mock response in test modereward-kit agent-eval --task-dir ./flight_task --test-mode --mock-response# Run in debug mode with verbose outputreward-kit agent-eval --task-dir ./flight_task --debug# Limit the number of tasks to evaluatereward-kit agent-eval --task-dir ./flight_task --max-tasks 2# Run specific tasks by IDreward-kit agent-eval --task-dir ./flight_task --task-ids flight.booking.001,flight.booking.002# Use a specific registry for a taskreward-kit agent-eval --task-dir ./flight_task --registry-override my_custom_tools.flight_tools# Use multiple tool registriesreward-kit agent-eval --task-dir ./complex_task --registries flight=flight_tools,hotel=hotel_tools# Specify evaluatorreward-kit agent-eval --task-dir ./flight_task --evaluator flight_reward.success_evaluator