run
: Run a local evaluation pipeline using a Hydra configuration.preview
: Preview evaluation results or re-evaluate generated outputs.deploy
: Deploy a reward function as an evaluator.agent-eval
: Run agent evaluations on task bundles.list
: List existing evaluators (coming soon).delete
: Delete an evaluator (coming soon).reward-kit run
)run
command is the primary way to execute local evaluation pipelines. It leverages Hydra for configuration, allowing you to define complex evaluation setups (including dataset loading, model generation, and reward application) in YAML files and easily override parameters from the command line.
--config-path TEXT
: Path to the directory containing your Hydra configuration files. (Required)--config-name TEXT
: Name of the main Hydra configuration file (e.g., run_my_eval.yaml
). (Required)--multirun
or -m
: Run multiple jobs (e.g., for sweeping over parameters). Refer to Hydra documentation for multi-run usage.--help
: Show help message for the run
command.run
command typically generates:
outputs/YYYY-MM-DD/HH-MM-SS/
)..hydra/
: Contains the full Hydra configuration for the run (for reproducibility).<config_output_name>_results.jsonl
(e.g., math_example_results.jsonl
): Detailed evaluation results for each sample.preview_input_output_pairs.jsonl
: Generated prompts and responses, suitable for use with reward-kit preview
.reward-kit preview
)preview
command allows you to test reward functions with sample data. A primary use case is to inspect or re-evaluate the preview_input_output_pairs.jsonl
file generated by the reward-kit run
command. This allows you to iterate on reward logic using a fixed set of model generations or to apply different metrics to the same outputs.
You can also use it with manually created sample files.
--metrics-folders
: Specify local metric scripts to apply, in the format “name=path/to/metric_script_dir”. The directory should contain a main.py
with a @reward_function
.--samples
: Path to a JSONL file containing sample conversations or prompt/response pairs. This is typically the preview_input_output_pairs.jsonl
file from a reward-kit run
output directory.--remote-url
: (Optional) URL of a deployed evaluator to use for scoring, instead of local --metrics-folders
.--max-samples
: Maximum number of samples to process (optional)--output
: Path to save preview results (optional)--verbose
: Enable verbose output (optional)reward-kit run
(preview_input_output_pairs.jsonl
), each line typically contains a “messages” list (including system, user, and assistant turns) and optionally a “ground_truth” field. If creating manually, a common format is:
deploy
command deploys a reward function as an evaluator on the Fireworks platform.
--id
: ID for the deployed evaluator (required)--metrics-folders
: Specify metrics to use in the format “name=path” (required)--display-name
: Human-readable name for the evaluator (optional)--description
: Description of the evaluator (optional)--force
: Overwrite if an evaluator with the same ID already exists (optional)--providers
: List of model providers to use (optional)--verbose
: Enable verbose output (optional)reward-kit run
first:
conf/dataset/my_data.yaml
, conf/run_my_eval.yaml
). Define or reference your reward function logic.reward-kit run
. This generates model responses and initial scores.
*_results.jsonl
) and the preview_input_output_pairs.jsonl
from the output directory.reward-kit preview
with the preview_input_output_pairs.jsonl
and your updated local metric script.reward-kit run
.--metrics-folders
for deploy
should point to the finalized reward function script(s) you intend to deploy as the evaluator.)agent-eval
command enables you to run agent evaluations using task bundles.
--task-dir
: Path to task bundle directory containing reward.py, tools.py, etc.--dataset
or -d
: Path to JSONL file containing task specifications.--output-dir
or -o
: Directory to store evaluation runs (default: ”./runs”).--model
: Override MODEL_AGENT environment variable.--sim-model
: Override MODEL_SIM environment variable for simulated user.--no-sim-user
: Disable simulated user (use static initial messages only).--test-mode
: Run in test mode without requiring API keys.--mock-response
: Use a mock agent response (works with —test-mode).--debug
: Enable detailed debug logging.--validate-only
: Validate task bundle structure without running evaluation.--export-tools
: Export tool specifications to directory for manual testing.--task-ids
: Comma-separated list of task IDs to run.--max-tasks
: Maximum number of tasks to evaluate.--registries
: Custom tool registries in format ‘name=path’.--registry-override
: Override all toolset paths with this registry path.--evaluator
: Custom evaluator module path (overrides default).examples/your_agent_task_bundle/
as a placeholder. You will need to replace this with the actual path to your task bundle directory.
reward.py
: Reward function with @reward_function decoratortools.py
: Tool registry with tool definitionstask.jsonl
: Dataset rows with task specificationsseed.sql
(optional): Initial database stateFIREWORKS_API_KEY
: Your Fireworks API key (required for deployment operations)FIREWORKS_API_BASE
: Base URL for the Fireworks API (defaults to https://api.fireworks.ai
)FIREWORKS_ACCOUNT_ID
: Your Fireworks account ID (optional, can be configured in auth.ini)MODEL_AGENT
: Default agent model to use (e.g., “openai/gpt-4o-mini”)MODEL_SIM
: Default simulation model to use (e.g., “openai/gpt-3.5-turbo”)FIREWORKS_API_KEY
is correctly set.
main.py
file.
--help
flag with any command: