reward-kit run
command, focusing on both the accuracy of the numerical answer and the adherence to a specific response format (e.g., <think>...</think><answer>...</answer>
). This example uses the GSM8K dataset.
gsm8k
dataset, configured via gsm8k_math_with_formatting_prompts.yaml
which adds specific system prompts to guide the model’s output format.examples/math_with_formatting/main.py
, referenced in the run configuration as examples.math_with_formatting.main.evaluate
.gsm8k_math_with_formatting_prompts.yaml
):
reward-kit
and its development dependencies installed:
run_math_with_formatting_eval.yaml
) uses a Fireworks AI model (e.g., accounts/fireworks/models/qwen3-235b-a22b
). Ensure your FIREWORKS_API_KEY
is set in your environment or a .env
file.examples/math_with_formatting/conf/run_math_with_formatting_eval.yaml
.
reward-kit run
command from the root of the repository:
limit_samples: 2
).gsm8k_math_with_formatting_prompts.yaml
.qwen3-235b-a22b
).examples.math_with_formatting.main.evaluate
, which combines accuracy and format checks.math_with_formatting_example_results.jsonl
) in a timestamped directory under outputs/
(the exact path is determined by Hydra, typically based on the current date/time).preview_input_output_pairs.jsonl
in the same output directory.evaluation_score
(average of accuracy and format) and a breakdown in evaluation_metrics
for accuracy_reward
and format_reward
.
examples/math_with_formatting/main.py
: Contains the evaluate()
function with the core reward logic, including:
accuracy_reward_fn
: Extracts and compares numerical answers.format_reward_fn
: Checks for the <think>...</think><answer>...</answer>
structure.gsm8k_math_with_formatting_prompts.yaml
) to add specific system prompts to the base gsm8k
dataset.