reward-kit run
command to evaluate code generation models on a sample of the codeparrot/apps
dataset. This example focuses on checking the parsability of generated Python code.
codeparrot/apps
, a dataset of programming problems and solutions. The specific dataset configuration used is apps_full_prompts
(defined in conf/dataset/apps_full_prompts.yaml
), which typically points to a pre-generated JSONL file.reward_kit.rewards.apps_coding_reward.evaluate_apps_solution
.
ast.parse
module. It scores 1.0
if the code is parsable and 0.0
otherwise.ground_truth_for_eval
field (derived from APPS’ input_output
field) is available to the reward function but not utilized by this initial parsability check.reward-kit
and its development dependencies installed. If you haven’t already, install them from the root of the repository:
accounts/fireworks/models/deepseek-v3-0324
) for code generation. Make sure your FIREWORKS_API_KEY
is set in your environment or in a .env
file in the project root.codeparrot/apps
dataset. The default run configuration (run_eval.yaml
) references apps_full_prompts
, which points to development/CODING_DATASET.jsonl
.
If you wished to regenerate this sample or create a different one (this is for informational purposes, not required to run the example with defaults):
scripts/convert_apps_to_prompts.py
can convert the raw Hugging Face codeparrot/apps
dataset into the JSONL format expected by the pipeline.conf/dataset/apps_source.yaml
.examples/apps_coding_example/conf/run_eval.yaml
. This is the main configuration file used by Hydra.
To run the evaluation using the reward-kit run
command:
run_eval.yaml
configuration directly from the command line. For example:
outputs/generated_responses_cache_apps/
), you can disable new generation:
reward-kit run
command will:
apps_full_prompts
dataset configuration (typically from development/CODING_DATASET.jsonl
).generation.enabled
is true
(default), generate code solutions using the configured model. Responses are cached (default: outputs/generated_responses_cache_apps/
).evaluate_apps_solution
reward function (checking for Python AST parsability).run_eval.yaml
as ./outputs/apps_coding_example/${now:%Y-%m-%d}/${now:%H-%M-%S}
. The results file will be named apps_coding_example_results.jsonl
within that directory.