Apps coding example
APPS Coding Example
This guide explains how to use the reward-kit run
command to evaluate code generation models on a sample of the codeparrot/apps
dataset. This example focuses on checking the parsability of generated Python code.
Overview
- Dataset: A sample from
codeparrot/apps
, a dataset of programming problems and solutions. The specific dataset configuration used isapps_full_prompts
(defined inconf/dataset/apps_full_prompts.yaml
), which typically points to a pre-generated JSONL file. - Task: Given a problem description (question), the model should generate a Python code solution.
- Reward Function: The evaluation uses
reward_kit.rewards.apps_coding_reward.evaluate_apps_solution
.- Functionality: In its current form for this example, this reward function performs a basic check to see if the generated Python code is parsable by Python’s
ast.parse
module. It scores1.0
if the code is parsable and0.0
otherwise. - It does not execute the code or check for functional correctness against test cases in this simplified setup.
- The
ground_truth_for_eval
field (derived from APPS’input_output
field) is available to the reward function but not utilized by this initial parsability check.
- Functionality: In its current form for this example, this reward function performs a basic check to see if the generated Python code is parsable by Python’s
- System Prompt: A default system prompt is provided in the configuration to guide the model:
Setup
- Environment: Ensure your Python environment is set up with
reward-kit
and its development dependencies installed. If you haven’t already, install them from the root of the repository: - API Key: The default configuration uses a Fireworks AI model (
accounts/fireworks/models/deepseek-v3-0324
) for code generation. Make sure yourFIREWORKS_API_KEY
is set in your environment or in a.env
file in the project root.
Data Preparation (Informational)
The example typically uses a pre-generated sample of prompts from the codeparrot/apps
dataset. The default run configuration (run_eval.yaml
) references apps_full_prompts
, which points to development/CODING_DATASET.jsonl
.
If you wished to regenerate this sample or create a different one (this is for informational purposes, not required to run the example with defaults):
- The script
scripts/convert_apps_to_prompts.py
can convert the raw Hugging Facecodeparrot/apps
dataset into the JSONL format expected by the pipeline. - The source dataset configuration for raw APPS data is defined in
conf/dataset/apps_source.yaml
. - An example command to generate 5 samples from the ‘test’ split:
Running the Evaluation
The evaluation is configured in examples/apps_coding_example/conf/run_eval.yaml
. This is the main configuration file used by Hydra.
To run the evaluation using the reward-kit run
command:
- Ensure your virtual environment is activated:
- Execute the run command from the root of the repository:
Overriding Parameters
You can override parameters from the run_eval.yaml
configuration directly from the command line. For example:
- Limit the number of samples for a quick test:
- Disable code generation (to test reward function with cached responses):
If you have previously run the example and responses are cached (default cache dir:
outputs/generated_responses_cache_apps/
), you can disable new generation: - Change the generation model:
Refer to the Hydra Configuration for Examples guide for more details on Hydra.
Expected Output
The reward-kit run
command will:
- Load prompts based on the
apps_full_prompts
dataset configuration (typically fromdevelopment/CODING_DATASET.jsonl
). - If
generation.enabled
istrue
(default), generate code solutions using the configured model. Responses are cached (default:outputs/generated_responses_cache_apps/
). - Evaluate each generated solution using the
evaluate_apps_solution
reward function (checking for Python AST parsability). - Print a summary of results to the console.
- Save detailed evaluation results to a JSONL file in a timestamped directory. The default output path is configured in
run_eval.yaml
as./outputs/apps_coding_example/${now:%Y-%m-%d}/${now:%H-%M-%S}
. The results file will be namedapps_coding_example_results.jsonl
within that directory.
The results file will contain the original prompt, generated response, the parsability score (0 or 1), and other metrics from the reward function.