Dataset Configuration Guide
This guide explains the structure and fields used in YAML configuration files for datasets within the Reward Kit. These configurations are typically located inconf/dataset/ or within an example’s conf/dataset/ directory (e.g., examples/math_example/conf/dataset/). They are processed by reward_kit.datasets.loader.py using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
1. Base Dataset Configuration
A base dataset configuration defines the connection to a raw data source and performs initial processing like column mapping. Example File:conf/dataset/base_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k.yaml (concrete example)
Key Fields:
- 
_target_(Required)- Description: Specifies the Python function to instantiate for loading this dataset.
- Typical Value: reward_kit.datasets.loader.load_and_process_dataset
- Example: _target_: reward_kit.datasets.loader.load_and_process_dataset
 
- 
source_type(Required)- Description: Defines the type of the data source.
- Supported Values:
- "huggingface": For datasets hosted on the Hugging Face Hub.
- "jsonl": For local datasets in JSON Lines format.
- "fireworks": (Not yet implemented) For datasets hosted on Fireworks AI.
 
- Example: source_type: huggingface
 
- 
path_or_name(Required)- Description: Identifier for the dataset.
- For huggingface: The Hugging Face dataset name (e.g.,"gsm8k","cais/mmlu").
- For jsonl: Path to the.jsonlfile (e.g.,"data/my_data.jsonl").
 
- For 
- Example: path_or_name: "gsm8k"
 
- Description: Identifier for the dataset.
- 
split(Optional)- Description: Specifies the dataset split to load (e.g., "train","test","validation"). If loading a Hugging FaceDatasetDictor multiple JSONL files mapped viadata_files, this selects the split after loading.
- Default: "train"
- Example: split: "test"
 
- Description: Specifies the dataset split to load (e.g., 
- 
config_name(Optional)- Description: For Hugging Face datasets with multiple configurations (e.g., "main","all"forgsm8k). Corresponds to thenameparameter in Hugging Face’sload_dataset.
- Default: null
- Example: config_name: "main"(forgsm8k)
 
- Description: For Hugging Face datasets with multiple configurations (e.g., 
- 
data_files(Optional)- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s datasets.load_dataset. Can be a single file path, a list, or a dictionary mapping split names to file paths.
- Example: data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
 
- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s 
- 
max_samples(Optional)- Description: Maximum number of samples to load from the dataset (or from each split if a DatasetDictis loaded). Ifnullor0, all samples are loaded.
- Default: null
- Example: max_samples: 100
 
- Description: Maximum number of samples to load from the dataset (or from each split if a 
- 
column_mapping(Optional)- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g., "query","ground_truth"), and values are the original column names in the source dataset. This mapping is applied byreward_kit.datasets.loader.py.
- Default: {"query": "query", "ground_truth": "ground_truth", "solution": null}
- Example (gsm8k.yaml):
 
- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g., 
- 
preprocessing_steps(Optional)- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g., "reward_kit.datasets.loader.transform_codeparrot_apps_sample"). These functions are applied to the dataset after loading and before column mapping.
- Default: []
- Example: preprocessing_steps: ["my_module.my_preprocessor_func"]
 
- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g., 
- 
hf_extra_load_params(Optional)- Description: A dictionary of extra parameters to pass directly to Hugging Face’s datasets.load_dataset()(e.g.,trust_remote_code: True).
- Default: {}
- Example: hf_extra_load_params: {trust_remote_code: True}
 
- Description: A dictionary of extra parameters to pass directly to Hugging Face’s 
- 
description(Optional, Metadata)- Description: A brief description of the dataset configuration for documentation purposes.
- Example: description: "GSM8K (Grade School Math 8K) dataset."
 
2. Derived Dataset Configuration
A derived dataset configuration references a base dataset and applies further transformations, such as adding system prompts, changing the output format, or applying different column mappings or sample limits. Example File:examples/math_example/conf/dataset/base_derived_dataset.yaml (schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml (concrete example)
Key Fields:
- 
_target_(Required)- Description: Specifies the Python function to instantiate for loading this derived dataset.
- Typical Value: reward_kit.datasets.loader.load_derived_dataset
- Example: _target_: reward_kit.datasets.loader.load_derived_dataset
 
- 
base_dataset(Required)- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g., "gsm8k", which would loadconf/dataset/gsm8k.yaml) or a full inline base dataset configuration object.
- Example: base_dataset: "gsm8k"
 
- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g., 
- 
system_prompt(Optional)- Description: A string that will be used as the system prompt. In the evaluation_format, this prompt is added as asystem_promptfield alongsideuser_query.
- Default: null
- Example (gsm8k_math_prompts.yaml):"Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
 
- Description: A string that will be used as the system prompt. In the 
- 
output_format(Optional)- Description: Specifies the final format for the derived dataset.
- Supported Values:
- "evaluation_format": Converts dataset records to include- user_query,- ground_truth_for_eval, and optionally- system_promptand- id. This is the standard format for many evaluation scenarios.
- "conversation_format": (Not yet implemented) Converts to a list of messages.
- "jsonl": Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping).
 
- Default: "evaluation_format"
- Example: output_format: "evaluation_format"
 
- 
transformations(Optional)- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in loader.py).
- Default: []
 
- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in 
- 
derived_column_mapping(Optional)- Description: A dictionary for column mapping applied after the base dataset is loaded and before the output_formatconversion. This can override or extend the base dataset’scolumn_mapping. Keys are new names, values are names from the loaded base dataset.
- Default: {}
- Example (gsm8k_math_prompts.yaml):Note: These mapped columns (query,ground_truth) are then used byconvert_to_evaluation_formatto createuser_queryandground_truth_for_eval.
 
- Description: A dictionary for column mapping applied after the base dataset is loaded and before the 
- 
derived_max_samples(Optional)- Description: Maximum number of samples for this derived dataset. If specified, this overrides any max_samplesfrom the base dataset configuration for the purpose of this derived dataset.
- Default: null
- Example: derived_max_samples: 5
 
- Description: Maximum number of samples for this derived dataset. If specified, this overrides any 
- 
description(Optional, Metadata)- Description: A brief description of this derived dataset configuration.
- Example: description: "GSM8K dataset with math-specific system prompt in evaluation format."
 
How Configurations are Loaded
Thereward_kit.datasets.loader.py script uses Hydra to:
- Compose these YAML configurations.
- Instantiate the appropriate loader function (load_and_process_datasetorload_derived_dataset) with the parameters defined in the YAML.
- The loader functions then use these parameters to fetch data (e.g., from Hugging Face or local files), apply mappings, execute preprocessing steps, and format the data as requested.