Dataset Configuration Guide
This guide explains the structure and fields used in YAML configuration files for datasets within the Reward Kit. These configurations are typically located inconf/dataset/
or within an example’s conf/dataset/
directory (e.g., examples/math_example/conf/dataset/
). They are processed by reward_kit.datasets.loader.py
using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
1. Base Dataset Configuration
A base dataset configuration defines the connection to a raw data source and performs initial processing like column mapping. Example File:conf/dataset/base_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k.yaml
(concrete example)
Key Fields:
-
_target_
(Required)- Description: Specifies the Python function to instantiate for loading this dataset.
- Typical Value:
reward_kit.datasets.loader.load_and_process_dataset
- Example:
_target_: reward_kit.datasets.loader.load_and_process_dataset
-
source_type
(Required)- Description: Defines the type of the data source.
- Supported Values:
"huggingface"
: For datasets hosted on the Hugging Face Hub."jsonl"
: For local datasets in JSON Lines format."fireworks"
: (Not yet implemented) For datasets hosted on Fireworks AI.
- Example:
source_type: huggingface
-
path_or_name
(Required)- Description: Identifier for the dataset.
- For
huggingface
: The Hugging Face dataset name (e.g.,"gsm8k"
,"cais/mmlu"
). - For
jsonl
: Path to the.jsonl
file (e.g.,"data/my_data.jsonl"
).
- For
- Example:
path_or_name: "gsm8k"
- Description: Identifier for the dataset.
-
split
(Optional)- Description: Specifies the dataset split to load (e.g.,
"train"
,"test"
,"validation"
). If loading a Hugging FaceDatasetDict
or multiple JSONL files mapped viadata_files
, this selects the split after loading. - Default:
"train"
- Example:
split: "test"
- Description: Specifies the dataset split to load (e.g.,
-
config_name
(Optional)- Description: For Hugging Face datasets with multiple configurations (e.g.,
"main"
,"all"
forgsm8k
). Corresponds to thename
parameter in Hugging Face’sload_dataset
. - Default:
null
- Example:
config_name: "main"
(forgsm8k
)
- Description: For Hugging Face datasets with multiple configurations (e.g.,
-
data_files
(Optional)- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
datasets.load_dataset
. Can be a single file path, a list, or a dictionary mapping split names to file paths. - Example:
data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
-
max_samples
(Optional)- Description: Maximum number of samples to load from the dataset (or from each split if a
DatasetDict
is loaded). Ifnull
or0
, all samples are loaded. - Default:
null
- Example:
max_samples: 100
- Description: Maximum number of samples to load from the dataset (or from each split if a
-
column_mapping
(Optional)- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
"query"
,"ground_truth"
), and values are the original column names in the source dataset. This mapping is applied byreward_kit.datasets.loader.py
. - Default:
{"query": "query", "ground_truth": "ground_truth", "solution": null}
- Example (
gsm8k.yaml
):
- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
-
preprocessing_steps
(Optional)- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
"reward_kit.datasets.loader.transform_codeparrot_apps_sample"
). These functions are applied to the dataset after loading and before column mapping. - Default:
[]
- Example:
preprocessing_steps: ["my_module.my_preprocessor_func"]
- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
-
hf_extra_load_params
(Optional)- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
datasets.load_dataset()
(e.g.,trust_remote_code: True
). - Default:
{}
- Example:
hf_extra_load_params: {trust_remote_code: True}
- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
-
description
(Optional, Metadata)- Description: A brief description of the dataset configuration for documentation purposes.
- Example:
description: "GSM8K (Grade School Math 8K) dataset."
2. Derived Dataset Configuration
A derived dataset configuration references a base dataset and applies further transformations, such as adding system prompts, changing the output format, or applying different column mappings or sample limits. Example File:examples/math_example/conf/dataset/base_derived_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml
(concrete example)
Key Fields:
-
_target_
(Required)- Description: Specifies the Python function to instantiate for loading this derived dataset.
- Typical Value:
reward_kit.datasets.loader.load_derived_dataset
- Example:
_target_: reward_kit.datasets.loader.load_derived_dataset
-
base_dataset
(Required)- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
"gsm8k"
, which would loadconf/dataset/gsm8k.yaml
) or a full inline base dataset configuration object. - Example:
base_dataset: "gsm8k"
- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
-
system_prompt
(Optional)- Description: A string that will be used as the system prompt. In the
evaluation_format
, this prompt is added as asystem_prompt
field alongsideuser_query
. - Default:
null
- Example (
gsm8k_math_prompts.yaml
):"Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
- Description: A string that will be used as the system prompt. In the
-
output_format
(Optional)- Description: Specifies the final format for the derived dataset.
- Supported Values:
"evaluation_format"
: Converts dataset records to includeuser_query
,ground_truth_for_eval
, and optionallysystem_prompt
andid
. This is the standard format for many evaluation scenarios."conversation_format"
: (Not yet implemented) Converts to a list of messages."jsonl"
: Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping).
- Default:
"evaluation_format"
- Example:
output_format: "evaluation_format"
-
transformations
(Optional)- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
loader.py
). - Default:
[]
- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
-
derived_column_mapping
(Optional)- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
output_format
conversion. This can override or extend the base dataset’scolumn_mapping
. Keys are new names, values are names from the loaded base dataset. - Default:
{}
- Example (
gsm8k_math_prompts.yaml
):Note: These mapped columns (query
,ground_truth
) are then used byconvert_to_evaluation_format
to createuser_query
andground_truth_for_eval
.
- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
-
derived_max_samples
(Optional)- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
max_samples
from the base dataset configuration for the purpose of this derived dataset. - Default:
null
- Example:
derived_max_samples: 5
- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
-
description
(Optional, Metadata)- Description: A brief description of this derived dataset configuration.
- Example:
description: "GSM8K dataset with math-specific system prompt in evaluation format."
How Configurations are Loaded
Thereward_kit.datasets.loader.py
script uses Hydra to:
- Compose these YAML configurations.
- Instantiate the appropriate loader function (
load_and_process_dataset
orload_derived_dataset
) with the parameters defined in the YAML. - The loader functions then use these parameters to fetch data (e.g., from Hugging Face or local files), apply mappings, execute preprocessing steps, and format the data as requested.