This directory contains examples showing how to use reward-kit reward functions with Hugging Face’s Transformer Reinforcement Learning (TRL) library.
TRL is a library designed for fine-tuning language models using reinforcement learning techniques. It includes implementations of various RL algorithms including:
These examples demonstrate how to integrate reward-kit’s reward functions with TRL for model fine-tuning. This document serves as the primary guide for the scripts found in the examples/trl_integration/
directory.
examples/trl_integration/
The examples/trl_integration/
directory contains several Python scripts:
grpo_example.py
: Demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. This is a key example showing:
ppo_example.py
: Likely demonstrates integration with TRL’s Proximal Policy Optimization (PPO) trainer.minimal_deepcoder_grpo_example.py
: A more focused GRPO example, possibly related to the DeepCoder dataset or a simplified setup.working_grpo_example.py
: Another GRPO variant, perhaps a more tested or stable version.convert_dataset_to_jsonl.py
: A utility script for dataset preparation.trl_adapter.py
: Contains adapter logic, likely used by the example scripts.test_trl_integration.py
: Pytest file for testing the integration.Most examples are run directly as Python scripts. Ensure your virtual environment is active (source .venv/bin/activate
).
1. GRPO Example (grpo_example.py
)
This example demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. It shows:
(Check the script itself or accompanying comments for any specific dataset or model requirements.)
2. Other Examples (e.g., ppo_example.py
, minimal_deepcoder_grpo_example.py
)
(Always refer to the comments within each script for specific instructions or dependencies, as they might vary.)
To run these examples you’ll need the optional TRL dependencies:
For the GRPO example, you might also need:
TRL expects reward functions that:
The examples show how to:
For fine-tuning models like DeepSeek-R1, multiple reward functions are typically combined:
The examples demonstrate proper weighting and normalization techniques.
reward-kit’s RewardFunction class includes a get_trl_adapter()
method that converts any reward function into the format expected by TRL. This makes it easy to use existing reward functions from reward-kit with TRL trainers.