Trl integration overview
TRL Integration Examples
This directory contains examples showing how to use reward-kit reward functions with Hugging Face’s Transformer Reinforcement Learning (TRL) library.
Overview
TRL is a library designed for fine-tuning language models using reinforcement learning techniques. It includes implementations of various RL algorithms including:
- GRPO (Group Relative Policy Optimization) - used in DeepSeek-R1
- PPO (Proximal Policy Optimization)
- DPO (Direct Preference Optimization)
- ORPO (Optimal RLHF Policy Optimization)
These examples demonstrate how to integrate reward-kit’s reward functions with TRL for model fine-tuning. This document serves as the primary guide for the scripts found in the examples/trl_integration/
directory.
Example Scripts in examples/trl_integration/
The examples/trl_integration/
directory contains several Python scripts:
grpo_example.py
: Demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. This is a key example showing:- Creating reward functions compatible with reward-kit.
- Converting them to the TRL-compatible format.
- Combining multiple reward functions.
- Dataset preparation for TRL.
- Setting up the GRPO trainer.
ppo_example.py
: Likely demonstrates integration with TRL’s Proximal Policy Optimization (PPO) trainer.minimal_deepcoder_grpo_example.py
: A more focused GRPO example, possibly related to the DeepCoder dataset or a simplified setup.working_grpo_example.py
: Another GRPO variant, perhaps a more tested or stable version.convert_dataset_to_jsonl.py
: A utility script for dataset preparation.trl_adapter.py
: Contains adapter logic, likely used by the example scripts.test_trl_integration.py
: Pytest file for testing the integration.
Running the Examples
Most examples are run directly as Python scripts. Ensure your virtual environment is active (source .venv/bin/activate
).
1. GRPO Example (grpo_example.py
)
This example demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. It shows:
- Creating format and accuracy reward functions compatible with reward-kit
- Converting these reward functions to TRL-compatible format
- Combining multiple reward functions with weights
- Preparing a dataset in the format expected by TRL
- Setting up the GRPO trainer with these reward functions
(Check the script itself or accompanying comments for any specific dataset or model requirements.)
2. Other Examples (e.g., ppo_example.py
, minimal_deepcoder_grpo_example.py
)
(Always refer to the comments within each script for specific instructions or dependencies, as they might vary.)
Prerequisites
To run these examples you’ll need the optional TRL dependencies:
For the GRPO example, you might also need:
Key Concepts
Reward Function Format
TRL expects reward functions that:
- Accept batch inputs (either string lists or message arrays)
- Return a list of float reward values
- Handle both completion-only inputs and full conversation histories
Dataset Preparation
The examples show how to:
- Load datasets from Hugging Face
- Format them for TRL (adding system prompts, mapping fields)
- Structure datasets properly for reward functions
Reward Combining
For fine-tuning models like DeepSeek-R1, multiple reward functions are typically combined:
- Format rewards (encouraging proper tag usage)
- Content rewards (accuracy, helpfulness, etc.)
The examples demonstrate proper weighting and normalization techniques.
Integration with reward-kit’s RewardFunction Class
reward-kit’s RewardFunction class includes a get_trl_adapter()
method that converts any reward function into the format expected by TRL. This makes it easy to use existing reward functions from reward-kit with TRL trainers.