TRL Integration Examples

This directory contains examples showing how to use reward-kit reward functions with Hugging Face’s Transformer Reinforcement Learning (TRL) library.

Overview

TRL is a library designed for fine-tuning language models using reinforcement learning techniques. It includes implementations of various RL algorithms including:

  • GRPO (Group Relative Policy Optimization) - used in DeepSeek-R1
  • PPO (Proximal Policy Optimization)
  • DPO (Direct Preference Optimization)
  • ORPO (Optimal RLHF Policy Optimization)

These examples demonstrate how to integrate reward-kit’s reward functions with TRL for model fine-tuning. This document serves as the primary guide for the scripts found in the examples/trl_integration/ directory.

Example Scripts in examples/trl_integration/

The examples/trl_integration/ directory contains several Python scripts:

  • grpo_example.py: Demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. This is a key example showing:
    • Creating reward functions compatible with reward-kit.
    • Converting them to the TRL-compatible format.
    • Combining multiple reward functions.
    • Dataset preparation for TRL.
    • Setting up the GRPO trainer.
  • ppo_example.py: Likely demonstrates integration with TRL’s Proximal Policy Optimization (PPO) trainer.
  • minimal_deepcoder_grpo_example.py: A more focused GRPO example, possibly related to the DeepCoder dataset or a simplified setup.
  • working_grpo_example.py: Another GRPO variant, perhaps a more tested or stable version.
  • convert_dataset_to_jsonl.py: A utility script for dataset preparation.
  • trl_adapter.py: Contains adapter logic, likely used by the example scripts.
  • test_trl_integration.py: Pytest file for testing the integration.

Running the Examples

Most examples are run directly as Python scripts. Ensure your virtual environment is active (source .venv/bin/activate).

1. GRPO Example (grpo_example.py) This example demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. It shows:

  • Creating format and accuracy reward functions compatible with reward-kit
  • Converting these reward functions to TRL-compatible format
  • Combining multiple reward functions with weights
  • Preparing a dataset in the format expected by TRL
  • Setting up the GRPO trainer with these reward functions
# Run the GRPO example
python examples/trl_integration/grpo_example.py

(Check the script itself or accompanying comments for any specific dataset or model requirements.)

2. Other Examples (e.g., ppo_example.py, minimal_deepcoder_grpo_example.py)

# Example for PPO
python examples/trl_integration/ppo_example.py

# Example for Minimal DeepCoder GRPO
python examples/trl_integration/minimal_deepcoder_grpo_example.py

(Always refer to the comments within each script for specific instructions or dependencies, as they might vary.)

Prerequisites

To run these examples you’ll need the optional TRL dependencies:

pip install "reward-kit[trl]"

For the GRPO example, you might also need:

pip install math_verify  # For math verification in the reward function

Key Concepts

Reward Function Format

TRL expects reward functions that:

  1. Accept batch inputs (either string lists or message arrays)
  2. Return a list of float reward values
  3. Handle both completion-only inputs and full conversation histories

Dataset Preparation

The examples show how to:

  1. Load datasets from Hugging Face
  2. Format them for TRL (adding system prompts, mapping fields)
  3. Structure datasets properly for reward functions

Reward Combining

For fine-tuning models like DeepSeek-R1, multiple reward functions are typically combined:

  1. Format rewards (encouraging proper tag usage)
  2. Content rewards (accuracy, helpfulness, etc.)

The examples demonstrate proper weighting and normalization techniques.

Integration with reward-kit’s RewardFunction Class

reward-kit’s RewardFunction class includes a get_trl_adapter() method that converts any reward function into the format expected by TRL. This makes it easy to use existing reward functions from reward-kit with TRL trainers.

References