Reward Kit Examples

This directory contains examples demonstrating how to use the Reward Kit library for evaluating and deploying reward functions for LLM fine-tuning.

Prerequisites

Before running the examples, make sure you have:

  1. A Fireworks AI account and API key
  2. The Reward Kit package installed

Setup

1. Create a Virtual Environment

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

2. Install Reward Kit

# Install the package in development mode
pip install -e .

3. Configure API Access

For development, use these environment variables:

# Set environment variables for development
export FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY
export FIREWORKS_API_BASE=https://dev.api.fireworks.ai

For production, use:

# Set environment variables for production
export FIREWORKS_API_KEY=your_api_key
export FIREWORKS_API_BASE=https://api.fireworks.ai

Example Walkthroughs

Combined Accuracy and Length Evaluation

The accuracy_length/cosine_scaled_example.py demonstrates the cosine_scaled_accuracy_length_reward function which evaluates responses based on both accuracy and length efficiency.

# Run the example
python examples/accuracy_length/cosine_scaled_example.py

This example:

  1. Demonstrates evaluation of different response types (short correct, long correct, short incorrect, long incorrect)
  2. Shows how the combined reward function prioritizes short correct answers
  3. Illustrates customizing the weights between accuracy and length components

See the Accuracy + Length Overview for more details.

Basic Evaluation Example

The evaluation_preview_example.py demonstrates how to preview and create an evaluation using the Reward Kit.

Step 1: Understand the Metric

Examine the example metric in the metrics/word_count directory. This metric evaluates responses based on their word count:

@reward_function
def evaluate(messages, original_messages=None, **kwargs):
    # Get the last message (assistant's response)
    last_message = messages[-1]
    content = last_message.content or ''
    
    # Count words and calculate score
    word_count = len(content.split())
    score = min(word_count / 100, 1.0)  # Cap at 1.0
    
    return EvaluateResult(
        score=score,
        reason=f'Word count: {word_count}',
        metrics={
            'word_count': MetricResult(
                score=score,
                reason=f'Word count: {word_count}'
            )
        }
    )

Step 2: Prepare Sample Data

Review the sample conversations in samples/samples.jsonl. Each line contains a JSON object representing a conversation:

{"messages": [{"role": "user", "content": "Tell me about AI"}, {"role": "assistant", "content": "AI refers to systems designed to mimic human intelligence."}]}

Step 3: Run the Preview

Execute the evaluation preview example:

source .venv/bin/activate && FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai python examples/evaluation_preview_example.py

This will:

  1. Load the word count metric from examples/metrics/word_count
  2. Load sample conversations from examples/samples/samples.jsonl
  3. Preview the evaluator using the Fireworks API
  4. Display the evaluation results for each sample
  5. Create an evaluator named “word-count-eval”

Deployment Example

The deploy_example.py demonstrates how to deploy a reward function to the Fireworks platform.

Step 1: Examine the Reward Function

Review the informativeness reward function in the deploy example, which evaluates responses based on:

  • Length
  • Specificity markers
  • Content density

Step 2: Run the Deployment

Execute the deployment example:

source .venv/bin/activate && FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai python examples/deploy_example.py

This will:

  1. Test the reward function locally with sample data
  2. Deploy the function to the Fireworks platform
  3. Display the deployed evaluator ID

Using the CLI

The Reward Kit also provides a command-line interface for common operations.

Preview an Evaluator Using CLI

# Activate the virtual environment and set environment variables
source .venv/bin/activate

# Preview an evaluator
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai reward-kit preview \
  --metrics-folders "word_count=./examples/metrics/word_count" \
  --samples ./examples/samples/samples.jsonl

Deploy an Evaluator Using CLI

# Activate the virtual environment and set environment variables
source .venv/bin/activate

# Deploy an evaluator
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai reward-kit deploy \
  --id my-evaluator \
  --metrics-folders "word_count=./examples/metrics/word_count" \
  --display-name "My Word Count Evaluator" \
  --description "Evaluates responses based on word count" \
  --force

Creating Your Own Evaluators

Follow these steps to create your own custom evaluator:

  1. Create a directory for your metric (e.g., my_metrics/coherence)
  2. Create a main.py file with an evaluate function
  3. Test your evaluator using the preview functionality
  4. Deploy your evaluator when ready

Example Custom Metric

from reward_kit import EvaluateResult, MetricResult, reward_function, Message
from typing import List

@reward_function
def evaluate(messages: List[Message], original_messages: List[Message] = list(), **kwargs):
    """Custom evaluation metric."""
    # Your evaluation logic here
    # ...
    
    return EvaluateResult(
        score=your_score,
        reason="Explanation of score",
        metrics={
            'your_metric': MetricResult(
                score=your_score,
                reason="Detailed explanation"
            )
        }
    )

Next Steps

After exploring these examples, you can:

  1. Create your own custom metrics
  2. Integrate reward functions into model training workflows
  3. Use deployed evaluators to score model outputs
  4. Combine multiple metrics for comprehensive evaluation