Developing Evaluators - Fireworks AI Docs

The Build SDK natively integrates reward-kit to make it easy to develop Evaluators for RFT in Python.

Prerequisites

You can install the Fireworks Build SDK using pip:

pip install --upgrade fireworks-ai

Make sure to set the FIREWORKS_API_KEY environment variable to your Fireworks API key:

export FIREWORKS_API_KEY=<API_KEY>

You can create an API key in the Fireworks AI web UI or by installing the firectl CLI tool and running:

firectl signin
firectl create api-key --key-name <Your-Key-Name>

Your first evaluator

For this tutorial, we’ll create a new project using uv.

$ uv init
Initialized project `my-project`
$ uv add fireworks-ai

You should now have a project with a pyproject.toml file and a uv.lock file.

% tree
.
├── main.py
├── pyproject.toml
├── README.md
└── uv.lock

1 directory, 4 files

To create your first evaluator, create a new file at my_first_evaluator/main.py:

Evaluators must be in their own directory because the Build SDK automatically recursively packages all sibling and child files from the directory containing the imported reward function.

mkdir -p my_first_evaluator
touch my_first_evaluator/main.py

Add the following code to my_first_evaluator/main.py:

my_first_evaluator/main.py

from fireworks import reward_function


@reward_function(id="my-first-evaluator")
def evaluate(messages, **kwargs):
    """
    This is a simple reward function that returns a score of 1.0 if the message contains the word "fireworks" and 0.0 otherwise.
    """
    # Extract the content from the messages structure
    content = messages[0]["content"]
    score = 1.0 if "fireworks" in content else 0.0

    return {"score": score}

To test your evaluator locally, you can simply call the function itself. Replace the contents of main.py with the following code:

main.py

from my_first_evaluator.main import evaluate

print(evaluate(messages=[{"role": "user", "content": "Hello, world!"}]))
print(evaluate(messages=[{"role": "user", "content": "Hello, fireworks!"}]))

Let’s run the script and see what happens:

% uv run python main.py
score=0.0 is_score_valid=True reason=None metrics={} step_outputs=None error=None
score=1.0 is_score_valid=True reason=None metrics={} step_outputs=None error=None

You should see that the first message returns a score of 0.0 and the second message returns a score of 1.0, showing that our evaluator is working as expected.

Evaluating on a dataset

Now that we’ve created and tested our first evaluator, we can use it to evaluate on Fireworks infrastructure using a dataset uploaded on Fireworks. To do this, we’ll create a Dataset object and call create_evaluation_job. Create a new file called run_first_evaluator.py at the root of your project and add the following code:

run_first_evaluator.py

from my_first_evaluator.main import evaluate
from fireworks import Dataset
import random

dataset = Dataset.from_list(
    data=[
        {"messages": [{"role": "user", "content": "Hello, fireworks!" if random.random() < 0.5 else "Hello, world!"}]}
        for _ in range(100)
    ]
)

job = dataset.create_evaluation_job(evaluate)
print(job.url)
job.wait_for_completion()
print(job.output_dataset.url)

Let’s run the script and see what happens:

% uv run python run_first_evaluator.py
https://app.fireworks.ai/dashboard/evaluation-jobs/wqfvyv90yzfv9q95

When the script first runs, you should see a URL for the evaluation job. You can go to the URL to see the evaluation job in the Fireworks AI web UI.

Running evaluation job in the UI

After some time, the evaluation job will be completed and you should see a URL for the output dataset. You can go to the URL to see the results in the Fireworks AI web UI.

Completed evaluation job in the UI

After the job is completed, the script will also print the URL for the output dataset.

% uv run python run_first_evaluator.py 
https://app.fireworks.ai/dashboard/evaluation-jobs/wqfvyv90yzfv9q95
https://app.fireworks.ai/dashboard/datasets/2025-06-24-18-16-54-101727

You can go to the URL to see the output dataset in the Fireworks AI web UI.

Results in the UI

Creating your second evaluator

Let’s create a more complex evaluator that imports a third-party library to calculate the score. Let’s add the textblob library to our project:

% uv add textblob

The Build SDK will automatically pick up dependencies found from pyproject.toml or requirements.txt files in your project. Alternatively you can specify a list of strings as you would in a requirements.txt file directly in the @reward_function decorator itself. Now, let’s create a new evaluator under my_second_evaluator/main.py:

% mkdir -p my_second_evaluator
% touch my_second_evaluator/main.py

Copy-paste the following code into my_second_evaluator/main.py:

my_second_evaluator/main.py

from fireworks import reward_function
from textblob import TextBlob


@reward_function(id="my-second-evaluator")
def evaluate(messages, **kwargs):
    """
    This is a reward function that demonstrates the use of third-party dependencies.
    It returns a normalized score between 0 and 1.0 based on the sentiment polarity of the message.
    """
    # Extract the content from the messages structure
    content = messages[0]["content"]

    # Use the third-party dependency (TextBlob) for sentiment analysis
    blob = TextBlob(content)
    sentiment_score = blob.sentiment.polarity  # type: ignore

    # Normalize sentiment score from [-1, 1] to [0, 1]
    # sentiment_score ranges from -1 to 1, so we add 1 to get [0, 2], then divide by 2 to get [0, 1]
    normalized_score = (sentiment_score + 1) / 2

    # Ensure the score is clamped between 0 and 1
    normalized_score = max(0.0, min(1.0, normalized_score))

    # Return the format expected by the framework
    return {"score": normalized_score}

Download the random_phrases.jsonl file and save it to the root of your project. The random_phrases.jsonl file should be at the root of your project like this:

% tree -I "__pycache__"
.
├── main.py
├── my_first_evaluator
│   └── main.py
├── my_second_evaluator
│   └── main.py
├── pyproject.toml
├── random_phrases.jsonl
├── README.md
├── run_first_evaluator.py
└── uv.lock

Create a new file called run_second_evaluator.py and add the following code:

run_second_evaluator.py

from fireworks import Dataset
from my_second_evaluator.main import evaluate

dataset = Dataset.from_file("random_phrases.jsonl")
job = dataset.create_evaluation_job(evaluate)
print(job.url)
job.wait_for_completion()
print(job.output_dataset.url)

Once the script is done running, you can click on the URL for the evaluation job and see the results in the Fireworks AI web UI.

Results of the second evaluator in the UI

🎉 Congratulations! You’ve now created and evaluated your first two evaluators. If you have any questions, please reach out to us on Discord.

Build SDK

​Prerequisites

​Your first evaluator

​Evaluating on a dataset

​Creating your second evaluator

Prerequisites

Your first evaluator

Evaluating on a dataset

Creating your second evaluator