Skip to main content

What this is

Prompt rendering and labels are part of your algorithm. Treat dataset conversion and eval labels as versioned experiment logic.

End-to-end examples

Load and score dataset rows

from shared.dataset import load_gsm8k_dataset, evaluate_gsm8k_response

rows = load_gsm8k_dataset("/path/to/gsm8k.jsonl", max_rows=1000)
reward, detail = evaluate_gsm8k_response(response="42", ground_truth="42")
print(reward, detail)