messages
key containing OpenAI-style chat messages.
ground_truth
field:
reward-kit
.
Nagivating to the Evaluations
tab in your fireworks dashboard, and click Create Evaluator
, you should see the following page
print
inside the evaluator code and view the output in the console panel.
messages
and any custom fields like ground_truth
coming from your dataset.score
: Float between 0.0 and 1.0reason
: (Optional) A string for loggingis_score_valid
: (Optional, defaults to True
) Flag to skip training on invalid outputsmetrics
for a mapping from metric name to MetricResult
to include auxiliary metrics. A common practice is to include individual metrics you want to track in the metrics field, where as actual score
in the EvaluateResult
is some weighted average of the individual metrics that will actually be used for training.
Example
[0.0, 1.0]
range, and individual metric values can be in arbitrary range.
Flag | Default | Valid range | When to change |
---|---|---|---|
--epochs | 1 | 1 – 10 (whole numbers only) | Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting. |
--batch-size | 32 k tokens | Hardware-bounded | Lower if you hit OOM; raise only when GPUs have >30 % headroom. |
--learning-rate | 1 e-4 | 1 e-5 – 5 e-4 | Decrease when the reward spikes then collapses; increase when the curve plateaus too early. |
--lora-rank | 8 | 4 – 128 (powers of 2) | Higher ranks give more capacity but cost VRAM; stay ≤64 unless you have >40 GB per GPU. |
--max-context-length | 8192 tokens | Up to model limit | Raise only when your prompts truncate; remember longer sequences consume quadratic compute. |
firectl
CLI enforce the ranges shown here; out-of-bound values throw an Invalid rollout parameters error.
Field | CLI flag | Default | Recommended range | Why it matters |
---|---|---|---|---|
Maximum tokens | --inference-max-tokens | 2 048 | 16 – 16 384 | Longer responses improve reward on summarisation / story tasks but add cost. (blog.ml.cmu.edu) |
Temperature | --inference-temperature | 0.7 | 0.1 – 2.0 ( > 0 only ) | Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF. (arxiv.org, huyenchip.com) |
Top-p | --inference-top-p | 1.0 | 0 – 1 | Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations. (codefinity.com) |
Top-k | --inference-top-k | 40 | 0 – 100 (0 = off) | Combine with temperature for more creative exploration; keep ≤50 for latency. (medium.com) |
n (choices) | --inference-n | 4 | 2 – 8 | Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory. (blog.ml.cmu.edu, rlhfbook.com) |
Extra body JSON | --inference-extra-body | empty | valid JSON | Pass extra OpenAI-style params (e.g., stop , logit_bias ). Invalid JSON is rejected. |
n
, max_tokens
, or batch size scale memory roughly linearly. Scale horizontally or enable Turbo mode if needed.MIN_*/MAX_*
constants). Entering a value outside these windows surfaces an error immediately, saving wasted GPU hours.
Goal | Turn these knobs |
---|---|
Faster convergence | ↑ epochs , tune learning-rate < 2× default |
Safer / less toxic | ↓ temperature , top_p , top_k |
More creative | temperature ≈ 1 – 1.2, top_p 0.9 |
Cheaper roll-outs | ↓ n , max_tokens , batch size |
Higher capacity | ↑ lora-rank , but monitor VRAM |