Reward-Driven Summarizer - RFT on Fireworks
Introduction
In this demo, we will demonstrate how thoughtful reward‑function design can steer a language model toward producing clear, 50‑token summaries that balance brevity with relevance. Using Fireworks’ reinforcement‑fine‑tuning workflow, you’ll see how adjusting a few well‑chosen signals can transform raw model outputs into reliable digests suitable for news briefs, chat recaps, and study notes—revealing, along the way, why defeating reward hacking is central to building trustworthy summarizers.Goals
Every summarizer will look different. Let’s set up some goals:- Use
llama-v3p1-8b-instruct
to balance speed and model intelligence - Summaries should be under 50 tokens
- Summaries should capture relevant information within a much larger text
Why Reinforcement Fine-Tune?
Reinforcement Fine‑Tuning augments standard supervised training by adding a reward signal that scores each model output after it is generated. Instead of optimizing only for next‑token likelihood, the model learns from these scores—gradually preferring strategies that maximize the reward and discarding those that do not. Traditional supervised fine‑tuning simply teaches a model to imitate example summaries, but it never checks whether the finished output actually satisfies our broader goals—like striking the right balance between brevity and substance. Reinforcement Fine‑Tuning adds a feedback step after each summary is generated, letting us reward outputs that hit that balance and discourage ones that don’t. Because we can adjust this feedback on the fly, RFT gives us a practical steering mechanism: tweak the reward, observe how the model adapts, and quickly converge on summaries that are both concise and informative. For this sort of summarization task, that end‑to‑end feedback loop is essential—imitation alone can’t capture the nuanced trade‑offs we care about. For more information on RFT on the Fireworks platform and when to use it, take a look at our examples on Knowledge DistillationSetup & Utils
If you haven’t already, head to https://fireworks.ai/, make an account, and grab an API key - you’ll need one for this demo.Initial Test
Before we touch any fine-tuning or reward functions, we first run the task with an off‑the‑shelf model and record how its raw summaries perform. This baseline reveals the model’s natural tendencies—what it captures well, what it omits, and where it drifts from our goals. Let’s define a system prompt:Part 1: Teach brevity (Length Gate)
Our opening baseline is a binary “length‑only” reward: a summary earns full credit if it stays within the token budget and zero otherwise. This simple gate makes it crystal‑clear to the model that excess verbosity is unacceptable.Part 2: Reward substance (ROUGE-L)
Once the model has learned that shorter is better, we need to remind it that substance still counts. The second evaluator rewards each summary according to how much of the source document’s wording it captures. A quick overlap measure—ROUGE‑L—is enough to push the policy toward mentioning the main ideas instead of trimming indiscriminately.Part 3: Focus on key facts (Bullet Recall)
Our third evaluator narrows the comparison window from the entire source document to a curated bullet list of key facts. Pure document‑level ROUGE can reward nonsense phrases that merely echo scattered words; by contrast, scoring against a focused checklist forces the model to mention the specific points humans actually care about. The downside is cost: generating high‑quality bullet lists requires either human or much larger LLM annotation. For example, a bullet point list of our new example might look like the following:Advanced Reward: Polish style (Fluency)
With essentials and length under control, the last step is polish: we combine the bullet‑coverage score with a fluency bonus (low perplexity from a tiny GPT‑2 scorer). The reward is a weighted average, so you can dial emphasis toward clarity or content with one line of code through the use ofreward-kit
Takeaways
By walking a plain language model through four reward tweaks—length gate, document overlap, key‑bullet focus, and a final fluency blend—we steered it into a dependable 50‑token summarizer. Each change showed, in minutes, how the model bends to whatever signal we supply, thanks to the lightweight evaluator‑swap workflow built into Fireworks’ RFT platform.- A model follows its incentives, not your intentions. Define the right reward and you steer behaviour directly; leave gaps and the model finds them.
- Start simple, then layer complexity. A binary length check exposed verbosity problems instantly; later signals refined relevance and style.
- End‑to‑end feedback beats imitation alone. Rewarding the full output captures goals that token‑level training can’t touch.