Fireworks Agent’s classification workflow is purpose-built for label-prediction tasks. It evaluates one or more base models against your labeled dataset, fine-tunes the strongest candidates, and reports a head-to-head comparison of base vs fine-tuned accuracy on a held-out test split. Use this workflow for classification, content extraction, intent detection, routing, and other tasks with a discrete label set.Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
For the underlying SFT mechanics (job parameters, supported base models, dataset format), see Managed Fine-Tuning → Supervised Fine-Tuning. The classification workflow is built on top of SFT with classification-specific dataset handling and reporting.
What you give Agent
| Input | Required? | Notes |
|---|---|---|
| Dataset ID(s) | Yes | Single dataset (split 80/20 train/test) or two datasets (separate train + eval) |
| Models to evaluate and fine-tune | Yes | Agent does not default to “all models”; pick from the supported list when prompted |
| Candidate labels | No | Agent infers labels from your data if you don’t list them explicitly |
| Imbalance-ratio threshold | No | Defaults to 50.0 (ratio of most-frequent to least-frequent label) |
Dataset requirements
- Each sample must contain
messagesin OpenAI chat-completion format. ground_truthis optional. If absent, Agent extracts the label from the final assistant message using task-specific logic in the plan.ground_truthmay be a single string or a list of strings.
Example session instructions
Single dataset with automatic split, two candidate models:Where classification lives in the 7-phase pipeline: Phase 1 is dataset inspection (with per-label sample counts and imbalance detection), phase 2 is plan + cost approval, phase 3 is the base-model benchmark plus fine-tuning sweep, phase 4 is the full-data run for each candidate, phase 5 is the fine-tuned evaluation with per-label and overall accuracy, phase 6 is deployment of the winner, phase 7 is the base-vs-fine-tuned comparison report. See How Agent runs a training job.
Workflow stages
Dataset inspection
Agent stages your dataset locally once, computes per-label sample counts, detects the imbalance ratio, and inspects message structure. If
ground_truth is missing, Agent decides how to extract the label from the final assistant turn.Label resolution
If you provided candidate labels in your instruction, Agent uses them. Otherwise, Agent infers them from the data and surfaces the inferred set in the plan for you to confirm.
Imbalance handling
Agent computes the ratio of most-frequent to least-frequent label. If it exceeds the threshold (default
50.0), Agent flags the imbalance in the plan and proposes mitigations (rebalancing, weighted training, or a smaller candidate set).Plan + cost approval
Agent writes a plan and presents it with a cost breakdown (base-model evaluation inference + fine-tuning + fine-tuned evaluation inference + total). Approve once to proceed.
Base-model benchmarking
Agent runs inference on the held-out eval split against each selected base model and reports per-label and overall accuracy. This becomes the baseline.
Fine-tuning sweep
Agent fine-tunes the strongest candidates with the default tuning grid (or your explicit grid), batched into waves of up to 6 active jobs.
Fine-tuned evaluation
Agent runs the same eval inference against each fine-tuned model and computes per-label accuracy, overall accuracy, and confusion-matrix style summaries.
Output
When the session reportssucceeded, Agent’s response includes:
- Per-label and overall accuracy for every base model evaluated
- Per-label and overall accuracy for every fine-tuned candidate
- The winning model ID, deployment ID, and inference endpoint
- A
fireworks-aiSDK snippet for label prediction final_report.mdin the session workspace with the full base vs fine-tuned comparison and estimated-vs-actual cost
Customizing the run
- Explicit labels: “Labels are: positive, negative, neutral.”
- Imbalance threshold override: “Use an imbalance threshold of 20.”
- Inference-only mode: “Just benchmark — don’t fine-tune.”
- Single candidate: “Only fine-tune Qwen3 8B, skip the base-vs-base comparison.”
- Custom split: “Use a 70/30 train/test split.”
Agent crib notes
- Required inputs: dataset ID and at least one base model to evaluate. Agent will ask explicitly if either is missing.
- Agent needs to know your labels eventually — either supply them in the instruction or confirm Agent’s inferred set when prompted.
- The default imbalance threshold is
50.0; if your dataset is highly imbalanced, expect Agent to flag it in the plan. - For multi-label classification (a sample with multiple ground-truth labels), pass
ground_truthas a list in your dataset. - Agent creates new resources only; your dataset, models, and deployments are never deleted or modified in place.