Speculative Decoding

Configuration options
Recommended draft models
Examples

Speed up text generation by using a smaller “draft” model to assist the main model, or using n-gram based speculation.

Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.

Configuration options

Flag	Type	Description
`--draft-model`	string	Draft model name. Can be a Fireworks model or custom model. See recommendations below.
`--draft-token-count`	int32	Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4.
`--ngram-speculation-length`	int32	Alternative to draft model: uses N-gram based speculation from previous input.

--draft-model and --ngram-speculation-length cannot be used together.

Recommended draft models

Draft model	Use with
`accounts/fireworks/models/llama-v3p2-1b-instruct`	All Llama models > 3B
`accounts/fireworks/models/qwen2p5-0p5b-instruct`	All Qwen models > 3B

Examples

Draft model
N-gram speculation

Use a smaller model to speed up generation:

firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \
  --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
  --draft-token-count=4

Fireworks also supports Predicted Outputs which works in addition to model-based speculative decoding.

Quantization

Regions

⌘I

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Configuration options

Recommended draft models

Examples

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Configuration options

​Recommended draft models

​Examples

Configuration options

Recommended draft models

Examples