Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.
Configuration options
Flag | Type | Description |
---|---|---|
--draft-model | string | Draft model name. Can be a Fireworks model or custom model. See recommendations below. |
--draft-token-count | int32 | Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4. |
--ngram-speculation-length | int32 | Alternative to draft model: uses N-gram based speculation from previous input. |
--draft-model
and --ngram-speculation-length
cannot be used together.Recommended draft models
Draft model | Use with |
---|---|
accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
Examples
- Draft model
- N-gram speculation
Use a smaller model to speed up generation:
Fireworks also supports Predicted Outputs which works in addition to model-based speculative decoding.