Skip to main content
Speed up text generation by using a smaller “draft” model to assist the main model, or using n-gram based speculation.
Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.

Configuration options

FlagTypeDescription
--draft-modelstringDraft model name. Can be a Fireworks model or custom model. See recommendations below.
--draft-token-countint32Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4.
--ngram-speculation-lengthint32Alternative to draft model: uses N-gram based speculation from previous input.
--draft-model and --ngram-speculation-length cannot be used together.
Draft modelUse with
accounts/fireworks/models/llama-v3p2-1b-instructAll Llama models > 3B
accounts/fireworks/models/qwen2p5-0p5b-instructAll Qwen models > 3B

Examples

  • Draft model
  • N-gram speculation
Use a smaller model to speed up generation:
firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \
  --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
  --draft-token-count=4
Fireworks also supports Predicted Outputs which works in addition to model-based speculative decoding.
I