Single replica deployments typically run at low concurrency, making certain optimizations particularly effective:

Key Optimizations

  • Configure draft tokens (--draft-token-count) when creating models - especially effective at low batch sizes
  • Set draft models (--draft-model) - test Eagle models or contact FireOptimizer team for custom speculators
  • Upgrade to H200 GPUs - use --accelerator-type to specify hardware when creating deployments
  • Use FP8 precision (--precision FP8) - reduce computation overhead during deployment creation
  • Optimize for your use case - contact support for deployment-specific optimizations not available through firectl flags