Deployment & Infrastructure
How can I optimize latency for single replica deployments?
Single replica deployments typically run at low concurrency, making certain optimizations particularly effective:
Key Optimizations
- Configure draft tokens (
--draft-token-count
) when creating models - especially effective at low batch sizes - Set draft models (
--draft-model
) - test Eagle models or contact FireOptimizer team for custom speculators - Upgrade to H200 GPUs - use
--accelerator-type
to specify hardware when creating deployments - Use FP8 precision (
--precision FP8
) - reduce computation overhead during deployment creation - Optimize for your use case - contact support for deployment-specific optimizations not available through firectl flags