How can I optimize latency for single replica deployments?

Single replica deployments typically run at low concurrency, making certain optimizations particularly effective:

Key Optimizations

Configure draft tokens (--draft-token-count) when creating models - especially effective at low batch sizes
Set draft models (--draft-model) - test Eagle models or contact FireOptimizer team for custom speculators
Upgrade to H200 GPUs - use --accelerator-type to specify hardware when creating deployments
Use FP8 precision (--precision FP8) - reduce computation overhead during deployment creation
Optimize for your use case - contact support for deployment-specific optimizations not available through firectl flags