Inference performance

Model quantization

Q: What quantization format is used for the Llama 3.1 405B model? The Llama 3.1 405B model uses the FP8 quantization format, which:

Closely matches Meta’s reference implementation
Provides further details in the model description at fireworks.ai/models/fireworks/llama-v3p1-405b-instruct
Has a general quantization methodology documented in our Quantization blog

Note: BF16 precision will be available soon for on-demand deployments.

Q: Does the API support batching and load balancing? Current capabilities include:

Load balancing: Yes, supported out of the box
Continuous batching: Yes, supported
Batch inference: Not currently supported (on the roadmap)
- Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
Streaming: Yes, supported

Q: What factors affect the number of simultaneous requests that can be handled? Request handling capacity depends on several factors:

If you experience any issues during these processes, you can: