Inference systems
Inference performance
Understanding model performance, quantization, and batching capabilities.
Model quantization
Q: What quantization format is used for the Llama 3.1 405B model?
The Llama 3.1 405B model uses the FP8 quantization format, which:
- Closely matches Meta’s reference implementation
- Provides further details in the model description at fireworks.ai/models/fireworks/llama-v3p1-405b-instruct
- Has a general quantization methodology documented in our Quantization blog
Note: BF16 precision will be available soon for on-demand deployments.
API capabilities
Q: Does the API support batching and load balancing?
Current capabilities include:
- Load balancing: Yes, supported out of the box
- Continuous batching: Yes, supported
- Batch inference: Not currently supported (on the roadmap)
- Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
- Streaming: Yes, supported
Request handling
Q: What factors affect the number of simultaneous requests that can be handled?
Request handling capacity depends on several factors:
- Model size and type
- Number of GPUs allocated to the deployment
- GPU type (e.g., A100, H100)
- Prompt size
- Generation token length
- Deployment type (serverless vs. on-demand)
Additional information
If you experience any issues during these processes, you can:
- Contact support through Discord at discord.gg/fireworks-ai
- Reach out to your account representative (Enterprise customers)
- Email inquiries@fireworks.ai
Was this page helpful?