Model quantization
Q: What quantization format is used for the Llama 3.1 405B model? The Llama 3.1 405B model uses the FP8 quantization format, which:- Closely matches Meta’s reference implementation
- Provides further details in the model description at fireworks.ai/models/fireworks/llama-v3p1-405b-instruct
- Has a general quantization methodology documented in our Quantization blog
API capabilities
Q: Does the API support batching and load balancing? Current capabilities include:- Load balancing: Yes, supported out of the box
- Continuous batching: Yes, supported
- Batch inference: Not currently supported (on the roadmap)
- Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
- Streaming: Yes, supported
Request handling
Q: What factors affect the number of simultaneous requests that can be handled? Request handling capacity depends on several factors:- Model size and type
- Number of GPUs allocated to the deployment
- GPU type (e.g., A100, H100)
- Prompt size
- Generation token length
- Deployment type (serverless vs. on-demand)
Additional information
If you experience any issues during these processes, you can:- Contact support through Discord at discord.gg/fireworks-ai
- Reach out to your account representative (Enterprise customers)
- Email inquiries@fireworks.ai