Skip to main contentModel quantization
Q: What quantization format is used for the Llama 3.1 405B model?
The Llama 3.1 405B model uses the FP8 quantization format, which:
Note: BF16 precision will be available soon for on-demand deployments.
API capabilities
Q: Does the API support batching and load balancing?
Current capabilities include:
- Load balancing: Yes, supported out of the box
- Continuous batching: Yes, supported
- Batch inference: Not currently supported (on the roadmap)
- Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
- Streaming: Yes, supported
Request handling
Q: What factors affect the number of simultaneous requests that can be handled?
Request handling capacity depends on several factors:
- Model size and type
- Number of GPUs allocated to the deployment
- GPU type (e.g., A100, H100)
- Prompt size
- Generation token length
- Deployment type (serverless vs. on-demand)
If you experience any issues during these processes, you can: