Model quantization

Q: What quantization format is used for the Llama 3.1 405B model?

The Llama 3.1 405B model uses the FP8 quantization format, which:

Note: BF16 precision will be available soon for on-demand deployments.


API capabilities

Q: Does the API support batching and load balancing?

Current capabilities include:

  • Load balancing: Yes, supported out of the box
  • Continuous batching: Yes, supported
  • Batch inference: Not currently supported (on the roadmap)
    • Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
  • Streaming: Yes, supported

Request handling

Q: What factors affect the number of simultaneous requests that can be handled?

Request handling capacity depends on several factors:

  • Model size and type
  • Number of GPUs allocated to the deployment
  • GPU type (e.g., A100, H100)
  • Prompt size
  • Generation token length
  • Deployment type (serverless vs. on-demand)

Additional information

If you experience any issues during these processes, you can: