Model quantization

Q: What quantization format is used for the Llama 3.1 405B model? The Llama 3.1 405B model uses the FP8 quantization format, which: Note: BF16 precision will be available soon for on-demand deployments.

API capabilities

Q: Does the API support batching and load balancing? Current capabilities include:
  • Load balancing: Yes, supported out of the box
  • Continuous batching: Yes, supported
  • Batch inference: Not currently supported (on the roadmap)
    • Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
  • Streaming: Yes, supported

Request handling

Q: What factors affect the number of simultaneous requests that can be handled? Request handling capacity depends on several factors:
  • Model size and type
  • Number of GPUs allocated to the deployment
  • GPU type (e.g., A100, H100)
  • Prompt size
  • Generation token length
  • Deployment type (serverless vs. on-demand)

Additional information

If you experience any issues during these processes, you can: