Current capabilities include:

  • Load balancing: Yes, supported out of the box
  • Continuous batching: Yes, supported
  • Batch inference: Not currently supported (on the roadmap)
    • Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
  • Streaming: Yes, supported