Models & Inference
Does the API support batching and load balancing?
Current capabilities include:
- Load balancing: Yes, supported out of the box
- Continuous batching: Yes, supported
- Batch inference: Not currently supported (on the roadmap)
- Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
- Streaming: Yes, supported