Models & Inference
What factors affect the number of simultaneous requests that can be handled?
Request handling capacity depends on several factors:
- Model size and type
- Number of GPUs allocated to the deployment
- GPU type (e.g., A100, H100)
- Prompt size
- Generation token length
- Deployment type (serverless vs. on-demand)