Request handling capacity depends on several factors:

  • Model size and type
  • Number of GPUs allocated to the deployment
  • GPU type (e.g., A100, H100)
  • Prompt size
  • Generation token length
  • Deployment type (serverless vs. on-demand)