The request handling capacity is influenced by multiple factors:

  • Model size and type
  • Number of GPUs allocated to the deployment
  • GPU type (e.g., A100 vs. H100)
  • Prompt size and generation token length
  • Deployment type (serverless vs. on-demand)