• For most models, the max completion token limit is the full context window of the model, e.g. 128K for DeepSeek R1
  • max_tokens is set at 2K by default, and you should set it to a higher value if you plan to have long generations.
  • For Llama 3.1 405B, have a 4096 token completion limit. Setting a higher max_tokens in API calls will not override this limit.
  • You will see "finish_reason": "length" in responses when hitting a max token limit.

Example API Response at Limit:

{
    "finish_reason": "length",
    "usage": {
        "completion_tokens": 4096,
        "prompt_tokens": 4206,
        "total_tokens": 8302
    }
}