What are the maximum completion token limits for models, and can they be increased?

For most models, the max completion token limit is the full context window of the model, e.g. 128K for DeepSeek R1
max_tokens is set at 2K by default, and you should set it to a higher value if you plan to have long generations.
For Llama 3.1 405B, have a 4096 token completion limit. Setting a higher max_tokens in API calls will not override this limit.
You will see "finish_reason": "length" in responses when hitting a max token limit.

Example API Response at Limit:

{
    "finish_reason": "length",
    "usage": {
        "completion_tokens": 4096,
        "prompt_tokens": 4206,
        "total_tokens": 8302
    }
}

Can safety filters or content restrictions be disabled on text generation models?

How to get performance metrics for streaming responses?

⌘I

Reference

Examples

FAQ

What are the maximum completion token limits for models, and can they be increased?