<s>[INST] <<SYS>> <</SYS>> user_message_1 [/INST]Some templates can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance. Here are some examples of calling the chat completions API:
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.For styles that support a system prompt, you may override this prompt by setting the first message with the role
system
. For example:
content
to the empty string.
The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on echo
.
<s>
) to your prompt input. You can always double-check by passing raw_output and inspecting the resulting prompt_token_ids
.usage
field containing
.close()
method on the returned generator object. HTTP clients in other languages typically offer similar functionality to close the connection.
n
parameter to the number of desired choices. The returned choices
array will contain the result of each generation.
max_tokens
or max_completion_tokens
defines the maximum number of tokens the model can generate, with a default of 2000. If the combined token count (prompt + output) exceeds the model’s limit, it automatically reduces the number of generated tokens to fit within the allowed context.
top_p
probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
min_p
specifies a probability threshold to control which tokens can be selected during generation. Tokens with probabilities lower than this threshold are excluded, making the model more focused on higher-probability tokens. The default value varies, and setting a lower value ensures more variety, while a higher value produces more predictable, focused outputs.
logits[j] -= c[j] * frequency_penalty + (c[j] > 0 ? 1 : 0) * presence_penalty
where
logits[j]
is the logits of the j-th tokenc[j]
is how often that token was sampled before the current positionrepetition_penalty
modifies the logit (raw model output) for repeated tokens. If a token has already appeared in the prompt or output, the penalty is applied to its probability of being selected again.
Key differences to keep in mind:
frequency_penalty
: Works on how often a word has been used, increasing the penalty for more frequent words. OAI compatible.presence_penalty
: Penalizes words once they appear, regardless of frequency. OAI compatible.repetition_penalty
: Adjusts the likelihood of repeated tokens based on previous appearances, providing an exponential scaling effect to control repetition more precisely, including from the prompt.mirostat_target
: Sets the desired level of unpredictability (perplexity) for the Mirostat algorithm. A higher target results in more diverse output, while a lower target keeps the text more predictable.mirostat_lr
: Controls how quickly the Mirostat algorithm adjusts token probabilities to reach the target perplexity. A lower learning rate makes the adjustments slower and more gradual, while a higher rate speeds up the corrections.Dict[int, float]
that maps a token_id
to a logits bias value between -200.0 and 200.0. For example
max_tokens
. Note the quality of the output may degrade as we override model’s decision to generate EOS token.
logprobs
parameter determines how many token probabilities are returned. If set to N, it will return log (base e)
probabilities for N+1 tokens: the chosen token plus the N most likely alternative tokens.
The log probabilities will be returned in a LogProbs object for each choice.
tokens
contains each token of the chosen result.token_ids
contains the integer IDs of each token of the chosen result.token_logprobs
contains the logprobs of each chosen token.top_logprobs
will be a list whose length is the number of tokens of the output. Each element is a dictionary of size logprobs
, from the most likely tokens at the given position to their respective log probabilities.echo
, this option can be set to see how the model tokenized your input.
top_logprobs
parameter to an integer value in conjunction with logprobs=True
will also return the above information but in an OpenAI client-compatible format.
echo
parameter to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.
raw_output
parameter to true will cause the API to return a raw_output
object in the response containing
addititional debugging information with regards to how the raw prompt and completion response as seen/produced by the
model.
prompt_fragments
- Pieces of the prompt (like individual messages) before truncation and concatenation.prompt_token_ids
- Fully tokenized prompt as seen by the model.completion
- Raw completion produced by the model before any tool calls are parsed.completion_logprobs
- Log probabilities for the completion. Only populated if logprobs
is specified in the
request.usage
field of the API response.