Querying text models

The examples in this section work out-of-the-box for serverless models. For non-serverless models, you must first create a deployment. Refer to deploying models on-demand for guidance of creating and querying models using on-demand deployments.

Using the API

Chat completions API

Models with a conversation config have the chat completions API enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Llama chat models use the following template:

<s>[INST] <<SYS>> <</SYS>> user_message_1 [/INST]

Some templates can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance. Here are some examples of calling the chat completions API:

from fireworks import LLM

llm = LLM(
  model="qwen3-235b-a22b",
  deployment_type="serverless"
)
response = llm.chat.completions.create(
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
)
print(response.choices[0].message.content)

Overriding the system prompt

A conversation style may include a default system prompt. For example, Llama 2 models use the default Llama prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

For styles that support a system prompt, you may override this prompt by setting the first message with the role system. For example:

JSON

[
  {
  	"role": "system",
  	"content": "You are a pirate."
  },
  {
  	"role": "user",
  	"content": "Hello, what is your name?"
  }
]

To completely omit the system prompt, you can set content to the empty string. The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on echo.

Completions API

Text models generate text based on the provided input prompt. All text models support this basic completions API. Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model’s special end-of-sequence (EOS) token has been generated.

Most models will automatically prepend the beginning-of-sequence (BOS) token (e.g. <s>) to your prompt input. You can always double-check by passing raw_output and inspecting the resulting prompt_token_ids.

Here are some examples of calling the completions API:

from fireworks import LLM

llm = LLM(
  model="qwen3-235b-a22b",
  deployment_type="serverless"
)
response = llm.completions.create(
  prompt="Say this is a test",
)

print(response.choices[0].text)

Querying existing dedicated deployments

If you have already created a dedicated deployment through the web app or CLI, you can query it using the same APIs. The key difference is in how you specify the model identifier.

Model identifier for dedicated deployments

When querying a dedicated deployment, you need to use a specific model identifier that includes both the model and deployment information. For detailed information about the different model identifier formats, see the Model identifier section in the on-demand deployments guide.

You can connect to your existing deployment using the Fireworks SDK:

from fireworks import LLM

# Connect to your existing deployment
llm = LLM(
    model="llama-v3p2-3b-instruct",  # The model your deployment is running
    deployment_type="on-demand",
    id="my-deployment-id",  # Your deployment ID
)

# Use OpenAI-compatible chat completions
response = llm.chat.completions.create(
    messages=[{"role": "user", "content": "Say this is a test"}]
)

print(response.choices[0].message.content)

When connecting to an existing deployment with the SDK, you don’t need to call .apply() - the deployment is already running.

Getting usage info

The returned object will contain a usage field containing

The number of prompt tokens ingested
The number of completion tokens (i.e. the number of tokens generated)

Advanced options

See the API reference for the completions and chat completions APIs for a detailed description of these options.

Streaming

By default, results are returned to the client once the generation is finished. Another option is to stream the results back, which is useful for chat use cases where the client can incrementally see results as each token is generated. Here is an example with the completions API:

from fireworks import LLM

llm = LLM(
  model="qwen3-235b-a22b",
  deployment_type="serverless"
)
response_generator = llm.completions.create(
  prompt="Say this is a test",
  stream=True,
)

for chunk in response_generator:
    print(chunk.choices[0].text)

and one with the chat completions API:

from fireworks import LLM

llm = LLM(
  model="qwen3-235b-a22b",
  deployment_type="serverless"
)
response_generator = llm.chat.completions.create(
  messages=[{
    "role": "user",
    "content": "Say this is a test",
  }],
  stream=True,
)
for chunk in response_generator:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Aborting requests

When using streaming, you can stop the generation process by closing the HTTP connection partway through. This will immediately halt server-side processing. For serverless models, no additional tokens will be generated or billed. For dedicated deployments, it may free up resources for other requests. The prompt is always fully processed and billed, and you cannot cancel or abort a request before the first token is generated. Non-streaming requests cannot be aborted. To abort generation when using the Fireworks SDK or OpenAI client, call the .close() method on the returned generator object. HTTP clients in other languages typically offer similar functionality to close the connection.

# example Fireworks SDK or OpenAI streaming setup above
start_time = time.time()
for chunk in response_generator:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
    # abort after 1 second
    if time.time() - start_time > 1:
        response_generator.close()
        break

Async mode

The Fireworks SDK also supports asynchronous mode for both completion and chat completion.

import asyncio
from fireworks import LLM

llm = LLM(
  model="qwen3-235b-a22b",
  deployment_type="serverless"
)

async def main():
    stream = await llm.completions.acreate(
        prompt="Say this is a test",
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].text, end="")

asyncio.run(main())

Predicted Outputs

See Using Predicted Outputs

Sampling options

The API auto-regressively generates text based on choosing the next token using the probability distribution over the space of tokens. For detailed information on how to implement these options, please refer to the Chat Completions or Completions API documentation.

Multiple choices

By default, the API will return a single generation choice per request. You can create multiple generations by setting the n parameter to the number of desired choices. The returned choices array will contain the result of each generation.

Max tokens

max_tokens or max_completion_tokens defines the maximum number of tokens the model can generate, with a default of 2000. If the combined token count (prompt + output) exceeds the model’s limit, it automatically reduces the number of generated tokens to fit within the allowed context.

Temperature

Temperature allows you to configure how much randomness you want in the generated text. A higher temperature leads to more “creative” results. On the other hand, setting a temperature of 0 will allow you to generate deterministic results which is useful for testing and debugging.

Top-p

Top-p (also called nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Top-k

Top-k is another sampling method where the

k

most probable tokens are filtered and the probability mass is redistributed among tokens. The value must be between 0 and 100.

Min-p

min_p specifies a probability threshold to control which tokens can be selected during generation. Tokens with probabilities lower than this threshold are excluded, making the model more focused on higher-probability tokens. The default value varies, and setting a lower value ensures more variety, while a higher value produces more predictable, focused outputs.

Repetition penalty

LLMs are sometimes prone to repeat a single character or a sentence. Using a frequency and presence penalty can reduce the likelihood of sampling repetitive sequences of tokens. They work by directly modifying the model’s logits (un-normalized log-probabilities) with an additive contribution. logits[j] -= c[j] * frequency_penalty + (c[j] > 0 ? 1 : 0) * presence_penalty where

logits[j] is the logits of the j-th token
c[j] is how often that token was sampled before the current position

The repetition_penalty modifies the logit (raw model output) for repeated tokens. If a token has already appeared in the prompt or output, the penalty is applied to its probability of being selected again. Key differences to keep in mind:

frequency_penalty: Works on how often a word has been used, increasing the penalty for more frequent words. OAI compatible.
presence_penalty: Penalizes words once they appear, regardless of frequency. OAI compatible.
repetition_penalty: Adjusts the likelihood of repeated tokens based on previous appearances, providing an exponential scaling effect to control repetition more precisely, including from the prompt.

Mirostat (learning rate and target)

The Mirostat algorithm is a sampling method that helps keep the output’s unpredictability, or perplexity, at a set target. It adjusts token probabilities as the text is generated to balance between more diverse or more predictable results. This is useful when you need steady control over how random or focused the text output should be. There are two parameters that can be adjusted:

mirostat_target: Sets the desired level of unpredictability (perplexity) for the Mirostat algorithm. A higher target results in more diverse output, while a lower target keeps the text more predictable.
mirostat_lr: Controls how quickly the Mirostat algorithm adjusts token probabilities to reach the target perplexity. A lower learning rate makes the adjustments slower and more gradual, while a higher rate speeds up the corrections.

Logit bias

Parameter that modifies the likelihood of specified tokens appearing. Pass in a Dict[int, float] that maps a token_id to a logits bias value between -200.0 and 200.0. For example

python

client.completions.create(
  model="...",
  prompt="...",
  logit_bias={0: 10.0, 2: -50.0}
)

Debugging options

Ignore EOS

This option allows you to control whether the model stops when it generated the End of Sequence (EOS) token. This is helpful primarily for performance benchmarking to reliably generate exactly max_tokens. Note the quality of the output may degrade as we override model’s decision to generate EOS token.

Logprobs

The logprobs parameter determines how many token probabilities are returned. If set to N, it will return log (base e) probabilities for N+1 tokens: the chosen token plus the N most likely alternative tokens. The log probabilities will be returned in a LogProbs object for each choice.

tokens contains each token of the chosen result.
token_ids contains the integer IDs of each token of the chosen result.
token_logprobs contains the logprobs of each chosen token.
top_logprobs will be a list whose length is the number of tokens of the output. Each element is a dictionary of size logprobs, from the most likely tokens at the given position to their respective log probabilities.

When used in conjunction with echo, this option can be set to see how the model tokenized your input.

Top logprobs

Setting the top_logprobs parameter to an integer value in conjunction with logprobs=True will also return the above information but in an OpenAI client-compatible format.

Echo

Setting the echo parameter to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.

Raw output

This is an unstable, experimental API. It may change at any time and should not be relied upon for production use cases.

Setting the raw_output parameter to true will cause the API to return a raw_output object in the response containing addititional debugging information with regards to how the raw prompt and completion response as seen/produced by the model.

prompt_fragments - Pieces of the prompt (like individual messages) before truncation and concatenation.
prompt_token_ids - Fully tokenized prompt as seen by the model.
completion - Raw completion produced by the model before any tool calls are parsed.
completion_logprobs - Log probabilities for the completion. Only populated if logprobs is specified in the request.

Appendix

Tokenization

Language models read and write text in chunks called tokens. In English, a token can be as short as one character or as long as one word (e.g., a or apple), and in some languages, tokens can be even shorter than one character or even longer than one word. Different model families use different tokenizers. The same text might be translated to different numbers of tokens depending on the model. It means that generation cost may vary per model even if the model size is the same. For the Llama model family, you can use this tool to estimate token counts. The actual number of tokens used in prompt and generation is returned in the usage field of the API response.

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Using the API

Chat completions API

Overriding the system prompt

Completions API

Querying existing dedicated deployments

Model identifier for dedicated deployments

Getting usage info

Advanced options

Streaming

Aborting requests

Async mode

Predicted Outputs

Sampling options

Multiple choices

Max tokens

Temperature

Top-p

Top-k

Min-p

Repetition penalty

Mirostat (learning rate and target)

Logit bias

Debugging options

Ignore EOS

Logprobs

Top logprobs

Echo

Raw output

Appendix

Tokenization

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Using the API

​Chat completions API

​Overriding the system prompt

​Completions API

​Querying existing dedicated deployments

​Model identifier for dedicated deployments

​Getting usage info

​Advanced options

​Streaming

​Aborting requests

​Async mode

​Predicted Outputs

​Sampling options

​Multiple choices

​Max tokens

​Temperature

​Top-p

​Top-k

​Min-p

​Repetition penalty

​Mirostat (learning rate and target)

​Logit bias

​Debugging options

​Ignore EOS

​Logprobs

​Top logprobs

​Echo

​Raw output

​Appendix

​Tokenization

Using the API

Chat completions API

Overriding the system prompt

Completions API

Querying existing dedicated deployments

Model identifier for dedicated deployments

Getting usage info

Advanced options

Streaming

Aborting requests

Async mode

Predicted Outputs

Sampling options

Multiple choices

Max tokens

Temperature

Top-p

Top-k

Min-p

Repetition penalty

Mirostat (learning rate and target)

Logit bias

Debugging options

Ignore EOS

Logprobs

Top logprobs

Echo

Raw output

Appendix

Tokenization