Using predicted outputs

On this page

Additional information on Predicted Outputs:

This feature is in beta and we are working on improvements. We welcome your feedback on Discord

In cases where large parts of the LLM output are known in advance, e.g. editing or rewriting a document or code snippet, you can improve output generation speeds with predicted outputs. Predicted outputs allows you to provide strong “guesses” of what output may look like. To use Predicted Outputs, set the prediction field in the Fireworks API with the predicted output. For example, you may want to edit a survey and add an option to contact users by text message:

{
  "questions": [
    {
      "question": "Name",
      "type": "text"
    },
    {
      "question": "Age",
      "type": "number"
    },
    {
      "question": "Feedback",
      "type": "text_area"
    },
    {
      "question": "How to Contact",
      "type": "multiple_choice",
      "options": ["Email", "Phone"],
      "optional": true
    }
  ]
}

In this case, we expect most of the code will remain the same. We set the ‘prediction’ field to be the original survey code. The output generation speed increases using predicted outputs.

Python (Fireworks)

from fireworks.client import Fireworks
 
code = """{
"questions": [
    {
      "question": "Name",
      "type": "text"
    },
    {
      "question": "Age",
      "type": "number"
    },
    {
      "question": "Feedback",
      "type": "text_area"
    },
    {
      "question": "How to Contact",
      "type": "multiple_choice",
      "options": ["Email", "Phone"],
      "optional": true
    }
  ]
}
"""

client = Fireworks(api_key="<FIREWORKS_API_KEY>")

response = client.chat.completions.create(
  model="accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages=[{
      "role": "user",
      "content": "Edit the How to Contact question to add an option called Text Message. Output the full edited code, with no markdown or explanations.",
    },
    {
      "role": "user",
      "content": code
    }
  ],
  temperature=0,
  prediction={"type": "content", "content": code}
)

print(response.choices[0].message.content)

Additional information on Predicted Outputs:

Using Predicted Outputs is free at this time
We recommend setting temperature=0 for best results for most intended use cases of Predicted Outputs. In these cases, using Predicted Outputs does not impact the quality of outputs generated
If the prediction is substantially different from the generated output, output generation speed may decrease
The max length of the prediction field is set by max_tokens and is 2048 by default, and needs to be updated if you have a longer input and prediction.
If you are using an on-demand deployment, you can set rewrite_speculation=True and potentially get even faster output generation. We are working on rolling this out to Serverless soon.

Using grammar mode

Voice agent platform

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Additional information on Predicted Outputs:

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Additional information on Predicted Outputs:

Additional information on Predicted Outputs: