See Querying text models for a general guide on the API and its options.

Using the API

Both completions API and chat completions API are supported. However, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.

You can pass images via a URL link or base64 encoded format. Code examples for both methods are below.
For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answer

Chat completions API

All vision-language models should have a conversation config and have chat completions API enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:

SYSTEM: {system message}

USER: <image>
{user message}

ASSISTANT:

The <image> substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.

Here are some examples of calling the chat completions API:

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.ChatCompletion.create(
  model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)

In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:

import fireworks.client
import base64

# Helper function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# The path to your image
image_path = "your_image.jpg"

# The base64 string of the image
image_base64 = encode_image(image_path)

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.ChatCompletion.create(
  model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
  messages = [{
    "role": "user",
    "content": [{
      "type": "text",
      "text": "Can you describe this image?",
    }, {
      "type": "image_url",
      "image_url": {
        "url": f"data:image/jpeg;base64,{image_base64}"
      },
    }, ],
  }],
)
print(response.choices[0].message.content)

Completions API

Advanced users can also query the completions API directly. Users will need to manually insert the image token <image> where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.Completion.create(
  model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
  prompt = "SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
  images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)

Best practices

  1. The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download. We do not persist the images longer than the server lifetime, and will be deleted automatically.
  2. For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
  3. If you have image metadata that you want the model to understand, please provide them through the prompt.

API limitations

Right now, we impose certain limits on the completions API and chat completions API as follows:

  1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
  2. If images are provided in base64 encoding, they must be less than 10MB in total (when converted to base64 encoding).
  3. If images are provided as URLs, then each image needs to be smaller than 5MB. If the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
  4. We currently support .png, .jpg, .jpeg, .gif, .bmp, .tiff and .ppm format images.

Calculating cost

An image is treated as a dynamic number of tokens based on image resolution. For one image, the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to our pricing page here.