See Querying text models for a general guide on the API and its options.

Using the API

Both completions API and chat completions API are supported. However, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.

For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answer
You can pass images via a URL link or base64 encoded format. Code examples for both methods are below.

Chat completions API

All vision-language models should have a conversation config and have chat completions API enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:

SYSTEM: {system message}

USER: <image>
{user message}

ASSISTANT:

The <image> substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.

Here are some examples of calling the chat completions API:

In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:

Completions API

Advanced users can also query the completions API directly. Users will need to manually insert the image token <image> where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:

import fireworks.client

fireworks.client.api_key = "<FIREWORKS_API_KEY>"

response = fireworks.client.Completion.create(
  model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
  prompt = "SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
  images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)

API limitations

Right now, we impose certain limits on the completions API and chat completions API as follows:

  1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
  2. All the images should be smaller than 5MB in size, and if the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.

Model limitations

At the moment, we primarily offer Phi-3 vision models for serverless deployment.

Managing images

The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download.

For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.

Calculating cost

For the Phi-3 Vision model, an image is treated as a dynamic number of tokens based on image resolution. For one image the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to our pricing page here.

FAQ

Can I fine-tune the image capabilities with VLM?

Not right now, but we will be working on Phi-3 vision model fine-tuning since it is now a more popular choice. If you are interested, please reach out to us via Discord.

Can I use a vision-language model to generate images?

No. But we support image generation models for this purpose:

Please give these models a try and let us know how it goes!

What type of files can I upload?

We currently support .png, .jpg, .jpeg, .gif, .bmp, .tiff and .ppm format images.

Is there a limit to the size of the image I can upload?

Currently, our API is restricted to 10MB for the whole request, so the image sent through request in base64 encoding will need to be smaller than 10MB (when converted to base64 encoding). If you are using URLs, then each image needs to be smaller than 5MB.

What is the retention policy for the images I upload?

We do not persist the images longer than the server lifetime, and will be deleted automatically.

How do rate limits work with VLMs?

VLMs are rate-limited like all of our other LLM models, which depends on which tier of rate-limiting you are at. For more information, please check out Pricing.

Can VLMs understand image metadata?

No. If you have image metadata that you want the model to understand, please provide them through the prompt.