Vision-language models (VLMs) can process both text and images in a single request, enabling you to ask questions about visual content or get descriptions of images. Common use-cases include image captioning, visual question answering, document analysis, chart interpretation, OCR, and content moderation.

This guide shows you how to use Fireworks’ VLMs through our API to analyze images alongside text prompts. You can view the VLMs we have available, by applying the vision filter on the Model Library page.

Chat completions API

Here are some examples of calling the chat completions API.

from fireworks import LLM

llm = LLM(model="qwen2p5-vl-32b-instruct", deployment_type="serverless")

response = llm.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Can you describe this image?",
        }, {
            "type": "image_url",
            "image_url": {
                "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
            },
        }, ],
    }],
)
print(response.choices[0].message.content)

In the above example, we provide images by specifying their URLs. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:

from fireworks import LLM
import base64

# Helper function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# The path to your image
image_path = "your_image.jpg"

# The base64 string of the image
image_base64 = encode_image(image_path)

llm = LLM(model="qwen2p5-vl-32b-instruct", deployment_type="serverless")

response = llm.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Can you describe this image?",
        }, {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image_base64}"
            },
        }, ],
    }],
)
print(response.choices[0].message.content)

For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answer

Calculating cost

An image is treated as a dynamic number of tokens based on image resolution. For one image, the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to our pricing page here.

Prompt caching

Vision-language models support prompt caching to improve performance for requests with repeated content. Both text and image portions of your prompts can benefit from caching to reduce time to first token by up to 80%.

To optimize caching for VLMs, structure your prompts with static content like instructions at the beginning and variable content like user-specific information at the end. This allows the cached prefix to be reused across multiple requests.

LoRA support

Many vision-language models support LoRA (Low-Rank Adaptation) fine-tuning and deployment. LoRA allows you to customize VLMs for specific visual tasks while using significantly less computational resources than full fine-tuning.

You can:

  • Fine-tune VLMs with your own image-text datasets
  • Deploy up to 100 LoRA adapters on a single base model deployment
  • Use LoRA adapters on both serverless and dedicated deployments

For more information on working with LoRA models, see our guides on deploying models and understanding LoRA performance.

Best practices

  1. The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download. We do not persist the images longer than the server lifetime, and will be deleted automatically.
  2. For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
  3. If you have image metadata that you want the model to understand, please provide them through the prompt.

Completions API

Advanced users can also query the completions API directly. Users will need to manually insert the image token <image> where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:

from fireworks import LLM

llm = LLM(model="qwen2p5-vl-32b-instruct", deployment_type="serverless")

response = llm.completions.create(
  prompt = "SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
  images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)

API limitations

Right now, we impose certain limits on the completions API and chat completions API as follows:

  1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
  2. If images are provided in base64 encoding, they must be less than 10MB in total (when converted to base64 encoding).
  3. If images are provided as URLs, then each image needs to be smaller than 5MB. If the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
  4. We currently support .png, .jpg, .jpeg, .gif, .bmp, .tiff and .ppm format images.