Querying vision-language models
See Querying text models for a general guide on the API and its options.
Using the API
Both completions API and chat completions API are supported. However, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.
Chat completions API
All vision-language models should have a conversation config and have chat completions API enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:
The <image>
substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.
Here are some examples of calling the chat completions API:
In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:
Completions API
Advanced users can also query the completions API directly. Users will need to manually insert the image token <image>
where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:
API limitations
Right now, we impose certain limits on the completions API and chat completions API as follows:
- The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
- All the images should be smaller than 5MB in size, and if the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
Model limitations
At the moment, we primarily offer Phi-3 vision models for serverless deployment.
Managing images
The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download.
For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
Calculating cost
For the Phi-3 Vision model, an image is treated as a dynamic number of tokens based on image resolution. For one image the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to our pricing page here.
FAQ
Can I fine-tune the image capabilities with VLM?
Not right now, but we will be working on Phi-3 vision model fine-tuning since it is now a more popular choice. If you are interested, please reach out to us via Discord.
Can I use a vision-language model to generate images?
No. But we support image generation models for this purpose:
Please give these models a try and let us know how it goes!
What type of files can I upload?
We currently support .png
, .jpg
, .jpeg
, .gif
, .bmp
, .tiff
and .ppm
format images.
Is there a limit to the size of the image I can upload?
Currently, our API is restricted to 10MB for the whole request, so the image sent through request in base64 encoding will need to be smaller than 10MB (when converted to base64 encoding). If you are using URLs, then each image needs to be smaller than 5MB.
What is the retention policy for the images I upload?
We do not persist the images longer than the server lifetime, and will be deleted automatically.
How do rate limits work with VLMs?
VLMs are rate-limited like all of our other LLM models, which depends on which tier of rate-limiting you are at. For more information, please check out Pricing.
Can VLMs understand image metadata?
No. If you have image metadata that you want the model to understand, please provide them through the prompt.
Was this page helpful?