Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases.
This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding.To see all vision models that support fine-tuning, visit the Model Library for vision models.
{ "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them." }, { "role": "user", "content": [ { "type": "text", "text": "What objects do you see in this image?" }, { "type": "image_url", "image_url": { "url": "..." } } ] }, { "role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image." } ]}
Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you’ll need to download and encode them to base64.
❌ Incorrect
✅ Correct
Copy
Ask AI
{ "type": "image_url", "image_url": { // ❌ Raw HTTP/HTTPS URLs are NOT supported "url": "https://example.com/image.jpg" }}
Copy
Ask AI
{ "type": "image_url", "image_url": { // ✅ Use data URI with base64 encoding // Format: data:image/{format};base64,{base64_encoded_data} "url": "..." }}
You can use the following script to automatically convert your dataset to the correct format:
Python script to download and encode images to base64
Usage:
Copy
Ask AI
# Install required dependencypip install requests# Download the scriptwget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py# Run the script - will output a new dataset <path_to_your_dataset>_base64.jsonlpython download_images_and_encode_to_b64.py --input_file <path_to_your_dataset.jsonl>
{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images and tell me the differences" }, { "type": "image_url", "image_url": { "url": "..." } }, { "type": "image_url", "image_url": { "url": "..." } } ] }, { "role": "assistant", "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different." } ]}
Copy
Ask AI
{ "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation." }, { "role": "user", "content": [ { "type": "text", "text": "Can you describe this kitchen layout for me?" }, { "type": "image_url", "image_url": { "url": "..." } } ] }, { "role": "assistant", "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light." }, { "role": "user", "content": "Now look at this living room. Do you think the styles would work well together?" }, { "role": "assistant", "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?" }, { "role": "user", "content": [ { "type": "text", "text": "Sorry about that! Here's the living room:" }, { "type": "image_url", "image_url": { "url": "..." } } ] }, { "role": "assistant", "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones." }, ]}
To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with <think></think> tags for reasoning.
Download with curl
Download with wget
Copy
Ask AI
# Download the example datasetcurl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
Copy
Ask AI
# Download the example datasetwget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
2
Upload your VLM dataset
Upload your prepared JSONL dataset to Fireworks for training:
For larger datasets (>500MB), use firectl as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our Secure Fine Tuning guide for setup details.
For additional parameters like learning rates, evaluation datasets, and batch sizes, see Additional SFT job settings.
Navigate to the Fine-tuning tab in the Fireworks console
Click “Create Fine-tuning Job”
Select your VLM base model (Qwen 2.5 VL)
Choose your uploaded dataset
Configure training parameters
Launch the job
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
Training loss: Should generally decrease over time
Evaluation loss: Monitor for overfitting if using evaluation dataset
Training progress: Epochs completed and estimated time remaining
Your VLM fine-tuning job is complete when the status shows COMPLETED and your custom model is ready for deployment.
5
Deploy your fine-tuned VLM
Once training is complete, deploy your custom VLM:
firectl
UI
Copy
Ask AI
# Create a deployment for your fine-tuned VLMfirectl create deployment my-custom-vlm# Check deployment statusfirectl get deployment accounts/your-account/deployment/deployment-id
Deploy from the UI using the Deploy dropdown in the fine-tuning job page.
For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.
For a hands-on, step-by-step walkthrough of VLM fine-tuning, we’ve created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation.