Supervised fine-tuning for VLMs (SFT)

VLM fine-tuning is currently supported for Qwen 2.5 VL models only.

This guide covers fine-tuning for Vision-Language Models (VLMs) that process both images and text. For fine-tuning text-only models, see our Supervised fine-tuning for text guide.

Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. This guide shows you how to fine-tune VLMs on Fireworks AI using LoRA (Low-Rank Adaptation) with datasets containing both images and text.

Supported Models

Currently, VLM fine-tuning supports:

Qwen 2.5 VL 3B Instruct - accounts/fireworks/models/qwen2p5-vl-3b-instruct
Qwen 2.5 VL 7B Instruct - accounts/fireworks/models/qwen2p5-vl-7b-instruct
Qwen 2.5 VL 32B Instruct - accounts/fireworks/models/qwen2p5-vl-32b-instruct
Qwen 2.5 VL 72B Instruct - accounts/fireworks/models/qwen2p5-vl-72b-instruct

Understanding LoRA for VLMs

LoRA significantly reduces the computational and memory requirements for fine-tuning large vision-language models. Instead of updating billions of parameters directly, LoRA learns small “adapter” layers that capture the changes needed for your specific task. Key benefits of LoRA for VLMs:

Efficiency: Requires significantly less memory and compute than full fine-tuning
Speed: Faster training times while maintaining high-quality results
Flexibility: Up to 100 LoRA adaptations can run simultaneously on a dedicated deployment
Cost-effective: Lower training costs compared to full parameter fine-tuning

Fine-tuning a VLM using LoRA

Prepare your vision dataset

vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.Dataset Requirements:

Format: .jsonl file
Minimum examples: 3
Maximum examples: 3 million per dataset
Images: Must be base64 encoded with proper MIME type prefixes
Supported image formats: PNG, JPG, JPEG

Message Schema: Each training example must include a messages array where each message has:

role: one of system, user, or assistant
content: an array containing text and image objects or just text

Basic VLM Dataset Example

{"messages": [{"role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them."}, {"role": "user", "content": [{"type": "text", "text": "What objects do you see in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}]}, {"role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image."}]}

If your dataset contains image urls

Images must be base64 encoded with MIME type prefixes. If your dataset contains image urls, you will need to download and encode them to base64.

❌ Incorrect Format - This will NOT work:

{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}

Raw HTTP/HTTPS URLs are not supported. Images must be base64 encoded.

✅ Correct Format - Use this instead:

{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}

Notice the data:image/jpeg;base64, prefix followed by the base64 encoded image data.

You can use the following script to automatically convert your dataset to the correct format:

Python script to download and encode images to base64

Usage:

# Install required dependency
pip install requests

# Download the script
wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py

# Run the script - will output a new dataset <path_to_your_dataset>_base64.jsonl
python download_images_and_encode_to_b64.py --input_file <path_to_your_dataset.jsonl>

Advanced Dataset Examples

Multi-image Conversation

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images and tell me the differences"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        },
        {
          "type": "image_url", 
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
    }
  ]
}

Multi-turn Conversation

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
    },
    {
      "role": "user", 
      "content": [
        {
          "type": "text",
          "text": "Can you describe this kitchen layout for me?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
    },
    {
      "role": "user",
      "content": "Now look at this living room. Do you think the styles would work well together?"
    },
    {
      "role": "assistant", 
      "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Sorry about that! Here's the living room:"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
    },
  ]
}

Upload your VLM dataset

Upload your prepared JSONL dataset to Fireworks for training:

firectl create dataset my-vlm-dataset /path/to/vlm_training_data.jsonl

For larger datasets (>500MB), use firectl as it handles large uploads more reliably than the web interface.

Launch VLM fine-tuning job

Create a supervised fine-tuning job for your VLM:

firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --output-model my-custom-vlm

Optional parameters:

firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --evaluation-dataset my-vlm-eval-dataset \
  --output-model my-custom-vlm \
  --learning-rate 2e-4 \
  --epochs 3

VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.

Monitor training progress

Track your VLM fine-tuning job:

# Check job status
firectl get sftj my-custom-vlm

# View training logs
firectl get sftj my-custom-vlm --logs

Monitor key metrics:

Training loss: Should generally decrease over time
Validation loss: Monitor for overfitting if using evaluation dataset
Training progress: Epochs completed and estimated time remaining

Your VLM fine-tuning job is complete when the status shows COMPLETED and your custom model is ready for deployment.

Deploy your fine-tuned VLM

Once training is complete, deploy your custom VLM:

# Create a deployment for your fine-tuned VLM
firectl create deployment my-vlm-deployment --model my-custom-vlm

# Check deployment status
firectl get deployment my-vlm-deployment

Advanced Configuration

For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.

Testing Your Fine-tuned VLM

After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:

from fireworks import LLM

# Use your fine-tuned model
llm = LLM(model="accounts/your-account/models/my-custom-vlm")

response = llm.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Analyze this image using your specialized training",
        }, {
            "type": "image_url",
            "image_url": {
                "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
            },
        }],
    }]
)
print(response.choices[0].message.content)

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Supervised fine-tuning for VLMs (SFT)

Supported Models

Understanding LoRA for VLMs

Fine-tuning a VLM using LoRA

Basic VLM Dataset Example

If your dataset contains image urls

Advanced Dataset Examples

Multi-image Conversation

Multi-turn Conversation

Advanced Configuration

Testing Your Fine-tuned VLM

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Supported Models

​Understanding LoRA for VLMs

​Fine-tuning a VLM using LoRA

​Basic VLM Dataset Example

​If your dataset contains image urls

​Advanced Dataset Examples

​Multi-image Conversation

​Multi-turn Conversation

​Advanced Configuration

​Testing Your Fine-tuned VLM

Supported Models

Understanding LoRA for VLMs

Fine-tuning a VLM using LoRA

Basic VLM Dataset Example

If your dataset contains image urls

Advanced Dataset Examples

Multi-image Conversation

Multi-turn Conversation

Advanced Configuration

Testing Your Fine-tuned VLM