VLM fine-tuning is currently supported for Qwen 2.5 VL models only.
This guide covers fine-tuning for Vision-Language Models (VLMs) that process both images and text. For fine-tuning text-only models, see our Supervised fine-tuning for text guide.
Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. This guide shows you how to fine-tune VLMs on Fireworks AI using LoRA (Low-Rank Adaptation) with datasets containing both images and text.

Supported Models

Currently, VLM fine-tuning supports:

Understanding LoRA for VLMs

LoRA significantly reduces the computational and memory requirements for fine-tuning large vision-language models. Instead of updating billions of parameters directly, LoRA learns small “adapter” layers that capture the changes needed for your specific task. Key benefits of LoRA for VLMs:
  • Efficiency: Requires significantly less memory and compute than full fine-tuning
  • Speed: Faster training times while maintaining high-quality results
  • Flexibility: Up to 100 LoRA adaptations can run simultaneously on a dedicated deployment
  • Cost-effective: Lower training costs compared to full parameter fine-tuning

Fine-tuning a VLM using LoRA

1

Prepare your vision dataset

vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.Dataset Requirements:
  • Format: .jsonl file
  • Minimum examples: 3
  • Maximum examples: 3 million per dataset
  • Images: Must be base64 encoded with proper MIME type prefixes
  • Supported image formats: PNG, JPG, JPEG
Message Schema: Each training example must include a messages array where each message has:
  • role: one of system, user, or assistant
  • content: an array containing text and image objects or just text

Basic VLM Dataset Example

{"messages": [{"role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them."}, {"role": "user", "content": [{"type": "text", "text": "What objects do you see in this image?"}, {"type": "image_url", "image_url": {"url": "..."}}]}, {"role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image."}]}

If your dataset contains image urls

Images must be base64 encoded with MIME type prefixes. If your dataset contains image urls, you will need to download and encode them to base64.
❌ Incorrect Format - This will NOT work:
{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}
Raw HTTP/HTTPS URLs are not supported. Images must be base64 encoded.
✅ Correct Format - Use this instead:
{"messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "..."}}, {"type": "text", "text": "What's in this image?"}]}, {"role": "assistant", "content": "I can see..."}]}
Notice the data:image/jpeg;base64, prefix followed by the base64 encoded image data.
You can use the following script to automatically convert your dataset to the correct format:

Advanced Dataset Examples

Multi-image Conversation

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images and tell me the differences"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        },
        {
          "type": "image_url", 
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
    }
  ]
}

Multi-turn Conversation

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
    },
    {
      "role": "user", 
      "content": [
        {
          "type": "text",
          "text": "Can you describe this kitchen layout for me?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
    },
    {
      "role": "user",
      "content": "Now look at this living room. Do you think the styles would work well together?"
    },
    {
      "role": "assistant", 
      "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Sorry about that! Here's the living room:"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
    },
  ]
}
2

Upload your VLM dataset

Upload your prepared JSONL dataset to Fireworks for training:
firectl create dataset my-vlm-dataset /path/to/vlm_training_data.jsonl
For larger datasets (>500MB), use firectl as it handles large uploads more reliably than the web interface.
3

Launch VLM fine-tuning job

Create a supervised fine-tuning job for your VLM:
firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --output-model my-custom-vlm
Optional parameters:
firectl create sftj \
  --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
  --dataset my-vlm-dataset \
  --evaluation-dataset my-vlm-eval-dataset \
  --output-model my-custom-vlm \
  --learning-rate 2e-4 \
  --epochs 3
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
4

Monitor training progress

Track your VLM fine-tuning job:
# Check job status
firectl get sftj my-custom-vlm

# View training logs
firectl get sftj my-custom-vlm --logs
Monitor key metrics:
  • Training loss: Should generally decrease over time
  • Validation loss: Monitor for overfitting if using evaluation dataset
  • Training progress: Epochs completed and estimated time remaining
Your VLM fine-tuning job is complete when the status shows COMPLETED and your custom model is ready for deployment.
5

Deploy your fine-tuned VLM

Once training is complete, deploy your custom VLM:
# Create a deployment for your fine-tuned VLM
firectl create deployment my-vlm-deployment --model my-custom-vlm

# Check deployment status
firectl get deployment my-vlm-deployment

Advanced Configuration

For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the Additional SFT job settings section in our comprehensive fine-tuning guide.

Testing Your Fine-tuned VLM

After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:
from fireworks import LLM

# Use your fine-tuned model
llm = LLM(model="accounts/your-account/models/my-custom-vlm")

response = llm.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Analyze this image using your specialized training",
        }, {
            "type": "image_url",
            "image_url": {
                "url": "..."
            },
        }],
    }]
)
print(response.choices[0].message.content)