Skip to main content
The Training SDK supports vision-language model (VLM) fine-tuning, allowing you to train models that understand both images and text. This works across all training modes — SFT, DPO, and RL — using the same SDK primitives and cookbook recipes you already know.
VLM support in the Training SDK requires a VLM-compatible training shape. See Training Shapes for available shapes.

What changes for vision

Compared to text-only training, VLM fine-tuning differs in three ways:
AspectText-onlyVision
Training shapeText model shape (e.g. qwen3-8b-128k)VLM shape (e.g. qwen3-vl-8b-65k)
TokenizerText tokenizer (e.g. Qwen/Qwen3-8B)VLM processor (e.g. Qwen/Qwen3-VL-8B-Instruct)
Message formatcontent is a stringcontent is an array of text and image_url objects
Everything else — loss functions, checkpointing, weight sync, deployment sampling — works identically.

Dataset format

Vision datasets use the standard OpenAI-compatible chat format. The key difference is that content fields can contain an array of content parts mixing text and images:

Single image

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "I can see a red car, a tree, and a blue house."
    }
  ]
}

Multiple images

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night."
    }
  ]
}

Multi-turn with images

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this kitchen."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets and granite countertops."
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Now compare it with this living room."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4BBB..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "Both spaces share a modern aesthetic with clean lines and neutral colors."
    }
  ]
}

Image encoding requirements

Images must be base64-encoded with a MIME type prefix. Raw HTTP URLs are not supported in training data.
{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
  }
}
Supported image formats: PNG, JPEG/JPG. If your dataset contains image URLs, download and convert them to base64 first. See the conversion script in the managed VLM fine-tuning guide.

Cookbook: VLM SFT

The cookbook’s sft_loop recipe works with vision datasets out of the box. Use a VLM training shape and a VLM tokenizer:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

cfg = Config(
    log_path="./vlm_sft_logs",
    base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
    dataset="/path/to/vision_data.jsonl",
    tokenizer_model="Qwen/Qwen3-VL-8B-Instruct",
    max_seq_len=4096,
    epochs=1,
    batch_size=4,
    learning_rate=1e-5,
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-vl-8b-65k",
    ),
)

main(cfg)
The recipe handles vision-aware tokenization automatically — image tokens are assigned weight 0.0 (prompt) and text response tokens are assigned weight 1.0 (train).

SDK-level: VLM training loop

For full control over the training loop, use the SDK directly with a VLM training shape. The workflow is the same as text-only training, but the tokenizer and shape are VLM-specific:

1. Provision a VLM trainer

import os
from fireworks.training.sdk import (
    FiretitanServiceClient,
    TrainerJobManager,
    TrainerJobConfig,
)

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

base_model = "accounts/fireworks/models/qwen3-vl-8b-instruct"
shape_id = "accounts/fireworks/trainingShapes/qwen3-vl-8b-65k"

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)

profile = rlor_mgr.resolve_training_profile(shape_id)

endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model=base_model,
    training_shape_ref=profile.training_shape_version,
    lora_rank=0,
    learning_rate=1e-5,
    gradient_accumulation_steps=4,
    display_name="vlm-sft",
))

2. Connect and train

import torch
import tinker
import transformers
from tinker_cookbook.supervised.common import datum_from_tokens_weights

service = FiretitanServiceClient(base_url=endpoint.base_url, api_key=api_key)
training_client = service.create_training_client(
    base_model=base_model, lora_rank=0,
)

processor = transformers.AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct", trust_remote_code=True,
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/..."}},
        ],
    },
    {
        "role": "assistant",
        "content": "The image shows a sunset over the ocean.",
    },
]

text = processor.apply_chat_template(conversation, tokenize=False)
full_tokens = processor.tokenizer.encode(text)

prompt_text = processor.apply_chat_template(conversation[:1], tokenize=False)
prompt_len = len(processor.tokenizer.encode(prompt_text))

weights = torch.zeros(len(full_tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0

datum = datum_from_tokens_weights(
    torch.tensor(full_tokens, dtype=torch.long),
    weights,
    max_length=4096,
)

def sft_loss(data, logprobs_list):
    total_loss = torch.tensor(0.0)
    n_tokens = 0
    for i, logprobs in enumerate(logprobs_list):
        w = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
        min_len = min(len(logprobs), len(w))
        total_loss = total_loss - torch.dot(logprobs[:min_len].float(), w[:min_len])
        n_tokens += w[:min_len].sum().item()
    return total_loss / max(n_tokens, 1), {"sft_loss": (total_loss / max(n_tokens, 1)).item()}

for step in range(100):
    training_client.forward_backward_custom([datum], sft_loss).result()
    training_client.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    ).result()

3. Save and promote

Checkpointing and weight sync work identically to text-only training:
result = training_client.save_weights_for_sampler_ext("vlm-final", checkpoint_type="base")

model = rlor_mgr.promote_checkpoint(
    job_id=endpoint.job_id,
    checkpoint_id=result.snapshot_name,
    output_model_id="my-vlm-model",
)

VLM DPO and RL

Vision inputs also work with DPO and RL training. The dataset format is the same — use multimodal content arrays in your messages:

DPO with vision

{
  "chosen": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This bar chart shows quarterly revenue growth of 15% year-over-year."}
    ]
  },
  "rejected": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This is a chart."}
    ]
  }
}

RL with vision prompts

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Solve the math problem shown in this image. Show your reasoning."},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
      ]
    }
  ]
}
Use the corresponding cookbook recipes (dpo_loop, rl_loop) with a VLM training shape and tokenizer — the multimodal message handling is automatic.

Available VLM training shapes

ModelShape IDContextGPUs
Qwen3 VL 8Baccounts/fireworks/trainingShapes/qwen3-vl-8b-65k65k4
See Training Shapes for the full list and details.