Vision Inputs

The Training SDK supports vision-language model (VLM) fine-tuning, allowing you to train models that understand both images and text. This works across all training modes — SFT, DPO, and RL — using the same SDK primitives and cookbook recipes you already know.

VLM support in the Training SDK requires a VLM-compatible training shape. See Training Shapes for available shapes.

What changes for vision

Compared to text-only training, VLM fine-tuning differs in three ways:

Aspect	Text-only	Vision
Training shape	Text model shape (e.g. `qwen3-8b-128k`)	VLM shape (e.g. `qwen3-vl-8b-65k`)
Tokenizer	Text tokenizer (e.g. `Qwen/Qwen3-8B`)	VLM processor (e.g. `Qwen/Qwen3-VL-8B-Instruct`)
Message format	`content` is a string	`content` is an array of text and `image_url` objects

Everything else — loss functions, checkpointing, weight sync, deployment sampling — works identically.

Dataset format

Vision datasets use the standard OpenAI-compatible chat format. The key difference is that content fields can contain an array of content parts mixing text and images:

Single image

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "I can see a red car, a tree, and a blue house."
    }
  ]
}

Multiple images

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night."
    }
  ]
}

Multi-turn with images

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this kitchen."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets and granite countertops."
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Now compare it with this living room."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4BBB..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "Both spaces share a modern aesthetic with clean lines and neutral colors."
    }
  ]
}

Image encoding requirements

Images must be base64-encoded with a MIME type prefix. Raw HTTP URLs are not supported in training data.

Correct
Incorrect

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
  }
}

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/photo.jpg"
  }
}

Supported image formats: PNG, JPEG/JPG. If your dataset contains image URLs, download and convert them to base64 first. See the conversion script in the managed VLM fine-tuning guide.

Cookbook: VLM SFT

The cookbook’s sft_loop recipe works with vision datasets out of the box. Use a VLM training shape and a VLM tokenizer:

from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

cfg = Config(
    log_path="./vlm_sft_logs",
    base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
    dataset="/path/to/vision_data.jsonl",
    tokenizer_model="Qwen/Qwen3-VL-8B-Instruct",
    max_seq_len=4096,
    epochs=1,
    batch_size=4,
    learning_rate=1e-5,
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-vl-8b-65k",
    ),
)

main(cfg)

The recipe handles vision-aware tokenization automatically — image tokens are assigned weight 0.0 (prompt) and text response tokens are assigned weight 1.0 (train).

SDK-level: VLM training loop

For full control over the training loop, use the SDK directly with a VLM training shape. The workflow is the same as text-only training, but the tokenizer and shape are VLM-specific:

1. Provision a VLM trainer

import os
from fireworks.training.sdk import (
    FiretitanServiceClient,
    TrainerJobManager,
    TrainerJobConfig,
)

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

base_model = "accounts/fireworks/models/qwen3-vl-8b-instruct"
shape_id = "accounts/fireworks/trainingShapes/qwen3-vl-8b-65k"

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)

profile = rlor_mgr.resolve_training_profile(shape_id)

endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
    base_model=base_model,
    training_shape_ref=profile.training_shape_version,
    lora_rank=0,
    learning_rate=1e-5,
    gradient_accumulation_steps=4,
    display_name="vlm-sft",
))

2. Connect and train

import torch
import tinker
import transformers
from tinker_cookbook.supervised.common import datum_from_tokens_weights

service = FiretitanServiceClient(base_url=endpoint.base_url, api_key=api_key)
training_client = service.create_training_client(
    base_model=base_model, lora_rank=0,
)

processor = transformers.AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct", trust_remote_code=True,
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/..."}},
        ],
    },
    {
        "role": "assistant",
        "content": "The image shows a sunset over the ocean.",
    },
]

text = processor.apply_chat_template(conversation, tokenize=False)
full_tokens = processor.tokenizer.encode(text)

prompt_text = processor.apply_chat_template(conversation[:1], tokenize=False)
prompt_len = len(processor.tokenizer.encode(prompt_text))

weights = torch.zeros(len(full_tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0

datum = datum_from_tokens_weights(
    torch.tensor(full_tokens, dtype=torch.long),
    weights,
    max_length=4096,
)

def sft_loss(data, logprobs_list):
    total_loss = torch.tensor(0.0)
    n_tokens = 0
    for i, logprobs in enumerate(logprobs_list):
        w = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
        min_len = min(len(logprobs), len(w))
        total_loss = total_loss - torch.dot(logprobs[:min_len].float(), w[:min_len])
        n_tokens += w[:min_len].sum().item()
    return total_loss / max(n_tokens, 1), {"sft_loss": (total_loss / max(n_tokens, 1)).item()}

for step in range(100):
    training_client.forward_backward_custom([datum], sft_loss).result()
    training_client.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    ).result()

3. Save and promote

Checkpointing and weight sync work identically to text-only training:

result = training_client.save_weights_for_sampler_ext("vlm-final", checkpoint_type="base")

model = rlor_mgr.promote_checkpoint(
    job_id=endpoint.job_id,
    checkpoint_id=result.snapshot_name,
    output_model_id="my-vlm-model",
)

VLM DPO and RL

Vision inputs also work with DPO and RL training. The dataset format is the same — use multimodal content arrays in your messages:

DPO with vision

{
  "chosen": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This bar chart shows quarterly revenue growth of 15% year-over-year."}
    ]
  },
  "rejected": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This is a chart."}
    ]
  }
}

RL with vision prompts

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Solve the math problem shown in this image. Show your reasoning."},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
      ]
    }
  ]
}

Use the corresponding cookbook recipes (dpo_loop, rl_loop) with a VLM training shape and tokenizer — the multimodal message handling is automatic.

Available VLM training shapes

Model	Shape ID	Context	GPUs
Qwen3 VL 8B	`accounts/fireworks/trainingShapes/qwen3-vl-8b-65k`	65k	4

See Training Shapes for the full list and details.

Training Shapes — available VLM and text training shapes
Supervised Fine Tuning - Vision (Managed) — managed VLM fine-tuning without writing training loops
Querying Vision Language Models — inference with VLMs
Cookbook SFT — SFT recipe details
Loss Functions — custom loss function patterns

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

What changes for vision

Dataset format

Single image

Multiple images

Multi-turn with images

Image encoding requirements

Cookbook: VLM SFT

SDK-level: VLM training loop

1. Provision a VLM trainer

2. Connect and train

3. Save and promote

VLM DPO and RL

DPO with vision

RL with vision prompts

Available VLM training shapes

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​What changes for vision

​Dataset format

​Single image

​Multiple images

​Multi-turn with images

​Image encoding requirements

​Cookbook: VLM SFT

​SDK-level: VLM training loop

​1. Provision a VLM trainer

​2. Connect and train

​3. Save and promote

​VLM DPO and RL

​DPO with vision

​RL with vision prompts

​Available VLM training shapes

​Related guides

What changes for vision

Dataset format

Single image

Multiple images

Multi-turn with images

Image encoding requirements

Cookbook: VLM SFT

SDK-level: VLM training loop

1. Provision a VLM trainer

2. Connect and train

3. Save and promote

VLM DPO and RL

DPO with vision

RL with vision prompts

Available VLM training shapes

Related guides