> ## Documentation Index > Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt > Use this file to discover all available pages before exploring further. # Supervised Fine Tuning - Vision > Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. To see all vision models that support fine-tuning, visit the [Model Library for vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true). ## Fine-tuning a VLM using LoRA vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example. **Dataset Requirements:** * **Format**: `.jsonl` file * **Minimum examples**: 3 * **Maximum examples**: 3 million per dataset * **Images**: Must be base64 encoded with proper MIME type prefixes * **Supported image formats**: PNG, JPG, JPEG **Message Schema:** Each training example must include a `messages` array where each message has: * `role`: one of `system`, `user`, or `assistant` * `content`: an array containing text and image objects or just text ### Basic VLM Dataset Example ```json theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them." }, { "role": "user", "content": [ { "type": "text", "text": "What objects do you see in this image?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ] }, { "role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image." } ] } ``` ### If your dataset contains image urls Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64. ```json theme={null} { "type": "image_url", "image_url": { // ❌ Raw HTTP/HTTPS URLs are NOT supported "url": "https://example.com/image.jpg" } } ``` ```json theme={null} { "type": "image_url", "image_url": { // ✅ Use data URI with base64 encoding // Format: data:image/{format};base64,{base64_encoded_data} "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ``` You can use the following script to automatically convert your dataset to the correct format: **Usage:** ```bash theme={null} # Install required dependency pip install requests # Download the script wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py # Run the script - will output a new dataset _base64.jsonl python download_images_and_encode_to_b64.py --input_file ``` ### Advanced Dataset Examples ```json expandable theme={null} { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images and tell me the differences" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..." } }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..." } } ] }, { "role": "assistant", "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different." } ] } ``` ```json expandable theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation." }, { "role": "user", "content": [ { "type": "text", "text": "Can you describe this kitchen layout for me?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light." }, { "role": "user", "content": "Now look at this living room. Do you think the styles would work well together?" }, { "role": "assistant", "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?" }, { "role": "user", "content": [ { "type": "text", "text": "Sorry about that! Here's the living room:" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones." }, ] } ``` ### Try with an Example Dataset To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `` tags for reasoning. ```bash theme={null} # Download the example dataset curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` ```bash theme={null} # Download the example dataset wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` Upload your prepared JSONL dataset to Fireworks for training: ```bash theme={null} firectl dataset create my-vlm-dataset /path/to/vlm_training_data.jsonl ``` Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard. Dataset creation interface

```javascript theme={null} // Create dataset entry const createDatasetPayload = { datasetId: "my-vlm-dataset", dataset: { userUploaded: {} } }; const response = await fetch(`${BASE_URL}/datasets`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify(createDatasetPayload) }); // Upload JSONL file const formData = new FormData(); formData.append("file", fileInput.files[0]); const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}` }, body: formData }); ``` For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details. Create a supervised fine-tuning job for your VLM: ```bash theme={null} firectl sftj create \ --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \ --dataset my-vlm-dataset \ --output-model my-custom-vlm \ --epochs 3 ``` For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings). 1. Navigate to the Fine-tuning tab in the Fireworks console 2. Click "Create Fine-tuning Job" 3. Select your VLM base model (Qwen 2.5 VL) 4. Choose your uploaded dataset 5. Configure training parameters 6. Launch the job Fine-tuning job creation interface

VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity. Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning). VLM fine-tuning job in the Fireworks console

VLM fine-tuning job in the Fireworks console

Monitor key metrics: * **Training loss**: Should generally decrease over time * **Evaluation loss**: Monitor for overfitting if using evaluation dataset * **Training progress**: Epochs completed and estimated time remaining Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment. Once training is complete, deploy your custom VLM: ```bash theme={null} # Create a deployment for your fine-tuned VLM firectl deployment create my-custom-vlm # Check deployment status firectl deployment get accounts/your-account/deployment/deployment-id ``` Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page. Deploy dropdown in the fine-tuning job page

Deploy dropdown in the fine-tuning job page

## Advanced Configuration For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide. Need custom training loops for VLMs? The **Training API** also supports vision-language model fine-tuning with full control over loss functions, training objectives, and evaluation. See [Training API — Vision Inputs](/fine-tuning/training-api/vision-inputs) for details. ## Interactive Tutorials: Fine-tuning VLMs For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation. **Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI** **Finetuning a VLM to beat SOTA closed source model** The cookbooks above cover the following: * Setting up your environment with Fireworks CLI * Preparing vision datasets in the correct format * Launching and monitoring VLM fine-tuning jobs * Testing your fine-tuned model * Best practices for VLM fine-tuning * Running inference on serverless VLMs * Running evals to show performance gains ## Testing Your Fine-tuned VLM After deployment, test your fine-tuned VLM using the same API patterns as base VLMs: ```python Python (OpenAI SDK) theme={null} import openai client = openai.OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key="", ) response = client.chat.completions.create( model="accounts/your-account/models/my-custom-vlm", messages=[{ "role": "user", "content": [{ "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg" }, },{ "type": "text", "text": "What's in this image?", }], }] ) print(response.choices[0].message.content) ``` If you fine-tuned using the example dataset, your model should include `` tags in its response.