Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Prepare your vision dataset
.jsonl
filemessages
array where each message has:role
: one of system
, user
, or assistant
content
: an array containing text and image objects or just textdata:image/jpeg;base64,
prefix followed by the base64 encoded image data.Python script to download and encode images to base64
Upload your VLM dataset
firectl
as it handles large uploads more reliably than the web interface.Launch VLM fine-tuning job
Monitor training progress
COMPLETED
and your custom model is ready for deployment.Deploy your fine-tuned VLM