Skip to main content
Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.

Create a deployment

Video and audio models require dedicated deployments. Create one using firectl:
firectl create deployment qwen3-omni-30b-a3b-instruct \
  --account-id <YOUR_ACCOUNT_ID> \
  --min-replica-count 1 \
  --max-replica-count 1 \
  --deployment-shape qwen3-omni-30b-a3b-instruct-minimal
Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.

Chat Completions API

Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.
import os
import base64
import requests

# Load and encode your preprocessed video and audio
with open("processed_video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode("utf-8")

with open("audio.ogg", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

# API configuration
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}",
}

# Request payload
payload = {
    "model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    "max_tokens": 1000,
    "temperature": 0.3,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
                {"type": "audio_url", "audio_url": {"url": f"data:audio/ogg;base64,{audio_b64}"}},
                {"type": "text", "text": "Describe what happens in this video."},
            ],
        },
    ],
}

# Send request
response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Working with videos

Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.

Preprocessing video

Extract frames at 1 FPS and downscale to 360p for efficient processing:
ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vf "fps=1,scale=-1:360" \
  -c:v libx264 -preset fast \
  -an \
  processed_video.mp4
ParameterDescription
-t 60Limit to first 60 seconds
fps=1Extract 1 frame per second
scale=-1:360Downscale to 360p height, maintain aspect ratio
-anRemove audio track (extracted separately)

Preprocessing audio

Extract audio as Opus in an Ogg container for optimal compression:
ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vn \
  -c:a libopus \
  -b:a 24k \
  -ar 16000 \
  -ac 1 \
  audio.ogg
ParameterDescription
-t 60Limit to first 60 seconds
-vnRemove video track
-c:a libopusUse Opus codec
-b:a 24k24 kbps bitrate
-ar 1600016 kHz sample rate
-ac 1Mono audio

Complete preprocessing example

import subprocess
import tempfile
import base64
import os

def preprocess_video(video_path: str) -> tuple[str, str]:
    """
    Preprocess video for optimal model input.
    
    Returns:
        Tuple of (video_base64, audio_base64)
    """
    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_video:
        processed_video_path = tmp_video.name
    with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp_audio:
        audio_path = tmp_audio.name
    
    try:
        # Process video: 1 FPS, 360p, max 60 seconds
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vf", "fps=1,scale=-1:360",
            "-c:v", "libx264", "-preset", "fast",
            "-an",
            processed_video_path
        ], check=True, capture_output=True)
        
        # Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vn",
            "-c:a", "libopus",
            "-b:a", "24k",
            "-ar", "16000",
            "-ac", "1",
            audio_path
        ], check=True, capture_output=True)
        
        with open(processed_video_path, "rb") as f:
            video_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        with open(audio_path, "rb") as f:
            audio_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        return video_b64, audio_b64
    
    finally:
        os.unlink(processed_video_path)
        os.unlink(audio_path)
Preprocessing is highly recommended to reduce latency and ensure consistent performance.

Performance considerations

Tips for optimal throughput:
  • Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
  • Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
  • Limit video duration – Cap at 60 seconds for consistent performance
  • Use dedicated deployments – Scale replicas based on your throughput needs

Known limitations

  1. Video duration: Maximum 60 seconds recommended for optimal performance
  2. Supported formats: .mp4 for video, .ogg (Opus) for audio
  3. Base64 size: Total encoded payload should be under 10MB
  4. Deployment required: Video models are not available on serverless; dedicated deployment required