Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.
Create a deployment
Video and audio models require dedicated deployments. Create one using firectl:
firectl create deployment qwen3-omni-30b-a3b-instruct \
--account-id <YOUR_ACCOUNT_ID> \
--min-replica-count 1 \
--max-replica-count 1 \
--deployment-shape qwen3-omni-30b-a3b-instruct-minimal
Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.
Chat Completions API
Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.
import os
import base64
import requests
# Load and encode your preprocessed video and audio
with open("processed_video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
with open("audio.ogg", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
# API configuration
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}",
}
# Request payload
payload = {
"model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"max_tokens": 1000,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
{"type": "audio_url", "audio_url": {"url": f"data:audio/ogg;base64,{audio_b64}"}},
{"type": "text", "text": "Describe what happens in this video."},
],
},
],
}
# Send request
response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])
# Encode your files (run these separately)
VIDEO_B64=$(base64 -i processed_video.mp4)
AUDIO_B64=$(base64 -i audio.ogg)
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"max_tokens": 1000,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "data:video/mp4;base64,'$VIDEO_B64'"}},
{"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,'$AUDIO_B64'"}},
{"type": "text", "text": "Describe what happens in this video."}
]
}
]
}'
Working with videos
Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.
Preprocessing video
Extract frames at 1 FPS and downscale to 360p for efficient processing:
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vf "fps=1,scale=-1:360" \
-c:v libx264 -preset fast \
-an \
processed_video.mp4
| Parameter | Description |
|---|
-t 60 | Limit to first 60 seconds |
fps=1 | Extract 1 frame per second |
scale=-1:360 | Downscale to 360p height, maintain aspect ratio |
-an | Remove audio track (extracted separately) |
Preprocessing audio
Extract audio as Opus in an Ogg container for optimal compression:
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vn \
-c:a libopus \
-b:a 24k \
-ar 16000 \
-ac 1 \
audio.ogg
| Parameter | Description |
|---|
-t 60 | Limit to first 60 seconds |
-vn | Remove video track |
-c:a libopus | Use Opus codec |
-b:a 24k | 24 kbps bitrate |
-ar 16000 | 16 kHz sample rate |
-ac 1 | Mono audio |
Complete preprocessing example
import subprocess
import tempfile
import base64
import os
def preprocess_video(video_path: str) -> tuple[str, str]:
"""
Preprocess video for optimal model input.
Returns:
Tuple of (video_base64, audio_base64)
"""
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_video:
processed_video_path = tmp_video.name
with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp_audio:
audio_path = tmp_audio.name
try:
# Process video: 1 FPS, 360p, max 60 seconds
subprocess.run([
"ffmpeg", "-y", "-i", video_path,
"-t", "60",
"-vf", "fps=1,scale=-1:360",
"-c:v", "libx264", "-preset", "fast",
"-an",
processed_video_path
], check=True, capture_output=True)
# Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
subprocess.run([
"ffmpeg", "-y", "-i", video_path,
"-t", "60",
"-vn",
"-c:a", "libopus",
"-b:a", "24k",
"-ar", "16000",
"-ac", "1",
audio_path
], check=True, capture_output=True)
with open(processed_video_path, "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
return video_b64, audio_b64
finally:
os.unlink(processed_video_path)
os.unlink(audio_path)
Preprocessing is highly recommended to reduce latency and ensure consistent performance.
Tips for optimal throughput:
- Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
- Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
- Limit video duration – Cap at 60 seconds for consistent performance
- Use dedicated deployments – Scale replicas based on your throughput needs
Known limitations
- Video duration: Maximum 60 seconds recommended for optimal performance
- Supported formats:
.mp4 for video, .ogg (Opus) for audio
- Base64 size: Total encoded payload should be under 10MB
- Deployment required: Video models are not available on serverless; dedicated deployment required