Streaming Transcription

Open a WebSocket

Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.

Stream audio and receive transcriptions

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.

Try Python notebook

Stream audio to get transcription continuously in real-time.

Explore Python sources

Stream audio to get transcription continuously in real-time.

Explore Node.js sources

Stream audio to get transcription continuously in real-time.

URL

Please use the following serverless endpoint:

wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming

Headers

Authorization

string

required

Your Fireworks API key, e.g. Authorization=API_KEY. Alternatively, can be provided as a query param.

Query Parameters

response_format

string

default:"verbose_json"

The format in which to return the response. Currently only verbose_json is recommended for streaming.

language

string | null

The target language for transcription. See the Supported Languages section below for a complete list of available languages.

prompt

string | null

The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. Um, here's, uh, what was recorded. will make the model to include the filler words into the transcription.

temperature

float

default:"0"

Sampling temperature to use when decoding text tokens during transcription.

timestamp_granularities

string | list[string] | null

The timestamp granularities to populate for this streaming transcription. Defaults to null. Set to word,segment to enable timestamp granularities. Use a list for timestamp_granularities in all client libraries. A comma-separated string like word,segment only works when manually included in the URL (e.g. in curl).

Client messages

This field is for client to send audio chunks over to server. Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono).

Server messages

task

string

default:"transcribe"

required

The task that was performed — either transcribe or translate.

language

string

required

The language of the transcribed/translated text.

text

string

required

The transcribed/translated text.

words

object[] | null

Extracted words and their corresponding timestamps.

Show Word properties

word

string

required

The text content of the word.

language

string

required

The language of the word.

probability

number

required

The probability of the word.

hallucination_score

number

required

The hallucination score of the word.

start

number

Start time of the word in seconds. Appears only when timestamp_granularities is set to word,segment.

end

number

End time of the word in seconds. Appears only when timestamp_granularities is set to word,segment.

is_final

bool

required

Indicates whether this word has been finalized.

segments

object[] | null

Segments of the transcribed/translated text and their corresponding details.

Show Segment properties (partial)

number

required

The ID of the segment.

text

string

required

The text content of the segment.

words

object[] | null

Extracted words in the segment.

start

number

Start time of the segment in seconds. Appears only when timestamp_granularities is set to word,segment.

end

number

End time of the segment in seconds. Appears only when timestamp_granularities is set to word,segment.

Streaming Audio

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:

Resample your audio to 16 kHz if it is not already.
Convert it to mono.
Send 50ms chunks (16,000 Hz * 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.

Handling Responses

The client maintains a state dictionary, starting with an empty dictionary {}. When the server sends the first transcription message, it contains a list of segments. Each segment has an id and text:

# Server initial message:
{
    "segments": [
        {"id": "0", "text": "This is the first sentence"},
        {"id": "1", "text": "This is the second sentence"}
    ]
}

# Client initial state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence",
}

When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment id:

# Server continuous message:
{
    "segments": [
        {"id": "1", "text": "This is the second sentence modified"},
        {"id": "2", "text": "This is the third sentence"}
    ]
}

# Client updated state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence modified",   # overwritten
    "2": "This is the third sentence",             # new
}

Example Usage

Check out a brief Python example below or example sources:

!pip3 install requests torch torchaudio websocket-client

import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse

lock = threading.Lock()
state = {}

def on_open(ws):
    def send_audio_chunks():
        for chunk in audio_chunk_bytes:
            ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
            time.sleep(chunk_size_ms / 1000)

        final_checkpoint = json.dumps({"checkpoint_id": "final"})
        ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT)

    threading.Thread(target=send_audio_chunks).start()

def on_message(ws, message):
    message = json.loads(message)
    if message.get("checkpoint_id") == "final":
        ws.close()
        return

    update = {s["id"]: s["text"] for s in message["segments"]}
    with lock:
        state.update(update)
        print("\n".join(f" - {k}: {v}" for k, v in state.items()))

def on_error(ws, error):
    print(f"WebSocket error: {error}")

# Open a connection URL with query params
url = "wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
    "language": "en",
})
ws = websocket.WebSocketApp(
    f"{url}?{params}",
    header={"Authorization": "<FIREWORKS_API_KEY>"},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
)
ws.run_forever()

Dedicated endpoint

For fixed throughput and predictable SLAs, you may request a dedicated endpoint for streaming transcription at inquiries@fireworks.ai or discord.

Supported Languages

The following languages are supported for transcription:

Language Code	Language Name
en	English
zh	Chinese
de	German
es	Spanish
ru	Russian
ko	Korean
fr	French
ja	Japanese
pt	Portuguese
tr	Turkish
pl	Polish
ca	Catalan
nl	Dutch
ar	Arabic
sv	Swedish
it	Italian
id	Indonesian
hi	Hindi
fi	Finnish
vi	Vietnamese
he	Hebrew
uk	Ukrainian
el	Greek
ms	Malay
cs	Czech
ro	Romanian
da	Danish
hu	Hungarian
ta	Tamil
no	Norwegian
th	Thai
ur	Urdu
hr	Croatian
bg	Bulgarian
lt	Lithuanian
la	Latin
mi	Maori
ml	Malayalam
cy	Welsh
sk	Slovak
te	Telugu
fa	Persian
lv	Latvian
bn	Bengali
sr	Serbian
az	Azerbaijani
sl	Slovenian
kn	Kannada
et	Estonian
mk	Macedonian
br	Breton
eu	Basque
is	Icelandic
hy	Armenian
ne	Nepali
mn	Mongolian
bs	Bosnian
kk	Kazakh
sq	Albanian
sw	Swahili
gl	Galician
mr	Marathi
pa	Punjabi
si	Sinhala
km	Khmer
sn	Shona
yo	Yoruba
so	Somali
af	Afrikaans
oc	Occitan
ka	Georgian
be	Belarusian
tg	Tajik
sd	Sindhi
gu	Gujarati
am	Amharic
yi	Yiddish
lo	Lao
uz	Uzbek
fo	Faroese
ht	Haitian Creole
ps	Pashto
tk	Turkmen
nn	Nynorsk
mt	Maltese
sa	Sanskrit
lb	Luxembourgish
my	Myanmar
bo	Tibetan
tl	Tagalog
mg	Malagasy
as	Assamese
tt	Tatar
haw	Hawaiian
ln	Lingala
ha	Hausa
ba	Bashkir
jw	Javanese
su	Sundanese
yue	Cantonese

LLM API

Responses API

Embeddings API

Image API

Audio API

Audio batch API

Accounts

Deployments

Models

LoRAs

Supervised fine-tuning jobs

Reinforcement fine-tuning jobs

Batch inference jobs

Datasets

Users

API Keys

Try Python notebook

Explore Python sources

Explore Node.js sources

URL

Headers

Query Parameters

Client messages

Server messages

Streaming Audio

Handling Responses

Example Usage

Dedicated endpoint

Supported Languages

LLM API

Responses API

Embeddings API

Image API

Audio API

Audio batch API

Accounts

Deployments

Models

LoRAs

Supervised fine-tuning jobs

Reinforcement fine-tuning jobs

Batch inference jobs

Datasets

Users

API Keys

Try Python notebook

Explore Python sources

Explore Node.js sources

​URL

​Headers

​Query Parameters

​Client messages

​Server messages

​Streaming Audio

​Handling Responses

​Example Usage

​Dedicated endpoint

​Supported Languages

URL

Headers

Query Parameters

Client messages

Server messages

Streaming Audio

Handling Responses

Example Usage

Dedicated endpoint

Supported Languages