1

Open a WebSocket

Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.

2

Stream audio and receive transcriptions

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.

Headers

Authorization
string
required

Your Fireworks API key, e.g. Authorization=API_KEY.

Query Parameters

model
string
default:
"whisper-v3"

String name of the ASR model to use. Can be one of whisper-v3 or whisper-v3-turbo. Please use the following serverless endpoints for evaluation:

  • wss://audio-streaming.us-virginia-1.direct.fireworks.ai (for whisper-v3 compatible model);
  • wss://audio-streaming-turbo.us-virginia-1.direct.fireworks.ai (for whisper-v3-turbo compatible model);
response_format
string
default:
"verbose_json"

The format in which to return the response. Currently only verbose_json is recommended for streaming.

language
string | null

The target language for transcription. The set of supported target languages can be found here.

prompt
string | null

The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. Um, here's, uh, what was recorded. will make the model to include the filler words into the transcription.

temperature
float
default:
"0"

Sampling temperature to use when decoding text tokens during transcription.

Streaming Audio

Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:

  1. Resample your audio to 16 kHz if it is not already.
  2. Convert it to mono.
  3. Send 50ms chunks (16,000 Hz * 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.

Handling Responses

The client maintains a state dictionary, starting with an empty dictionary {}. When the server sends the first transcription message, it contains a list of segments. Each segment has an id and text:

# Server initial message
{
    "segments": [
        {"id": "0", "text": "This is the first sentence"},
        {"id": "1", "text": "This is the second sentence"}
    ]
}

# Client initial state
{
    "0": "This is the first sentence",
    "1": "This is the second sentence",
}

When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment id:

# Server continuous message
{
    "segments": [
        {"id": "1", "text": "This is the second sentence modified"},
        {"id": "2", "text": "This is the third sentence"}
    ]
}

# Client continuous update
{
    "0": "This is the first sentence",
    "1": "This is the second sentence modified",   # overwritten
    "2": "This is the third sentence",             # new
}

Example Usage

Dedicated endpoint

For fixed throughput and predictable SLAs, you may request a dedicated endpoints for streaming transcription at inquiries@fireworks.ai or discord.