1

Open a WebSocket

Open a WebSocket connection to the endpoint.

2

Transcribe chunk by chunk

Stream audio data to the WebSocket and receive transcription from the WebSocket.



Input

audio_generator
AsyncGenerator[bytes, None]
required

An async generator that yields audio data chunks.

model
string
default: "whisper-v3"

String name of the ASR model to use. Can be one of whisper-v3 or whisper-v3-turbo.

vad_model
string
default: "silero"

String name of the voice activity detection (VAD) model to use. Can be one of silero, or whisperx-pyannet.

alignment_model
string
default: "tdnn_ffn"

String name of the alignment model to use. Can be one of tdnn_ffn, mms_fa, or gentle.

language
string | null

The target language for transcription. The set of supported target languages can be found here.

prompt
string | null

The input prompt with which to prime transcription. This can be used, for example, to continue a prior transcription given new audio data.

temperature
float
default: "0"

Sampling temperature to use when decoding text tokens during transcription.

response_format
string
default: "json"

The format in which to return the response. Can be one of json, text, srt, verbose_json, or vtt.

timestamp_granularities
string

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported. Can be one of word, or segment. If not present, defaults to segment.

preprocessing
string

Audio preprocessing mode. Currently supported:

  • none to skip audio preprocessing.
  • dynamic for arbitrary audio content with variable loudness.
  • soft_dynamic for speech intense recording such as podcasts and voice-overs.
  • bass_dynamic for boosting lower frequencies;

Output

text
string
required