Transcribe audio (realtime)
Open a WebSocket
Open a WebSocket connection to the endpoint.
Transcribe chunk by chunk
Stream audio data to the WebSocket and receive transcription from the WebSocket.
Try basic notebook
Map audio stream to text.
Try verbose output transcription
Detailed output with word-level timestamps.
Input
An async generator that yields audio data chunks.
String name of the ASR model to use. Can be one of whisper-v3
or whisper-v3-turbo
.
String name of the voice activity detection (VAD) model to use. Can be one of silero
, or whisperx-pyannet
.
String name of the alignment model to use. Can be one of tdnn_ffn
, mms_fa
, or gentle
.
The target language for transcription. The set of supported target languages can be found here.
The input prompt with which to prime transcription. This can be used, for example, to continue a prior transcription given new audio data.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of json
, text
, srt
, verbose_json
, or vtt
.
The timestamp granularities to populate for this transcription. response_format must be set verbose_json
to use timestamp granularities. Either or both of these options are supported. Can be one of word
, or segment
. If not present, defaults to segment
.
Audio preprocessing mode. Currently supported:
none
to skip audio preprocessing.dynamic
for arbitrary audio content with variable loudness.soft_dynamic
for speech intense recording such as podcasts and voice-overs.bass_dynamic
for boosting lower frequencies;
Output
Was this page helpful?