Voice agent platform

Welcome to Fireworks’ Voice Agent Platform beta. The testing UI and endpoint hits a free and rate limited deployment. It’s meant to provide a preview of latency and end-to-end experience on Fireworks. Fill out this form or schedule time with our PM to request full access and customization ability, like the ability to swap models or change extra parameters. The form can also be used to request design partnership for the Fireworks team to optimize your voice agent stack for performance and quality. The Fireworks voice agent stack consists of a co-located Fireworks transcription -> LLM -> voice output deployment. Settings for each component can be individually configured as described below.

Parameters

General Parameters

Server URL: Fireworks’ default co-located endpoint is “wss://audio-agent.link.fireworks.ai/v1/audio/agent”. Swap URLs if you’ve been instructed to use a personalized endpoint
API Key: Use your Fireworks API key

LLM Features An LLM is used by the voice agent UI to create responses, as prompted. Any LLM (including fine-tuned models) on the Fireworks platform can be used to power voice agents on Fireworks. Fireworks has pre-selected a model for the testing endpoint but use this form to request an endpoint with a different model.

System prompt: System prompt for the LLM
Tools:
- Tools for the LLM. Defined in OpenAI’s function calling JSON format. See OpenAI or Fireworks docs for instructions on defining tools. Inline comments are not supported
- Specify instructions for calling the tool in the LLM system prompt
- Voice agent UI assumes that tools will always result in a response. Provide the response to the LLM in JSON. For example, a scheduling tool could result in{"scheduling": "succeeded"} or {"suggested_time": "3 pm"}.

Audio Transcription (ASR) Features

Echo Cancellation (AEC) - Echo cancellation if you are not using headphones. Disable it otherwise, to avoid introducing artifacts.
High pass filter and noise suppression - both help eliminate noise if you anticipate usage in noisy environments. Disable if in non-noisy environments
Autogain control - helps stabilize user volume if users are expected to have a large or changing distance to to the mic Disable this setting if volume is stable

End-of-Utterance Features: End-of-utterance detects when a user has finished speaking and when the voice agent should respond.

Minimum delay - Minimum time in seconds before the agent can respond. Lower time reduces fastest possible response time but may lead to users being interrupted between sentences
Max interrupt delay - Maximum time in seconds before the responds if user has spoken anything. Lower time reduces slowest possible response time but may lead to users being interrupted between sentences.
Max follow-up delay - Max delay in seconds before agent starts speaking if user has not spoken anything. Lower times means the agent follows up more aggressively in periods of silence.

TTS Features

Choosing voices: Voices that begin with “fw” are Fireworks voice models. Voices that do not begin with “fw” come are powered by the open-source Kokoro model. Fireworks voice models can be prompted via IPA for specific pronunciation(see guide) while Kokoro models support non-English languages.
Changing voice language: Different Kokoro voices correspond to different languages. For example, voices starting with ‘e’ are Spanish. See full list for specifics. To have text generated in a different language, write your system prompt in that language. You may also need to explicitly instruct the LLM to respond in a particular language.
TTS Speed: Change how quickly the voice model speaks

Custom pronunciation via International Phonetic Alphabet

Have words that need to be pronounced a particular way? For example, let’s say you want to use the British pronounciation for Nike (one syllable) instead of the American pronunciation. Fireworks TTS supports vocalizing precise pronounciation via the International Phonetic Alphabet (IPA), a notation that represents the individual sounds of spoken language. For example, the British pronunciation of Nike is represented as <ipa>nˈaɪk</ipa> in IPA. To use custom pronounciation, you’ll need to prompt your LLM to output IPA every place a word would have been generated. For example, we use the prompt:

Pronunciation specifics: 

Your output will be fed to a text-to-speech model to vocalize. We want specific pronunciation for Nike, Porsche and Volkswagen. Every time you would have said Nike, do not say Nike, instead use the IPA representation of the pronunciation.  <ipa>nˈaɪk</ipa>. Same for  Porsche and Volkswagen where you should output <ipa>pˈɔɹʃə</ipa> and  <ipa>fˈɔlksvaːɡən</ipa>  for Volkswagen

This enables you to override default pronunciations and ensure that each phoneme is rendered exactly as intended. Use the guidelines below to learn the specific IPA syntax we support. Note that we specifically use the “eSpeak” formatting of IPA, which may differ from online IPA generators.

Generating IPA

To generate the IPA, we have a Python script (see below) or we’ve had success prompting ChatGPT to generate IPA when providing it with our syntax reference (see example prompt).

Syntax Reference

Allowed Symbols

Stress ˈ (primary) ˌ (secondary) WARNING: primary stress is not a normal apostrophe. Consonants b, d, f, h, j, k, l, m, n, p, s, t, v, w, z, ɡ, ŋ, ɹ, ʃ, ʒ, ð, θ, ɾ Vowels ə, i, u, ɑ, ɔ, ɛ, ɜ, ɪ, ʊ, ʌ, æ, a, o, ɒ, ᵻ, ɐ, ː

Stress-Mark Rules

Place ˈ or ˌ immediately before the vowel that carries stress.
Never put a stress mark after the vowel or before a consonant.
Use one primary stress for any monosyllable; add secondary stress only when needed in longer words.

`<ipa>` Tag Syntax

Wrap each transcription in angle-bracket tags:

<ipa>dʒˈɪf</ipa>

Examples

Word / Phrase	IPA options
GIF	`<ipa>dʒˈɪf</ipa>` or `<ipa>ɡˈɪf</ipa>`
SQL	`<ipa>sˈikwəl</ipa>` or `<ipa>ɛskjuːɛl</ipa>`
JSON	`<ipa>dʒˈeɪsɑn</ipa>` or `<ipa>dʒˈeɪsᵊn</ipa>`
Nike	`<ipa>nˈaɪki</ipa>` (US) or `<ipa>nˈaɪk</ipa>` (UK)
the quick brown fox	`<ipa>ðə kwˈɪk bɹˈaʊn fˈɑks</ipa>`

Python Script for generating IPA

You can easily generate valid eSpeak IPA strings from English by installing the misaki Python package and using the following code snippet:

from misaki.en import G2P

class FireworksG2P(G2P):
  def __init__(self):
    super().__init__()
    self._normalization_map = {
      "ʤ": "dʒ",
      "ʧ": "tʃ",
      "A": "eɪ",
      "I": "aɪ",
      "W": "aʊ",
      "Y": "ɔɪ",
      "O": "oʊ",
      "Q": "əʊ",
    }

  def _normalize_phonemes(self, phonemes: str) -> str:
      result = phonemes
      for old, new in self._normalization_map.items():
          result = result.replace(old, new)
      return result

  def __call__(self, text) -> str:
    raw_ipa, _ = super().__call__(text)
    return self._normalize_phonemes(raw_ipa)


g2p = FireworksG2P()
print(f"<ipa>{g2p('gif')}</ipa>")
print(f"<ipa>{g2p('jewel')}</ipa>")

Example Prompts and Tool Spec

Example prompt

Persona: You are Flame, the helpful sales representative from Fireworks AI. You help customers to understand Fireworks' offerings and help them to solve their problems using Fireworks. 

Product: You are focusing on doing outbound sales for the Fireworks Voice Agent platform. The Voice Agent Platform is described as follows:

* It is an easy-to-use real-time WebSockets API, which lets them stream both speech and JSON to the server while getting speech and JSON outputs streamed right back. This helps developers to create responsive conversational experiences across verticals like customer service, drive-through, sales, video conferencing. companionship, and more.
* The API is backed by a state-of-the-art real-time stack developed by Fireworks. Fireworks is a deep learning inference provider, and the Voice Agents platform makes use of all of their infrastructure and expertise in AI modeling, GPU optimization, and cloud infrastructure.
* The architecture consists of Automatic Speech Recognition (ASR), End-of-Utterance Detection, Large Language Models (LLMs), and Text-to-speech. It brings together all these pieces to deliver state-of-the-art AI intelligence through a voice interface.
* The components are orchestrated together with a collocated cloud deployment, minimizing network latency within the pipeline, and with special decoding tricks to maximize parallelism of the pipeline stages.
* Intelligent end-of-utterance detection analyzes the semantics of the user's last utterance to determine if they want a response or not. This allows for fast responses when the user wants one or holding off while the user is still speaking or is thinking.
* Due to Fireworks' focus on speed, the system shows an average of 600ms latency from the time the user stops speaking to when the first TTS byte is returned. This allows for smooth conversational experiences.
* All stages of the pipeline can be configured through JSON configuration messages. For example, the system prompt and generation parameters of the LLM can be configured. We have upcoming support for things like custom vocabulary -- if the user asked about this direct them to talk to one of our Engineers.
* We are mindful of out-of-the-box integrations, such as with telephony systems (e.g. through Twilio), drive-thru, and video conferencing platforms. We do not enable these out of the box, but direct the user to talk to an engineer to better understand their requirements.
* Common competing solution is to compose a separate ASR service, a voice activity detection (VAD) model, an LLM like GPT-4o, and a separate TTS service.

Sales flow: To sell the product, we recommend the following flow:

* Greet the customer, introduce yourself, and say why you're calling (e.g. follow-up from their reach-out).
* Scope out their current product needs/goals. For example, are they currently building voice agents? If so, what's their current approach? What do they like about the current approach and what are its pain points?
* Guide the customer through our product offerings, starting from the high-level product definition and digging into the details relevant to what they've identified as pain points.

Next steps: Once the customer is interested in our solution, offer to schedule a call with the lead engineer on the project, James. James can help answer technical questions, scope out solutions to unique customer challenges, and set up a proof-of-concept deployment for the customer to try the product for real. James is usually available Wednesday through Friday afternoons. ASK THE CUSTOMER FOR THEIR DATE AND TIME AVAILABILITY BEFORE SCHEDULING A CALL. Use the `schedule_call` tool to schedule a follow-up call with a Fireworks AI employee. EMIT A FUNCTION CALL AT THE BEGINNING OF A CHAT TURN. DO NOT EMIT IT INLINE. Only call schedule_call once, unless you want to edit the details.

Throughout the conversation, keep a conversational tone. Focus on asking directed questions and giving directed answers. Avoid regurgitating bulk information all at once. Questions should always be at the end of the turn. Do not ask a question and then continue to give other information. DO NOT USE LISTS OR MARKDOWN FORMATTING. DO NOT USE HEADERS (#, ##, ##). DO NOT USE NUMBERED LISTS. Information that could be represented as a list should be represented as different message turns.

If the customer asks to speak to a person, immediately try to arrange a conversation with James

Be succinct in your responses. Don't respond with more than a few sentences at once

Information about the specific caller is available below:

Current date: Monday 2025-04-07
Name: Ray
Company: Fennec AI
Context: Ray reached out to us on our "Contact Us" form.

DO NOT USE MARKDOWN. DO NOT USE BULLETED OR NUMBERED LISTS. DO NOT USE ASTERISKS.

Example tools

{
  "type": "function",
  "function": {
    "name": "schedule_call",
    "description": "A function to schedule an appointment with one of our employees",
    "parameters": {
      "type": "object",
      "properties": {
        "username": {
          "type": "string",
          "description": "The username of the employee to schedule a meeting with. For example, 'dmitry'"
        },
        "date": {
          "type": "string",
          "description": "The date on which to schedule the appointment, for example 2025-04-02"
        },
        "time": {
          "type": "string",
          "description": "The time at which to schedule the appointment, in PST. For example, 13:00"
        },
        "notes": {
          "type": "string",
          "description": "Notes to add to the calendar invite, e.g. additional context for the meeting"
        }
      },
      "required": [
        "username",
        "date",
        "time"
      ]
    }
  }
}

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

Parameters

Custom pronunciation via International Phonetic Alphabet

Generating IPA

Syntax Reference

Allowed Symbols

Stress-Mark Rules

`<ipa>` Tag Syntax

Examples

Python Script for generating IPA

Example Prompts and Tool Spec

Get Started

Querying models

Dedicated Deployments

Fine-tuning

Integrations

Policies

Administration

​Parameters

​Custom pronunciation via International Phonetic Alphabet

​Generating IPA

​Syntax Reference

​Allowed Symbols

​Stress-Mark Rules

​<ipa> Tag Syntax

​Examples

​Python Script for generating IPA

​

​Example Prompts and Tool Spec

Parameters

Custom pronunciation via International Phonetic Alphabet

Generating IPA

Syntax Reference

Allowed Symbols

Stress-Mark Rules

`<ipa>` Tag Syntax

Examples

Python Script for generating IPA

Example Prompts and Tool Spec