# Custom SSO
Set up custom Single Sign-On (SSO) authentication for Fireworks AI
Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform.
By default, Fireworks supports Google SSO.
If you have an enterprise account, Fireworks supports bringing your own identity provider using:
* OpenID Connect (OIDC) provider
* SAML 2.0 provider
Coordinate with your Fireworks AI representative to enable the integration.
## OpenID Connect (OIDC) provider
Create an OIDC client application in your identity provider, e.g. Okta.
Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret).
Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse
```
Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up.
## SAML 2.0 provider
Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm).
Set the SSO URL to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse
```
Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like:
```
urn:amazon:cognito:sp:
```
Create an Attribute Statement with the name:
```
http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
```
and the value `user.email`
Leave the rest of the settings as defaults.
Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up.
## Troubleshooting
### Invalid samlResponse or relayState from identity provider
This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports
service provider (SP) initiated login.
See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an
in-depth explanation.
### Required String parameter 'RelayState' is not present
See above.
# Managing users
Add and delete additional users in your Fireworks account
See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users. Only admin users can manage other users within the account.
## Adding users
To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://fireworks.ai/account/users](https://fireworks.ai/account/users).
```bash
firectl create user --email="alice@example.com"
```
To create another admin user, pass the `--role=admin` flag:
```bash
firectl create user --email="alice@example.com" --role=admin
```
## Updating a user's role
To update a user's role, run
```bash
firectl update user --role="{admin,user}"
```
## Deleting users
You can remove a user from your account by running:
```bash
firectl delete user
```
# Batch Delete Batch Jobs
post /v1/accounts/{account_id}/batchJobs:batchDelete
# Batch Delete Environments
post /v1/accounts/{account_id}/environments:batchDelete
# Batch Delete Node Pools
post /v1/accounts/{account_id}/nodePools:batchDelete
# Cancel Batch Job
post /v1/accounts/{account_id}/batchJobs/{batch_job_id}:cancel
Cancels an existing batch job if it is queued, pending, or running.
# Connect Environment
post /v1/accounts/{account_id}/environments/{environment_id}:connect
Connects the environment to a node pool.
Returns an error if there is an existing pending connection.
# Create Aws Iam Role Binding
post /v1/accounts/{account_id}/awsIamRoleBindings
# Create Batch Job
post /v1/accounts/{account_id}/batchJobs
# Create Cluster
post /v1/accounts/{account_id}/clusters
# Create Environment
post /v1/accounts/{account_id}/environments
# Create Node Pool
post /v1/accounts/{account_id}/nodePools
# Create Node Pool Binding
post /v1/accounts/{account_id}/nodePoolBindings
# Create Snapshot
post /v1/accounts/{account_id}/snapshots
# Delete Aws Iam Role Binding
post /v1/accounts/{account_id}/awsIamRoleBindings:delete
# Delete Batch Job
delete /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Delete Cluster
delete /v1/accounts/{account_id}/clusters/{cluster_id}
# Delete Environment
delete /v1/accounts/{account_id}/environments/{environment_id}
# Delete Node Pool
delete /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Delete Node Pool Binding
post /v1/accounts/{account_id}/nodePoolBindings:delete
# Delete Snapshot
delete /v1/accounts/{account_id}/snapshots/{snapshot_id}
# Disconnect Environment
post /v1/accounts/{account_id}/environments/{environment_id}:disconnect
Disconnects the environment from the node pool. Returns an error
if the environment is not connected to a node pool.
# Get Batch Job
get /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Get Batch Job Logs
get /v1/accounts/{account_id}/batchJobs/{batch_job_id}:getLogs
# Get Cluster
get /v1/accounts/{account_id}/clusters/{cluster_id}
# Get Cluster Connection Info
get /v1/accounts/{account_id}/clusters/{cluster_id}:getConnectionInfo
Retrieve connection settings for the cluster to be put in kubeconfig
# Get Environment
get /v1/accounts/{account_id}/environments/{environment_id}
# Get Node Pool
get /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Get Node Pool Stats
get /v1/accounts/{account_id}/nodePools/{node_pool_id}:getStats
# Get Snapshot
get /v1/accounts/{account_id}/snapshots/{snapshot_id}
# List Aws Iam Role Bindings
get /v1/accounts/{account_id}/awsIamRoleBindings
# List Batch Jobs
get /v1/accounts/{account_id}/batchJobs
# List Clusters
get /v1/accounts/{account_id}/clusters
# List Environments
get /v1/accounts/{account_id}/environments
# List Node Pool Bindings
get /v1/accounts/{account_id}/nodePoolBindings
# List Node Pools
get /v1/accounts/{account_id}/nodePools
# List Snapshots
get /v1/accounts/{account_id}/snapshots
# Update Batch Job
patch /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Update Cluster
patch /v1/accounts/{account_id}/clusters/{cluster_id}
# Update Environment
patch /v1/accounts/{account_id}/environments/{environment_id}
# Update Node Pool
patch /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Align transcription
post /audio/alignments
### Request
##### (multi-part form)
The input audio file to align with text. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance
The text to align with the audio.
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Currently supported:
* `mms_fa` optimal accuracy for multilingual speech.
* `tdnn_ffn` optimal accuracy for English-only speech.
* `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)).
The format in which to return the response. Can be one of `srt`, `verbose_json`, or `vtt`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/3pddjjdc").content
text = "At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality"
client = AudioInference(
model="whisper-v3-turbo",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.align_async(audio=audio, text=text)
print(f"Took: {(time.time() - start):.3f}s. Response: '{r}'")
```
```curl curl
# Download audio file
curl -sL -o "30s.flac" "https://tinyurl.com/3pddjjdc"
# Make request
curl -X POST "http://api.fireworks.ai/inference/v1/audio/alignments" \
-H "Authorization: Bearer <...>" \
-F "file=@30s.flac"
-F "text=At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality"
```
# Streaming Transcription
websocket /audio/transcriptions/streaming
Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.
Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
### URL
Please use the following serverless endpoint:
```
wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming
```
### Headers
Your Fireworks API key, e.g. `Authorization=API_KEY`.
### Query Parameters
The format in which to return the response. Currently only `verbose_json` is recommended for streaming.
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription.
### Streaming Audio
Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:
1. Resample your audio to 16 kHz if it is not already.
2. Convert it to mono.
3. Send 50ms chunks (16,000 Hz \* 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.
### Handling Responses
The client maintains a state dictionary, starting with an empty dictionary `{}`. When the server sends the first transcription message, it contains a list of segments. Each segment has an `id` and `text`:
```python
# Server initial message
{
"segments": [
{"id": "0", "text": "This is the first sentence"},
{"id": "1", "text": "This is the second sentence"}
]
}
# Client initial state
{
"0": "This is the first sentence",
"1": "This is the second sentence",
}
```
When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment `id`:
```python
# Server continuous message
{
"segments": [
{"id": "1", "text": "This is the second sentence modified"},
{"id": "2", "text": "This is the third sentence"}
]
}
# Client continuous update
{
"0": "This is the first sentence",
"1": "This is the second sentence modified", # overwritten
"2": "This is the third sentence", # new
}
```
### Example Usage
Check out a brief Python example below or example sources:
* [Python notebook](https://colab.research.google.com/github/fw-ai/cookbook/blob/main/learn/audio/audio_streaming_speech_to_text/audio_streaming_speech_to_text.ipynb)
* [Python sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/python)
* [Node.js sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/nodejs)
```python
!pip3 install requests torch torchaudio websocket-client
import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse
lock = threading.Lock()
segments = {}
def on_open(ws):
# Send audio chunks
def send_audio_chunks():
for chunk in audio_chunk_bytes:
ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
time.sleep(chunk_size_ms / 1000.0)
time.sleep(2)
ws.close()
threading.Thread(target=send_audio_chunks).start()
def on_message(ws, message):
# Merge new segments with existing segments
msg = json.loads(message)
new_segments = {seg["id"]: seg["text"] for seg in msg.get("segments", [])}
with lock:
segments.update(new_segments)
print(json.dumps(segments, indent=2))
def on_error(ws, error):
print(f"WebSocket error: {error}")
# Open a connection URL with query params
url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
"language": "en",
})
ws = websocket.WebSocketApp(
f"{url}?{params}",
header={"Authorization": ""},
on_open=on_open,
on_message=on_message,
on_error=on_error,
)
ws.run_forever()
```
### Dedicated endpoint
For fixed throughput and predictable SLAs, you may request a dedicated endpoints for streaming transcription at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) or [discord](https://www.google.com/url?q=https%3A%2F%2Fdiscord.gg%2Ffireworks-ai).
# Transcribe audio
post /audio/transcriptions
Send a sample audio to get a transcription.
### Request
##### (multi-part form)
The input audio file to transcribe. Max audio file size is 1 GB, there is not limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance.
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints:
* [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`);
* [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`);
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Currently supported:
* `mms_fa` optimal accuracy for multilingual speech.
* `tdnn_ffn` optimal accuracy for English-only speech.
* `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)).
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
#
# Or for the turbo version
# model="whisper-v3-turbo",
# base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.transcribe_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
```curl curl
# Download audio file
curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions" \
-H "Authorization: Bearer <...>" \
-F "file=@1hr.flac"
```
# Translate audio
post /audio/translations
### Request
##### (multi-part form)
The input audio file to translate. Max audio file size is 1 GB, there is not limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints:
* [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`);
* [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`);
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Currently supported:
* `mms_fa` optimal accuracy for multilingual speech.
* `tdnn_ffn` optimal accuracy for English-only speech.
* `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)).
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
#
# Or for the turbo version
# model="whisper-v3-turbo",
# base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.translate_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
```curl curl
# Download audio file
curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/translations" \
-H "Authorization: Bearer <...>" \
-F "file=@1hr.flac"
```
# Create Dataset
post /v1/accounts/{account_id}/datasets
# CRUD APIs for deployed models.
post /v1/accounts/{account_id}/deployedModels
# Create Deployment
post /v1/accounts/{account_id}/deployments
# Create Model
post /v1/accounts/{account_id}/models
# Create User
post /v1/accounts/{account_id}/users
# Create embeddings
post /embeddings
# Delete Dataset
delete /v1/accounts/{account_id}/datasets/{dataset_id}
# null
delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Delete Deployment
delete /v1/accounts/{account_id}/deployments/{deployment_id}
# Delete Model
delete /v1/accounts/{account_id}/models/{model_id}
# Generate an image
Official API reference for image generation workloads can be found on the corresponding models pages, upon clicking "view code". We support generating images from text prompts, other images, and/or ControlNet
[https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0)
[https://fireworks.ai/models/fireworks/SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B)
[https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic)
[https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl)
# Get Account
get /v1/accounts/{account_id}
# Get Dataset
get /v1/accounts/{account_id}/datasets/{dataset_id}
# Get Dataset Upload Endpoint
post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint
# Get Deployment
get /v1/accounts/{account_id}/deployments/{deployment_id}
# Get Model
get /v1/accounts/{account_id}/models/{model_id}
# Get Model Download Endpoint
get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint
# Get Model Upload Endpoint
post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint
# Get User
get /v1/accounts/{account_id}/users/{user_id}
# Introduction
Fireworks AI REST API enables you to interact with various Language, Image and Embedding Models using the API Key.
## Authentication
All requests made to the Fireworks AI via REST API must include an `Authorization` header.
Header should specify a valid `Bearer` Token with API key and must be encoded as JSON with the "Content-Type: application/json" header.
This ensures that your requests are properly authenticated and formatted for interaction with the Fireworks AI.
A Sample header to be included in the REST API request should look like below:
```json
authorization: Bearer
```
# List Datasets
get /v1/accounts/{account_id}/datasets
# List Deployments
get /v1/accounts/{account_id}/deployments
# List Models
get /v1/accounts/{account_id}/models
# List Users
get /v1/accounts/{account_id}/users
# Create Chat Completion
post /chat/completions
Creates a model response for the given chat conversation.
# Create Completion
post /completions
Creates a completion for the provided prompt and parameters.
# Update Dataset
patch /v1/accounts/{account_id}/datasets/{dataset_id}
# Update Deployment
patch /v1/accounts/{account_id}/deployments/{deployment_id}
# Update Model
patch /v1/accounts/{account_id}/models/{model_id}
# Update User
patch /v1/accounts/{account_id}/users/{user_id}
# Upload Dataset Files
post /v1/accounts/{account_id}/datasets/{dataset_id}:upload
Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint).
# Validate Dataset Upload
post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload
# Validate Model Upload
get /v1/accounts/{account_id}/models/{model_id}:validateUpload
# Start here
The **Fireworks Cookbook** is your hands-on guide to building, deploying, and fine-tuning generative AI and agentic workflows. It offers curated examples, Jupyter Notebooks, apps, and resources tailored to various use cases and skill levels, making it a go-to resource for practical Fireworks implementations.
In this cookbook, you’ll find:
* **Production-ready projects**: Scalable, proven solutions with ongoing support from the Fireworks engineering team.
* **Learning-focused tutorials**: Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
* **Community-driven showcases**: Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
***
## Repository structure
To help you easily navigate and find the right resources, the Cookbook organizes examples by purpose:
**Hands-on projects for learning AI** techniques, maintained by the DevRel team.
**Explore user-contributed projects** that push creative boundaries with Fireworks.
***
### Feedback & support
We value your feedback! If you encounter issues, need clarification, or have questions, please contact us at
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
***
**Additional resources:**
* [Fireworks AI Blog](https://fireworks.ai/blog)
* [Fireworks AI YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Fireworks AI Twitter](https://x.com/fireworksai_hq)
# Build with Fireworks
Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
## Inference
Explore notebooks and projects showcasing how to run generative AI models on Fireworks, demonstrating both third-party integrations and innovative applications with industry-leading speed and flexibility.
### LLMs
Dive into examples that utilize Fireworks for deploying and fine-tuning large language models (LLMs), featuring integrations with popular libraries and cutting-edge use cases.
**Notebooks**
(Python) An interactive Streamlit app for comparing LLMs on Fireworks with parameter tuning and LLM-as-a-Judge functionality.
(Python) Demonstrates structured responses using Llama 3.1, covering Grammar Mode and JSON Mode for consistent output formats.
(Python) Explores generating synthetic data with Llama 3.1 models on Fireworks, including structured outputs for quizzes.
**Apps**
A Next.js app for real-time transcription chat using Fireworks and Vercel integration.
### Visual-language
Discover projects combining vision and language capabilities using Fireworks, integrating external frameworks for seamless multimodal understanding.
### Audio
Explore real-time audio transcription, processing, and generation examples using Fireworks’ advanced audio models and integrations.
**Notebooks**
A notebook demonstrating real-time audio transcription using Fireworks' `whisper-v3-large` compatible model. The project includes streaming audio input and getting transcription messages, making it ideal for tasks requiring accurate and responsive audio processing.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
### Image
Experiment with image-based projects using Fireworks’ models, enhanced with third-party libraries for innovative applications in image creation, manipulation, and recognition.
### Multimodal
Learn from complex multimodal examples that blend text, audio, and image inputs, demonstrating the full potential of Fireworks combined with external tools for interactive AI experiences.
***
## Fine-tuning
Access notebooks that demonstrate efficient model fine-tuning on Fireworks, utilizing both internal capabilities and third-party tools like Axolotl for custom optimization.
### Multi-LoRA
Explore notebooks showcasing the integration and utilization of multiple LoRA adapters in Fireworks. These resources demonstrate advanced techniques for merging, fine-tuning, and deploying multi-LoRA configurations to optimize model performance across diverse tasks.
**Notebooks**
(Python) An interactive guide showcasing the integration of Multi-LoRA adapters on Fireworks, enabling fine-tuned responses for diverse product domains such as beauty, fashion, outdoor gear, and baby products.
***
## Function calling
Explore examples of function-calling workflows using Fireworks, showcasing how to integrate with external APIs and tools for sophisticated, multi-step AI operations.
**Notebooks**
Demonstrates Function-Calling with LangChain integration, including custom tool routing and query handling. (Python)
Explore the integration of Fireworks' function-calling model with LangChain tools. This notebook demonstrates building basic agents using `firefunction-v1` for tasks like answering questions, retrieving stock prices, and generating images with the Fireworks SDXL API (Javascript).
Showcases Function-Calling with LangGraph integration for graph-based agent systems and tool queries. (Python)
Uses Fireworks' Function-Calling for structured QA with OpenAI, featuring multi-turn conversation handling. (Python)
Demonstrates querying financial data using Fireworks' Function-Calling API with integrated tool setup. (Python)
Extracts structured information from web content using Fireworks' Function-Calling API. (Python)
Generates stock charts using Fireworks' Function-Calling API with AutoGen integration. (Python)
**Apps**
A demo app showcasing chat with function-calling capabilities for dynamic service invocation.
***
## RAG
Build retrieval-augmented generation (RAG) systems with Fireworks, featuring projects that connect with vector databases and search tools for enhanced, context-aware AI responses.
**Notebooks**
A basic RAG implementation using ChromaDB with League of Legends data, comparing responses across multiple models. (Python)
An agentic system using RAG for generating catchy research paper titles with embeddings and LLM completions. (Python)
A movie recommendation system using Fireworks' function-calling models and MongoDB Atlas for personalized, real-time suggestions. (Python)
**Apps**
A RAG chatbot using SurrealDB for vector storage and Fireworks for real-time, context-aware responses.
***
### Integration partners
We welcome contributions from integration partners! Follow these steps:
1. **Clone the Repo**: [Fireworks Cookbook repo](https://github.com/fw-ai/cookbook)
2. **Create Folder**: Add your company/tool under `integrations`
3. **Add Examples**: Include code, notebooks, or demos
4. **Use Template**: Fill out the [integration guide](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md)
5. **Submit PR**: Create a pull request
6. **Review**: Fireworks will review and merge
Need help? Contact us or open an issue.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# Community showcase
Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
Convert any PDF into a personalized podcast using open-source LLMs and TTS models. Powered by Fireworks-hosted Llama 3.1, MeloTTS, and Bark, this app generates engaging dialogue and outputs it as an MP3 file via a user-friendly Gradio interface.
High-throughput code generation with Qwen2.5 Coder models, optimized for fast inference on Fireworks. Includes a robust pipeline for data creation, fine-tuning with Unsloth, and real-time application in AI-powered code editors.
Ensure accurate and reliable technical documentation with ProoferX, built using Fireworks’ fast Llama models and Firefunc for structured output. This project addresses a key challenge in developer tools by validating and streamlining documentation with real-time checks.
***
## Community project submissions
We welcome your contributions to the **Fireworks Cookbook**! Share your projects and help expand our collaborative resource.
Here’s how:
1. **Clone the Repo**: [Fireworks Cookbook](https://github.com/fw-ai/cookbook) and go to `showcase`.
2. **Create Folder**: Add a folder named after your project.
3. **Include Code**: Add notebooks, apps, or other resources demonstrating your project.
4. **Complete Template**: Fill out the [Showcase Template](https://github.com/fw-ai/cookbook/blob/main/showcase/template_projectMDX.md) for key project details.
5. **Submit PR**: Submit your project as a pull request.
6. **Review & Feature**: Our team will review your submission; selected projects may be highlighted in docs or social media.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# Direct routing
Direct routing enables enterprise users reduce latency to their deployments.
## Internet direct routing
Internet direct routing bypasses our global API load balancer and directly routes your request to the machines where
your deployment is running. This can save several tens or even hundreds of milliseconds of time-to-first-token (TTFT)
latency.
To create a deployment using Internet direct routing:
```bash
$ firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--direct-routing-type INTERNET \
--direct-route-api-keys
Name: accounts/my-account/deployments/abcd1234
...
Direct Route Handle: my-account-abcd1234.us-arizona-1.direct.fireworks.ai
Region: US_ARIZONA_1
```
You will need to specify a comma-separated list of API keys that can access the direct route deployment. These keys can
be any alpha-numeric string and are a distinct concept from the API keys provisioned via the Fireworks console. A key
provisioned in the console but not specified the list here will not be allowed when querying the model via direct
routing.
Take note of the `Direct Route Handle` to get the inference endpoint. This is what you will use access the deployment
instead of the global `https://api.fireworks.ai/inference/` endpoint. For example:
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"prompt": "The sky is"
}' \
--url https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai/v1/completions
```
## Private Service Connect (PSC)
Contact your Fireworks representative to set up [GCP Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect)
to your deployment.
## AWS PrivateLink
Contact your Fireworks representative to set up [AWS PrivateLink](https://aws.amazon.com/privatelink/) to your
deployment.
# Regions
Fireworks runs a global fleet of hardware on which you can deploy your models.
## Availability
Current region availability:
| **Region** | **Launch status** | **Hardware availability** |
| ---------------- | ------------------- | ------------------------------------- |
| `US_ILLINOIS_2` | Generally Available | `NVIDIA_A100_80GB` |
| `US_VIRGINIA_2` | Generally Available | `NVIDIA_H100_80GB` `AMD_MI300X_192GB` |
| `EU_PARIS_1` | Generally Available | `NVIDIA_H200_141GB` |
| `AP_TOKYO_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `EU_FRANKFURT_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_ILLINOIS_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_IOWA_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_VIRGINIA_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_ARIZONA_1` | Enterprise only | `NVIDIA_H100_80GB` |
If you need deployments in a non-GA region, please contact our team at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
## Using a region
When creating a deployment, you can pass the `--region` flag:
```
firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--region US_IOWA_1
```
## Changing regions
Updating a region for a deployment in-place is currently not supported. To move a deployment between regions, please
create a new deployment in the new region, then delete the old deployment.
## Quotas
Each region has it's own separate quota for each hardware type. To view your current quotas, run
```
firectl list quotas
```
# Reserved capacity
Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the
following advantages over [on-demand deployments](/guides/ondemand-deployments):
* Guaranteed capacity
* Higher quotas
* Lower GPU-hour prices
* Pre-GA access to newer regions
* Pre-GA access to newest hardware
## Purchasing or renewing a reservation
To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account
manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us).
## Viewing your reservations
To view your existing reservations, run:
```
firectl list reservations
```
## Usage and billing
Reservations are automatically "consumed" when you create deployments that the meet the reservation parameters. For
example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both
deployments are running, 12 H100s will count towards using your reservation, while the excess 4 H100s will be metered
and billed at the on-demand rate.
When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding
number of deployments, otherwise you may be billed at for your usage at on-demand rates.
Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract
(e.g. monthly, quarterly, or yearly).
Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is
actively used.
# About Fireworks developer partners
Learn about the Fireworks Developer Partners Program, including goals, application process, and benefits for tools and platforms in the LLMOps/Gen-Ops ecosystem.
The **Fireworks developer integrations program** supports tools, platforms, and projects in the LLMOps/Gen-Ops ecosystem, enabling seamless collaboration with Fireworks. 🌐 Whether through **native integrations** or **compatible workflows**, developer integrations represent tools and platforms that:
* Offer **native integration** with Fireworks APIs, enabling deep functionality and seamless operation.
* Provide **compatible workflows**, demonstrating interoperability with Fireworks through shared use cases and adaptable processes.
* Add value to the Fireworks ecosystem by enhancing developer workflows, improving scalability, or solving key challenges in LLMOps/Gen-Ops. 🔧
***
# Goals of the developer partners program
1. **Expand the ecosystem**: Build a rich network of tools that extend Fireworks’ capabilities. 🌱
2. **Showcase interoperability**: Demonstrate how Fireworks works with diverse tools to solve real-world challenges. 🌍
3. **Support innovation**: Encourage the creation of impactful generative AI solutions. 💡
4. **Promote collaboration**: Highlight shared contributions through joint marketing, workshops, and developer resources. 🤝
***
## Types of developer partners
1. **Native integrations** 🛠️
* Tools with direct integration into Fireworks APIs or SDKs, offering seamless plug-and-play functionality.
* Examples include official connectors, plugins, and platform integrations.
2. **Compatible workflows**
* Tools or platforms that interoperate with Fireworks through shared APIs, workflows, or third-party bridges.
* Examples include vector stores, fine-tuning tools, and monitoring solutions that work alongside Fireworks.
***
# What does a developer integration look like?
A developer integration can include:
* **Native integrations**: Fully integrated tools or connectors offering seamless user experiences.
* **Workflow compatibility**: Examples and documentation showing how a tool works with Fireworks APIs.
* **Developer resources**: Contributed guides, notebooks, and sample repositories to enable other users.
**Examples**:
* **Native integration**: A plugin for a vector database that directly connects with Fireworks’ RAG workflows.
* **Compatible workflow**: A step-by-step guide for using Fireworks APIs alongside an MLOps monitoring tool.
***
# How to apply
### Step 1: Demonstrate compatibility or build integration 🔍
* **Native integrations**: Develop a connector or integration directly into Fireworks APIs or SDKs.
* **Compatible workflows**: Validate how your tool works with Fireworks workflows and APIs.
* Prepare resources such as GitHub repos, notebooks, or workflow guides.
### Step 2: Submit your application 📤
1. **Create documentation**
* Use the [Fireworks cookbook template](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md) to document your integration or workflow.
2. **Submit your contribution**
* Fork the [Fireworks cookbook](https://github.com/fw-ai/cookbook) and submit a pull request with your materials.
* Include links to your GitHub repo or supporting documentation.
3. **Contact developer relations**\
For guidance, reach out to [DevRel](mailto:devrel@fireworks.ai).
### Step 3: Review and feedback ✅
* Fireworks developer relations will review your submission to ensure technical accuracy and alignment with program goals.
* Once approved, your integration or workflow will be published in Fireworks documentation and promoted through official channels.
***
# Benefits of becoming a Fireworks developer partner 🌟
1. **Ecosystem visibility**
* Be featured in Fireworks documentation and resources as a trusted integration.
* Gain recognition within the growing LLMOps/Gen-Ops developer community.
2. **Technical and marketing support**
* Access Fireworks resources and technical support for building integrations.
* Collaborate on co-marketing campaigns, webinars, and tutorials.
3. **Community collaboration**
* Join a network of ecosystem partners working to push generative AI innovation forward.
* Share insights and learn from other projects in the LLMOps/Gen-Ops space.
***
# Program FAQ ❓
**Q: Who can apply to the Developer Partners program?**\
A: Tools, platforms, and projects that either integrate natively with Fireworks or demonstrate compatibility through workflows are welcome to apply.
**Q: What types of contributions are required?**\
A: Contributions can include technical documentation, integration guides, sample workflows, GitHub repos, and co-marketing materials.
**Q: Is there a cost to participate?**\
A: No, the Developer Partners program is free.
**Q: Can compatible workflows evolve into native integrations?**\
A: Yes! Tools demonstrating strong adoption and compatibility may transition to deeper integrations and partnerships.
***
For more information or to get started, contact us at:
* **Discord**: [Join here](https://discord.gg/fireworks-ai)
* **Email**: [devrel@fireworks.ai](mailto:devrel@fireworks.ai)
# Account setup & management
Solutions for common account access issues and management procedures for Fireworks.ai accounts
## Multiple account access
**Q: What should I do if I can't access my company account after being invited when I already have a personal account?**
This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to).
To resolve this:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts
2. Include in your email:
* The account ID you created personally (e.g., username-44ace8)
* The company account ID you need access to (e.g., company-a57b2a)
* Mention that you're having trouble accessing your company account
Note: This is a known scenario that support can resolve once they verify your email ownership.
***
## Account closure
**Q: How do I close my Fireworks.ai account?**
To close your account:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
2. Include in your request:
* Your account ID
* A clear request for account deletion
Before closing your account, please ensure:
* All outstanding invoices are paid
* Any active deployments are terminated
* Important data is backed up if needed
***
## Signing in from different Fireworks accounts
**Q: I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this?**
If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Billing management
Information about Fireworks.ai invoicing and API billing.
## Invoice questions
**Q: Why did I receive an invoice when I only deposited credits?**
Fireworks.ai billing works as follows:
* **Deposited credits** are used first.
* Once credits are exhausted, you **continue to accrue charges** for additional usage.
* **Usage charges** are billed at the end of each month.
* You’ll receive an invoice for any usage that **exceeded your pre-purchased credits**.
This process happens automatically, regardless of subscription status. To prevent additional charges, please monitor your usage or contact support to set up spending restrictions.
**Q: Where's my receipt for purchased credits?**
Receipts for purchased credits are sent via Stripe upon initial credit purchase. Check your email for receipts from Stripe (not Fireworks). Contact [billing@fireworks.ai](mailto:billing@fireworks.ai) if you still are encountering problems.
***
## API billing
**Q: Are calls to the Models API billable?**
No, calls to the **Models API** endpoint are free. This applies to all **management API calls** for:
* Accounts
* Users
* Models
* Datasets
*Note*: While the API calls themselves are free, charges apply for:
* **Model deployments**
* **Fine-tuning jobs**
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Credit system
Understanding how Fireworks.ai billing, credits, and account suspension work.
## Billing and credit usage
**Q: How does billing and credit usage work?**
Usage and billing operate through a **tiered system**:
* Each **tier** has a monthly usage limit, regardless of available credits.
* Once you reach your tier's limit, **service will be suspended** even if you have remaining credits.
* **Usage limits** reset at the beginning of each month.
* Pre-purchased credits do not prevent additional charges once the limit is exceeded.
***
## Account suspension
**Q: Why might my account be suspended even with remaining credits?**
Your account may be suspended due to several factors:
1. **Monthly usage limits**:
* Each tier includes a monthly usage limit, independent of any credits.
* Once you reach this limit, your service will be suspended, even if you have credits remaining.
* Usage limits automatically reset at the beginning of each month.
2. **Billing structure**:
* Pre-purchased credits do not prevent additional charges.
* You can exceed your pre-purchased credits and will be billed for any usage beyond that limit.
* **Example**: If you have `$20` in pre-purchased credits but incur `$83` in usage, you will be billed for the `$63` difference.
***
## Missing credits
**Q: I bought credits but don’t see them reflected in my account. Did they disappear?**
Fireworks operates with a **postpaid billing** system where:
* **Prepaid credits** are instantly applied to any outstanding balance.
* **Example**: If you had a `$750` outstanding bill and added `$500` in credits, your bill would reduce to `$250`, with \$0 remaining credits available for new usage.
To check your credit balance:
1. Visit your **billing dashboard**.
2. Review the **"Credits"** section.
3. Check your **current outstanding balance**.
*Note*: Credits are always applied to any existing balance before being available for new usage.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Cost structure
Understanding Fireworks.ai pricing and fees for various services.
## Platform costs
**Q: How much does Fireworks cost?**
Fireworks AI operates on a **pay-as-you-go** model for all non-Enterprise usage, and new users automatically receive free credits. You pay based on:
* **Per token** for serverless inference
* **Per GPU usage time** for on-demand deployments
* **Per token of training data** for fine-tuning
For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options.
Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing).
***
## Fine-tuning fees
**Q: Are there extra fees for serving fine-tuned models?**
No, deploying fine-tuned models to serverless infrastructure is free. Here’s what you need to know:
**What’s free**:
* Deploying fine-tuned models to serverless infrastructure
* Hosting the models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Discounts
Information about bulk usage discounts and special pricing options.
## Bulk usage
**Q: Are there discounts for bulk usage?**
Yes, we offer discounts for **bulk or pre-paid purchases** exclusively for on-demand deployments—not for serverless GPUs. Please contact [inquiries@firework.ai](mailto:inquiries@fireworks.ai) if you're interested.
***
## Serverless discounts
**Q: Are there discounts for bulk spend on serverless deployments?**
Our publicly accessible services have **standard rates** for all customers. Currently, we do not offer bulk discounts for serverless deployments.
***
## Additional information
For **enterprise customers** or **high-volume users**:
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
* Discuss **annual commitment discounts**
* Explore **enterprise-specific features and benefits**
# Billing & scaling
Understanding billing and scaling mechanisms for on-demand deployments.
## Autoscaling and costs
**Q: How does autoscaling affect my costs?**
* **Scaling from 0**: No minimum cost when scaled to zero
* **Scaling up**: Each new replica adds to your total cost proportionally. For example:
* Scaling from 1 to 2 replicas doubles your GPU costs
* If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing).
***
## Rate-limits for on-demand deployment
**Q: What are the rate limits for on-demand deployments?**
Request throughput scales with your GPU allocation. Base allocations include:
* Up to 8 A100 GPUs
* Up to 8 H100 GPUs
On-demand deployments offer several advantages:
* **Predictable pricing** based on time units, not token I/O
* **Protected latency and performance**, independent of traffic on the serverless platform
* **Choice of GPUs**, including A100s and H100s
Need more GPUs? Contact us to discuss higher allocations for your specific use case.
***
## On-demand billing
**Q: How does billing work for on-demand deployments?**
On-demand deployments come with automatic cost optimization features:
* **Default autoscaling**: Automatically scales to 0 replicas when not in use
* **Pay for what you use**: Charged only for GPU time when replicas are active
* **Flexible configuration**: Customize autoscaling behavior to match your needs
**Best practices for cost management**:
1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use
2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges
3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling).
***
## Scaling structure
**Q: How does billing and scaling work for on-demand GPU deployments?**
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
**Billing**:
* Charges start when the server begins accepting requests
* **Billed by GPU-second** for each active instance
* Costs accumulate even if there are no active API calls
**Scaling options**:
* Supports **autoscaling** from 0 to multiple GPUs
* Each additional GPU **adds to the billing rate**
* Can handle unlimited requests within the GPU’s capacity
**Management requirements**:
* Not fully serverless; requires some manual management
* **Manually delete deployments** when no longer needed
* Or configure autoscaling to **scale down to 0** during inactive periods
**Cost control tips**:
* Regularly **monitor active deployments**
* **Delete unused deployments** to avoid unnecessary costs
* Consider **serverless options** for intermittent usage
* Use **autoscaling to 0** to optimize costs during low-demand times
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Deployment issues
Troubleshooting and resolving common issues with on-demand deployments.
## Custom model issues
**Q: What are the common issues when deploying custom models?**
Here are key areas to troubleshoot for custom model deployments:
### 1. Deployment hanging or crashing
**Common causes**:
* **Missing model files**, especially when using Hugging Face models
* **Symlinked files** not uploaded correctly
* **Outdated firectl version**
**Solutions**:
* Download models without symlinks using:
```bash
huggingface-cli download model_name --local-dir=/path --local-dir-use-symlinks=False
```
* Update **firectl** to the latest version
### 2. LoRA adapters vs full models
* **Compatibility**: LoRA adapters work with specific base models.
* **Performance**: May experience slightly lower speed with LoRA, but **quality should remain similar** to the original model.
* **Troubleshooting quality drops**:
* Check **model configuration**
* Review **conversation template**
* Add `echo: true` to debug requests
### 3. Performance optimization factors
Consider adjusting the following for improved performance:
* **Accelerator count** and **accelerator type**
* **Long prompt** settings to handle complex inputs
***
## Autoscaling
**Q: What should I expect for deployment and scaling performance?**
* **Initial deployment**: Should complete within minutes
* **Scaling from zero**: You may experience brief availability delays while the system scales up
* **Troubleshooting**: If deployment takes over 1 hour, this typically indicates a crash and should be investigated
* **Best practice**: Monitor deployment status and contact support if deployment times are unusually long
***
## Performance questions
**Q: I have more specific performance questions about improvements**
For detailed discussions on performance and optimization options:
* **Schedule a consultation** directly with our PM, Ray Thai ([calendly](https://calendly.com/raythai))
* Discuss your **specific use cases**
* Get **personalized recommendations**
* Review **advanced configuration options**
*Note*: Monitor costs carefully during the deployment and testing phase, as repeated deployments and tests can quickly consume credits.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Hardware options
Understanding hardware choices for Fireworks.ai on-demand deployments.
## Hardware selection
**Q: Which accelerator/GPU should I use?**
It depends on your specific needs. Fireworks has two grouping of accelerators: smaller (A100) and larger (H100, H200, and MI300X) accelerators. Small accelerators are less expensive (see [pricing page](https://fireworks.ai/pricing)), so they’re more cost-effective for low-volume use cases. However, if you have enough volume to fully utilize a larger accelerator, we find that they tend to be both faster and more cost-effective per token.
Choosing between larger accelerators depends on the use case.
* MI300X has the highest memory capacity and sometimes enables large models to be deployed with comparatively few GPUs. For example, unquantized Llama 3.1 70B fits on one MI300X and FP8 Llama 405B fits on 4 MI300X’s. Higher memory also may enable better throughput for longer prompts and less sharded deployments. It’s also more affordably priced than the H100.
* H100 offers blazing fast inference and often provides the highest throughput, especially for high-volume use cases
* H200 is recommended for large models like DeepSeek V3 and DeepSeek R1 e.g. the minimum config for DeepSeek V3, DeepSeek R1 is 8 H200s.
### Best Practices for Selection
1. **Analyze your workload requirements** to determine which GPU fits your processing needs.
2. Consider your **throughput needs** and the scale of your deployment.
3. Calculate the **cost-performance ratio** for each hardware option.
4. Factor in **future scaling needs** to ensure the selected GPU can support growth.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# On-demand deployment scaling
Understanding Fireworks.ai system scaling and request handling capabilities.
## System scaling
**Q: How does the system scale?**
Our system is **horizontally scalable**, meaning it:
* Scales linearly with additional **replicas** of the deployment
* **Automatically allocates resources** based on demand
* Manages **distributed load handling** efficiently
***
## Auto scaling
**Q: Do you support Auto Scaling?**
Yes, our system supports **auto scaling** with the following features:
* **Scaling down to zero** capability for resource efficiency
* Controllable **scale-up and scale-down velocity**
* **Custom scaling rules and thresholds** to match your specific needs
***
## Throughput capacity
**Q: What’s the supported throughput?**
Throughput capacity typically depends on several factors:
* **Deployment type** (serverless or on-demand)
* **Traffic patterns** and **request patterns**
* **Hardware configuration**
* **Model size and complexity**
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
The request handling capacity is influenced by multiple factors:
* **Model size and type**
* **Number of GPUs** allocated to the deployment
* **GPU type** (e.g., A100 vs. H100)
* **Prompt size** and **generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance optimization
Guidelines for optimizing performance and benchmarking Fireworks.ai deployments.
## Performance improvement
**Q: What are the techniques to improve performance?**
To optimize model performance, consider the following techniques:
1. **Quantization**
2. **Check model type**: Determine whether the model is **GQA** (Grouped Query Attention) or **MQA** (Multi-Query Attention).
3. **Increase batch size** to improve throughput.
***
## Benchmarking
**Q: How can we benchmark?**
There are multiple ways to benchmark your deployment’s performance:
* Use our [open-source load-testing tool](https://github.com/fw-ai/benchmark)
* Develop custom performance testing scripts
* Integrate with monitoring tools to track metrics
***
## Model latency
**Q: What’s the latency for small, medium, and large LLM models?**
Model latency and performance depend on various factors:
* **Input/output prompt lengths**
* **Model quantization**
* **Model sharding**
* **Disaggregated prefill processes**
* **Hardware configuration**
* **Multiple layers of caching**
* **Fire optimizations**
* **LoRA adapters** (Low-Rank Adaptation)
Our team specializes in personalizing model performance. We work with you to understand your traffic patterns and create customized deployment templates that maximize performance for your use case.
***
## Performance factors
**Q: What factors affect model latency and performance?**
Key factors that impact latency and performance include:
* **Model architecture and size**
* **Hardware configuration**
* **Network conditions**
* **Request patterns**
* **Batch size settings**
* **Caching implementation**
***
## Best practices
**Q: What are the best practices for optimizing performance?**
For optimal performance, follow these recommendations:
1. **Choose an appropriate model size** for your specific use case.
2. **Implement batching strategies** to improve efficiency.
3. **Use quantization** where applicable to reduce computational load.
4. **Monitor and adjust scaling parameters** to meet demand.
5. **Optimize prompt lengths** to reduce processing time.
6. **Implement caching** to minimize repeated calculations.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Costs & management
Understanding costs and model availability for serverless deployments.
## Deployment costs
**Q: Are there costs associated with deploying fine-tuned models to serverless infrastructure?**
No, deploying fine-tuned models to serverless infrastructure is free.
**What’s free**:
* Deploying fine-tuned models to serverless
* Hosting models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Model availability
**Q: Do you provide notice before removing model availability?**
Yes, we provide advance notice before removing models from the serverless infrastructure:
* **Minimum 2 weeks’ notice** before model removal
* Longer notice periods may be provided for **popular models**, depending on usage
* Higher-usage models may have extended deprecation timelines
**Best Practices**:
1. Monitor announcements regularly.
2. Prepare a migration plan in advance.
3. Test alternative models to ensure continuity.
4. Keep your contact information updated for timely notifications.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance issues
Troubleshooting timeout errors and performance issues with serverless LLM models.
## Timeout and response times
**Q: Why am I experiencing request timeout errors and slow response times with serverless LLM models?**
Timeout errors and increased response times can occur due to **server load during high-traffic periods**.
With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned.
The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**.
Deployment barriers should be **minimal** and **pricing is based on usage**.
However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**.
With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes.
Which is why our two recommended ways to address timeout and response time issues is:
### Current solution (recommended for production)
* **Use on-demand deployments** for more stable performance
* **Guaranteed response times**
* **Dedicated resources** to ensure availability
We are always investing in ways to improve speed and performance.
### Upcoming improvements
* Enhanced SLAs for uptime
* More consistent generation speeds during peak load times
If you experience persistent issues, please include the following details in your support request:
1. Exact **model name**
2. **Timestamp** of errors (in UTC)
3. **Frequency** of timeouts
4. **Average wait times**
### Performance optimization tips
* Consider **batch processing** for handling bulk requests
* Implement **retry logic with exponential backoff**
* Monitor **usage patterns** to identify peak traffic times
* Set **appropriate timeout settings** based on model complexity
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Service levels
Understanding SLAs and service guarantees for Fireworks.ai serverless deployments.
## Latency guarantees
**Q: Is latency guaranteed for serverless models?**
Currently there are **no latency or availability guarantees** for serverless models, however they are coming soon and we recommend contacting [sales](https://fireworks.ai/company/contact-us) to discuss any specific needs or requirements you have.
***
## Service level agreements
**Q: Are there any SLAs for serverless models?**
Our **multi-tenant serverless offering** does not currently come with **Service Level Agreements (SLAs)**. However they are coming and we'd love to understand what your use case is in order to ensure you have the best experience possible on the Fireworks platform. Reach out to us via sales or our Discord community.
***
## Quota information
**Q: Are there any quotas for serverless?**
For **serverless deployments**, quotas are as follows:
* **Developer accounts**: 600 requests per minute (RPM)
* **Enterprise accounts**: 600 requests per minute (RPM)
* Quotas apply **across all models** and cannot be exceeded within the serverless infrastructure
**For higher quotas**:
* Consider switching to **on-demand deployments**
* **Contact enterprise sales** for custom solutions
* Evaluate **dedicated infrastructure options** for greater flexibility
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Certifications
Information about Fireworks.ai compliance certifications and HIPAA requirements.
## Security certifications
**Q: What type of certifications do you have?**
We are **SOC 2 Type II** and **HIPAA Certified**. These certifications demonstrate our commitment to:
* **Security**
* **Availability**
* **Processing integrity**
* **Confidentiality**
* **Privacy**
You can view more at [https://trust.fireworks.ai/](https://trust.fireworks.ai/).
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
# Enterprise quotas
Understanding quota allocations for Enterprise customers.
## Enterprise limits
**Q: Are there any quotas for Enterprise Tier?**
No, there are **no quotas** for Enterprise Tier. Enterprise customers benefit from:
1. **Resource Allocation**:
* **Unlimited request capacity**
* **Flexible scaling options**
* **Custom resource allocation**
2. **Performance Benefits**:
* **Dedicated infrastructure**
* **Priority processing**
* **Enhanced support**
3. **Custom Solutions**:
* **Tailored deployment options**
* **Specialized configurations**
* **Customized scaling policies**
For specific requirements or custom configurations, contact your **enterprise account representative**.
***
## Additional resources
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
# Platform support
Information about Fireworks.ai deployment regions, general support channels, and platform requests.
## General support
**Q: I have another question or issue.**
We have an active [Discord community](https://discord.gg/mMqQxvFD9A) where you can:
* Post questions
* Request features
* Report bugs
* Interact directly with the Fireworks team and community
***
## Feature requests
**Q: How can I request a new model to be added to the platform?**
Head over to our **Discord server** and let us know which models you would like to see deployed. We actively take feature requests for new, popular models.
***
## Product feedback
**Q: I have specific performance questions or want to know about further performance improvement options.**
If you need more tailored performance advice or want to discuss advanced optimization options, here are two ways to get support:
1. **General support**: Reach out via our [support channels](https://fireworks.ai/company/contact-us) or check out the performance optimization practices for tips on maximizing efficiency with on-demand deployments.
2. **Direct consultation**: For in-depth questions, feel free to schedule a consultation directly with our Product Manager, Ray Thai, using [this link to his calendar](https://calendly.com/raythai). Ray can assist with advanced optimization strategies and hardware recommendations based on your specific workload and deployment needs.
***
## Deployment regions
**Q: Do you host your deployments in the EU or Asia?**
We are currently deployed in multiple U.S.-based locations. However, we’re open to hearing more about your specific requirements. You can:
* Join our [Discord community](https://discord.gg/mMqQxvFD9A)
* Write to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
If you're an Enterprise customer, please contact your dedicated customer support representative to ensure a timely response.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Support structure & access
Information about Fireworks.ai support options, access methods, and communication channels.
## Support options
**Q: What support options exist?**
* Enterprise accounts receive **dedicated support**.
* Developer-tier customers can interact directly with the Fireworks team and community through our **Discord channel**.
***
## Support process
**Q: How does Support work?**
Fireworks provides support for its services with **target response times** based on the **priority level** of the issue. Customers can indicate priority when creating support issues through the **Fireworks support system**.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Enterprise support tiers & SLAs
Detailed information about Fireworks.ai support priority levels and response time commitments.
## Enterprise support contact
**Q: If you're an Enterprise customer, how do you contact support?**
Enterprise customers have access to **dedicated support channels**. Please contact your assigned **customer support representative** for timely assistance.
***
## Communication channels
**Q: Do you have a shared Slack channel?**
For customers who use Slack internally, we create a **shared Slack channel**. This channel is used for:
* **Answering questions** about Fireworks’ platform and features
* **Receiving bug reports** from customers
* **Communicating** around incidents and escalations
* **Announcing new features** and requesting feedback on current offerings
***
## Support priority levels
**Q: What are the support tiers and SLAs for enterprise?**
Support issues are categorized into four priority levels, with specific examples for each:
| Priority Level | Response Time | Description | Examples |
| --------------- | ----------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| **Urgent (P0)** | Within 1 hour | Reserved for critical cases that break live production workflows | • Production scheduled task/runbook unexpectedly failing • Application inaccessible to end users |
| **High (P1)** | Within 4 business hours | Problems that prevent regular platform usage but not breaking live production | • Development/staging schedule failing • Task deployment failing |
| **Normal (P2)** | Within 8 business hours | Requests for information, enhancements, or documentation clarification with no negative service impact | • Feature requests • Documentation questions |
| **Low (P3)** | Within 2 business days | Any issues that don't fall into P0, P1, or P2 categories | • General inquiries • Non-urgent requests |
*Note: Business hours refer to standard working hours.*
# Platform models
Information about custom and available models on Fireworks.ai.
## Custom models
**Q: Does Fireworks support custom base models?**
Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models).
***
## Model availability
**Q: There’s a model I would like to use that isn’t available on Fireworks. Can I request it?**
Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform.
**To request new models**:
1. **Join our [Discord server](https://discord.gg/fireworks-ai)**
2. Let us know which models you’d like to see
3. Provide **use case details**, if possible, to help us prioritize
We regularly evaluate and add new models based on:
* **Community requests**
* **Popular demand**
* **Technical feasibility**
* **Licensing requirements**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning service
Overview of Fireworks.ai fine-tuning capabilities and supported models.
## Service availability
**Q: Does Fireworks offer a fine-tuning service?**
Yes, Fireworks offers a fine-tuning service. Take a look at our [fine-tuning guide](https://docs.fireworks.ai/fine-tuning/fine-tuning-models), which is also available [via REST API](https://docs.fireworks.ai/fine-tuning/fine-tuning-via-api) for detailed information about our services and capabilities.
***
## Model support
**Q: What models are supported for fine-tuning? Is Llama 3 supported for fine-tuning?**
Yes, **Llama 3** (8B and 70B) is supported for fine-tuning with **LoRA adapters**, which can be deployed via our **serverless** and **on-demand** options for inference.
**Capabilities include**:
* **LoRA adapter training** for flexible model adjustments
* **Serverless deployment support** for scalable, cost-effective usage
* **On-demand deployment options** for high-performance inference
* A variety of **base model options** to suit different use cases
For a complete list of models available for fine-tuning, refer to our [documentation](https://docs.fireworks.ai/fine-tuning/fine-tuning-models).
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning troubleshooting
Solutions for common fine-tuning deployment and access issues.
## Access issues
**Q: Why am I getting "Model not found" errors when trying to access my fine-tuned model?**
If you’re unable to access your fine-tuned model, try these troubleshooting steps:
**First steps**:
* Attempt to access the model through both the **playground** and the **API**.
* Check if the error occurs for **all users** on the account.
* Ensure your **API key** is valid.
**Common causes**:
* User email previously associated with a **deleted account**
* **API key permissions** issues
* **Access conflicts** due to multiple accounts
**Debug process**:
1. Verify the API key’s validity using:
```bash
curl -v -H "Authorization: Bearer $FIREWORKS_API_KEY" https://api.fireworks.ai/verifyApiKey
```
2. Check if the issue persists across different **API keys**.
3. Identify which specific **users/emails** are affected.
**Getting help**:
* Contact support with:
* Your **account ID**
* **API key verification** results
* A list of **affected users/emails**
* Results from both **playground** and **API** tests
*Note*: If you have multiple accounts, ensure that access permissions are checked across all of them.
***
## Troubleshooting firectl deployment
**Q: Why am I getting "invalid id" errors when using firectl commands like create deployment or list deployments?**
This error typically occurs when your **account ID** is not properly configured.
### Common symptoms
* Error message: `invalid id: id must be at least 1 character long`
* Affects multiple commands, including:
* `firectl create deployment`
* `firectl list deployments`
To resolve:
### Steps to resolve
1. Run `firectl whoami` to check which **account id** is being used.
2. Ensure the correct **account ID** is being used. If not, run `firectl signin` to sign-in to the right account.
***
## LoRA deployment issues
**Q: Why can’t I deploy my fine-tuned Llama 3.1 LoRA adapter?**
If you encounter the following error:
```bash
Invalid LoRA weight model.layers.0.self_attn.q_proj.lora_A.weight shape: torch.Size([16, 4096]), expected (16, 8192)
```
This issue is due to the `fireworks.json` file being set to **Llama 3.1 70b instruct** by default.
**Workaround**:
1. Download the **model weights**.
2. Modify the base model to be `accounts/fireworks/models/llama-v3p1-8b-instruct`.
3. Follow the instructions in the [documentation](https://fireworks.ai/fine-tuning/model-upload) to upload and deploy the model.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# FLUX capabilities
Understanding FLUX image generation features and limitations.
## Multiple images
**Q: Can I generate multiple images in a single API call using FLUX serverless?**
No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance.
***
## Image-to-image generation
**Q: Does FLUX support image-to-image generation?**
No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development.
***
## LoRA models
**Q: Can I create custom LoRA models with FLUX?**
Inference on FLUX-LoRA adapters is currently supported. However managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available.
***
## Size control
**Q: How do I control output image sizes when using SDXL ControlNet?**
When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request:
The input control signal image will be automatically:
* **Resized** to fit your specified dimensions
* **Cropped** to preserve aspect ratio
**Example**: To generate a 768x1344 image, explicitly include these parameters in your request:
```json
{
"width": 768,
"height": 1344
}
```
*Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Limitations & controls
Understanding model limitations, safety features, and token limits.
## Safety Features
**Q: Can safety filters or content restrictions be disabled on text generation models?**
No, safety features and content restrictions for text generation models (such as Llama, Mistral, etc.) are embedded by the original model creators during training:
* **Safety measures** are integrated directly into the models by the teams that trained and released them.
* These are **core behaviors** of the model, not external filters.
* Different models may have varying levels of built-in safety.
* **Fireworks.ai does not add additional censorship layers** beyond what is inherent in the models.
* Original model behaviors **cannot be modified** via API parameters or configuration.
*Note*: For specific content handling needs, review the documentation of each model to understand its inherent safety features.
## Token Limits
**Q: What are the maximum completion token limits for models, and can they be increased?**
Token limits are model-specific and have technical constraints:
**Current Limitations**:
* Many models, such as **Llama 3.1 405B**, have a **4096 token completion limit**.
* Setting a higher `max_tokens` in API calls **will not override** this limit.
* You will see `"finish_reason": "length"` in responses when hitting this limit.
**Why Limits Exist**:
* **Resource management** for shared infrastructure
* Prevents single requests from monopolizing resources
* Helps maintain **service availability** for all users
**Working with Token Limits**:
* Break longer generations into **multiple requests**.
* *Note*: This may require repeating context or prompts.
* Be mindful that repeated context can **increase total token usage**.
**Example API Response at Limit**:
```json
{
"finish_reason": "length",
"usage": {
"completion_tokens": 4096,
"prompt_tokens": 4206,
"total_tokens": 8302
}
}
```
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Inference performance
Understanding model performance, quantization, and batching capabilities.
## Model quantization
**Q: What quantization format is used for the Llama 3.1 405B model?**
The **Llama 3.1 405B model** uses the **FP8 quantization format**, which:
* Closely matches **Meta's reference implementation**
* Provides further details in the model description at [fireworks.ai/models/fireworks/llama-v3p1-405b-instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct)
* Has a general quantization methodology documented in our [Quantization blog](https://fireworks.ai/blog/fireworks-quantization)
*Note*: **BF16 precision** will be available soon for on-demand deployments.
***
## API capabilities
**Q: Does the API support batching and load balancing?**
Current capabilities include:
* **Load balancing**: Yes, supported out of the box
* **Continuous batching**: Yes, supported
* **Batch inference**: Not currently supported (on the roadmap)
* Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
* **Streaming**: Yes, supported
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
Request handling capacity depends on several factors:
* **Model size and type**
* **Number of GPUs allocated** to the deployment
* **GPU type** (e.g., A100, H100)
* **Prompt size**
* **Generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Data security
Information about Fireworks.ai data encryption and security measures.
## Data at rest
**Q: How is data encrypted at rest?**
All resources stored within Fireworks are **encrypted at rest**, including:
* **Models**
* **Datasets**
* **LoRA Adapters**
* Other stored resources
***
## Data in transit
**Q: How is data encrypted in transit?**
All data passed through Fireworks is encrypted using **industry-standard protocols and methods**.
***
## Encryption options
**Q: Does Fireworks provide client-side encryption or allow customers to bring their own encryption keys?**
Currently, Fireworks does not provide:
* **Client-side encryption**
* **Customer-managed keys** for encrypting data at rest
*Note*: We continuously evaluate additional encryption options based on customer needs and security requirements.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Security documentation
Access to Fireworks.ai security policies and documentation.
## Security policies
**Q: Where can I find more information about your security policies?**
Comprehensive security documentation is available at [trust.fireworks.ai](https://trust.fireworks.ai), including:
* **Security measures**
* **Compliance information**
* **Best practices**
* **Policy updates**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Model security
Understanding model security and guardrail implementations.
## Model guardrails
**Q: Do you put any guardrails before any LLM models?**
By default, we don’t apply any guardrails to LLM models. Our customers can implement guardrails through various methods:
1. **Using built-in options**:
* Models such as **Llama Guard** provide built-in guardrails.
* Integration with existing **security frameworks**.
2. **Third-party solutions**:
* AI gateways like **Portkey** offer guardrails as a feature.
* Documentation available at: [Portkey Guardrails](https://docs.portkey.ai/docs/product/guardrails)
**Best practices**:
* Implement guardrails appropriate to your **use case**.
* Conduct regular **security audits**.
* Monitor **model outputs** consistently.
* Keep **security policies** updated.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Private access
Understanding private connection options for Fireworks.ai services.
## Private connections
**Q: Do you provide private connections?**
Fireworks provides various forms of **private connections**:
**Cloud provider options**:
* **AWS PrivateLink**
* **GCP Private Service Connect**
**Additional options**:
* **Direct Routing**, which allows you to connect your dedicated API Gateway
**Benefits**:
* **Enhanced security**
* **Reduced latency**
* **Private network communication**
* **Improved reliability**
**Implementation process**:
1. **Contact support** to initiate setup.
2. **Choose connection type** based on your requirements.
3. **Configure network settings** as per the guidelines.
4. **Verify connectivity** to ensure successful integration.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning models
We're introducing an upgraded tuning service with improved speed, usability and reliability! The new service utilizes different commands and model coverage. The new service is offered for free as we're in public preview.
See these [docs](https://docs.fireworks.ai/fine-tuning/fine-tuning-legacy) to use our legacy service instead.
## Introduction
Fireworks' offers a [LoRA](https://huggingface.co/docs/diffusers/training/lora)-based fine-tuning method designed for usability, reliability and efficiency. LoRA is used for fine-tuning all models besides our 70B models, which uses qLoRA (quantized) to improve training speeds.
The fine-tuning service is provide hassle-free quality improvements through intelligent defaults and little configuration. Models fine-tuned with our service can be seamlessly deployed for inference on Fireworks or downloaded for local usage.
Fine-tuning a model with a dataset can be useful for several reasons:
1. **Enhanced Precision**: It allows the model to adapt to the unique attributes and trends within the dataset, leading to significantly improved precision and effectiveness.
2. **Domain Adaptation**: While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field.
3. **Bias Reduction**: General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes.
4. **Contemporary Relevance**: Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant.
5. **Customization for Specific Applications**: This process allows for the tailoring of the model to meet unique objectives and needs, an aspect not achievable with standard models.
In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let's hop on a journey of fine-tuning a model!
Fine-tuned model inference on Serverless is slower than base model inference
on Serverless. For use cases that need low latency, we recommend using
[on-demand
deployments](https://docs.fireworks.ai/guides/ondemand-deployments). For
on-demand deployements, fine-tuned model inference speeds are significant
closer to base model speeds (but still slightly slower). If you are only using
1 LoRA on-demand, [merging fine-tuned
weights](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model)
into the base model when using on-demand deployments will provide identical
speed to base model inference. If you have an enterprise use case that needs
fast fine-tuned models, please [contact
us!](https://fireworks.ai/company/contact-us)
## Pricing
Our new tuning service is currently free but will be charged based on the total number of tokens processed (dataset tokens \* number of epochs). Running inference on fine-tuned models incurs no extra costs outside of base inference fees.
See our [Pricing](https://fireworks.ai/pricing#fine-tuning) page for pricing details on our legacy tuning service.
## Installing firectl
[`firectl`](/tools-sdks/firectl/firectl) is the command-line (CLI) utility to manage, and deploy various resources on the [Fireworks AI Platform](https://fireworks.ai). Use `firectl` to manage fine-tuning jobs and their resulting models.
Please visit the Firectl [Getting Started](/tools-sdks/firectl/firectl) Guide on installing and using `firectl`.
## Preparing your dataset
To fine-tune a model, we need to first upload a dataset. Once uploaded, this dataset can be used to create one or more fine-tuning jobs. A dataset consists of a single JSONL file, where each line is a separate training example.
Limits:
* Minimum number of examples is 3.
* Maximum number of examples is 3,000,000.
Format:
* Each line of the file must be a valid JSON object.
Each dataset must conform to the schema expected by our OpenAI-compatible [Chat Completions API](https://docs.fireworks.ai/guides/querying-text-models#chat-completions-api). Each JSON object of the dataset must contain a single array field called `messages`. Each message is an object containing two fields:
* `role` - one of "system", "user", or "assistant".
* `content` - the content of the message.
A message with the "system" role is optional, but if specified, must be the first message of the conversation. Subsequent messages start with "user" and alternate between "user" and "assistant". See below for example training examples:
```json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "blue"}]}
{"messages": [{"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]}
```
### Creating your dataset
To create a dataset, run:
```shell
firectl create dataset path/to/dataset.jsonl
```
and you can check the dataset with:
```shell
firectl get dataset
```
## Starting your tuning job
To start a structured fine-tuning job (sftj), run:
```shell
firectl create sftj --base-model --dataset --output-model
```
For example:
```shell
firectl create sftj --base-model llama-v3p1-8b-instruct --dataset my_dataset --output-model my_model
```
firectl will return the fine-tuning job ID.
When creating a fine-tuning job, you can start tuning from a base model, or from a model you tuned earlier (LoRA add-on):
1. **Base model**: Use the `base-model` parameter to start from a pre-trained base model.
2. **Existing LoRA add-on**: Use the `warm-start-from` parameter to start from an existing LoRA addon model, where the LoRA is specified with the format "accounts/\/models/\"
You must specify either `base-model` or `warm-start-from` in your command-line flags.
### Checking the job status
You can monitor the progress of the tuning job by running
```shell
firectl get fine-tuning-job
```
Once the job successfully completes, a model will be created in your account. You can see a list of models by running:
```shell
firectl list models
```
Or if you specified a model ID when creating the fine-tuning job, you can get the model directly:
```shell
firectl get model
```
## Deploying and using a model
Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on [Deploying a model](/models/deploying#lora-addons) and [Querying text models](/guides/querying-text-models) for detailed instructions.
Some base models may not support serverless addons. To check:
1. Run `firectl -a fireworks get `
2. Look under `Deployed Model Refs` to see if a `fireworks`-owned deployment exists, e.g. `accounts/fireworks/deployments/3c7a68b0`
3. If so, then it is supported
If the base model doesn't support serverless addons, you will need use an [on-demand deployment](/models/deploying#deploying-to-on-demand) to deploy it.
## Additional tuning options
Tuning settings are specified when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like epochs learning rate, we recommend using default settings and only changing hyperparameters if results are not as desired. All tuning options must be specified via command line flags as shown in the below example command with multiple flags.
```shell
firectl create sftj \
--base-model llama-v3p1-8b-instruct \
--dataset cancerset \
--output-model my-tuned-model \
--job-id my-fine-tuning-job \
--learning-rate 0.0001 \
--epochs 2 \
--early-stop \
--evaluation-dataset my-eval-set
```
### Evaluation
By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data.
1. `evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl
```shell
firectl create sftj \
...
--evaluation-dataset my-eval-set \
...
```
### Early stopping
Early stopping stops training early in the validation loss does not improve. It is off by default
```shell
firectl create sftj \
...
--early-stop \
...
```
### Max Context Length
By default, fine-tuned models support a max context length of 8k. Increase max context length if your use case requires context above 8k. Maximum context length can be increased up to the default context length of your selected model. For models with over 70B parameters, we only support up to 32k max context length.
```shell
firectl create sftj \
...
--max-context-length 16000
...
```
### Epochs
Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported.
**Note: we set a max value of 3 million dataset examples \* epochs**
```shell
firectl create sftj \
...
--epochs 2.0 \
...
```
### Learning rate
Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value set is automatically based on your selected model.
```shell
firectl create sftj \
...
--learning-rate 0.0001 \
...
```
### Lora Rank
LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8.
```shell
firectl create sftj \
...
--lora-rank 16 \
...
```
### Training progress and monitoring
The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.
```shell
firectl create sftj \
...
--wandb-entity my-org \
--wandb-api-key xxx \
--wandb-project "My Project" \
...
```
### Model ID
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within (ID constraints)\[[https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids)].
```shell
firectl create sftj \
...
--output-model-id my-model \
...
```
### Job ID
By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.
```shell
firectl create sftj \
...
--job-id my-fine-tuning-job \
...
```
## Downloading model weights
To download model weights run
```shell
firectl download model
```
## Appendix
### Supported base models - tuning
Fireworks tuning service is limited to select models where we're confident in providing intelligent defaults for a hassle-free experience. Currently, we only support tuning models with the following architectures:
* [Llama 1,2,3.x](https://huggingface.co/docs/transformers/en/model_doc/llama2) architectures are supported. Llama vision models and Llama 405B currently not supported
* [Qwen2](https://huggingface.co/docs/transformers/en/model_doc/qwen2) architectures are supported.
### Supported base models - LoRAs on dedicated deployment
LoRAs can be deployed for inference on dedicated deployments (on-demand or enterprise reserved) for the following models:
* All models supported for tuning
* accounts/fireworks/models/mixtral-8x7b-instruct-hf
* accounts/fireworks/models/mixtral-8x22b-instruct-hf
* accounts/fireworks/models/mixtral-8x22b-hf
* accounts/fireworks/models/mixtral-8x7b
* accounts/fireworks/models/mistral-7b-instruct-v0p2
* accounts/fireworks/models/mistral-7b
* accounts/fireworks/models/code-qwen-1p5-7b
* accounts/fireworks/models/deepseek-coder-v2-lite-base
* accounts/fireworks/models/deepseek-coder-7b-base
* accounts/fireworks/models/deepseek-coder-1b-base
* accounts/fireworks/models/codegemma-7b
* accounts/fireworks/models/codegemma-2b
* accounts/fireworks/models/starcoder2-15b
* accounts/fireworks/models/starcoder2-7b
* accounts/fireworks/models/starcoder2-3b
* accounts/fireworks/models/stablecode-3b
This means that [up to 100](https://docs.fireworks.ai/guides/quotas_usage/rate-limits#other-quotas) LoRAs can be deployed to a dedicated instance for no extra fees compared to the base deployment costs.
### Supported base models - LoRAs on serverless
The following base models are supported for low-rank adaptation (LoRA) and can be deployed as LoRA add-ons on Fireworks [serverless](/models/deploying#deploying-to-serverless) and [on-demand](/models/deploying#deploying-to-on-demand) deployments, using the default parameters below. Serverless deployment is only available for a subset of fine-tuned models - run "get (\)\[[https://docs.fireworks.ai/models/overview#introduction](https://docs.fireworks.ai/models/overview#introduction)]" or check the models (page)\[[https://fireworks.ai/models](https://fireworks.ai/models)] to see if there's an active serverless deployment.
A limited number of models are available for serverless LoRA deployment, meaning that up to 100 LoRAs can be deployed to serverless and are constantly available on a pay-per-token basis.
* accounts/fireworks/models/llama-v3p1-8b-instruct
* accounts/fireworks/models/llama-v3p1-70b-instruct
* accounts/fireworks/models/llama-v3p2-3b-instruct
### Support
We'd love to hear what you think! Please connect with the team, ask questions, and share your feedback in the [#fine-tuning](https://discord.gg/zYDmm4zqmq) Discord channel.
##
# Using Document Inlining
## Overview
Document Inlining allows any LLM to process images and PDFs through our chat completions API. Simply append `#transform=inline` to your document URL to enable this feature. Document Inlining connects our proprietary Fireworks Parsing Service to any LLM to provide advantages including:
* Improved reasoning (compared to VLMs): LLMs reason better over text than over image and document inlining allows you to use specialized and more recently updated text models
* Improved input flexibility: Document Inlining enables PDFs and multiple images to be ingested
* Ultra-simple usage: Use Document Inlining through our openAI-compability, chat completions API. Simply add 1-line to specify to add your file and turn on Document Inlining
Read our [announcement blog](https://fireworks.ai/blog/document-inlining-launch) for more details.
## Usage
### Basic Example
Note the "#transform=inline" addition to the image URL.
```python Python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://pdfobject.com/pdf/sample.pdf#transform=inline"
}
},
{
"type": "text",
"text": "What information can you extract from this document?"
}
]
}
]
)
```
```typescript TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "",
baseURL: "https://api.fireworks.ai/inference/v1"
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://example.com/document.pdf#transform=inline"
}
},
{
type: "text",
text: "What information can you extract from this document?"
}
]
}
]
});
```
```javascript JavaScript
const OpenAI = require("openai");
const client = new OpenAI({
apiKey: "",
baseURL: "https://api.fireworks.ai/inference/v1"
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://example.com/document.pdf#transform=inline"
}
},
{
type: "text",
text: "What information can you extract from this document?"
}
]
}
]
});
```
The `image_url.url` field supports both direct URLs and base64-encoded data URLs, compatible with VLM API:
```text
# For PDF files
data:application/pdf;base64,{base64_str_for_pdf}
# For images (png/jpg/gif/tiff supported)
data:image/png;base64,{base64_str_for_image}
data:image/jpeg;base64,{base64_str_for_image}
data:image/gif;base64,{base64_str_for_image}
data:image/tiff;base64,{base64_str_for_image}
```
Similarly, simply append `#transform=inline` to the base64 string to enable document inlining.
### Combining with Structured Output
Document Inlining works seamlessly with structured output formats. Here's how to extract specific fields using [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting):
```python
from pydantic import BaseModel
class DocumentInfo(BaseModel):
title: str
key_points: list[str]
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[...], # Same as above
response_format={"type": "json_object", "schema": DocumentInfo.model_json_schema()}
)
```
## Limitations
Document Inlining is only intended to handle images and documents that contain text. Document Inlining may provide subpar results for highly visual, spatially dependent, or layout-heavy content that does not translate well into structured text.
* Maximum document size: 50 pages or the model's context size (whichever is smaller)
* Maximum document size: \~32 MB if sent as base64 encoded string, \~100 MB if sent as URL
* Supported formats: PDFs and images
## Model Compatibility
Document Inlining works with any LLM on Fireworks, including:
* Serverless models
* On-demand models
* Fine-tuned and custom models
* Vision models
Simply append `#transform=inline` to your document URL to enable the feature with any supported model. Multiple documents are supported. Vision models also support document inlining with images for use cases that require both document processing and non-document vision. Users can control whether to inline a document by selectively appending `#transform=inline` to image\_url.url of each attachment.
## Pricing
During public preview, Document Inlining incurs no added costs compared to our typical text models. For example, let’s say you’re conducting a structured extraction task where you provide:
Input: 10 token Prompt + document with 1,000 tokens worth of text
Output: 100 tokens
You would simply pay for the 1110 tokens worth of input and output token costs but will NOT incur additional costs for document parsing.
Please note that Document Inlining is in Public Preview mode and subject to changes. Please contact us on Discord if you have feedback or questions or at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) for enterprise inquries.
# Concepts
This document outlines basic Fireworks AI concepts.
## Resources
### Account
Your account is the top-level resource under which other resources are located. Quotas and billing are
enforced at the account level, so usage for all users in an account contribute to the same quotas and bill.
For developer accounts, the account ID is auto-generated from the email address used to sign up.
Enterprise accounts can optionally choose a custom, unique account ID.
### User
A user is an email address associated with an account. Users added to an account have full access to delete, edit, and create resources within the account, such as deployments and models.
### Model
A model is a set of model weights and metadata associated with the model. A model cannot be used
for inference until it is deployed to one or more deployments, creating a "deployed model". There
are two types of models:
* Base models
* Low-rank adaptation (LoRA) addons
See our [Models overview](/models/overview) page for details.
### Deployment
A deployment is a collection (one or more) model servers that host one base model and optionally
one or more LoRA addons.
Fireworks provides a set of "serverless" deployments that host common base models. These deployments
may be used for [serverless inference](/models/overview#serverless-inference) as well as hosting [serverless addons](/models/overview#serverless-addons).
### Deployed model
A deployed model is an instance of a base model or LoRA addon that is loaded into a deployment.
### Dataset
A dataset is an immutable set of training examples that can be used to fine-tune a model.
### Fine-tuning job
A fine-tuning job is an offline training job that uses a dataset to train a LoRA addon model.
## Resource names and IDs
A full resource name looks like
```
accounts/my-account/models/my-model
```
The individual segments `my-account` and `my-model` are account and [model IDs](https://docs.fireworks.ai/models/overview), respectively.
Resource IDs must satisfy the following constraints:
* between 1 and 63 characters (inclusive)
* consist of a-z, 0-9, and hyphen (-)
* does not begin or end with a hyphen (-)
Some APIs take the full resource name, while others may take a resource ID if the context is clear.
## Control plane and data plane
The Fireworks API can be split into a control plane and a data plane.
* The **control plane** consists of APIs used for managing the lifecycle of resources. This
includes your account, models, and deployments.
* The **data plane** consists of the APIs used for inference and the backend services that power
them.
## Interfaces
Users can interact with Fireworks through one of many interfaces:
* The **web console** at [https://fireworks.ai](https://fireworks.ai)
* The command-line interface `firectl`
* [Python SDK](/tools-sdks/python-client/installation)
# Introduction
Fireworks AI is a generative AI inference platform to run and customize models with industry-leading speed and production-readiness.
## Welcome to Fireworks AI
{/*
Make an API call to an open-source LLM
Watch to learn more about the Fireworks AI platform
*/}
## What we offer
The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly [pricing](https://fireworks.ai/pricing). See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links.
* **Inference:** Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case.
* **Modalities and Models:** Use 100s models (or bring your own) across modalities of:
* [Text](https://docs.fireworks.ai/guides/querying-text-models)
* [Audio](https://docs.fireworks.ai/api-reference/audio-transcriptions)
* [Image](https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt)
* [Embedding](https://docs.fireworks.ai/guides/querying-embeddings-models)
* [Vision-understanding](https://docs.fireworks.ai/guides/querying-vision-language-models)
* **Adaptation:** [Tune](https://docs.fireworks.ai/fine-tuning/fine-tuning-models) and optimize your model and deployment for the best . [Serve](https://docs.fireworks.ai/models/deploying) and experiment with hundreds of fine-tuned models with our multi-LoRA [capabilities](https://fireworks.ai/blog/multi-lora).
* **Compound AI Development:** Use [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting), [grammar mode](https://docs.fireworks.ai/structured-responses/structured-output-grammar-based) or [function calling](https://docs.fireworks.ai/guides/function-calling) to build a collaborative system with reliable and performant outputs
## Inference
Fireworks has 3 options for running generative AI models with unparalleled speed and costs.
* **Serverless**: The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots.
* **[On-demand](https://fireworks.ai/blog/why-gpus-on-demand)** -The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both \~250% improved throughput and 50% improved latency compared to vLLM. Excels for:
* **Production volume** - Per-token costs decrease with more volume and there are no set rate limits
* **Custom needs and reliability** - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models
* **Enterprise Reserved GPUs** - Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations.
| Property | **Serverless** | **On-demand** | **Enterprise reserved** |
| -------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Performance** | Industry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage. | Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM. | Tailor-made set-up by Fireworks AI experts for best possible latency |
| **Getting Started** | Self-serve - immediately use serverless with 1 line of code | Self-serve - configure GPUs, then use them with 1 line of code. | Chat with Fireworks |
| **Scaling and management** | Scale up and down freely within rate limits | Option for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups. | Chat with Fireworks |
| **Pricing** | Pay fixed price per token | Pay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM. | Customized price based on reserved GPU capacity |
| **Commitment** | None | None | Arrange plan length with Fireworks |
| **Rate limits** | Yes, see [quotas](https://docs.fireworks.ai/accounts/quotas) | No rate limits. [Quotas](https://docs.fireworks.ai/accounts/quotas) on number of GPUs | None |
| **Model Selection** | Collection of popular models, curated by Fireworks | Use 100s of pre-uploaded models or upload your own custom model within supported [architecture](https://docs.fireworks.ai/models/uploading-custom-models) | Use 100s of pre-uploaded models or upload any model |
## FireOptimizer
**FireOptimizer** - Fireworks optimizes inference for your workload and your use case though FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are:
* **[Fine-tuning](https://fireworks.ai/blog/fine-tune-launch)** - Quickly fine-tune models with LoRA for the best quality on your use case
* Upload data and choose your model to start tuning
* Pay per token of training data.
* Serve and evaluate models immediately on Fireworks
* Download models weights to use anywhere
* **[Multi-LoRA serving](https://fireworks.ai/blog/multi-lora)** - Deploy 100s of fine-tuned models at no extra cost.
* Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model.
* Use models fine-tuned on Fireworks or upload your own fine-tuned adapter
* Host hundreds of models on the same deployment on either serverless or dedicated deployments
## Compound AI
Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include:
* **[JSON mode and grammar mode](https://fireworks.ai/blog/why-do-all-LLMs-need-structured-output-modes)** - Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components.
* **[Function calling](https://fireworks.ai/blog/firefunction-v2-launch-post)** - Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B
{/*
## Support
Join our community of Generative AI builders
Have more questions? Drop us a note!
*/}
# Onboarding
A quick guide to navigating and building with the Fireworks platform.
# Introduction
Welcome to the **Fireworks onboarding guide**!
This guide is designed to help you quickly and effectively get started with the Fireworks platform, whether you're a developer, researcher, or AI enthusiast. By following this step-by-step resource, you'll learn how to explore and experiment with state-of-the-art AI models, prototype your ideas using Fireworks’ serverless infrastructure, and scale your projects with advanced on-demand deployments.
### Who this guide is for
This guide is designed for new Fireworks users who are exploring the platform for the first time. It provides a hands-on introduction to the core features of Fireworks, including the model library, playgrounds, and on-demand deployments, all accessible through the web app.
For experienced users, this guide serves as a starting point, with future resources planned to dive deeper into advanced tools like `firectl` and other intermediate features to enhance your workflow.
### Objectives of the guide
* **Explore the Fireworks model library**: Navigate and select generative AI models for text, image, and audio tasks.
* **Experiment with the playground**: Test prompts, tweak parameters, and generate outputs in real time.
* **Prototype effortlessly**: Use Fireworks’ serverless infrastructure to deploy and iterate without managing servers.
* **Scale your AI**: Learn how on-demand deployments offer predictable performance and advanced customization.
* **Develop complex systems**: Unlock advanced capabilities like Compound AI, function calling, and retrieval-augmented generation to create production-ready applications.
By the end of this guide, you’ll be equipped with the knowledge and tools to confidently use Fireworks to build, scale, and optimize AI-powered solutions. Let’s get started!
***
# Step 1. Explore our model library
Fireworks provides a range of leading open-source models for tasks like text generation, code generation, and image understanding.
With the Fireworks [model library](https://fireworks.ai/models), you can choose from our wide range of popular LLMs, VLMs, LVMs, and audio models, such as:
* [**LLMs**: Llama 3.3 70B](https://fireworks.ai/models/fireworks/llama-v3p3-70b-instruct), [Deepseek V3](https://fireworks.ai/models/fireworks/deepseek-v3), and [Qwen2.5 Coder 32B Instruct](https://fireworks.ai/models/fireworks/qwen2p5-coder-32b-instruct).
* [**VLMs**: Llama 3.2 90B Vision Instruct](https://fireworks.ai/models/fireworks/llama-v3p2-90b-vision-instruct).
* [**Vision models**: BFL’s FLUX.1 \[dev\] FP8](https://fireworks.ai/models/fireworks/flux-1-dev-fp8) and [Stability.ai’s Stable Diffusion 3.5 Large Turbo](https://fireworks.ai/models/fireworks/stable-diffusion-3p5-large-turbo).
* [**Audio models**: Whisper V3](https://fireworks.ai/models/fireworks/whisper-v3) and [(blazing fast)](https://fireworks.ai/blog/audio-transcription-launch)[Whisper V3 Turbo](https://fireworks.ai/models/fireworks/whisper-v3-turbo).
as well as [**embedding models**](https://docs.fireworks.ai/guides/querying-embeddings-models#list-of-available-models) from Nomic AI.
In this video, we introduce the **Fireworks Model Library**, your gateway to a diverse range of open-source and proprietary models designed for tasks like text generation, image understanding, and audio processing. Whether you’re a developer or a creative, Fireworks makes it easy to find and integrate the right tools for your generative AI needs.
### What you’ll learn:
1️⃣ **Navigating the model library**: Browse popular models, filter by deployment type, and search for specific tools like Llama, Whisper, and Flux.\
2️⃣ **Customizing your experience**: Use filters like "Serverless Models" to find models that fit your specific needs.\
3️⃣ **Seamless integration**: Discover how Fireworks simplifies the process of discovering and managing AI models.
Developers building generative AI applications can interact with Fireworks in multiple ways:
* 🌐 **Via the web app**: Access the Fireworks platform directly in your browser for easy model management.
* 🐍 [**Through our Python SDK**](https://docs.fireworks.ai/tools-sdks/python-client/installation): Programmatically integrate and manage models within your codebase.
* 🔗 [**With external providers**](https://docs.fireworks.ai/tools-sdks/openai-compatibility): Pass your Fireworks API key to third-party tools for seamless workflow integration.
For additional documentation and guides, check out our [Cookbook](https://docs.fireworks.ai/cookbook/learn_with_fireworks/ecosystem_examples), which includes community-contributed notebooks and applications.
### Action items
* 👀 **Browse the model library**: Explore our [open and closed-source models](https://fireworks.ai/models).
* 📚 **Read real-world use cases**: See how customers are building production systems like:
* [Upwork’s Proposal Writer](https://fireworks.ai/blog/story-upwork-proposal)
* [Cresta’s Knowledge Assistant](https://fireworks.ai/blog/story-cresta-knowledge-assist)
* [Cursor’s Fast Apply](https://fireworks.ai/blog/cursor)
* [Sourcegraph’s Cody](https://fireworks.ai/blog/accelerating-code-completion-with-fireworks-fast-llm-inference)
* 👋 **Join our Discord community**: [Connect and share your projects](https://discord.com/invite/fireworks-ai).
***
# Step 2. Experiment using the model playground
The easiest way to get started with Fireworks and test models with minimal setup is through the **Model Playground**. Here, you can experiment with prompts, adjust parameters, and get immediate feedback on results before moving to more advanced steps.
Take a closer look at how the LLM Playground lets you experiment with text-based models.
In this video, we explore the **Fireworks Model Playground**, the easiest way to experiment with LLMs, adjust parameters, and get instant feedback. Whether you’re crafting creative prompts, refining outputs, or testing model performance, the Playground is your go-to tool for seamless experimentation.
### ✨ What you’ll learn:
* 🔍 **Getting started**: Access the Playground from the Model Library by selecting models like Llama 3.3 70B Instruct.
* 📋 **Model details**: Discover key information, including starter code in Python, Typescript, Java, Go, and Shell for Chat and Completion modes.
* 🎭 **Running prompts**: Test creative prompts like “Write a synopsis of the modern 2020 version of the Cats musical” and see instant results.
* 🎛️ **Parameter controls**: Adjust settings like temperature and max tokens to refine outputs to your liking.
* ⚡ **Completion mode**: Explore latency and tokens-per-second metrics with prompts like “Write a synopsis of the modern 2020 Tarzan movie with Brendan Fraser.”
* 💻 **Code integration**: Generate ready-to-use code snippets directly from the Playground for effortless integration into your projects.
Discover how the Image Playground transforms visual AI experimentation into an intuitive process.
In this video, we dive into the **Fireworks Image Playground**, where you can create stunning visuals, refine parameters, and explore the possibilities of AI-driven image generation. Perfect for developers, designers, and creators, the Image Playground is your gateway to experimenting with prompts and parameters for artistic and practical outputs.
### ✨ What you’ll learn:
* ☑️ **Getting started**: Navigate the Model Library to find image models like FLUX.1 schnell FP8 and open them in the Model Playground.
* ☑️ **Crafting prompts**: Use creative prompts like “Movie poster for a film set in a world where gravity doesn’t exist” and watch the model bring your vision to life.
* ☑️ **Adjusting parameters**: Experiment with settings like Guidance Scale, Inference Steps, and Seed to refine and perfect your results.
* ☑️ **Exploring variants**: Test different models, such as FLUX.1 dev FP8, for varied image quality and creative flexibility.
* ☑️ **Integrating code**: Generate and view sample code in Python, Typescript, or Shell, complete with request parameters and response codes for seamless integration.
Experience how the Audio Playground empowers advanced audio transcription and translation tasks.
Welcome to Part 2C of our onboarding series! In this video, we explore the **Fireworks Audio Playground**, showcasing the incredible speed and accuracy of the Whisper Turbo models. Whether you’re transcribing, translating, or analyzing audio, Fireworks makes it easy to experiment and unlock the potential of advanced audio models.
### ✨ What you’ll learn:
* 🎵 **Real-world test case**: Using the song *Do You Hear the People Sing?* from *Les Misérables*, featuring nine distinct languages and various English accents, to demonstrate transcription and translation capabilities.
* 🔍 **Navigating the model library**: Find Whisper v3 Turbo and access its playground.
* 📂 **Uploading audio**: Test the model with screen-recorded audio to ensure unbiased results without metadata influence.
* ⚡ **Fast and accurate transcription**: Observe Whisper Turbo’s ability to transcribe multilingual content at lightning speed and compare its output to the original lyrics.
### 🔑 Key features of the Audio Playground:
* 🌍 **Multilingual capabilities**: Whisper Turbo excels in recognizing and transcribing multiple languages and dialects.
* ⚡ **Incredible speed**: Experience near-instant transcriptions for even complex audio files.
* 🎛️ **Interactive testing**: Upload audio, tweak parameters, and explore transcription and translation features in real time.
Each model in the Playground includes the following features, designed to enhance your experimentation and streamline your workflow:
* 🎛️ **Parameter controls**: Adjust settings like [temperature](https://docs.fireworks.ai/guides/querying-text-models#temperature) and [max tokens](https://docs.fireworks.ai/guides/querying-text-models#max-tokens) for LLMs or image-specific parameters (e.g., Guidance Scale) for image generation models. These controls allow you to fine-tune the behavior and outputs of the models, helping you achieve the desired results for different use cases.
* 🧩 **Code samples**: Copy-paste ready-to-use code in Python, Typescript, or Shell to integrate models directly into your applications. This eliminates the guesswork of API implementation and speeds up development, so you can focus on building impactful solutions.
* 🎨 **Additional UI elements**: Leverage interactive features like file upload buttons for image or audio inputs, making it easy to test multimodal capabilities without any additional setup. This ensures a smooth, hands-on testing experience, even for complex workflows.
* 🔍 **Model ID**: Clearly displayed in the format [`account/fireworks/models/`](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids), allowing you to switch between models effortlessly with a single line of code, making experimentation and integration faster and more efficient.
### Action items
* 💻 🖱️ **Sign into your account** and explore various models, including:
* **LLMs and VLMs**: [Llama 3.3 70B](https://fireworks.ai/models/fireworks/llama-v3p3-70b-instruct/playground), [Llama 3.2 90B Vision Instruct](https://fireworks.ai/models/fireworks/llama-v3p2-90b-vision-instruct/playground)
* **VLMs**: [FLUX.1 \[dev\] FP8](https://fireworks.ai/models/fireworks/flux-1-dev-fp8/playground)
* **Audio models**: [Whisper V3 Turbo](https://fireworks.ai/models/fireworks/whisper-v3-turbo/playground)
* ❓ **Have questions, comments, or feedback?** Head over to Discord and post in:
* [`#feature-requests`](https://discord.com/channels/1137072072808472616/1137075904938508340)
* [`#questions`](https://discord.com/channels/1137072072808472616/1137075268138324079)
* `#bug-reports`
* 📚 **Check out sampling options**: Review the [sampling options for text models](https://docs.fireworks.ai/guides/querying-text-models#temperature) to see the parameters we currently support.
***
# Step 3. Prototyping with serverless
Fireworks' **serverless infrastructure** lets you quickly prototype AI models without managing servers or committing to long-term contracts. This setup supports fast experimentation and seamless scaling for your projects.
### Why use Fireworks serverless?
* 🚀 **Launch instantly**: Deploy apps with no setup or configuration required.
* 🎯 **Focus on prompt engineering**: [Design](https://docs.fireworks.ai/guides/querying-text-models#using-the-api) and [refine](https://docs.fireworks.ai/structured-responses/structured-output-grammar-based) your prompts without worrying about infrastructure.
* ⚙️ **Adjust parameters easily**: Modify settings like temperature and max tokens to customize model outputs.
* 💰 **Pay-as-you-go**: Only pay for [what you use](https://fireworks.ai/pricing#text), with pricing based on parameter size buckets, making it cost-effective for projects of any size.
To start prototyping, you’ll need to obtain your API key, which allows you to interact with Fireworks' serverless models programmatically.
In this video, we’ll guide you through generating your **Fireworks API key**, the first step to leveraging Fireworks’ serverless infrastructure. Prototype AI models with ease, scale seamlessly, and focus on building without worrying about managing servers.
### ✨ Why use Fireworks serverless?
* 🚀 **Launch instantly**: Deploy apps with no setup or configuration required.
* 🎯 **Focus on prompt engineering**: Refine your prompts without infrastructure headaches.
* ⚙️ **Adjust parameters easily**: Tweak settings like temperature and max tokens to customize outputs.
* 💰 **Pay-as-you-go**: Cost-effective pricing based on usage, perfect for projects of any size.
### 🛠️ How to get your API key:
1️⃣ **Navigate to User Settings**: Log in to your Fireworks account and click the profile icon.\
2️⃣ **Generate your key**: Select ‘API Keys’ and click ‘Create API Key’ to generate your unique key.\
3️⃣ **Copy and secure**: Save your API key securely—it’s essential for authentication.
### Using your API key
Your API key is essential for securely accessing and managing your serverless deployments. Here’s how to use it:
* **Via the API**: Include your API key in the headers of your [RESTful API requests](https://docs.fireworks.ai/api-reference/post-chatcompletions#authorization-authorization) to integrate Fireworks’ models into your applications.
* **Using our SDK**: Configure the Fireworks [Python library](https://docs.fireworks.ai/tools-sdks/firectl/commands/authentication#authenticate-with-api-key) with your API key to manage and deploy models programmatically.
* **Through third-party tools**: Pass your API key to third-party clients (like [LangChain](https://python.langchain.com/docs/integrations/providers/fireworks/#authentication)) to incorporate Fireworks into your existing workflows, enabling you to use serverless models seamlessly.
Additionally, Fireworks is [**OpenAI compatible**](https://docs.fireworks.ai/tools-sdks/openai-compatibility), enabling you to leverage familiar OpenAI tools and integrations within your Fireworks projects.
In this video, we’ll show you how to use your **Fireworks API key** to call serverless LLMs and effortlessly prototype with Fireworks’ serverless infrastructure. Whether you're creating structured datasets or testing model outputs, Fireworks makes scaling your ideas simple—no servers required!
### ✨ What you’ll learn:
* 📖 **Accessing the Cookbook**: Explore Fireworks' GitHub repo and open example notebooks like *"Llama 3.1 Synthetic Data Generation"* in Colab.
* 🔑 **Using your API key**: Learn how to securely generate and use your Fireworks API key for authentication.
* 🤖 **Interacting with models**: Call Llama 3.1 models to generate structured synthetic data and customize outputs.
* 🎯 **Prompt engineering in action**: See how to craft prompts to generate JSON-structured quiz questions with context, responses, and metadata.
### 🌟 Featured example:
Watch as we:
* 📍 **Generate geography quiz questions**: Using Llama 3.1 405B for structured outputs.
* 💾 **Save data**: Store structured data in JSONL format for project use.
* ⚡ **Showcase flexibility**: Highlight how Fireworks supports dataset creation, testing, and more.
### Action items
* 🔑 **Get your API key**: Navigate to your account settings and [generate your API key](https://fireworks.ai/account/api-keys) to authenticate your requests.
* 📓 **Call a serverless model**: See how you can call a serverless model using a [sample notebook](https://colab.research.google.com/drive/1arL7bWuF2P3soS3p19MWJeUDtW0Eu5tk?usp=sharing).
* 🔖 **Read the API usage guide**: Understand the [different endpoints and parameters](https://docs.fireworks.ai/api-reference/introduction) available for use in your projects.
* 📚 **Read the serverless deployment guides**: Access our docs on serverless usage, [pricing](https://docs.fireworks.ai/faq/billing-pricing-usage/billing/credit-system), and [rate limits](https://docs.fireworks.ai/guides/quotas_usage/rate-limits).
* 💻 **Try out additional sample notebooks**: Use your Fireworks API key to explore [more](https://colab.research.google.com/drive/1LvUsItqOAsRUhXjyBexiborSricT_e3H?usp=sharing) [sample](https://colab.research.google.com/drive/1uCm7ZcbsWvWMpRQvJVG9e6E2sD0ZeL8x?usp=sharing) [notebooks](https://colab.research.google.com/drive/1huPsNm9l4OcJvIcu63u0FFWF8X2J7zW3?usp=sharing) in our [cookbook](https://docs.fireworks.ai/cookbook/learn_with_fireworks/ecosystem_examples).
***
# Step 4. Scale out with on-demand deployments
Fireworks’ **on-demand deployments** provide you with dedicated GPU instances, ensuring [predictable performance and advanced customization options](https://fireworks.ai/blog/why-gpus-on-demand) for your AI workloads. These deployments allow you to scale efficiently, optimize costs, and access exclusive models that aren’t available on serverless infrastructure.
### Why choose on-demand deployments?
* 🏎️ **Predictable performance**: Enjoy consistent performance unaffected by other users’ workloads.
* 📈 **Flexible scaling**: Adjust replicas or GPU resources to handle varying workloads efficiently.
* ⚙️ **Customization**: Choose GPU types, enable features like long-context support, and apply quantization to optimize costs.
* 🔓 **Expanded access**: Deploy [larger models](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct) or [custom models](https://docs.fireworks.ai/models/uploading-custom-models) from Hugging Face files.
* 💰 **Cost optimization**: Save more with reserved capacity when you have high utilization needs.
### Key features of on-demand deployments
* 🔄 **Replica scaling**: Automatically [adjust replicas](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling) to handle workload changes.
* 🖥️ **Hardware options**: Choose GPUs like [NVIDIA H100, NVIDIA A100, or AMD MI300X](https://docs.fireworks.ai/guides/ondemand-deployments#choosing-hardware-type) to match your performance and budget needs. Check the [Regions Guide](https://docs.fireworks.ai/deployments/regionss) for availability.
* ⚡ **Quantization**: Use FP8 or other precision settings to improve speed and reduce costs while [keeping accuracy high](https://fireworks.ai/blog/fireworks-quantization). See the [Quantization Guide](https://docs.fireworks.ai/models/quantization).
### Action items
* 🔖 **Understand the benefits of on-demand versus serverless**: Learn about the [full range of deployment options](https://fireworks.ai/blog/why-gpus-on-demand) and how to [customize them to your needs](https://docs.fireworks.ai/deployments/reservations).
* 📚 **Explore optimization techniques**: Learn how [caching](https://docs.fireworks.ai/guides/prompt-caching), [quantization](https://docs.fireworks.ai/models/quantization), and [speculative decoding](https://docs.fireworks.ai/guides/predicted-outputs) can improve performance and reduce costs.
* ❓ **Check out our FAQs**: Find answers to common questions about [account management](https://docs.fireworks.ai/faq/account/access/setup-management), [support services](https://docs.fireworks.ai/faq/general/support/platform-support), and [on-demand deployment infrastructure](https://docs.fireworks.ai/faq/deployment/ondemand/ondemand-deployment-scaling).
***
# Step 5. Building Compound AI systems
Expand your AI capabilities by incorporating advanced features like **Compound AI**, **function calling**, or **retrieval-augmented generation (RAG)**. These tools enable you to build sophisticated applications that integrate seamlessly with external systems. For greater control, consider on-prem or BYOC deployments.
### With Fireworks, you can:
* 🛠️ **Leverage advanced features**: Build Compound AI systems with function calling, RAG, and agents (Advanced Features).
* 🔗 **Integrate external tools**: Connect models with APIs, databases, or other services to enhance functionality.
* 🔍 **Optimize workflows**: Use Fireworks’ advanced tools to streamline AI development, enhance system efficiency, and scale complex applications with ease.
### Action items
* 📚 **Learn about Compound AI and Advanced Features**: Explore richer functionality to create more sophisticated applications.
* **Fireworks Compound AI System**: With [f1](https://fireworks.ai/blog/fireworks-compound-ai-system-f1), experience how specialized models work together to deliver groundbreaking performance, efficiency, and advanced reasoning capabilities.
* **Document inlining**: Make any LLM capable of [processing documents](https://fireworks.ai/blog/document-inlining-launch) for seamless retrieval, summarization, and comprehension.
* **Multimodal enterprise**: See how Fireworks integrates text, image, and audio models to power [enterprise-grade multimodal AI](https://fireworks.ai/blog/multimodal-enterprise) solutions.
* **Multi-LoRA fine-tuning**: Learn how Multi-LoRA fine-tuning enables [precise model customization](https://fireworks.ai/blog/multi-lora) across diverse datasets.
* **Audio transcription launch**: Explore Fireworks’ state-of-the-art [audio transcription models](https://fireworks.ai/blog/audio-transcription-launch) for fast and accurate speech-to-text applications.
* 📞 **Contact us for enterprise solutions**: Have complex requirements or need reserved capacity? [Reach out to our team](https://fireworks.ai/company/contact-us?tab=business) to discuss tailored solutions for your organization.
***
### 🌟 Dive deeper into the docs
Ready to learn more? Continue exploring the Fireworks documentation to uncover specific tools, workflows, and advanced features that can help you take your AI systems to the next level.
# Quickstart
Get started in 5 minutes
Fireworks.ai is a lightning-fast inference platform that serves generative AI models. All the models are exposed over `completions` and a `chat completions` API.
Using the API, you can build on popular open-source models and custom fine-tuned models like FireFunction, Hermes 2 Pro, etc.
Experience all our models in the [model playground!](https://fireworks.ai/models/fireworks/mixtral-8x7b-instruct)
Quickstart helps you to get started in minutes. However, if you want to explore more, please refer to the [guides](/guides/querying-text-models) section or the [API reference](/api-reference/introduction).
In this guide, you will:
* Set up your development environment
* Choose an SDK
* Call the Fireworks API with an API Key
## Account Creation
Create a [Fireworks AI](https://fireworks.ai/login) account. Under Account Settings, click on [API Keys](https://fireworks.ai/api-keys) to generate one.
Please keep the API Key in a secure location.
### Set up developer environment
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade fireworks-ai
```
Fireworks Python Client is OpenAI API Compatible.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API.
```python
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade openai
```
Fireworks AI platform offers drop-in replacement with OpenAI Python Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $OPENAI_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx OPENAI_API_BASE "https://api.fireworks.ai/inference/v1"
setx OPENAI_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %OPENAI_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI Python SDK.
```python
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
# notice the change in the model name
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
)
print(response.choices[0].message.content)
```
Before installing, ensure that you have the right version of Node. Please make sure you have the `npm` installed or a package manager of your choice.
```bash
npm install openai
```
Fireworks AI platform offers drop-in replacement with OpenAI JavaScript Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export OPENAI_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $OPENAI_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx OPENAI_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %OPENAI_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI JavaScript SDK.
```javascript
import OpenAI from 'openai';
const openai = new OpenAI({
baseUrl: 'https://api.fireworks.ai/inference/v1',
apiKey: process.env['OPENAI_API_KEY']
});
const completion = await openai.chat.completions.create({
messages: [{ role: "user", content: "Say this is a test" }],
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
});
console.log(completion.choices[0].message.content);
```
cURL is a popular open-source command line tool to send HTTP requests. Most Operating systems ship cURL by default.
However, if you are not sure, you can follow the first two steps of this guide to setup cURL. If not, we recommend skipping to **Step Three**.
Check if your operating system has cURL installed by running `curl https://api.fireworks.ai`
macOS comes with the cURL tool bundled with the operating system.
If you want to upgrade to the latest version shipped by the cURL project, we recommend installing homebrew:
```bash Homebrew
brew install curl
```
Most Linux distributions offer curl and libcurl to be installed if they are not installed by default.
```bash apt
apt install curl
```
```bash yum
yum install curl
```
Windows 10 comes with the cURL tool bundled with the operating system since version 1804.
If you have an older Windows version or just want to upgrade to the latest version shipped by the cURL project, download the latest official cURL release for Windows from [curl.se/windows](https://curl.se/windows).
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
Making your first API request with cURL. Notice the use of `$FIREWORKS_API_KEY`.
```
curl \
--header 'Authorization: Bearer '$FIREWORKS_API_KEY \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
More details on calling various APIs can be found at our [API Reference](/api-reference)
## Dive in further
Integrating Fireworks AI using LangChain
Learn Stable Diffusion 3 API
Create a unique model
Deploy on our blazing-fast inference stack
Have fun!
If you have any questions, please reach out to us on [Discord](https://discord.gg/mMqQxvFD9A) or [Twitter](https://twitter.com/thefireworksai).
# Using function-calling
## Introduction
Function calling enables models to intelligently select and utilize tools based on user input. This powerful feature allows you to build dynamic agents that can access real-time information and generate structured outputs. The function calling API doesn't execute functions directly. Instead, it generates [OpenAI](https://platform.openai.com/docs/guides/function-calling)-compatible function call specifications that you can then implement.
## How function calling works
1. **Tools specifications**: You specify a **query** along with the **list of available tools** for the model. The tools are specified using [JSON Schema](https://json-schema.org/learn/getting-started-step-by-step). Each tool includes its name, description, and required parameters.
2. **Intent detection**: The model analyzes user input and determines whether to provide a conversationsl response or generate function calling specifications.
3. **Function call generation**: When appropriate, the model outputs structured function calls in OpenAI-compatible format, including all necessary parameters based on the context.
4. **Execution and response generation**: You execute the specified function calls and feed results back to the model for continued conversation.
## Supported models
A subset of models hosted on Fireworks supports function calling using the described syntax. These models are listed below. The [supportsTools](https://docs.fireworks.ai/api-reference/get-model#response-supports-tools) field in the model response also indicates whether the model supports function calling.
* [Llama 3.1 405B Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct)
* [Llama 3.1 70B Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-70b-instruct)
* [Qwen 2.5 72B Instruct](https://fireworks.ai/models/fireworks/qwen2p5-72b-instruct)
* [Mixtral MoE 8x22B Instruct](https://fireworks.ai/models/fireworks/mixtral-moe-8x22b-instruct)
* [Firefunction-v2](https://fireworks.ai/models/fireworks/firefunction-v2): Latest and most performant model, optimized for complex function calling scenarios (on-demand only)
* [Firefunction-v1](https://fireworks.ai/models/fireworks/firefunction-v1): Previous generation, Mixtral-based function calling model optimized for fast routing and structured output (on-demand only)
These models can all utilize function calling with the same syntax, shown below.
## Basic example: City population data retrieval: Llama 3.1 405B Instruct
For this example, let’s consider a user looking for population data for a specific city. We will provide the model with a tool that it can invoke to retrieve city population data.
1. To achieve this, we detail the purpose, arguments, and usage of the `get_city_population` function using [JSON Schema](https://json-schema.org/). This information is provided through the `tools` argument. The user query is sent as usual through the `messages` argument.
```python Request
import openai
import json
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key=""
)
# Define the function tool for getting city population
tools = [
{
"type": "function",
"function": {
# The name of the function
"name": "get_city_population",
# A detailed description of what the function does
"description": "Retrieve the current population data for a specified city.",
# Define the JSON schema for the function parameters
"parameters": {
# Always declare a top-level object for parameters
"type": "object",
# Properties define the arguments for the function
"properties": {
"city_name": {
# JSON Schema type
"type": "string",
# A detailed description of the property
"description": "The name of the city for which population data is needed, e.g., 'San Francisco'."
},
},
# Specify which properties are required
"required": ["city_name"],
},
},
}
]
# Define a comprehensive system prompt
prompt = f"""
You have access to the following function:
Function Name: '{tools[0]["function"]["name"]}'
Purpose: '{tools[0]["function"]["description"]}'
Parameters Schema: {json.dumps(tools[0]["function"]["parameters"], indent=4)}
Instructions for Using Functions:
1. Use the function '{tools[0]["function"]["name"]}' to retrieve population data when required.
2. If a function call is necessary, reply ONLY in the following format:
{{"city_name": "example_city"}}
3. Adhere strictly to the parameters schema. Ensure all required fields are provided.
4. Use the function only when you cannot directly answer using general knowledge.
5. If no function is necessary, respond to the query directly without mentioning the function.
Examples:
- For a query like "What is the population of Toronto?" respond with:
{{"city_name": "Toronto"}}
- For "What is the population of the Earth?" respond with general knowledge and do NOT use the function.
"""
# Initial message context
messages = [
{"role": "system", "content": prompt},
{"role": "user", "content": "What is the population of San Francisco?"}
]
# Call the model
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
# Print the model's response
print(chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": null,
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": [
{
"id": "call_tPSbe4guTSXuUWbqtWguSJzu",
"function": {
"arguments": "{\"city_name\": \"San Francisco\"}",
"name": "get_city_population"
},
"type": "function",
"index": 0
}
]
}
```
2. In our case, the model decides to invoke the `get_city_population` tool with a specific argument. **Note** that the model itself does not invoke the tool. It just specifies the argument. When the model issues a function call - the completion reason would be set to `tool_calls`. The API caller is responsible for parsing the function name and arguments supplied by the model & invoking the appropriate tool.
```python Call External API
def get_city_population(city_name: str):
print(f"{city_name=}")
if city_name == "San Francisco":
return {"population": 883305}
else:
raise NotImplementedError()
function_call = chat_completion.choices[0].message.tool_calls[0].function
tool_response = locals()[function_call.name](**json.loads(function_call.arguments))
print(tool_response)
```
```json Response
city_name='San Francisco'
{'population': 883305}
```
3. The API caller obtains the response from the tool invocation & passes its response back to the model for generating a response.
```python Request
agent_response = chat_completion.choices[0].message
# Append the response from the agent
messages.append(
{
"role": agent_response.role,
"content": "",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat_completion.choices[0].message.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response)
}
)
next_chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(next_chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": "The population of San Francisco is 883305.",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
```
This results in the following response
```
The population of San Francisco is 883305.
```
## Advanced example: Financial data retrieval
**TL;DR** **This example tutorial is available as a Python notebook** \[[code](https://github.com/fw-ai/cookbook/blob/main/learn/function-calling/notebooks_firefunction_openai/fireworks_function_calling_demo.ipynb) | [Colab](https://colab.research.google.com/drive/1m7Bk1360CFI50y24KBVxRAKYuEU3pbPU?usp=sharing)].
For this example, let's consider a user looking for Nike's financial data. We will provide the model with a tool that the model is allowed to invoke & get access to the financial information of any company.
1. To achieve our goal, we will provide the model with information about the `get_financial_data` function. We detail its purpose, arguments, etc in [JSON Schema](https://json-schema.org/). We send this information in through the `tools` argument. We sent the user query as usual through the `messages` argument.
```python Request
import openai
import json
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = ""
)
messages = [
{"role": "system", "content": f"You are a helpful assistant with access to functions."
"Use them if required."},
{"role": "user", "content": "What are Nike's net income in 2022?"}
]
tools = [
{
"type": "function",
"function": {
# name of the function
"name": "get_financial_data",
# a good, detailed description for what the function is supposed to do
"description": "Get financial data for a company given the metric and year.",
# a well defined json schema: https://json-schema.org/learn/getting-started-step-by-step#define
"parameters": {
# for OpenAI compatibility, we always declare a top level object for the parameters of the function
"type": "object",
# the properties for the object would be any arguments you want to provide to the function
"properties": {
"metric": {
# JSON Schema supports string, number, integer, object, array, boolean and null
# for more information, please check out https://json-schema.org/understanding-json-schema/reference/type
"type": "string",
# You can restrict the space of possible values in an JSON Schema
# you can check out https://json-schema.org/understanding-json-schema/reference/enum for more examples on how enum works
"enum": ["net_income", "revenue", "ebdita"],
},
"financial_year": {
"type": "integer",
# If the model does not understand how it is supposed to fill the field, a good description goes a long way
"description": "Year for which we want to get financial data."
},
"company": {
"type": "string",
"description": "Name of the company for which we want to get financial data."
}
},
# You can specify which of the properties from above are required
# for more info on `required` field, please check https://json-schema.org/understanding-json-schema/reference/object#required
"required": ["metric", "financial_year", "company"],
},
},
}
]
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": "",
"role": "assistant",
"function_call": null,
"tool_calls": [
{
"id": "call_XstygHYlzKrI8hbERr0ybeOQ",
"function": {
"arguments": "{\"metric\": \"net_income\", \"financial_year\": 2022, \"company\": \"Nike\"}",
"name": "get_financial_data"
},
"type": "function",
"index": 0
}
]
}
```
2. In our case, the model decides to invoke the tool `get_financial_data` with some specific set of arguments. **Note** The model itself won't invoke the tool. It just specifies the argument. When the model issues a function call - the completion reason would be set to `tool_calls`. The API caller is responsible for parsing the function name and arguments supplied by the model & invoking the appropriate tool.
```python Call External API
def get_financial_data(metric: str, financial_year: int, company: str):
print(f"{metric=} {financial_year=} {company=}")
if metric == "net_income" and financial_year == 2022 and company == "Nike":
return {"net_income": 6_046_000_000}
else:
raise NotImplementedError()
function_call = chat_completion.choices[0].message.tool_calls[0].function
tool_response = locals()[function_call.name](**json.loads(function_call.arguments))
print(tool_response)
```
```json Response
metric='net_income' financial_year=2022 company='Nike'
{'net_income': 6046000000}
```
3. The API caller obtains the response from the tool invocation & passes its response back to the model for generating a response.
```python Request
agent_response = chat_completion.choices[0].message
# Append the response from the agent
messages.append(
{
"role": agent_response.role,
"content": "",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat_completion.choices[0].message.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response)
}
)
next_chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(next_chat_completion.choices[0].message.content)
```
```json Response
{
"content": "Nike's net income for the year 2022 was $6,046,000,000.",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
```
This results in the following response
```
Nike's net income for the year 2022 was $6,046,000,000.
```
## Tools specification
The `tools` field is an array where each component includes the following fields:
1. `type` (`string`) Specifies the type of the tool. Currently, only `function` is supported.
2. `function` (`object`) Specifies the function to be called. It includes the following fields:
* `description` (`string`): A description of what the function does, used by the model to choose when and how to call the function.
* `name` (`string`): The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
* `parameters` (`object`): The parameters the functions accepts, described as a JSON Schema object. See the [JSON Schema reference](https://json-schema.org/understanding-json-schema/reference) for documentation about the format.
## Tool choice
The `tool_choice` parameter controls whether the model is allowed to call functions or not. Currently, we support values `auto`, `none` , `any` or a specific function name.
* `auto` (default)
The model can dynamically choose between generating a message or calling a function. This is the **default** tool choice when no value is specified for `tool_choice`.
* `none`
Disables the use of any tools, similar to not specifying the `tool_choice` field.
* `any`
Allows the model to call any function. You can also specify:
```
tool_choice = {"type": "function"}
```
This ensures that a function call will always be made, with no restriction on the function's name.
* Specific function name
To force the model to use a particular function, you can explicitly specify the function name in the `tool_choice` field. For example:
```
tool_choice = {"type": "function", "function": {"name": "get_financial_data"}}
```
This ensures that the model will only use the `get_financial_data` function.
## OpenAI compatibility
Fireworks AI's function calling API is fully compatible with OpenAI's implementation, with a few differences:
* No support for parallel function calling
* No nested function calling
* Simplified tool choice options
## Best practices
* **Number of Functions**: The length of the list of functions specified to the model, directly impacts its performance. For best performance, keep the list of functions below 7. It's possible to see some degradation in the quality of the model as the tool list length exceeds 10.
* **Function Description**: The function specification follows [JSON Schema](https://json-schema.org/). For best performance, describe in great detail what the function does under the "description" section. An example of a good function description is "Get financial data for a company given the metric and year". A bad example would be "Get financial data for a company".
* **System Prompt**: In order to ensure optimal performance, we recommend **not** adding any additional system prompt. User-specified system prompts can interfere with the function detection & calling ability of the model. The auto-injected prompt for our function calling model is designed to ensure optimal performance.
* **Temperature**: Setting temperature to 0.0 or some low value. This helps the model to only generate confident predictions and avoid hallucinating parameter values.
* **Function descriptions**: Providing verbose descriptions for functions & its parameters. This is similar to prompt engineering: the more elaborate & accurate the function definition/documentation - the better the model is at deciphering the accurate intent of the function and its parameters.
## Function calling vs JSON mode
When to use function calling vs [JSON mode](/structured-responses/structured-response-formatting)?
Use function calling when:
* Building interactive agents
* Requiring structured API calls
* Implementing multi-step workflows
* Needing dynamic decision making
Use JSON mode when:
* Performing simple data extraction
* Working with static data
* Needing non-JSON output formats
* Processing batch data without interaction
## Example apps
* Official demos
* [Interactive Image and Finnace Dashboard](https://functional-chat.vercel.app/)
* [Data Extraction Pipeline](https://colab.research.google.com/drive/1SI6jz66k122vv641e8wDDI0Ujh4cwlUy?usp=sharing)
* Langchain integrations
* [Javascript Function Calling](https://github.com/langchain-ai/langchainjs/blob/main/cookbook/function_calling_fireworks.ipynb)
* [Agent Executor Implementation](https://colab.research.google.com/drive/1huPsNm9l4OcJvIcu63u0FFWF8X2J7zW3?usp=sharing)
* [RAG with Langchain](https://colab.research.google.com/drive/1Vy4tYxP_rlbkAKi4pGpaDRV7hnSQeG2d?usp=sharing)
## Resources
* [Fireworks Blog Post on FireFunction-v2](https://fireworks.ai/blog/firefunction-v2-launch-post)
* [Open AI Docs on Function Calling](https://platform.openai.com/docs/guides/function-calling)
* [Open AI Cookbook on Function Calling](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models)
* [Function Calling Best Practices](#best-practices)
## Data policy
Data from Firefunction is logged and automatically deleted after 30 days to ensure product quality and to prevent abuse ( bulk data on average # functions used, etc). This data will never be used to train models. Please contact [raythai@fireworks.ai](mailto:raythai@fireworks.ai) if you have questions, comments, or use cases where data cannot be logged.
# Merging LoRA adapters with base models
A guide for downloading base models, merging them with LoRA adapters, and deploying the result using Fireworks.
# Merging LoRA adapters with base models on Fireworks
A guide for downloading base models, merging them with LoRA adapters, and deploying the result using Fireworks.
**Prerequisites:**
* Fireworks account and `firectl` installed
* Python environment with necessary packages
* Local LoRA adapter or access to HuggingFace
* Python 3.9 or later (\< 3.13)
Follow the steps below to merge and deploy your models.
## 1. Access and download base model
### 1.1 List available models
View all models in your Fireworks account:
```bash
firectl list models
```
Example output:
```
Code Llama 13B (code-llama-13b) 2024-02-29 20:36:24 HF_BASE_MODEL
CodeGemma 7B (codegemma-7b) 2024-06-19 22:57:22 HF_BASE_MODEL
... ... ...
```
Recall the supported base models:
* Gemma
* Phi, Phi-3
* Llama 1, 2, 3, 3.1
* LLaVa
* Mistral & Mixtral
* Qwen2
* StableLM
* Starcoder (GPTBigCode) & Starcoder2
* DeepSeek V1 & V2
* GPT NeoX
### 1.2 Download base model
Download your chosen model to a local directory:
```bash
firectl download model
```
Example:
```bash
firectl download model code-llama-13b ./base_model
```
Available flags:
* `--quiet`: Suppress progress bar
* `-h, --help`: Display help information
## 2. Obtain LoRA adapter
### 2.1 Download LoRA adapter from Fireworks
The easiest way to obtain a LoRA adapter is to download it directly from Fireworks. LoRA adapters are listed alongside models when using `firectl list models` and are denoted with the type `HF_PEFT_ADDON`. Download a LoRA adapter using the same command as downloading a model.
### 2.2 Download from HuggingFace (Optional)
If you need to download a LoRA adapter from HuggingFace, follow these steps:
**Requirements**
Install the required package:
```bash
pip install huggingface_hub
```
**Download code**
```python
from huggingface_hub import snapshot_download
# Configure download parameters
adapter_id = "hf-account/adapter-name" # Your HuggingFace adapter path
output_path = "./path/to/save/adapter" # Local directory to save adapter
# Download the adapter
local_path = snapshot_download(
repo_id=adapter_id,
local_dir=output_path
)
```
Important notes:
* Replace `adapter_id` with your desired LoRA adapter
* Ensure `output_path` is a valid directory path
* The function returns the local path where files are downloaded
## 3. Merging base model with LoRA adapter
### 3.1 Installation requirements
First, ensure you have the necessary libraries installed:
```bash
pip install torch transformers peft
```
### 3.2 Merging script
Create a Python script (`merge_model.py`) with the following code:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
def merge_lora_with_base_model(base_model_path: str, lora_path: str, output_path: str):
"""
Merge a LoRA adapter with a base model and save the result.
Args:
base_model_path (str): Path to the base model directory
lora_path (str): Path to the LoRA adapter directory
output_path (str): Directory to save the merged model
"""
# Load base model
print(f"Loading base model from {base_model_path}")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and apply LoRA adapter
print(f"Loading LoRA adapter from {lora_path}")
model = PeftModel.from_pretrained(
base_model,
lora_path
)
# Merge adapter with base model
print("Merging LoRA adapter with base model...")
merged_model = model.merge_and_unload()
# Save merged model
print(f"Saving merged model to {output_path}")
merged_model.save_pretrained(output_path)
# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
tokenizer.save_pretrained(output_path)
print("Merge completed successfully!")
if __name__ == "__main__":
# Example usage
merge_lora_with_base_model(
base_model_path="./base_model", # Directory containing the base model
lora_path="./lora_adapter", # Directory containing the LoRA adapter
output_path="./merged_model" # Output directory for merged model
)
```
**NOTE:** If you downloaded the base model from Fireworks AI, then you might need to update the `base_model_path` to `./base_model/hf` because required files such as config.json might be within the `hf` directory.
### 3.3 Running the merge
Execute the script after setting your paths:
```bash
python merge_model.py
```
**Important:** After merging, verify that all necessary tokenizer files are present in the output directory. The merging process might skip some essential tokenizer files. You may need to manually copy these files from the base model:
* `tokenizer_config.json`
* `tokenizer.json`
* `special_tokens_map.json`
These files can be found in the original base model directory or the model's HuggingFace repository (e.g., meta-llama/Llama-3.1-70B-Instruct).
### 3.4 Important Notes
* Ensure sufficient disk and GPU memory for all models
* Check your cache directory (\~/.cache/huggingface/hub) as models may already be downloaded there
* Verify LoRA adapter compatibility with base model
* All paths must exist and have proper permissions
* Memory issues can be resolved by setting `device_map="cpu"`
## 4. Uploading and deploying merged model
### 4.1 Create model in Fireworks
Upload your merged model to Fireworks:
```bash
firectl create model
```
Example:
```bash
firectl create model sql-enhanced-model ./merged_model
```
For additional options:
```bash
firectl create model -h
```
### 4.2 Create deployment
Deploy your uploaded model:
Basic deployment:
```bash
firectl create deployment
```
Using full model path:
```bash
firectl create deployment accounts//models/
```
Example:
```bash
firectl create deployment sql-enhanced-model
# OR
firectl create deployment accounts/myaccount/models/sql-enhanced-model
```
Recall, for additional deployment parameters/configuration options:
```bash
firectl create deployment -h
```
### 4.3 Verification
After deployment, you can verify the status using:
```bash
firectl list deployments
```
***
## Complete workflow summary
1. Download base model from Fireworks using `firectl`
2. Download LoRA adapter to local device (e.g. using HuggingFace)
3. Merge models using provided Python script
4. Upload merged model to Fireworks
5. Create deployment
# On-demand deployments
Deploying on your own GPUs
Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:
* Predictable performance unaffected by load caused by other users
* No hard rate limits - but subject to the maximum load capacity of the deployment
* Cheaper under high utilization
* Access to larger selection of models not available via our serverless models
* [Custom base models](/models/uploading-custom-models#custom-base-models) from Hugging Face files
If you plan on using a significant amount of on-demand deployments, consider purchasing [reserved capacity](/deployments/reservations)
for cheaper pricing and higher GPU quotas.
## Quickstart
See the "All models" list on our [Models](https://fireworks.ai/models) page for a list of pre-uploaded models on the
Fireworks AI platform. You can also use a [custom base model](#custom-base-models).
To create a new deployment of a [model provided by Fireworks](https://fireworks.ai/models), run:
```bash
firectl create deployment accounts/fireworks/models/ --wait
```
This command will complete when the deployment is `READY`. To let it run asynchronously, remove the `--wait` flag.
The string `accounts/fireworks/models/` is an example of a ``. [Read more](https://docs.fireworks.ai/models/overview#introduction) about model names.
To create a new deployment using a custom base model, follow the [Uploading custom models](/models/uploading-custom-models#custom-base-models) guide to first upload your custom base model to the Fireworks platform. Then run:
```bash
firectl create deployment
```
The deployment ID is the last part of the deployment name: `accounts//deployments/`.
You can verify the deployment is complete by running:
```bash
firectl get deployment
```
The state field should show `READY`.
To query a specific deployment, use the model identifier in the format: `#`
In most cases, the model identifier follows this pattern:
`accounts//models/` + `#` + `accounts//deployments/`
**Example:**
The model identifier for querying Llama3.2-3B Instruct (listed as `accounts/fireworks/models/llama-v3p2-3b-instruct`) for Acme Inq.'s deployment (deployment ID being `12ab34cd56ef`) would be:
`accounts/fireworks/models/llama-v3p2-3b-instruct#accounts/acmeInc/deployments/12ab34cd56ef`
**Sample Request:**
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/#accounts//deployments/",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
By default, deployments will automatically [scale down to zero](#customizing-autoscaling-behavior) replicas if unused (i.e. no
inference requests) for 1 hour, and automatically delete itself if unused for one week.
To completely delete the deployment, run:
```bash
firectl delete deployment
```
**Notes:**
* Make sure you include the `#` in the model identifier when querying a specific deployment.
* If you are unsure about the model identifier format, refer to the [Model Identifiers](https://docs.fireworks.ai/models/deploying#model-identifier) section for more details and alternatives.
## Deployment options
### Replica count (horizontal scaling)
The number of replicas (horizontal scaling) is specified by passing the `--min-replica-count` and `--max-replica-count`
flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will
automatically scale based on server load.
Auto-scaling up may fail if there is a GPU stockout. Use [reserved capacity](/deployments/reservations) to
guarantee capacity for your deployments.
The default value for `--min-replica-count` is 0. Setting `--min-replica-count` to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified "scale-to-zero" time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.
The default value for `--max-replica-count` is 1 if
`--min-replica-count=0`, or the value of `--min-replica-count` otherwise.
```bash create
firectl create deployment \
--min-replica-count 2 \
--max-replica-count 3
```
```bash update
firectl update deployment \
--min-replica-count 2 \
--max-replica-count 3
```
### Customizing autoscaling behavior
You can customize certain aspects of the deployment's autoscaling behavior by setting the following flags:
* `--scale-up-window` The duration the autoscaler will wait before scaling up a deployment after observing increased
load. Default is `30s`.
* `--scale-down-window` The duration the autoscaler will wait before scaling down a deployment after observing
decreased load. Default is `10m`.
* `--scale-to-zero-window` The duration after which there are no requests that the deployment will be scaled down to
zero replicas. This is ignored if `--min-replica-count` is greater than 0. Default is `1h`. The minimum is `5m`.
There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling
from 0 to 1 replicas.A deployment with `--min-replica-count` set to 0 will be automatically deleted if it receives no traffic for 7
days.
Refer to [time.ParseDuration](https://pkg.go.dev/time#ParseDuration) for valid syntax for the duration string.
### Multiple GPUs (vertical scaling)
The number of GPUs used per replica is specified by passing the `--accelerator-count` flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
```bash create
firectl create deployment --accelerator-count 2
```
```bash update
firectl update deployment --accelerator-count 2
```
### Choosing hardware type
By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB, NVIDIA H200 141GB or AMD MI300X GPUs by passing the `--accelerator-type` flag. Valid values for `--accelerator-type` are:
* `NVIDIA_A100_80GB`
* `NVIDIA_H100_80GB`
* `NVIDIA_H200_141GB`
* `AMD_MI300X_192GB` - Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300X
See [Regions](/deployments/regions) for a list of accelerator availability by region. Region can be either specified or auto-selected for a deployment upon creation. After creation, the region cannot be changed. If you plan on changing the accelerator type, you may need to re-create the deployment in a new region where it is availabile.
For advice on choosing a hardware type, see this [FAQ](https://docs.fireworks.ai/faq/deployment/ondemand/hardware-options#hardware-selection)
```bash create
firectl create deployment --accelerator-type="NVIDIA_H100_80GB"
```
```bash update
firectl update deployment --accelerator-type="NVIDIA_H100_80GB"
```
### Model based speculative decoding
Model based speculative decoding allows you to speed up output generation in some cases, by using a smaller model to assist the larger model in generation.
Fireworks also offers speculative decoding based on a user-provided prediction which works in addition to model based speculative decoding. Read [Using Predicted Outputs](guides/predicted-outputs.mdx) to learn more.Speculative decoding may slow down output generation if the smaller model is not a good speculator for the larger model, or token count / speculation length is too high or too low. Speculative decoding may also reduce the max throughput you can achieve with your deployment. Test different models and speculation lengths to determine the best settings for your use case.
We offer the following settings that can be set as flags in firectl, our CLI tool:
| Flag | Type | Description |
| ---------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--draft-model` | string | To use a draft model for speculative decoding, set this flag to the name of the draft model you want to use. See the table below for recommendations on draft models to use for popular model families. Note that draft models can be standalone models (referred from Fireworks account or custom models uploaded to your account) or an add-on (e.g. Eagle) |
| `--draft-token-count` | int32 | When using a draft model, set this flag to the number of tokens to generate per step for speculative decoding. Setting `--draft-token-count=0` turns off draft model speculation for the deployment. As a rough guideline, use --draft-token-count=3 for eagle draft models and --draft-token-count=4 for other draft models |
| `--ngram-speculation-length` | int32 | To use N-gram based speculation, set this flag to the length of the previous input sequence to be considered for N-gram speculation |
`draft-token-count` must be set when `draft-model` or `--ngram-speculation-length` is used. `draft-model` and `ngram-speculation-length` cannot be used together as they are alternative approaches to model-based speculation. Setting both will throw an error.
You can use the following draft models directly:
| Draft model name | Recommended for |
| -------------------------------------------------------- | -------------------------- |
| accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
| accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
| accounts/fireworks/models/eagle-llama-v3-3b-instruct-v2 | Llama 3.2 3B |
| accounts/fireworks/models/eagle-qwen-v2p5-3b-instruct-v2 | Qwen 2.5 3B |
| accounts/fireworks/models/eagle-llama-v3-8b-instruct-v2 | Llama 3.1 8B, Llama 3.0 8B |
| accounts/fireworks/models/eagle-qwen-v2p5-7b-instruct-v2 | Qwen 2.5 7B |
Here's an example of deploying Llama 3.3 70B with a draft model:
firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct --accelerator-type NVIDIA\_H100\_80GB --draft-model accounts/fireworks/models/llama-v3p2-1b-instruct --draft-token-count 4
In most cases, speculative decoding does not change the quality of the output generated (mathematically, outputs are unchanged, but there might be numerical differences, especially at higher temperatures). If speculation is used on the deployment and you want to verify the output is unchanged, you can set `disable_speculation=True` in the inference API call - in this case, the draft model is still called but its output are not used, so performance will be impacted.
### Quantization
By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces
the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can changes
model numerics which may introduce small changes to the output.
In order to deploy a base model using quantization, it must be prepared first. See our [Quantization](/models/quantization)
guide for details.
To create a deployment using a quantized model, pass the `--precision` flag with the desired precision.
```bash
firectl create deployment \
--accelerator-type="NVIDIA_H100_80GB" \
--precision="FP8"
```
Quantized deployments can only be served using H100 GPUs.
### Optimizing your deployments for long context
By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be
achieved for long-prompt length (>\~3000 tokens) workloads by passing the `--long-prompt` flag.
This option roughly doubles the amount of GPU memory required to serve the model and requires a minimum of two
GPUs to be effective. If `--accelerator-count` is not specified, then a deployment using twice the minimum number of
GPUs (to serve without `--long-prompt`) will be created.
```bash create
firectl create deployment --accelerator-count=2 --long-prompt
```
```bash update
firectl update deployment --long-prompt
```
To update a deployment to disable this option, pass `--long-prompt=false`.
Additional optimization options are available through our enterprise plan.
## Deploying LoRA addons
By default, LoRA addons are disabled for deployments. To enable addons, pass the `--enable-addons` flag:
```bash create
firectl create deployment --enable-addons
```
```bash update
firectl update deployment --enable-addons
```
See [Uploading a custom model](/models/uploading-custom-models#custom-lora-addons) for instructions on how to upload custom
LoRA addons. To deploy a LoRA addon to a on-demand deployment, pass the `--deployment` flag to `firectl deploy`. For
example:
```bash
firectl deploy --deployment
```
The base model of the deployment must match the base model of the addon.
# Pricing
On-demand deployments are billed by GPU-second. Consult our [pricing page](https://fireworks.ai/pricing) for details.
# Using Predicted Outputs
Use Predicted Outputs to boost output generation speeds for editing / rewriting use cases
This feature is in beta and we are working on improvements. We welcome your feedback on [Discord](https://discord.gg/fireworks-ai)
In cases where large parts of the LLM output are known in advance, e.g. editing or rewriting a document or code snippet, you can improve output generation speeds with predicted outputs. Predicted outputs allows you to provide strong "guesses" of what output may look like.
To use Predicted Outputs, set the `prediction` field in the Fireworks API with the predicted output. For example, you may want to edit a survey and add an option to contact users by text message:
```
{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
```
In this case, we expect most of the code will remain the same. We set the ‘prediction’ field to be the original survey code. The output generation speed increases using predicted outputs.
```python Python (Fireworks)
from fireworks.client import Fireworks
code = """{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
"""
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{
"role": "user",
"content": "Edit the How to Contact question to add an option called Text Message. Output the full edited code, with no markdown or explanations.",
},
{
"role": "user",
"content": code
}
],
temperature=0,
prediction={"type": "content", "content": code}
)
print(response.choices[0].message.content)
```
### Additional information on Predicted Outputs:
* Using Predicted Outputs is free at this time
* We recommend setting temperature=0 for best results for most intended use cases of Predicted Outputs. In these cases, using Predicted Outputs does not impact the quality of outputs generated
* If the prediction is substantially different from the generated output, output generation speed may decrease
* The max length of the `prediction` field is set by `max_tokens` and is 2048 by default, and needs to be updated if you have a longer input and prediction.
* If you are using an on-demand deployment, you can set `rewrite_speculation=True` and potentially get even faster output generation. We are working on rolling this out to Serverless soon.
# Prompt caching
Prompt caching is a performance optimization feature that allows Fireworks to
respond faster to requests with prompts that share common prefixes. In many
situations, it can reduce time to first token (TTFT) by as much as 80%.
Prompt caching is **enabled by default** for all Fireworks models and deployments.
For dedicated deployments, prompt caching frees up resources, leading to higher
throughput on the same hardware. Dedicated deployments on the Enterprise plan allow
additional configuration options to further optimize cache performance.
## Using prompt caching
### Common use cases
Requests to LLMs often share a large portion of their prompt. For example:
* Long system prompts with detailed instructions
* Descriptions of available tools for function calling
* Growing previous conversation history for chat use cases
* Shared per-user context, like a current file for a coding assistant
Prompt caching avoids re-processing the cached prefix of the prompt and
starts output generation much sooner.
### Structuring prompts for caching
Prompt caching works only for exact prefix matches within a prompt. To
realize caching benefits, place static content like instructions and examples at
the beginning of your prompt, and put variable content, such as user-specific
information, at the end.
For function calling models, tools are considered part of the prompt.
For vision-language models, images currently aren't cached (but this might be improved in the future).
### How it works
Fireworks will automatically find the longest prefix of the request that is
present in the cache and reuse it. The remaining portion of the prompt will be
processed as usual.
The entire prompt is stored in the cache for future reuse. Cached prompts
usually stay in the cache for at least several minutes. Depending on the model,
load level, and deployment configuration, it can be up to several hours. The
oldest prompts are evicted from the cache first.
Prompt caching doesn't alter the result generated by the model. The response you
receive will be identical to what you would get if prompt caching was not used.
Each generation is sampled from the model independently on each request and is not
cached for future usage.
## Monitoring
For dedicated deployments, information about prompt caching is returned in the
response headers. The header `fireworks-prompt-tokens` contains the number of tokens
in the prompt, out of which `fireworks-cached-prompt-tokens` are cached.
Aggregated metrics are also available in the [usage dashboard](https://fireworks.ai/account/usage?type=deployments).
## Data privacy
Serverless deployments maintain separate caches for each Fireworks account to prevent data leakage and timing attacks.
Dedicated deployments by default share a single cache across all requests.
Because prompt caching doesn't change the outputs, privacy is preserved even
if the deployment powers a multi-tenant application. It does open a minor risk
of a timing attack: potentially, an adversary can learn that a particular prompt
is cached by observing the response time. To ensure full isolation, you can pass
the `x-prompt-cache-isolation-key` header or the `prompt_cache_isolation_key`
field in the body of the request. It can contain an arbitrary string that acts
as an additional cache key, i.e., no sharing will occur between requests with
different IDs.
## Limiting or turning off caching
Additionally, you can pass the `prompt_cache_max_len` field in the request body to
limit the maximum prefix of the prompt (in tokens) that is considered for
caching. It's rarely needed in real applications but can come in handy for
benchmarking the performance of dedicated deployments by passing
`"prompt_cache_max_len": 0`.
## Advanced: cache locality for Enterprise deployments
Dedicated deployments on an Enterprise plan allow you to pass an additional hint in the request to improve cache hit rates.
First, the deployment needs to be created or updated with an additional flag:
```bash
firectl create deployment ... --enable-session-affinity
```
Then the client can pass an opaque identifier representing a single user or
session in the `user` field of the body or in the `x-session-affinity` header. Fireworks
will try to route requests with the identifier to the same server, further reducing response times.
It's best to choose an identifier that groups requests with long shared prompt
prefixes. For example, it can be a chat session with the same user or an
assistant working with the same shared context.
# Querying embedding models
Fireworks hosts many embedding models, and we will walk through an example of using `nomic-ai/nomic-embed-text-v1.5` today to see how to query Fireworks with embeddings API.
# Embedding documents
Our embedding service is OpenAI compatible. Use OpenAI's embeddings [guide](https://platform.openai.com/docs/guides/embeddings) and OpenAI's [embeddings documentation](https://platform.openai.com/docs/api-reference/embeddings) for more detailed information on our embedding model usage.
The embedding model inputs text and outputs a vector (list) of floating point numbers to use for tasks like similarity comparisons and search.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: Spiderman was a particularly entertaining movie with...",
)
print(response)
```
This code embeds the text "search\_document: Spiderman was a particularly entertaining movie with..." and returns the following
```json Response
CreateEmbeddingResponse(data=[Embedding(embedding=[0.006380197126418352, 0.011841800063848495,...], index=0, object='embedding')], model='intfloat/e5-mistral-7b-instruct', object='list', usage=Usage(prompt_tokens=12, total_tokens=12))
```
However, you might have noticed the interesting prefix with `search_document: `, what is that supposed to mean?
# Embedding queries and document
Nomic models have been fine-tuned to take prefixes. For user query, you will need to prefix it with `search_query: `, and for documents, you need to prefix with `search_document: `. What does that mean exactly?
* Let's say I previously used the embedding model to embed many movie reviews that I stored in a vector database. All the documents should come with a prefix of `search_document: `
* I now want to create a movie recommendation that takes in a user query and outputs recommendations based on this data. The code below demonstrates how to embed the user query and system prompt.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
query = "I love superhero movies, any recommendations?"
task_description="Given a user query for movies, retrieve the relevant movie that can fulfill the query. "
query_emb = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input=f"search_query: {query}"
)
print(query_emb)
```
To view this example end-to-end and see how to use a MongoDB vector store and Fireworks-hosted generation model for RAG, see our full [guide](https://github.com/fw-ai/cookbook/blob/main/examples/rag/mongo_basic.ipynb). For more information on what kind of prefixes are possible with nomic, please check out [this guide from nomic](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage).
# Variable dimensions
The model also supports variable embedding dimension sizes. In this case, we can provide dimension as a query to the embeddings.create request
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: I like Christmas movies, can you make any recommendations?",
dimensions=128,
)
print(len(response.data[0].embedding))
```
You will see that the returned results are embeddings with dimension 128.
# List of available models
| Model name | model size |
| :--------------------------------------------- | :--------- |
| `nomic-ai/nomic-embed-text-v1.5` (recommended) | 137M |
| `nomic-ai/nomic-embed-text-v1` | 137M |
| `WhereIsAI/UAE-Large-V1` | 335M |
| `thenlper/gte-large` | 335M |
| `thenlper/gte-base` | 109M |
# Querying text models
Fireworks.ai offers an OpenAI-compatible REST API for querying text models. There are several ways to interact with it:
* The [Fireworks Python client library](/tools-sdks/python-client/installation)
* The [web console](https://fireworks.ai)
* [LangChain](https://python.langchain.com/docs/integrations/providers/fireworks)
* Directly invoking the [REST API](/api-reference/post-completions) using your favorite tools or language
* The [OpenAI Python client](https://github.com/openai/openai-python)
## Using the web console
All Fireworks models can be accessed through the web console at [fireworks.ai](https://fireworks.ai). Clicking on a model will take you to the playground where you can enter a prompt along with additional request parameters.
Non-chat models will use the [completions API](/api-reference/post-completions) which passes your input directly into the model.
Models with a conversation config are considered chat models (also known as instruct models). By default, chat models will use the [chat completions API](/api-reference/post-chatcompletions) which will automatically format your input with the conversation style of the model. Advanced users can revert back to the completions API by unchecking the "Use chat template" option.
## Using the API
### Chat completions API
Models with a conversation config have the [chat completions API](/api-reference/post-completions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Llama chat models use the following [template](https://gpus.llm-utils.org/llama-2-prompt-template/):
> \\[INST] \<\>
>
> {system_prompt}
>
> \<\>
>
> \{user\_message\_1} \[/INST]
Some templates can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
#### Overriding the system prompt
A conversation style may include a default system prompt. For example, Llama 2 models use the default Llama prompt:
> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
For styles that support a system prompt, you may override this prompt by setting the first message with the role `system`. For example:
```json JSON
[
{
"role": "system",
"content": "You are a pirate."
},
{
"role": "user",
"content": "Hello, what is your name?"
}
]
```
To completely omit the system prompt, you can set `content` to the empty string.
The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on [`echo`](#echo).
### Completions API
Text models generate text based on the provided input prompt. All text models support this basic [completions API](/api-reference/post-completions). Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model's special end-of-sequence (EOS) token has been generated.
Most models will automatically prepend the beginning-of-sequence (BOS) token (e.g. ``) to your prompt input. You can always double-check by passing [raw\_output](#raw-output) and inspecting the resulting `prompt_token_ids`.
Here are some examples of calling the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.Completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
## Getting usage info
The returned object will contain a `usage` field containing
* The number of prompt tokens ingested
* The number of completion tokens (i.e. the number of tokens generated)
## Advanced options
See the API reference for the [completions](/api-reference/post-completions) and [chat completions](/api-reference/post-completions) APIs for a detailed description of these options.
### Streaming
By default, results are returned to the client once the generation is finished. Another option is to stream the results back, which is useful for chat use cases where the client can incrementally see results as each token is generated.
Here is an example with the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.Completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text, end="")
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"prompt": "Say this is a test",
"stream": true
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
and one with the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if "content" in chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end="")
```
### Async mode
The Python client library also supports asynchronous mode for both completion and chat completion.
```python Python (Fireworks)
import asyncio
from fireworks.client import AsyncFireworks
client = AsyncFireworks(api_key="")
async def main():
stream = client.completion.acreate(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
```python Python (OpenAI 1.x)
import asyncio
import openai
client = openai.AsyncOpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
async def main():
stream = await client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
### Predicted Outputs
See [Using Predicted Outputs](/guides/predicted-outputs)
### Sampling options
The API auto-regressively generates text based on choosing the next token using the probability distribution over the space of tokens. For detailed information on how to implement these options, please refer to the [Chat Completions](/api-reference/post-chatcompletions) or [Completions](/api-reference/post-completions) API documentation.
#### Multiple choices
By default, the API will return a single generation choice per request. You can create multiple generations by setting the `n` parameter to the number of desired choices. The returned `choices` array will contain the result of each generation.
#### Max tokens
`max_tokens` or `max_completion_tokens` defines the maximum number of tokens the model can generate, with a default of 2000. If the combined token count (prompt + output) exceeds the model’s limit, it automatically reduces the number of generated tokens to fit within the allowed context.
#### Temperature
Temperature allows you to configure how much randomness you want in the generated text. A higher temperature leads to more "creative" results. On the other hand, setting a temperature of 0 will allow you to generate deterministic results which is useful for testing and debugging.
#### Top-p
Top-p (also called nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top\_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
#### Top-k
Top-k is another sampling method where the k most probable tokens are filtered and the probability mass is redistributed among tokens.
#### Min-p
[`min_p`](https://arxiv.org/abs/2407.01082) specifies a probability threshold to control which tokens can be selected during generation. Tokens with probabilities lower than this threshold are excluded, making the model more focused on higher-probability tokens. The default value varies, and setting a lower value ensures more variety, while a higher value produces more predictable, focused outputs.
#### Repetition penalty
LLMs are sometimes prone to repeat a single character or a sentence. Using a frequency and presence penalty can reduce the likelihood of sampling repetitive sequences of tokens. They work by directly modifying the model's logits (un-normalized log-probabilities) with an additive contribution.
> logits\[j] -= c\[j] \* frequency\_penalty + (c\[j] > 0 ? 1 : 0) \* presence\_penalty
where
* `logits[j]` is the logits of the j-th token
* `c[j]` is how often that token was sampled before the current position
The [`repetition_penalty`](https://arxiv.org/pdf/1909.05858.pdf) modifies the logit (raw model output) for repeated tokens. If a token has already appeared in the prompt or output, the penalty is applied to its probability of being selected again.
**Key differences to keep in mind:**
* `frequency_penalty`: Works on how often a word has been used, increasing the penalty for more frequent words. OAI compatible.
* `presence_penalty`: Penalizes words once they appear, regardless of frequency. OAI compatible.
* `repetition_penalty`: Adjusts the likelihood of repeated tokens based on previous appearances, providing an exponential scaling effect to control repetition more precisely, including from the prompt.
#### Mirostat (learning rate and target)
The [Mirostat algorithm](https://arxiv.org/abs/2007.14966) is a sampling method that helps keep the output’s unpredictability, or perplexity, at a set target. It adjusts token probabilities as the text is generated to balance between more diverse or more predictable results. This is useful when you need steady control over how random or focused the text output should be.
There are two parameters that can be adjusted:
* `mirostat_target`: Sets the desired level of unpredictability (perplexity) for the Mirostat algorithm. A higher target results in more diverse output, while a lower target keeps the text more predictable.
* `mirostat_lr`: Controls how quickly the Mirostat algorithm adjusts token probabilities to reach the target perplexity. A lower learning rate makes the adjustments slower and more gradual, while a higher rate speeds up the corrections.
#### Logit bias
Parameter that modifies the likelihood of specified tokens appearing. Pass in a Dict\[int, float] that maps a token\_id to a logits bias value between -200.0 and 200.0. For example
```Text python
client.completions.create(
model="...",
prompt="...",
logit_bias={0: 10.0, 2: -50.0}
)
```
## Debugging options
### Ignore EOS
This option allows you to control whether the model stops when it generated the End of Sequence (EOS) token. This is helpful primarily for performance benchmarking to reliably generate exactly `max_tokens`. Note the quality of the output may degrade as we override model's decision to generate EOS token.
### Logprobs
The `logprobs` parameter determines how many token probabilities are returned. If set to N, it will return log (base e)
probabilities for N+1 tokens: the chosen token plus the N most likely alternative tokens.
The log probabilities will be returned in a LogProbs object for each choice.
* `tokens` contains each token of the chosen result.
* `token_ids` contains the integer IDs of each token of the chosen result.
* `token_logprobs` contains the logprobs of each chosen token.
* `top_logprobs` will be a list whose length is the number of tokens of the output. Each element is a dictionary of size `logprobs`, from the most likely tokens at the given position to their respective log probabilities.
When used in conjunction with echo, this option can be set to see how the model tokenized your input.
### Top logprobs
Setting the `top_logprobs` parameter to an integer value in conjunction with `logprobs=True` will also return the above information but in an OpenAI client-compatible format.
### Echo
Setting the `echo` parameter to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.
### Raw output
This is an unstable, experimental API. It may change at any time and should not be relied upon for production use
cases.
Setting the `raw_output` parameter to true will cause the API to return a `raw_output` object in the response containing
addititional debugging information with regards to how the raw prompt and completion response as seen/produced by the
model.
* `prompt_fragments` - Pieces of the prompt (like individual messages) before truncation and concatenation.
* `prompt_token_ids` - Fully tokenized prompt as seen by the model.
* `completion` - Raw completion produced by the model before any tool calls are parsed.
* `completion_logprobs` - Log probabilities for the completion. Only populated if `logprobs` is specified in the
request.
## Appendix
### Tokenization
Language models read and write text in chunks called tokens. In English, a **token** can be as short as one character or as long as one word (e.g., a or apple), and in some languages, tokens can be even shorter than one character or even longer than one word.
Different model families use different **tokenizers**. The same text might be translated to different numbers of tokens depending on the model. It means that generation cost may vary per model even if the model size is the same. For the Llama model family, you can use [this tool](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) to estimate token counts. The actual number of tokens used in prompt and generation is returned in the `usage` field of the API response.
# Querying vision-language models
See [Querying text models](/guides/querying-text-models) for a general guide on the API and its options.
## Using the API
Both completions API and chat completions API are supported. However, we recommend users use the chat completions API
whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in
poor model performance.
For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answerYou can pass images via a URL link or base64 encoded format. Code examples for both methods are below.
### Chat completions API
All vision-language models should have a conversation config and have [chat completions API](https://docs.fireworks.ai/api-reference/post-chatcompletions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:
```
SYSTEM: {system message}
USER:
{user message}
ASSISTANT:
```
The `` substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```bash cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:
```python Python (Fireworks)
import fireworks.client
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```Text cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
### Completions API
Advanced users can also query the completions API directly. Users will need to manually insert the image token `` where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:
```python
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.Completion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
prompt = "SYSTEM: Hello\n\nUSER:\ntell me about the image\n\nASSISTANT:",
images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)
```
## API limitations
Right now, we impose certain limits on the completions API and chat completions API as follows:
1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
2. All the images should be smaller than 5MB in size, and if the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
## Model limitations
At the moment, we primarily offer Phi-3 vision models for serverless deployment.
## Managing images
The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download.
For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
## Calculating cost
For the Phi-3 Vision model, an image is treated as a dynamic number of tokens based on image resolution. For one image the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to [our pricing page here.](https://fireworks.ai/pricing)
## FAQ
### Can I fine-tune the image capabilities with VLM?
Not right now, but we will be working on Phi-3 vision model fine-tuning since it is now a more popular choice. If you are interested, please reach out to us via Discord.
### Can I use a vision-language model to generate images?
No. But we support image generation models for this purpose:
* [Stable Diffusion](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0)
* [SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B)
* [Japanese Stable Diffusion](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl)
* [Playground v2](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic)
Please give these models a try and let us know how it goes!
### What type of files can I upload?
We currently support `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff` and `.ppm` format images.
### Is there a limit to the size of the image I can upload?
Currently, our API is restricted to 10MB for the whole request, so the image sent through request in base64 encoding will need to be smaller than 10MB (when converted to base64 encoding). If you are using URLs, then each image needs to be smaller than 5MB.
### What is the retention policy for the images I upload?
We do not persist the images longer than the server lifetime, and will be deleted automatically.
### How do rate limits work with VLMs?
VLMs are rate-limited like all of our other LLM models, which depends on which tier of rate-limiting you are at. For more information, please check out [Pricing](https://fireworks.ai/pricing).
### Can VLMs understand image metadata?
No. If you have image metadata that you want the model to understand, please provide them through the prompt.
# Rate limits, spend limits and quotas
Rate limits, spend limits and quotas for serverless inference and on-demand deployments
## Rate Limits on Serverless
Rate limits on Serverless exist to ensure fair usage and reasonable performance for all users. We use a combination of maximum rate limits and dynamic rate limits - please read this section completely to understand how rate limits work.
* Fixed limits reflect the maximum usage allowed on Serverless
* Dynamic limits vary based on available capacity and current traffic load
If you need higher rate limits, faster speeds, more consistent latency, or guaranteed reliability with SLAs, [contact us](https://fireworks.ai/company/contact-us) to learn more about our Enterprise offerings, or consider using [on-demand deployments](https://github.com/fw-ai/docs/blob/main/guides/ondemand-deployments.mdx)
### Fixed Limits
| Limits | Self-Serve |
| ---------------------------------------------------------------------------- | ---------- |
| Requests per minute | 6,000 |
| Tokens per day, models \< 40B | 2.5B |
| Tokens per day, models between 40B - 100B | 1.25B |
| Tokens per day, models > 100B | 600M |
| # [LoRAs](https://docs.fireworks.ai/getting-started/concepts#deployed-model) | 100 |
### Dynamic Limits
Dynamic rate limits vary based on available capacity and current traffic load. Here's how it works:
* Each user has a dynamic rate limit, which increases with sustained usage near the current limit. Typically, you can expect to stay within the limits if your traffic gradually doubles within an hour.
* The actual rate of increase depends on model size, traffic load, capacity availability, and other factors. The API response headers (see below) will let you know what your current limits are, so you know when more capacity is available.
* If you exceed your dynamic rate limit, the requests will still be processed but with lower priority. Those requests may see higher latency. You can monitor it via API response header `x-ratelimit-over-limit: yes`. If you significantly exceed your dynamic rate limit, the requests will be dropped with HTTP code 429.
* Dynamic rate limits work similarly to ["autoscaling"](https://en.wikipedia.org/wiki/Autoscaling) in many infrastructure systems. A gradual increase in traffic volume results in increased available capacity. Abrupt spikes in traffic may cause overload.
Here's an example of how dynamic rate limits scale up:
| Metric | Starting Quota (Minimum Guaranteed) | 10 Minutes | 1 Hour | 2 Hours |
| ------------------------ | ----------------------------------- | ---------- | ------ | ------- |
| Requests per minute | 60 | 120 | 720 | 1440 |
| Input tokens per minute | 60000 | 120000 | 720000 | 1440000 |
| Output tokens per minute | 6000 | 12000 | 72000 | 144000 |
### Rate limit response headers
| Header | Description |
| ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| x-ratelimit-limit-requests, x-ratelimit-limit-tokens-prompt, x-ratelimit-limit-tokens-generated | The maximum number of requests or tokens that are permitted per minute before the limit is exhausted and future requests are de-prioritized. `requests` refers to the number of completions (`n > 1` counts as several requests). `tokens-prompt` and `tokens-generated` refer to the number of input and output tokens respectively. |
| x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens-prompt, x-ratelimit-remaining-tokens-generated | The remaining number of requests or tokens that are permitted before exhausting the rate limit. Note that the limit is replenished continuously. If your usage is sustainably below the rate limit, this number will hover near its maximum value. |
| x-ratelimit-over-limit | Contains "yes" or "no". The value "yes" means that at least one of the limits is exhausted and this request was executed with lower priority. |
## GPU Limits with On-Demand Deployments
If you need higher limits, [contact us](https://fireworks.ai/company/contact-us) to learn more about our Enterprise offerings.
| **Quota Name** | **Default Value** |
| -------------------------------------------------------------------------------- | ----------------- |
| # Nvidia A100 | 8 |
| # Nvidia H100 | 8 |
| # Nvidia H200 | 8 |
| # AMD MI300X | 8 |
| Total GPU Hours per month | 2000 |
| # [LoRAs](https://docs.fireworks.ai/getting-started/concepts#deployed-model) | 100 |
| Note that the limit on # LoRAs is a total limit across Serverless and On-Demand. | |
## Spend limits
In order to prevent fraud, Fireworks imposes a monthly spending limit on your account. Once you hit the spending limit, your account will automatically enter a suspended state, API requests will be rejected and all Fireworks usage will be stopped. This includes serverless inference, dedicated deployments, and fine-tuning jobs.
Your spend limit will organically increase over time as you spend more on the platform. You can also increase your spend limit at any time, by purchasing prepaid credits to meet the historical spend required for a higher tier. For instance, if you are a new Tier 1 user with `$0` historical spend, you can purchase `$100` prepaid credits and become a Tier 2 user.
You can qualify for a higher tier by adding credits into your Fireworks account. There may be a propagation delay for a few minutes after you prepay for credits - you may still see "monthly usage exceeded error" for a few minutes after adding credits.
| **Tier** | **Qualification** | **Spending Limit** |
| --------- | --------------------------------------------------------------------- | ------------------ |
| Tier 1 | Valid payment method added | \$50/mo |
| Tier 2 | \$50 spent in payments or credits added | \$500/mo |
| Tier 3 | \$500 spent in payments or credits added | \$5,000/mo |
| Tier 4 | \$5000 spent in payments or credits added | \$50,000/mo |
| Unlimited | Contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) | Unlimited |
### Reducing Spend Limits
In certain cases, developers want to reduce their spend limit. For example, developers may fear unexpected costs from their app unexpectedly going viral. Users can lower or raise spend limits to any arbitrary number within their Tier with the following command:
```bash
firectl update quota monthly-spend-usd --value
```
## Viewing quotas
You can view your current quota capacity by running:
```bash
firectl list quotas
```
## Account suspension
Account suspension occurs when your spending limit is hit, no payment method is on file after credits are depleted, or past invoice payment fails. If you have a failed payment, go to the \[Invoices] section at [https://fireworks.ai/billing](https://fireworks.ai/billing), pay all failed invoices, and your account will be automatically unsuspended. If your account is still suspended after 1 hour, contact the Fireworks team in Discord or via email.
# Data privacy & security
How we secure and handle your data
# Zero Data Retention
Fireworks has Zero Data Retention by default. Specifically, this means
* Fireworks does not log or store prompt or generation data for any open models, without explicit user opt-in.
* More technically: prompt and generation data exist only in volatile memory for the duration of the request. If [prompt caching](https://docs.fireworks.ai/guides/prompt-caching#data-privacy) is active, some prompt data (and associated KV caches) can be stored in volatile memory for several minutes. In either case, prompt and generation data are not logged into any persistent storage.
* Fireworks logs metadata (e.g. number of tokens in a request) as required to deliver the service.
* Users can explicitly opt-in to log prompt and generation data for certain advanced features (e.g. FireOptimizer).
* For proprietary Fireworks models (e.g. f1, FireFunction), prompt and generation data may be logged to enable bulk analytics to improve the model.
* In this case, the model description will contain an explicit message about logging.
# Understanding LoRA performance
Understand the performance impact of LoRA fine-tuning, optimization strategies, and deployment considerations.
# Why is my LoRA slower than base model inference?
This guide explores why LoRA (Low-Rank Adaptation) fine-tuned models can exhibit slower inference times compared to base models, addresses key factors affecting performance, and provides actionable advice on optimizing deployments. It also delves into the implications of concurrent LoRA adapters and when to merge weights versus serving multiple adapters.
## Key concepts and definitions
* **LoRA (Low-Rank Adaptation)**: A fine-tuning technique that updates weights of the large model by approximating the delta as a product of two low-rank ("narrow") matrices.
* **PEFT (Parameter-Efficient Fine-Tuning)**: A set of approaches for model adaptation that change only part of the model and thus can be trained and served more efficiently. LoRA is the most popular technique in PEFT family.
* **Weight Merging**: Combining LoRA-tuned weights with the base model to create a standalone model.
* **Speculative Decoding (SD)**: An optimization technique using a smaller "draft" model to precompute and speed up text generation.
## Q\&A: Addressing common questions
### Q1: Why is the LoRA fine-tuned model slower than the base model?
Three factors contribute to the observed slowdown:
1. **Hardware configuration**: LoRA models on serverless setups often share resources, leading to lower performance compared to dedicated deployments.
Note, this applies for serverless, not on-demand.
2. **Unmerged LoRA weights**: Serving LoRA adapters dynamically adds computational overhead during inference. Merging weights removes this overhead.
3. **Speculative decoding**: Base models often use speculative decoding to optimize generation speed. Without SD, LoRA fine-tuned models can lag behind.
### Q2: Does the number of concurrent LoRA adapters affect performance?
1. TTFT increase - any request against an unmerged LoRA model has some initial overhead for loading model weights (on the order of tens of milliseconds) and increases the prompt processing time by about 10-30%. This manifests as increased time to first token. Repeated requests might amortize some of these overheads.
2. Generation speed overheads for unmerged LoRA models increase with higher request concurrency. For a deployment serving a few requests per second, the overhead might be minimal, but relative overhead increases with a higher level of load. As a corollary, on-demand deployments with LoRA adapters have lower maximum throughput.
3. Performance is mostly independent of the total number of LoRA adapters deployed at a single deployment.
### Q3: How can performance be improved?
To address latency issues and optimize performance:
1. **Dedicated deployments**:
* Deploy LoRA models on dedicated hardware to avoid shared resource bottlenecks inherent to serverless setups.
2. **Weight merging**:
* Merge LoRA weights into the base model to eliminate runtime overhead.
3. **Speculative decoding**:
* Utilize speculative decoding for fine-tuned models with a custom draft model. This can achieve better-than-base performance.
### Q4: When should I merge weights vs. serve multiple LoRA adapters?
| **Scenario** | **Multi-LoRA (Unmerged)** | **Merged LoRA** |
| ------------------------ | ------------------------------------ | ------------------------------------- |
| **Use case** | Serving multiple fine-tuned variants | Low-latency, single-model deployments |
| **Hardware needs** | Shared or dedicated hardware | Dedicated hardware |
| **Performance impact** | Overhead per adapter | Equivalent to base model |
| **Concurrency handling** | Efficient for experimentation | Limited to one fine-tuned model |
### Q5: What is the performance impact of weight merging?
Merging weights creates a new standalone model indistinguishable from a fully fine-tuned model. Once merged:
* Latency matches the base model.
* Memory usage is reduced since adapters are no longer dynamically applied.
### Q6: What does it take for fine-tuning to match the performance of the base deployment?
To match or exceed base model performance, consider these steps:
1. **Speculative decoding**:
* Train a custom draft model optimized for your fine-tuned setup.
2. **Dedicated hardware**:
* Avoid serverless deployments to ensure consistent performance.
3. **Weight merging**:
* Merge LoRA weights to eliminate inference overhead.
## Implementation guide for optimizing LoRA performance
### Steps to improve performance
1. **Fine-Tune with LoRA**:
* Use LoRA for efficient parameter updates. See our guide on [uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models) for more information.
2. **Download and merge weights**:
* Download adapters and merge weights using the PEFT library, [like in this guide](https://docs.fireworks.ai/guides/lora-model-merge).
3. **Deploy on dedicated hardware**:
* Deploy merged models for consistent low-latency performance, [like in this guide](https://docs.fireworks.ai/guides/ondemand-deployments).
4. **Use speculative decoding**:
* Train and deploy a draft model to further reduce latency.
This is currently an enterprise feature, please reach out for more information.
# Deploying models
A model must be deployed before it can be used for inference. Fireworks deploys the most popular base models to
serverless deployments that can be used out of the box (including LoRA addons). See [Querying text models](/guides/querying-text-models).
Less popular base models or custom base
models must be used with an [on-demand deployment](/guides/ondemand-deployments).
## Deploying a model
### LoRA addons
#### Deploying to serverless
Fireworks also supports deploying serverless addons for [supported base models](/fine-tuning/fine-tuning-models#appendix).
To deploy a LoRA addon to serverless, run
`firectl deploy` without passing a deployment ID:
```bash
firectl deploy
```
Serverless addons are charged by input and output tokens for inference. There is no additional charge for deploying
serverless addons.
LoRA addons on serverless have higher latency compared with base model inference. This includes LoRA fine-tunes, which
are one type of LoRA addon. For faster inference speeds with LoRA addons, we recommend deploying to on-demand.
Unused addons may be automatically undeployed after a week.
#### Deploying to on-demand
Addons may also be deployed in an [on-demand deployment](/guides/ondemand-deployments) of [supported base models](/fine-tuning/fine-tuning-models#appendix).
To create an on-demand deployment, run:
```bash
firectl create deployment "accounts/fireworks/models/" --enable-addons
```
On-demand deployments are charged by GPU-hour. See [Pricing](https://fireworks.ai/pricing#ondemand) for
details.
Once the deployment is ready, deploy the addon to the deployment:
```bash
firectl deploy --deployment
```
### Base models
Custom base models may only be used with [on-demand deployments](/guides/ondemand-deployments). To create one, run:
```bash
firectl create deployment
```
On-demand deployments are charged by GPU-hour. See [Pricing](https://fireworks.ai/pricing#ondemand) for
details.
Use the `` specified during [model upload](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model-2). Creating the deployment will automatically deploy the base model to the deployment.
## Checking whether a model is deployed
You can check the status of a model deployment by looking at the "Deployed Model Refs" section from:
```
firectl get model
```
If successful, there will be an entry with `State: DEPLOYED`.
Alternatively, you can list all deployed models within your account by running:
```
firectl list deployed-models
```
## Inference
### Model identifier
After your model is successfully deployed, it will be ready for inference. A model can be queried using one of the
following model identifiers:
* The model and deployment names - `accounts//models/#accounts//deployments/`,
e.g.
* `accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678`
* `accounts/alice/models/custom-model#accounts/alice/deployments/12345678`
* The model and deployment short-names - `/#/`,
e.g.
* `fireworks/mixtral-8x7b#alice/12345678`
* `alice/custom-model#alice/12345678`
* Deployed model name - Instead of needing to use both the model and deployment name to refer to a deployed model, you can optionally just use a unique deployed model name. This name utilizes a unique deployed model ID that is created upon deployment. The deployed model ID takes the form \-\/`
* `/#/`
### Multiple deployments
Since a model may be deployed to multiple deployments, querying by model name will route to the "default" deployed
model. You can see which deployed model entry is marked with `Default: true` by describing the model:
```
firectl get model
...
Deployed Model Refs:
[{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
Default: true
},
{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
},
]
```
To update the default deployed model, note the "Name" of the deployed model reference above. Then run:
```
firectl update deployed-model --default
```
Deleting a default deployment:
To delete a default deployment you must delete all other deployments for the same model first,
or designate a different deployed model as the default as described above. This is to ensure that querying by model name
will always route to an unambiguous default deployment as long as deployments for the model exist.
### Querying the model
To test the model using the completions API, run:
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
See [Querying text models](/guides/querying-text-models) for a more comprehensive guide.
## Publishing a deployed model
By default, models can only be queried by the account that owns them. To make a deployed model public so anyone with a
valid Fireworks API key can query it, update the deployed model with the `--public` flag.
```bash
firectl update deployed-omdel --public
```
To unpublish it, run:
```bash
firectl update deployed-model --public=false
```
You must use the **deployed model ID**, not the **model ID**. To get a list of deployed models, run `firectl list deployed-models`.
# Overview
## Introduction
A *model* is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be
deployed on hardware (i.e. a *deployment*) for inference. Each model has a [globally unique name](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) of the
form `accounts//models/`. The model IDs are:
* Pre-populated for models that Fireworks has uploaded. For example, "llama-v3p1-70b-instruct" is the model ID for the Llama 3.1 70B model that Fireworks provides. It can be found on each model's page ([example](https://fireworks.ai/models/fireworks/llama-v3p1-70b-instruct))
* Either auto-generated or user-specified for fine-tuned models [uploaded](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model) or [created](https://docs.fireworks.ai/fine-tuning/fine-tuning-models#model-id) by users
* User-specified for [custom models](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model) uploaded by users
There are two types of models:
* Base models
* Low-rank adaptation (LoRA) addons
### Base models
A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as
full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for
[serverless inference](#serverless-inference) as well as [dedicated deployments](#dedicated-deployments). Fireworks
also allows you to upload your own custom base models.
### Low-rank adaptation (LoRA) addons
A LoRA addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to
a fully fine-tuned model. Fireworks supports both [training](/fine-tuning/fine-tuning-models),
[uploading](/models/uploading-custom-models#custom-lora-addons), and [serving](/models/deploying) LoRA addons.
LoRA addons must be deployed on a serverless or dedicated deployment for its corresponding base model.
## Using models for inference
A model must be deployed before it can be used for inference. Take a look at the [Querying text models](/guides/querying-text-models)
guide for a comprehensive overview of making LLM inference.
### Serverless inference
Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the
Fireworks team for the community to use. Take a look at the [Models](https://fireworks.ai/models) page for the latest
list of serverless models.
Serverless inference is billed on a per-token basis depending on the model size. See our [Pricing](https://fireworks.ai/pricing#text)
page for details.
Since serverless deployments are shared across users, there are no SLA guarantees for up-time or latency. It is
best-effort. The Fireworks team may also deprecate models from serverless with at least 2 weeks notice.
Custom base models are not supported for serverless inference.
### Serverless addons
The most popular base models for fine-tuning will also support serverless LoRA addons. This feature allows users to
quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the
[Deploying to serverless](/models/deploying#deploying-to-serverless) guide for details.
Similar to serverless inference, there are no SLA guarantees for serverless addons.
### Dedicated deployments
Dedicated deployments give users the most flexibility and control over what models can be deployed and performance
guarantees. These deployments are private to you and give you access to a wide array of hardware. Both LoRA addons and
base models can be deployed to dedicated deployments.
Dedicated deployments are billed by a GPU-second basis. See our [Pricing](https://fireworks.ai/pricing#ondemand) page
for details.
Take a look at our [On-demand deployments](/guides/ondemand-deployments) guide for a comprehensive overview.
## Data privacy & security
Your data is your data. No prompt or generated data is logged or stored on Fireworks; only meta-data like the number of tokens in a request is logged, as required to deliver the service. There are two exceptions:
* For our proprietary FireFunction model, input/output data is logged for 30 days only to enable bulk analytics to improve the model, such as tracking the number of functions provided to the model.
* For certain advanced features (e.g. FireOptimizer), users can explicitly opt-in to log data.
# null
By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits
used to serve the model, improving performance and reducing cost to serve. However, this can change model numerics
which may introduce small changes to the output.
Take a look at our [blog post](https://fireworks.ai/blog/fireworks-quantization) for a detailed treatment of how
quantization affects model quality.
## Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision using `firectl prepare-model`:
```bash
firectl prepare-model
```
This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.
You can check on the status of preparation by running
```bash
firectl get model
```
and checking if the state is still in `PREPARING`. A successfully prepared model will have the desired precision added
to the `Precisions` list.
## Creating an FP8 deployment
By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a
model, run:
```bash
firectl get model
```
The `Precisions` field will indicate what precisions the model has been prepared for.
To use the quantized FP8 checkpoint, pass the `--precision` flag:
```bash
firectl create deployment --accelerator-type NVIDIA_H100_80GB --precision FP8
```
Quantized deployments can only be served using H100 GPUs.
# Uploading a custom model
In addition to the predefined set of models already available on Fireworks and models you fine-tune on the Fireworks
platform, you can also upload your own custom models. Both custom base models and LoRA addons are supported.
## Custom LoRA addons
### Requirements
Your custom LoRA addon must contain the following files:
* `adapter_config.json` - The Hugging Face adapter configuration file.
* `adapter_model.bin` or `adapter_model.safetensors` - The saved addon file.
The `adapter_config.json` must contain the following fields:
* `r` - The number of LoRA ranks. Must be between an integer between 4 and 64, inclusive.
* `target_modules` - A list of target modules. Currently the following target modules are supported:
* `q_proj`
* `k_proj`
* `v_proj`
* `o_proj`
* `up_proj` or `w1`
* `down_proj` or `w2`
* `gate_proj` or `w3`
* `block_sparse_moe.gate`
Additional fields may be specified but are ignored.
### Enabling chat completions
To enable the chat completions API for your LoRA addon, add a `fireworks.json` file directory containing:
```json
{
"conversation_config": {
"style": "jinja",
"args": {
"template": ""
}
}
}
```
### Uploading the model
To upload a LoRA addon, run the following command. The MODEL\_ID is an arbitrary [resource ID](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) to refer to the model within Fireworks.
> NOTE: Only some base models support LoRA addons.
```bash
firectl create model /path/to/files/ --base-model "accounts/fireworks/models/"
```
## Custom base models
### Requirements
Fireworks currently supports the following model architectures:
* [Gemma](https://huggingface.co/docs/transformers/en/model_doc/gemma)
* [Phi, Phi-3](https://huggingface.co/docs/transformers/en/model_doc/phi)
* [Llama 1,2,3,3.1](https://huggingface.co/docs/transformers/en/model_doc/llama2)
* [LLaVa](https://huggingface.co/docs/transformers/main/en/model_doc/llava)
* [Mistral](https://huggingface.co/docs/transformers/en/model_doc/mistral) & [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral)
* [Qwen2](https://huggingface.co/docs/transformers/en/model_doc/qwen2)
* [StableLM](https://huggingface.co/docs/transformers/main/en/model_doc/stablelm)
* [Starcoder(GPTBigCode)](https://huggingface.co/docs/transformers/en/model_doc/gpt_bigcode) & [Starcoder2](https://huggingface.co/docs/transformers/main/en/model_doc/starcoder2)
* [DeepSeek V1 & V2](https://huggingface.co/deepseek-ai)
* [GPT NeoX](https://huggingface.co/docs/transformers/en/model_doc/gpt_neox)
The model files you will need to provide depend on the model architecture. In general, you will need the following files:
* Model configuration: `config.json`.
Fireworks does not support the `quantization_config` option in `config.json`.
* Model weights, in one of the following formats:
* `*.safetensors`
* `*.bin`
* Weights index:`*.index.json`
* Tokenizer file(s), e.g.
* `tokenizer.model`
* `tokenizer.json`
* `tokenizer_config.json`
If the requisite files are not present, model deployment may fail.
### Enabling chat completions
To enable the chat completions API for your custom base model, ensure your `tokenizer_config.json` contains a
`chat_template` field. See the Hugging Face guide on [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating)
for details.
### Uploading the model
To upload a custom base model, run the following command.
```bash
firectl create model /path/to/files/
```
## Deploying
A model cannot be used for inference until it is deployed. See the [Deploying models](/models/deploying) guide to deploy
the model.
## Publishing
By default, all models you create are only visible to and deployable by users within your account. To publish a model so
anyone with a Fireworks account can deploy it, you can create it with the `--public` flag. This will allow it to show up
in public model lists.
```bash create
firectl create model /path/to/files --public
```
```bash update
firectl update model --public
```
To unpublish the model, just run
```bash update
firectl update model --public=false
```
# Using grammar mode
## What is grammar-based structured output?
Grammar mode is the ability to specify a forced output schema for any Fireworks model via an extended BNF formal grammar ([GBNF format](https://github.com/ggerganov/llama.cpp/tree/master/grammars)). This method is popularly used to constrain model outputs in [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). What is a formal grammar? It's a way to define rules to declare strings to be valid or invalid. See the "Syntax for Describing Grammars" below for more info. Similar to our [JSON mode](/structured-responses/structured-response-formatting) format, you provide `response_format` field in the request like `{"type": "grammar", "grammar": }`.
For best results, we still recommend that you do some prompt engineering and describe the desired output to the model to guide decision-making.
## Why grammar-based structured output?
* Relying solely on system prompt engineering is finicky and time-consuming. It can be difficult to coerce the model to do certain things, for example
* Behave like a classifier, only output from a predefined list
* Output only Japanese, Chinese, a specified programming language, or otherwise prevent the model from generating a large set of of tokens
* Sometimes JSON is not what you need (e.g. it may be finicky with string escaping) and you need some other structured output
* Small models may have difficulty following instructions
## End-to-end examples
This guide provides a step-by-step example of creating a structured output response with grammar using the Fireworks.ai API. The example uses Python and the OpenAI library to define the schema for the output.
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` libraries installed. You can install them using pip:
```bash
pip install openai
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all fireworks models support this feature. You can find your favorite model and get structured responses out of it!
### Step 1: Configure the Fireworks.ai client
You can use either Fireworks.ai or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
Replace `"Your_API_Key"` with your actual API key.
### Step 2: Define the output grammar
Define a grammar to restrict the specified output. Let's say you have a model that is a classifier and classifies patient requests into a few predefined classes:
```
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
```
Then you can ask the model to only respond within these classes.
### Step 3: Specify your output grammar in your chat completions request
```python Python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
diagnosis_grammar = """
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": diagnosis_grammar},
messages=[
{
"role": "system",
"content": "Given the symptoms try to guess the possible diagnosis. Possible choices: arthritis, dengue, urinary tract infection, impetigo, cervical spondylosis. Answer with a single word",
},
{
"role": "user",
"content": "I have been having trouble with my muscles and joints. My neck is really tight and my muscles feel weak. I have swollen joints and it is hard to move around without becoming stiff. It is also really uncomfortable to walk.",
},
],
)
print(chat_completion.choices[0].message.content)
```
and for the response, we will only get one of the 5 classes we specified, in this case, the model output is
```
'arthritis'
```
Note, that we still have done some prompt engineering to instruct the model about possible diagnoses in free form. Alternatively, we may have used one of the fine-tuned models for the medical domain.
## Advanced examples
### Japanese and Chinese
Make a request to the Fireworks.ai API to get a structured response. In your request, specify the output schema you used in step 3. For example, we are pretending
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
cjk_grammar = """
root ::= jp-char+ ([ \t\n] jp-char+)*
jp-char ::= hiragana | katakana | punctuation | cjk
hiragana ::= [ぁ-ゟ]
katakana ::= [ァ-ヿ]
punctuation ::= [、-〾]
cjk ::= [一-鿿]
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": cjk_grammar},
messages=[
{
"role": "user",
"content": "You are a Japanese tour guide who speaks fluent Japanese. Please tell me what are some good places for me to visit in Kyoto",
},
],
)
print(chat_completion.choices[0].message.content)
```
The model will reply in Japanese
```
こんにちは、私は日本語を母国語として話せるキョトの私が案内する旅行案内者です。京都を旅行にお付き合いいただきありがとうごさいます。京都にはたくさんの楽しいところがありますが、私はあなたの需要に基いて、いくつかのおすすめていきます。\n最初に、古都の一面を体感できる場所として、清水寺をおすすめします。清水寺は世界的に有名な寺院で、美しい金面山だまのホームページと、きれいな庭で知られています。\n次に、京
```
And since the grammar is actually more lenient than Japanese and covers Chinese as well, we can also just prompt the model to be a fluent Chinese speaker.
```
You are a Japanese tour guide who speaks fluent Chinese. Please tell me what are some good places for me to visit in Shanghai?",
```
And you can see here that we are trying something a little difficult, asking a Japanese tour guide to speak Chinese. But with the help from the grammar, the model replied in Chinese, with the same grammar specified
```
当您访问上海时、我建议您参观以下几个地方。\\n上海外国语大学。这是一所著名的大学、校园美景优秀、还有各种餐馆可供您选择。\\n上海中山公园。这是一座位于城市中心的公园、风景优美、有许多古老的建设和展览馆。\\n南京路。这是一条繁华的商业大街、有许多品牌商店和餐馆可供您选择。\\n上海南京东路步行街。这是全球最大的步行街之一、有许多商店和餐馆可供您选择。\\n上海世博中心。这是一座展览馆复合体、经常举办各种展览和会议
```
Without the help from the grammar, here is the model reply in a mix of Chinese and English
```
你好!uming as a Japanese tour guide who speaks fluent Chinese, I would be happy to recommend some places for you to visit in Shanghai!\\n\\n1. The Bund: This is a famous waterfront area in Shanghai that offers stunning views of the city's skyline, including the iconic Oriental Pearl Tower. You can take a leisurely stroll along the promenade and enjoy the beautiful scenery.\\n\\n2. Yuyuan Garden: This is a beautiful classical Chinese garden that dates back to the Ming Dynasty. It features pavilions, halls, rockeries, ponds, and cloisters, and is a great place to experience traditional Chinese architecture and garden design.\\n\\n3. Shanghai Tower: This is the tallest building in China and the second-tallest building in the world. You can take the elevator up to the observation deck on the 128th floor for breatht
```
### C code generation
This is one of the community contribution on llama.cpp. You can hook that with our Mixtral model and try to come up with a good solution for a coding problem you have.
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="your_API_Key",
)
c_grammar = """
root ::= (declaration)*
declaration ::= dataType identifier "(" parameter? ")" "{" statement* "}"
dataType ::= "int" ws | "float" ws | "char" ws
identifier ::= [a-zA-Z_] [a-zA-Z_0-9]*
parameter ::= dataType identifier
statement ::=
( dataType identifier ws "=" ws expression ";" ) |
( identifier ws "=" ws expression ";" ) |
( identifier ws "(" argList? ")" ";" ) |
( "return" ws expression ";" ) |
( "while" "(" condition ")" "{" statement* "}" ) |
( "for" "(" forInit ";" ws condition ";" ws forUpdate ")" "{" statement* "}" ) |
( "if" "(" condition ")" "{" statement* "}" ("else" "{" statement* "}")? ) |
( singleLineComment ) |
( multiLineComment )
forInit ::= dataType identifier ws "=" ws expression | identifier ws "=" ws expression
forUpdate ::= identifier ws "=" ws expression
condition ::= expression relationOperator expression
relationOperator ::= ("<=" | "<" | "==" | "!=" | ">=" | ">")
expression ::= term (("+" | "-") term)*
term ::= factor(("*" | "/") factor)*
factor ::= identifier | number | unaryTerm | funcCall | parenExpression
unaryTerm ::= "-" factor
funcCall ::= identifier "(" argList? ")"
parenExpression ::= "(" ws expression ws ")"
argList ::= expression ("," ws expression)*
number ::= [0-9]+
singleLineComment ::= "//" [^\n]* "\n"
multiLineComment ::= "/*" ( [^*] | ("*" [^/]) )* "*/"
ws ::= ([ \t\n]+)"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": c_grammar},
messages=[
{
"role": "user",
"content": "You are an expert in writing C code. Can you write a program that prints hello world?",
},
],
)
print(chat_completion.choices[0].message.content)
```
In this case, we get a cute little valid C program as the output:
```
char\nc(int a){return 2*a;}
```
## Syntax
### Background
[Bakus-Naur Form (BNF)](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) is a notation for describing the syntax of formal languages like programming languages, file formats, and protocols. Fireworks API uses an extension of BNF with a few modern regex-like features, inspired by [Llama.cpp's implementation](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
### Basics
In BNF, we define *production rules* that specify how a *non-terminal* (rule name) can be replaced with sequences of *terminals* (characters, specifically Unicode [code points](https://en.wikipedia.org/wiki/Code_point)) and other non-terminals. The basic format of a production rule is `nonterminal ::= sequence...`.
Consider an example of a small chess notation grammar:
```
# `root` specifies the pattern for the overall output
root ::= (
# it must start with the characters "1. " followed by a sequence
# of characters that match the `move` rule, followed by a space, followed
# by another move, and then a newline
"1. " move " " move "\n"
# it's followed by one or more subsequent moves, numbered with one or two digits
([1-9] [0-9]? ". " move " " move "\n")+
)
# `move` is an abstract representation, which can be a pawn, nonpawn, or castle.
# The `[+#]?` denotes the possibility of checking or mate signs after moves
move ::= (pawn | nonpawn | castle) [+#]?
pawn ::= ...
nonpawn ::= ...
castle ::= ...
```
### Non-terminals and terminals
Non-terminal symbols (rule names) stand for a pattern of terminals and other non-terminals. They are required to be a dashed lowercase word, like `move`, `castle`, or `check-mate`.
Terminals are actual characters ([code points](https://en.wikipedia.org/wiki/Code_point)). They can be specified as a sequence like `"1"` or `"O-O"` or as ranges like `[1-9]` or `[NBKQR]`.
### Characters and character ranges
Terminals support the full range of Unicode. Unicode characters can be specified directly in the grammar, for example `hiragana ::= [ぁ-ゟ]`, or with escapes: 8-bit (`\xXX`), 16-bit (`\uXXXX`) or 32-bit (`\UXXXXXXXX`).
Character ranges can be negated with `^`:
```
single-line ::= [^\n]+ "\n"`
```
Dot `.` symbol matches any character:
```
any-three-symbol-sequence ::= ...
```
### Sequences and alternatives
The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optional symbols (below) to a sequence.
### Repetition and optional symbols
* `*` after a symbol or sequence means that it can be repeated zero or more times.
* `+` denotes that the symbol or sequence should appear one or more times.
* `?` makes the preceding symbol or sequence optional.
### Comments and newlines
Comments can be specified with `#`:
```
# defines optional whitespace
ws ::= [ \t\n]+
```
Newlines are allowed between rules and between symbols or sequences nested inside parentheses. Additionally, a newline after an alternate marker `|` will continue the current rule, even outside of parentheses.
### The root rule
In a full grammar, the `root` rule always defines the starting point of the grammar. In other words, it specifies what the entire output must match.
```
# a grammar for lists
root ::= ("- " item)+
item ::= [^\n]+ "\n"
```
# Using JSON mode
## What is JSON mode?
JSON mode enables you to provide a JSON schema to force any Fireworks language model to respond in
## Why JSON responses?
1. Clarity and Precision: Responding in JSON ensures that the output from the LLM is clear, precise, and easy to parse. This is particularly beneficial in scenarios where the response needs to be further processed or analyzed by other systems.
2. Ease of Integration: JSON, being a widely-used format, allows for easy integration with various platforms and applications. This interoperability is essential for developers looking to incorporate AI capabilities into their existing systems without extensive modifications.
## End-to-end example
This guide provides a step-by-step example of how to create a structured output response using the Fireworks.ai API. The example uses Python and the `pydantic` library to define the schema for the output.
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` and `pydantic` libraries installed. You can install them using pip:
```bash
pip install openai pydantic
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all fireworks models support this feature. You can find your favorite model and get a JSON response out of it!
### Step 1: Import libraries
Start by importing the required libraries:
```python
import openai
from pydantic import BaseModel, Field
```
### Step 2: Configure the Fireworks.ai client
You can use either Fireworks.ai or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
Replace `"Your_API_Key"` with your actual API key.
### Step 3: Define the output schema
Define a Pydantic model to specify the schema of the output. For example:
```python
class Result(BaseModel):
winner: str
```
This model defines a simple schema with a single field `winner`. If you are not familiar with pydantic, please [check the documentation here](https://docs.pydantic.dev/latest/) . Pydantic emits JSON Schema, and you can find more informations [about it here](https://json-schema.org/).
### Step 4: Specify your output schema in your chat completions request
Make a request to the Fireworks.ai API to get a JSON response. In your request, specify the output schema you used in step 3. For example, to ask who won the US presidential election in 2012:
```python
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
response_format={"type": "json_object", "schema": Result.model_json_schema()},
messages=[
{
"role": "user",
"content": "Who won the US presidential election in 2012? Reply just in one JSON.",
},
],
)
```
### Step 5: Display the result
Finally, print the result:
```python
print(repr(chat_completion.choices[0].message.content))
```
This will display the response in the format defined by the `Result` schema. We get just one nice json response:
```
'{\n "winner": "Barack Obama"\n}'
```
And you can parse that as a plain JSON, and hook it up with the rest of your system. Current we enforce a structure with a grammar based state machine, to make sure that the LLMs would always generate all the fields in the schema. If your provided output schema is not a valid json schema, we will fail the response.
## Structured response modes
Fireworks support the following variants:
* **Arbitrary JSON**. Similar to [OpenAI](https://platform.openai.com/docs/guides/text-generation/json-mode), you can force the model to produce any valid json by providing `{"type": "json_object"}` as `response_format` in the request. This forces the model to output JSON but does not specify what specific JSON schema to use.
* **JSON with the given schema**. To specify a given JSON schema, you can provide the schema according to [JSON schema spec](https://json-schema.org/specification) to be imposed on the model generation. See supported constructs in the next section.
**Important:** when using JSON mode, it's also crucial to instruct the model to produce JSON and describe the desired schema via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly "stuck" request.
To get the best outcome, you need to include the schema in **both the prompt and the schema.**
Technically, it means that when using "JSON with the given schema" mode, the model doesn't automatically "see" the schema passed in the `response_format` field. Adherence to the schema is forced upon the model during sampling. So for best results, you need to include the desired schema in the prompt in addition to specifying it as `response_format`. You may need to experiment with the best way to describe the schema in the prompt depending on the model: besides JSON schema, describing it in plain English might work well too, e.g. "extract name and address of the person in JSON format".
**Note:** that the message content may be partially cut off if `finish_reason="length"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length. In this case, the return value might not be a valid JSON.
Structured response modes work for both Completions and Chat Completions APIs.
If you use [function calling](/docs/function-calling), JSON mode is enabled automatically and function schema is added to the prompt. So none of the comments above apply.
### JSON schema constructs
Fireworks supports a subset of [JSON schema specification](https://json-schema.org/specification).
Supported:
* Nested schemas composition, including `anyOf` and `$ref`
* `type`: `string`, `number`, `integer` `boolean`, `object`, `array`, `null`
* `properties` and `required` for objects
* `items` for arrays
Fireworks API doesn't error out on unsupported constructs. They just won't be enforced. Not yet supported constraints include:
* Sophisticated composition with `oneOf`
* Length/size constraints for objects and arrays
* Regular expressions via `pattern`
**Note**: JSON specification [allows for arbitrary field names](https://json-schema.org/understanding-json-schema/reference/object#additionalproperties) to appear in an object with the `properties` constraint unless `"additionalProperties": false` or `"unevaluatedProperties": false` is provided. It's a poor default for LLM constrained generation since any hallucination would be accepted. Thus Fireworks treats any schema with `properties` constraint as if it had `"unevaluatedProperties": false`.
An example of `response_format` field with the schema accepting an object with two fields - a required string and an optional integer:
```
{
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"foo": {"type": "string"},
"bar": {"type": "integer"}
},
"required": ["foo"]
}
}
```
## Similar features
Check out our [function calling model](/guides/function-calling) if you're interested in use cases like:
* Multi-turn capabilities: For example, the ability for the model to ask for clarifying information about parameters
* Routing: The ability for the model to route across multiple different options or models. Instead of just having one possible JSON Schema, you have many different JSON schemas to work across.
Check out [grammar mode](/structured-responses/structured-output-grammar-based) if you want structured output specified not through JSON, but rather through an arbitrary grammar (limit output to specific words, character limits, character types, etc).
# Authentication
Authentication for access to your account
### Signing in
Users using Google SSO can run:
```
firectl signin
```
If you are using [custom SSO](/accounts/sso), also specify the account ID:
```
firectl signin my-enterprise-account
```
### Authenticate with API Key
To authenticate without a web browser, append `--api-key` to any firectl command.
```
firectl --api-key API_KEY
```
To persist the API key for all subsequent commands, run:
```
firectl set-api-key API_KEY
```
# Create a Dataset
Create a Dataset on Fireworks AI platform
```
firectl create dataset [flags]
```
### Example
```
firectl create dataset my-dataset /path/to/dataset.jsonl
```
### Flags
```
--display-name string The display name of the dataset.
-h, --help help for dataset
--quiet If true, does not print the upload progress bar.
```
# Create a deployment
Create a Deployment on Fireworks AI platform
Creates a new deployment.
```
firectl create deployment [flags]
```
### Example
```
firectl create deployment falcon-7b
```
### Flags
```
--description string Description of the deployment.
--disable-speculative-decoding If true, speculative decoding is disabled.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
--max-peft-batch-size int32 Max batching of concurrent peft requests of the server.
--max-replica-count int32 Maximum number of replicas for the deployment. If min-replica-count > 0 defaults to 0, otherwise defaults to 1.
--min-replica-count int32 Minimum number of replicas for the deployment. If min-replica-count < max-replica-count the deployment will automatically scale between the two replica counts based on load.
--model-id string The ID of a model that should be deployed when the deployment is created.
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled. (default 168h0m0s)
--wait Wait until the deployment is ready.
--world-size int32 The number of GPUs the base model is served with.
-h, --help help for deployment
```
### Flags inherited from parent commands
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Create a fine-tuning job
Create a fine-tuning job with a base model
Creates a fine-tuning job on Fireworks AI platform with the provided configuration yaml.
```
firectl create sftj [flags]
```
### Example
```
firectl create sftj \
--base-model llama-v3p1-8b-instruct \
--dataset cancerset \
--output-model my-tuned-model \
--job-id my-fine-tuning-job \
--learning-rate 0.0001 \
--epochs 2 \
--early-stop \
--evaluation-dataset my-eval-set
```
### Flags
```
--base-model string (required) The base model used for fine-tuning. e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
--dataset string (required) The ID of the dataset for the fine tuning.
--display-name string (optional) The display name of the fine-tuning job.
--draft-base-model string (optional) The draft model hf base model field.
--epochs int (optional) The number of epochs to train for.
--evaluation-dataset string (optional) The evaluation dataset for the supervised fine-tuning job.
--job-id string (optional) The ID of the fine-tuning job.
--learning-rate float (optional) The learning rate used for training.
--lora-rank int32 (optional) The LoRA rank used for training.
--early-stop Enable early stopping for the supervised fine-tuning job.
--quiet If set, only errors will be printed.
-h, --help help for deployment
--wandb-api-key string (optional) A Weights & Biases API key associated with the entity.
--wandb-entity string (optional) The Weights & Biases entity where training progress should be reported.
--wandb-project string (optional) The Weights & Biases project where training progress should be reported.
--wandb-run-id string [WANDB_RUN_ID] WandB Run ID. Implies --wandb.
--wandb Enable WandB
```
# Create Model
Create a model on Fireworks AI platform
```
firectl create model [flags]
```
### Example
```
firectl create model my-model /path/to/checkpoint/
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--quiet If true, does not print the upload progress bar.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Delete Resources
Deletes resource(s) in a Fireworks AI account
### Delete a model
```
firectl delete model [flags]
```
##### Example
```
firectl delete model my-model
```
### Delete a fine-tuning job.
```
firectl delete fine-tuning-job [flags]
```
#### Example
```
firectl delete fine-tuning-job my-fine-tuning-job
```
### Delete a deployment
Deletes an model deployment.
```
firectl delete deployment [flags]
```
#### Example
```
firectl delete deployment my-deployment
```
### Delete a dataset.
```
firectl delete dataset [flags]
```
#### Example
```
firectl delete dataset my-dataset
```
### Flags
```
-h, --help help for deleting resources
```
# Deploy Model
Deploy a model on Fireworks AI platform
```
firectl deploy [flags]
```
#### Example
```
firectl deploy my-model
```
### Flags
```
--deployment-id string The ID of the deployment where the model is to be deployed.
-h, --help help for deploy
--wait Wait until the model is deployed.
```
# Download a model
Download a model from third-party locations
```
firectl download model [flags]
```
#### Example
```
firectl download model my-model /path/to/checkpoint/
```
### Flags
```
-h, --help help for download
```
# Get Resources
Retrieves model information from Fireworks AI platform
```
firectl get [flags]
```
#### Example
```
firectl get model [flags]
```
### Retrieve user information
Prints information about a user.
```
firectl get user [flags]
```
#### Example
```
firectl get user john-08bb29
```
### Retrieve fine-tuning job information
Prints information about a fine-tuning job.
```
firectl get fine-tuning-job [flags]
```
#### Example
```
firectl get fine-tuning-job my-fine-tuning-job
```
### Get information about a deployment.
```
firectl get deployment [flags]
```
#### Example
```
firectl get deployment my-deployment
```
### Get information about a dataset.
```
firectl get dataset [flags]
```
#### Example
```
firectl get dataset instr-fine-tuning
```
### Flags
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Import Model
Imports specified model from Fireworks AI Platform
Imports a model from the fireworks account.
```
firectl import model [flags]
```
#### Example
```
firectl import model llama-v3p1-8b-instruct
```
### Flags
```
-h, --help help for model
--model-id string The ID of the model to be created.
```
# List Resources
List various resources in an Fireworks AI account
```
firectl list [flags]
```
### List models
```
firectl list models
```
### List fine-tuning jobs
Prints all fine-tuning jobs in an account.
```
firectl list fine-tuning-jobs [flags]
```
### List deployments
Prints all deployments in the account.
```
firectl list deployments [flags]
```
### List deployed models
Prints all deployed models in an account.
```
firectl list deployed-models [flags]
```
### List datasets
Prints all datasets uploaded by a user in an account.
```
firectl list datasets [flags]
```
### Flags inherited from parent commands
```
--filter string Only resources satisfying the provided filter will be listed. See https://google.aip.dev/160 for the filter grammar.
-h, --help help for list
--no-paginate List all resources without pagination.
--order-by string A list of fields to order by. To specify a descending order for a field, append a " desc" suffix
--page-size int32 The maximum number of resources to list.
--page-token string The page to list.
```
# Undeploy Model
Undeploy a model on Fireworks AI platform
```
firectl undeploy [flags]
```
#### Example
```
firectl undeploy my-model
```
### Flags
```
-h, --help help for undeploy
--wait Wait until the model is deployed.
```
# Update Resources
Updates Resources on Fireworks AI platform
```
firectl update model [flags]
```
#### Example
```
firectl update model my-model --display-name="New Name"
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
## Update a user
```
firectl update user [flags]
```
#### Example
```
firectl update user my-user --display-name="Alice Cullen"
```
### Flags
```
--display-name string The display name of the user.
-h, --help help for user
--user string The role of the user. Must be one of {user, admin}.
```
## Update a deployment
```
firectl update deployment [flags]
```
#### Example
```
firectl update deployment my-deployment
```
### Flags
```
--description string Description of the deployment. Must be fewer than 1000 characters long.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
-h, --help help for deployment
--max-peft-batch-size int32 Max batching of concurrent PEFT requests to the server.
--max-replica-count int32 The maximum number of replicas.
--min-replica-count int32 The minimum number of replicas. (default 1)
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled.
--world-size int32 The number of GPUs the base model is served with.
```
## Update a dataset
```
firectl update dataset [flags]
```
#### Example
```
firectl update dataset my-dataset
```
### Flags
```
--display-name string The display name of the model.
-h, --help help for dataset
```
# Getting Started
Learn to create, deploy, and manage resources using Firectl
Firectl can be installed several ways based on your choice and platform.
```bash homebrew
brew tap fw-ai/firectl
brew install firectl
# If you encounter a failed SHA256 check, try first running
brew update
```
```bash macOS (Apple Silicon)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-arm64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash macOS (x86_64)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-amd64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash Linux (x86_64)
wget -O firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip firectl.gz
sudo install -o root -g root -m 0755 firectl /usr/local/bin/firectl
```
```Text Windows (64 bit)
wget -L https://storage.googleapis.com/fireworks-public/firectl/stable/firectl.exe
```
### Sign into Fireworks account
To sign into your Fireworks account:
```bash
firectl signin
```
If you have set up [Custom SSO](/accounts/sso) then also pass your account ID:
```bash
firectl signin
```
### Check you have signed in
To show which account you have signed into:
```bash
firectl whoami
```
### Check your installed version
```bash
firectl version
```
### Upgrade to the latest version
```bash
sudo firectl upgrade
```
# OpenAI compatibility
You can use [OpenAI Python client library](https://github.com/openai/openai-python) to interact with Fireworks.
This makes migration of existing applications already using OpenAI particularly easy.
## Specify endpoint and API key
You can override parameters for the entire application using environment variables
```shell Shell
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
or by setting these values in Python
```python
import openai
# warning: it has a process-wide effect
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
```
Alternatively, you may specify these parameters for a single request (useful if you mix calls to OpenAI and Fireworks in the same process):
```python
# api_base and api_key can be passed to any of the supported APIs
chat_completion = openai.ChatCompletion.create(
api_base="https://api.fireworks.ai/inference/v1",
api_key = "
Note, that if you're using OpenAI SDK, they `usage` field won't be listed in the SDK's structure definition. But it can be accessed directly. For example:
* In Python SDK, you can access the attribute directly, e.g. `for chunk in openai.ChatCompletion.create(...): print(chunk["usage"])`.
* In TypeScript SDK, you need to cast away the typing, e.g. `for await (const chunk of await openai.chat.completions.create(...)) { console.log((chunk as any).usage); }`.
### Not supported options
The following options are not yet supported:
* `presence_penalty`
* `frequency_penalty`
* `best_of`: you can use `n` instead
* `logit_bias`
* `functions`: you can use our [LangChain integration](https://python.langchain.com/docs/integrations/providers/fireworks) to achieve similar functionality client-side
Please reach out to us on [Discord](https://discord.gg/fireworks-ai) if you have a use case requiring one of these.
# API Reference
## BaseCompletion Objects
```python
class BaseCompletion()
```
Base class for handling completions. This class provides shared logic for creating completions,\
both synchronously and asynchronously, and both streaming and non-streaming.
**Attributes**:
* `endpoint` *str* - API endpoint for the completion request.
* `response_class` *Type* - Class used for parsing the non-streaming response.
* `stream_response_class` *Type* - Class used for parsing the streaming response.
#### create
```python
@classmethod
def create(cls,
model,
prompt_or_messages=None,
request_timeout=600,
stream=False,
**kwargs)
```
Create a completion or chat completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `prompt_or_messages` *Union\[str, List\[ChatMessage]]* - The prompt for Completion or a list of chat messages for ChatCompletion. If not specified, must specify either `prompt` or `messages` in kwargs.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, Generator[CompletionStreamResponse, None, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse\
or a generator yielding CompletionStreamResponse.
#### acreate
```python
@classmethod
def acreate(cls, model, *args, request_timeout=600, stream=False, **kwargs)
```
Asynchronously create a completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, AsyncGenerator[CompletionStreamResponse, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse or an async generator yielding CompletionStreamResponse.
# completion
## Completion Objects
```python
class Completion(BaseCompletion)
```
Class for handling text completions.
# chat\_completion
## ChatCompletion Objects
```python
class ChatCompletion(BaseCompletion)
```
Class for handling chat completions.
# api
## Choice Objects
```python
class Choice(BaseModel)
```
A completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionResponse Objects
```python
class CompletionResponse(BaseModel)
```
The response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `choices` *List\[Choice]* - The list of generated completion choices.
## CompletionResponseStreamChoice Objects
```python
class CompletionResponseStreamChoice(BaseModel)
```
A streamed completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionStreamResponse Objects
```python
class CompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[CompletionResponseStreamChoice]):\
The list of streamed completion choices.
## Model Objects
```python
class Model(BaseModel)
```
A model deployed to the Fireworks platform.
**Attributes**:
* `id` *str* - The model name.
* `object` *str* - The object type, which is always "model".
* `created` *int* - The Unix time in seconds when the model was generated.
## ListModelsResponse Objects
```python
class ListModelsResponse(BaseModel)
```
The response message from a /v1/models call.
**Attributes**:
* `object` *str* - The object type, which is always "list".
* `data` *List\[Model]* - The list of models.
## ChatMessage Objects
```python
class ChatMessage(BaseModel)
```
A chat completion message.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the message.
## ChatCompletionResponseChoice Objects
```python
class ChatCompletionResponseChoice(BaseModel)
```
A chat completion choice generated by a chat model.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `message` *ChatMessage* - The chat completion message.
* `finish_reason` *Optional\[str]* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## UsageInfo Objects
```python
class UsageInfo(BaseModel)
```
Usage statistics.
**Attributes**:
* `prompt_tokens` *int* - The number of tokens in the prompt.
* `total_tokens` *int* - The total number of tokens used in the request (prompt + completion).
* `completion_tokens` *Optional\[int]* - The number of tokens in the generated completion.
## ChatCompletionResponse Objects
```python
class ChatCompletionResponse(BaseModel)
```
The response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.
* `choices` *List\[ChatCompletionResponseChoice]* - The list of chat completion choices.
* `usage` *UsageInfo* - Usage statistics for the chat completion.
## DeltaMessage Objects
```python
class DeltaMessage(BaseModel)
```
A message delta.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the chunk message.
## ChatCompletionResponseStreamChoice Objects
```python
class ChatCompletionResponseStreamChoice(BaseModel)
```
A streamed chat completion choice.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `delta` *DeltaMessage* - The message delta.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## ChatCompletionStreamResponse Objects
```python
class ChatCompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[ChatCompletionResponseStreamChoice]):\
The list of streamed chat completion choices.
# model
## Model Objects
```python
class Model()
```
#### list
```python
@classmethod
def list(cls, request_timeout=60)
```
Returns a list of available models.
**Arguments**:
* `request_timeout` *int, optional* - The request timeout in seconds. Default is 60.
**Returns**:
* `ListModelsResponse` - A list of available models.
# log
#### set\_console\_log\_level
```python
def set_console_log_level(level: str) -> None
```
Controls console logging.
**Arguments**:
* `level` - the minimum level that prints out to console.\
Supported values: \[CRITICAL, FATAL, ERROR, WARN,\
WARNING, INFO, DEBUG]
# error
## PermissionError Objects
```python
class PermissionError(FireworksError)
```
A permission denied error.
## InvalidRequestError Objects
```python
class InvalidRequestError(FireworksError)
```
A invalid request error.
## AuthenticationError Objects
```python
class AuthenticationError(FireworksError)
```
A authentication error.
## RateLimitError Objects
```python
class RateLimitError(FireworksError)
```
A rate limit error.
## InternalServerError Objects
```python
class InternalServerError(FireworksError)
```
An internal server error.
## ServiceUnavailableError Objects
```python
class ServiceUnavailableError(FireworksError)
```
A service unavailable error.
# Getting Started
You can install the client library with pip:
```bash pip
pip install --upgrade fireworks-ai
```
### Authentication
You can authenticate with Fireworks by setting the `fireworks.client.api_key` variable:
```python
fireworks.client.api_key = ""
```
Or by setting the `FIREWORKS_API_KEY` environment variable:
```
export FIREWORKS_API_KEY=
```
# Inference errors
This page lists common error codes encountered during inference requests using the Fireworks API, their meanings, and potential resolutions.
## Error codes
Below is a table of common status codes and their associated messages for inference-related API requests.
| **Error Code** | **Error Name** | **Possible Issue(s)** | **How to Resolve** |
| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `400` | `Bad Request` | Invalid input or malformed request. | Review the request parameters and ensure they match the expected format. |
| `401` | `Unauthorized` | Invalid API key or insufficient permissions. | Verify your API key and ensure it has the correct permissions. |
| `402` | `Payment Required` | User's account is not on a paid plan or has exceeded usage limits. | Check your billing status and ensure your payment method is up to date. Upgrade your plan if necessary. |
| `403` | `Forbidden` | The model name may be incorrect, or the model does not exist. This error is also returned to avoid leaking information about model availability. | Verify the model name on the Fireworks site and ensure it exists. Double-check the spelling of the model name in your request. |
| `404` | `Not Found` | The API endpoint is incorrect, or the resource path is invalid (e.g., a user tried accessing `/v1/foobar` instead of a valid endpoint). | Verify the URL path in your request and ensure you are using the correct API endpoint as per the documentation. |
| `405` | `Method Not Allowed` | Using an unsupported HTTP method (e.g., using GET instead of POST). | Check the API documentation for the correct HTTP method to use for the request. |
| `408` | `Request Timeout` | The request took too long to complete, possibly due to server overload or network issues. | Retry the request after a brief wait. Consider increasing the timeout value if applicable. |
| `412` | `Precondition Failed` | This error occurs when attempting to invoke a LoRA model that failed to load. The final validation of the model happens during inference, not at upload time. | Check the body of the request for a detailed error message. Ensure the LoRA model was uploaded correctly and is compatible. Contact support if the issue persists. |
| `413` | `Payload Too Large` | Input data exceeds the allowed size limit. | Reduce the size of the input payload (e.g., by trimming large text or image data). |
| `429` | `Over Quota` | The user has reached the API rate limit. | Wait for the quota to reset or upgrade your plan for a higher rate limit. |
| `500` | `Internal Server Error` | This indicates a server-side code bug and is unlikely to resolve on its own. | Contact Fireworks support immediately, as this error typically requires intervention from the engineering team. |
| `502` | `Bad Gateway` | The server received an invalid response from an upstream server. | Wait and retry the request. If the error persists, it may indicate a server outage. |
| `503` | `Service Unavailable` | The service is down for maintenance or experiencing issues. | Retry the request after some time. Check for any maintenance announcements. |
| `504` | `Gateway Timeout` | The server did not receive a response in time from an upstream server. | Wait briefly and retry the request. Consider using a shorter input prompt if applicable. |
| `520` | `Unknown Error` | An unexpected error occurred with no clear explanation. | Retry the request. If the issue persists, contact support for further assistance. |
## Troubleshooting tips
If you encounter an error not listed here, try the following:
* Review the API documentation for the correct usage of endpoints and parameters.
* Check the [Fireworks status page](https://status.fireworks.ai) for any ongoing service disruptions.
* Contact support at [support@fireworks.ai](mailto:support@fireworks.ai) for further assistance.
This will provide additional insights into any issues encountered.
## Need more help?
If you continue to experience issues, please reach out on our [Discord channel](https://discord.gg/fireworks-ai).