# Custom SSO
Source: https://docs.fireworks.ai/accounts/sso
Set up custom Single Sign-On (SSO) authentication for Fireworks AI
Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform.
By default, Fireworks supports Google SSO.
If you have an enterprise account, Fireworks supports bringing your own identity provider using:
* OpenID Connect (OIDC) provider
* SAML 2.0 provider
Coordinate with your Fireworks AI representative to enable the integration.
## OpenID Connect (OIDC) provider
Create an OIDC client application in your identity provider, e.g. Okta.
Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret).
Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse
```
Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up.
## SAML 2.0 provider
Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm).
Set the SSO URL to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse
```
Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like:
```
urn:amazon:cognito:sp:
```
Create an Attribute Statement with the name:
```
http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
```
and the value `user.email`
Leave the rest of the settings as defaults.
Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up.
## Troubleshooting
### Invalid samlResponse or relayState from identity provider
This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports
service provider (SP) initiated login.
See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an
in-depth explanation.
### Required String parameter 'RelayState' is not present
See above.
# Managing users
Source: https://docs.fireworks.ai/accounts/users
Add and delete additional users in your Fireworks account
See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users. Only admin users can manage other users within the account.
## Adding users
To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://fireworks.ai/account/users](https://fireworks.ai/account/users).
```bash
firectl create user --email="alice@example.com"
```
To create another admin user, pass the `--role=admin` flag:
```bash
firectl create user --email="alice@example.com" --role=admin
```
## Updating a user's role
To update a user's role, run
```bash
firectl update user --role="{admin,user}"
```
## Deleting users
You can remove a user from your account by running:
```bash
firectl delete user
```
# Batch Delete Batch Jobs
Source: https://docs.fireworks.ai/api-reference-dlde/batch-delete-batch-jobs
post /v1/accounts/{account_id}/batchJobs:batchDelete
# Batch Delete Environments
Source: https://docs.fireworks.ai/api-reference-dlde/batch-delete-environments
post /v1/accounts/{account_id}/environments:batchDelete
# Batch Delete Node Pools
Source: https://docs.fireworks.ai/api-reference-dlde/batch-delete-node-pools
post /v1/accounts/{account_id}/nodePools:batchDelete
# Cancel Batch Job
Source: https://docs.fireworks.ai/api-reference-dlde/cancel-batch-job
post /v1/accounts/{account_id}/batchJobs/{batch_job_id}:cancel
Cancels an existing batch job if it is queued, pending, or running.
# Connect Environment
Source: https://docs.fireworks.ai/api-reference-dlde/connect-environment
post /v1/accounts/{account_id}/environments/{environment_id}:connect
Connects the environment to a node pool.
Returns an error if there is an existing pending connection.
# Create Aws Iam Role Binding
Source: https://docs.fireworks.ai/api-reference-dlde/create-aws-iam-role-binding
post /v1/accounts/{account_id}/awsIamRoleBindings
# Create Batch Job
Source: https://docs.fireworks.ai/api-reference-dlde/create-batch-job
post /v1/accounts/{account_id}/batchJobs
# Create Cluster
Source: https://docs.fireworks.ai/api-reference-dlde/create-cluster
post /v1/accounts/{account_id}/clusters
# Create Environment
Source: https://docs.fireworks.ai/api-reference-dlde/create-environment
post /v1/accounts/{account_id}/environments
# Create Node Pool
Source: https://docs.fireworks.ai/api-reference-dlde/create-node-pool
post /v1/accounts/{account_id}/nodePools
# Create Node Pool Binding
Source: https://docs.fireworks.ai/api-reference-dlde/create-node-pool-binding
post /v1/accounts/{account_id}/nodePoolBindings
# Create Snapshot
Source: https://docs.fireworks.ai/api-reference-dlde/create-snapshot
post /v1/accounts/{account_id}/snapshots
# Delete Aws Iam Role Binding
Source: https://docs.fireworks.ai/api-reference-dlde/delete-aws-iam-role-binding
post /v1/accounts/{account_id}/awsIamRoleBindings:delete
# Delete Batch Job
Source: https://docs.fireworks.ai/api-reference-dlde/delete-batch-job
delete /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Delete Cluster
Source: https://docs.fireworks.ai/api-reference-dlde/delete-cluster
delete /v1/accounts/{account_id}/clusters/{cluster_id}
# Delete Environment
Source: https://docs.fireworks.ai/api-reference-dlde/delete-environment
delete /v1/accounts/{account_id}/environments/{environment_id}
# Delete Node Pool
Source: https://docs.fireworks.ai/api-reference-dlde/delete-node-pool
delete /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Delete Node Pool Binding
Source: https://docs.fireworks.ai/api-reference-dlde/delete-node-pool-binding
post /v1/accounts/{account_id}/nodePoolBindings:delete
# Delete Snapshot
Source: https://docs.fireworks.ai/api-reference-dlde/delete-snapshot
delete /v1/accounts/{account_id}/snapshots/{snapshot_id}
# Disconnect Environment
Source: https://docs.fireworks.ai/api-reference-dlde/disconnect-environment
post /v1/accounts/{account_id}/environments/{environment_id}:disconnect
Disconnects the environment from the node pool. Returns an error
if the environment is not connected to a node pool.
# Get Batch Job
Source: https://docs.fireworks.ai/api-reference-dlde/get-batch-job
get /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Get Batch Job Logs
Source: https://docs.fireworks.ai/api-reference-dlde/get-batch-job-logs
get /v1/accounts/{account_id}/batchJobs/{batch_job_id}:getLogs
# Get Cluster
Source: https://docs.fireworks.ai/api-reference-dlde/get-cluster
get /v1/accounts/{account_id}/clusters/{cluster_id}
# Get Cluster Connection Info
Source: https://docs.fireworks.ai/api-reference-dlde/get-cluster-connection-info
get /v1/accounts/{account_id}/clusters/{cluster_id}:getConnectionInfo
Retrieve connection settings for the cluster to be put in kubeconfig
# Get Environment
Source: https://docs.fireworks.ai/api-reference-dlde/get-environment
get /v1/accounts/{account_id}/environments/{environment_id}
# Get Node Pool
Source: https://docs.fireworks.ai/api-reference-dlde/get-node-pool
get /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Get Node Pool Stats
Source: https://docs.fireworks.ai/api-reference-dlde/get-node-pool-stats
get /v1/accounts/{account_id}/nodePools/{node_pool_id}:getStats
# Get Snapshot
Source: https://docs.fireworks.ai/api-reference-dlde/get-snapshot
get /v1/accounts/{account_id}/snapshots/{snapshot_id}
# List Aws Iam Role Bindings
Source: https://docs.fireworks.ai/api-reference-dlde/list-aws-iam-role-bindings
get /v1/accounts/{account_id}/awsIamRoleBindings
# List Batch Jobs
Source: https://docs.fireworks.ai/api-reference-dlde/list-batch-jobs
get /v1/accounts/{account_id}/batchJobs
# List Clusters
Source: https://docs.fireworks.ai/api-reference-dlde/list-clusters
get /v1/accounts/{account_id}/clusters
# List Environments
Source: https://docs.fireworks.ai/api-reference-dlde/list-environments
get /v1/accounts/{account_id}/environments
# List Node Pool Bindings
Source: https://docs.fireworks.ai/api-reference-dlde/list-node-pool-bindings
get /v1/accounts/{account_id}/nodePoolBindings
# List Node Pools
Source: https://docs.fireworks.ai/api-reference-dlde/list-node-pools
get /v1/accounts/{account_id}/nodePools
# List Snapshots
Source: https://docs.fireworks.ai/api-reference-dlde/list-snapshots
get /v1/accounts/{account_id}/snapshots
# Update Batch Job
Source: https://docs.fireworks.ai/api-reference-dlde/update-batch-job
patch /v1/accounts/{account_id}/batchJobs/{batch_job_id}
# Update Cluster
Source: https://docs.fireworks.ai/api-reference-dlde/update-cluster
patch /v1/accounts/{account_id}/clusters/{cluster_id}
# Update Environment
Source: https://docs.fireworks.ai/api-reference-dlde/update-environment
patch /v1/accounts/{account_id}/environments/{environment_id}
# Update Node Pool
Source: https://docs.fireworks.ai/api-reference-dlde/update-node-pool
patch /v1/accounts/{account_id}/nodePools/{node_pool_id}
# Streaming Transcription
Source: https://docs.fireworks.ai/api-reference/audio-streaming-transcriptions
websocket /audio/transcriptions/streaming
Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint.
Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
### URL
Please use the following serverless endpoint:
```
wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming
```
### Headers
Your Fireworks API key, e.g. `Authorization=API_KEY`. Alternatively, can be provided as a query param.
### Query Parameters
The format in which to return the response. Currently only `verbose_json` is recommended for streaming.
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription.
### Streaming Audio
Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will:
1. Resample your audio to 16 kHz if it is not already.
2. Convert it to mono.
3. Send 50ms chunks (16,000 Hz \* 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format.
### Handling Responses
The client maintains a state dictionary, starting with an empty dictionary `{}`. When the server sends the first transcription message, it contains a list of segments. Each segment has an `id` and `text`:
```python
# Server initial message:
{
"segments": [
{"id": "0", "text": "This is the first sentence"},
{"id": "1", "text": "This is the second sentence"}
]
}
# Client initial state:
{
"0": "This is the first sentence",
"1": "This is the second sentence",
}
```
When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment `id`:
```python
# Server continuous message:
{
"segments": [
{"id": "1", "text": "This is the second sentence modified"},
{"id": "2", "text": "This is the third sentence"}
]
}
# Client updated state:
{
"0": "This is the first sentence",
"1": "This is the second sentence modified", # overwritten
"2": "This is the third sentence", # new
}
```
### Example Usage
Check out a brief Python example below or example sources:
* [Python notebook](https://colab.research.google.com/github/fw-ai/cookbook/blob/main/learn/audio/audio_streaming_speech_to_text/audio_streaming_speech_to_text.ipynb)
* [Python sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/python)
* [Node.js sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/nodejs)
```python
!pip3 install requests torch torchaudio websocket-client
import io
import time
import json
import torch
import requests
import torchaudio
import threading
import websocket
import urllib.parse
lock = threading.Lock()
state = {}
def on_open(ws):
def send_audio_chunks():
for chunk in audio_chunk_bytes:
ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
time.sleep(chunk_size_ms / 1000)
final_checkpoint = json.dumps({"checkpoint_id": "final"})
ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT)
threading.Thread(target=send_audio_chunks).start()
def on_message(ws, message):
message = json.loads(message)
if message.get("checkpoint_id") == "final":
ws.close()
return
update = {s["id"]: s["text"] for s in message["segments"]}
with lock:
state.update(update)
print("\n".join(f" - {k}: {v}" for k, v in state.items()))
def on_error(ws, error):
print(f"WebSocket error: {error}")
# Open a connection URL with query params
url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
"language": "en",
})
ws = websocket.WebSocketApp(
f"{url}?{params}",
header={"Authorization": ""},
on_open=on_open,
on_message=on_message,
on_error=on_error,
)
ws.run_forever()
```
### Dedicated endpoint
For fixed throughput and predictable SLAs, you may request a dedicated endpoint for streaming transcription at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) or [discord](https://www.google.com/url?q=https%3A%2F%2Fdiscord.gg%2Ffireworks-ai).
# Transcribe audio
Source: https://docs.fireworks.ai/api-reference/audio-transcriptions
post /audio/transcriptions
Send a sample audio to get a transcription.
### Request
##### (multi-part form)
The input audio file to transcribe or an URL to the public audio file.
Max audio file size is 1 GB, there is no limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance.
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints:
* [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`);
* [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`);
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Currently supported:
* `mms_fa` optimal accuracy for multilingual speech.
* `tdnn_ffn` optimal accuracy for English-only speech.
* `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)).
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription. Alternatively, fallback decoding can be enabled by passing a list of temperatures like `0.0,0.2,0.4,0.6,0.8,1.0`. This can help to improve performance.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Whether to get speaker diarization for the transcription. Can be one of `true`, or `false`. If not present, defaults to `false`.
Enabling diarization also requires other fields to hold specific values:
1. `response_format` must be set `verbose_json`.
2. `timestamp_granularities` must include `word` to use diarization.
Note: Diarization is in a public preview phase and currently does not incur additional charges.
The minimum number of speakers to detect for diarization. `diarize` must be set `true` to use `min_speakers`. If not present, defaults to `1`.
The maximum number of speakers to detect for diarization. `diarize` must be set `true` to use `max_speakers`. If not present, defaults to `inf`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Speaker label for the word.
Segments of the transcribed/translated text and their corresponding details.
```curl curl
# Download audio file
curl -L -o "audio.flac" "https://tinyurl.com/4997djsh"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions" \
-H "Authorization: " \
-F "file=@audio.flac"
```
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
#
# Or for the turbo version
# model="whisper-v3-turbo",
# base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai",
api_key="",
)
# Make request
start = time.time()
r = await client.transcribe_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
# Translate audio
Source: https://docs.fireworks.ai/api-reference/audio-translations
post /audio/translations
### Request
##### (multi-part form)
The input audio file to translate or an URL to the public audio file.
Max audio file size is 1 GB, there is no limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance.
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints:
* [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`);
* [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`);
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Currently supported:
* `mms_fa` optimal accuracy for multilingual speech.
* `tdnn_ffn` optimal accuracy for English-only speech.
* `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)).
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription.
Sampling temperature to use when decoding text tokens during transcription. Alternatively, fallback decoding can be enabled by passing a list of temperatures like `0.0,0.2,0.4,0.6,0.8,1.0`. This can help to improve performance.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```curl curl
# Download audio file
curl -L -o "audio.flac" "https://tinyurl.com/4997djsh"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/translations" \
-H "Authorization: " \
-F "file=@audio.flac"
```
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
#
# Or for the turbo version
# model="whisper-v3-turbo",
# base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai",
api_key="",
)
# Make request
start = time.time()
r = await client.translate_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
# Create API Key
Source: https://docs.fireworks.ai/api-reference/create-api-key
post /v1/accounts/{account_id}/apiKeys
# Create Dataset
Source: https://docs.fireworks.ai/api-reference/create-dataset
post /v1/accounts/{account_id}/datasets
# Load LoRA
Source: https://docs.fireworks.ai/api-reference/create-deployed-model
post /v1/accounts/{account_id}/deployedModels
# Create Deployment
Source: https://docs.fireworks.ai/api-reference/create-deployment
post /v1/accounts/{account_id}/deployments
# Create Model
Source: https://docs.fireworks.ai/api-reference/create-model
post /v1/accounts/{account_id}/models
# Create Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/create-supervised-fine-tuning-job
post /v1/accounts/{account_id}/supervisedFineTuningJobs
# Create User
Source: https://docs.fireworks.ai/api-reference/create-user
post /v1/accounts/{account_id}/users
# Create embeddings
Source: https://docs.fireworks.ai/api-reference/creates-an-embedding-vector-representing-the-input-text
post /embeddings
# Delete API Key
Source: https://docs.fireworks.ai/api-reference/delete-api-key
post /v1/accounts/{account_id}/apiKeys:delete
# Delete Dataset
Source: https://docs.fireworks.ai/api-reference/delete-dataset
delete /v1/accounts/{account_id}/datasets/{dataset_id}
# Unload LoRA
Source: https://docs.fireworks.ai/api-reference/delete-deployed-model
delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Delete Deployment
Source: https://docs.fireworks.ai/api-reference/delete-deployment
delete /v1/accounts/{account_id}/deployments/{deployment_id}
# Delete Model
Source: https://docs.fireworks.ai/api-reference/delete-model
delete /v1/accounts/{account_id}/models/{model_id}
# Delete Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/delete-supervised-fine-tuning-job
delete /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}
# Generate an image
Source: https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt
Official API reference for image generation workloads can be found on the corresponding models pages, upon clicking "view code". We support generating images from text prompts, other images, and/or ControlNet
[https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0)
[https://fireworks.ai/models/fireworks/SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B)
[https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic)
[https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl)
# Get Account
Source: https://docs.fireworks.ai/api-reference/get-account
get /v1/accounts/{account_id}
# Get Dataset
Source: https://docs.fireworks.ai/api-reference/get-dataset
get /v1/accounts/{account_id}/datasets/{dataset_id}
# Get Dataset Upload Endpoint
Source: https://docs.fireworks.ai/api-reference/get-dataset-upload-endpoint
post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint
# Get LoRA
Source: https://docs.fireworks.ai/api-reference/get-deployed-model
get /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Get Deployment
Source: https://docs.fireworks.ai/api-reference/get-deployment
get /v1/accounts/{account_id}/deployments/{deployment_id}
# Get Model
Source: https://docs.fireworks.ai/api-reference/get-model
get /v1/accounts/{account_id}/models/{model_id}
# Get Model Download Endpoint
Source: https://docs.fireworks.ai/api-reference/get-model-download-endpoint
get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint
# Get Model Upload Endpoint
Source: https://docs.fireworks.ai/api-reference/get-model-upload-endpoint
post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint
# Get Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/get-supervised-fine-tuning-job
get /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}
# Get User
Source: https://docs.fireworks.ai/api-reference/get-user
get /v1/accounts/{account_id}/users/{user_id}
# Introduction
Source: https://docs.fireworks.ai/api-reference/introduction
Fireworks AI REST API enables you to interact with various Language, Image and Embedding Models using the API Key.
## Authentication
All requests made to the Fireworks AI via REST API must include an `Authorization` header.
Header should specify a valid `Bearer` Token with API key and must be encoded as JSON with the "Content-Type: application/json" header.
This ensures that your requests are properly authenticated and formatted for interaction with the Fireworks AI.
A Sample header to be included in the REST API request should look like below:
```json
authorization: Bearer
```
# List API Keys
Source: https://docs.fireworks.ai/api-reference/list-api-keys
get /v1/accounts/{account_id}/apiKeys
# List Datasets
Source: https://docs.fireworks.ai/api-reference/list-datasets
get /v1/accounts/{account_id}/datasets
# List LoRAs
Source: https://docs.fireworks.ai/api-reference/list-deployed-models
get /v1/accounts/{account_id}/deployedModels
# List Deployments
Source: https://docs.fireworks.ai/api-reference/list-deployments
get /v1/accounts/{account_id}/deployments
# List Models
Source: https://docs.fireworks.ai/api-reference/list-models
get /v1/accounts/{account_id}/models
# List Supervised Fine-tuning Jobs
Source: https://docs.fireworks.ai/api-reference/list-supervised-fine-tuning-jobs
get /v1/accounts/{account_id}/supervisedFineTuningJobs
# List Users
Source: https://docs.fireworks.ai/api-reference/list-users
get /v1/accounts/{account_id}/users
# Create Chat Completion
Source: https://docs.fireworks.ai/api-reference/post-chatcompletions
post /chat/completions
Creates a model response for the given chat conversation.
# Create Completion
Source: https://docs.fireworks.ai/api-reference/post-completions
post /completions
Creates a completion for the provided prompt and parameters.
# Undelete Deployment
Source: https://docs.fireworks.ai/api-reference/undelete-deployment
post /v1/accounts/{account_id}/deployments/{deployment_id}:undelete
# Update Dataset
Source: https://docs.fireworks.ai/api-reference/update-dataset
patch /v1/accounts/{account_id}/datasets/{dataset_id}
# Update LoRA
Source: https://docs.fireworks.ai/api-reference/update-deployed-model
patch /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Update Deployment
Source: https://docs.fireworks.ai/api-reference/update-deployment
patch /v1/accounts/{account_id}/deployments/{deployment_id}
# Update Model
Source: https://docs.fireworks.ai/api-reference/update-model
patch /v1/accounts/{account_id}/models/{model_id}
# Update User
Source: https://docs.fireworks.ai/api-reference/update-user
patch /v1/accounts/{account_id}/users/{user_id}
# Upload Dataset Files
Source: https://docs.fireworks.ai/api-reference/upload-dataset-files
post /v1/accounts/{account_id}/datasets/{dataset_id}:upload
Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint).
# Validate Dataset Upload
Source: https://docs.fireworks.ai/api-reference/validate-dataset-upload
post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload
# Validate Model Upload
Source: https://docs.fireworks.ai/api-reference/validate-model-upload
get /v1/accounts/{account_id}/models/{model_id}:validateUpload
# Start here
Source: https://docs.fireworks.ai/cookbook/cookbook_landing
The **Fireworks Cookbook** is your hands-on guide to building, deploying, and fine-tuning generative AI and agentic workflows. It offers curated examples, Jupyter Notebooks, apps, and resources tailored to various use cases and skill levels, making it a go-to resource for practical Fireworks implementations.
In this cookbook, you’ll find:
* **Production-ready projects**: Scalable, proven solutions with ongoing support from the Fireworks engineering team.
* **Learning-focused tutorials**: Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
* **Community-driven showcases**: Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
***
## Repository structure
To help you easily navigate and find the right resources, the Cookbook organizes examples by purpose:
**Hands-on projects for learning AI** techniques, maintained by the DevRel team.
**Explore user-contributed projects** that push creative boundaries with Fireworks.
***
### Feedback & support
We value your feedback! If you encounter issues, need clarification, or have questions, please contact us at
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
***
**Additional resources:**
* [Fireworks AI Blog](https://fireworks.ai/blog)
* [Fireworks AI YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Fireworks AI Twitter](https://x.com/fireworksai_hq)
# Build with Fireworks
Source: https://docs.fireworks.ai/cookbook/learn_with_fireworks/ecosystem_examples
Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
## Inference
Explore notebooks and projects showcasing how to run generative AI models on Fireworks, demonstrating both third-party integrations and innovative applications with industry-leading speed and flexibility.
### LLMs
Dive into examples that utilize Fireworks for deploying and fine-tuning large language models (LLMs), featuring integrations with popular libraries and cutting-edge use cases.
**Notebooks**
(Python) An interactive Streamlit app for comparing LLMs on Fireworks with parameter tuning and LLM-as-a-Judge functionality.
(Python) Demonstrates structured responses using Llama 3.1, covering Grammar Mode and JSON Mode for consistent output formats.
(Python) Explores generating synthetic data with Llama 3.1 models on Fireworks, including structured outputs for quizzes.
(Python) Uses DeepSeek V3 & R1 to generate structured PC specifications while explaining component choices using Reasoning JSON Mode.
(Python) Demonstrates structured patient record generation using Reasoning JSON Mode to explain treatment recommendations.
**Apps**
A Next.js app for real-time transcription chat using Fireworks and Vercel integration.
### Visual-language
Discover projects combining vision and language capabilities using Fireworks, integrating external frameworks for seamless multimodal understanding.
### Audio
Explore real-time audio transcription, processing, and generation examples using Fireworks’ advanced audio models and integrations.
**Notebooks**
A notebook demonstrating real-time audio transcription using Fireworks' `whisper-v3-large` compatible model. The project includes streaming audio input and getting transcription messages, making it ideal for tasks requiring accurate and responsive audio processing.
Stream audio to get transcription continuously in real-time.
Stream audio to get transcription continuously in real-time.
### Image
Experiment with image-based projects using Fireworks’ models, enhanced with third-party libraries for innovative applications in image creation, manipulation, and recognition.
### Multimodal
Learn from complex multimodal examples that blend text, audio, and image inputs, demonstrating the full potential of Fireworks combined with external tools for interactive AI experiences.
***
## Fine-tuning
Access notebooks that demonstrate efficient model fine-tuning on Fireworks, utilizing both internal capabilities and third-party tools like Axolotl for custom optimization.
### Multi-LoRA
Explore notebooks showcasing the integration and utilization of multiple LoRA adapters in Fireworks. These resources demonstrate advanced techniques for merging, fine-tuning, and deploying multi-LoRA configurations to optimize model performance across diverse tasks.
**Notebooks**
(Python) An interactive guide showcasing the integration of Multi-LoRA adapters on Fireworks, enabling fine-tuned responses for diverse product domains such as beauty, fashion, outdoor gear, and baby products.
***
## Function calling
Explore examples of function-calling workflows using Fireworks, showcasing how to integrate with external APIs and tools for sophisticated, multi-step AI operations.
**Notebooks**
Demonstrates Function-Calling with LangChain integration, including custom tool routing and query handling. (Python)
Explore the integration of Fireworks' function-calling model with LangChain tools. This notebook demonstrates building basic agents using `firefunction-v1` for tasks like answering questions, retrieving stock prices, and generating images with the Fireworks SDXL API (Javascript).
Showcases Function-Calling with LangGraph integration for graph-based agent systems and tool queries. (Python)
Uses Fireworks' Function-Calling for structured QA with OpenAI, featuring multi-turn conversation handling. (Python)
Demonstrates querying financial data using Fireworks' Function-Calling API with integrated tool setup. (Python)
Extracts structured information from web content using Fireworks' Function-Calling API. (Python)
Generates stock charts using Fireworks' Function-Calling API with AutoGen integration. (Python)
**Apps**
A demo app showcasing chat with function-calling capabilities for dynamic service invocation.
***
## RAG
Build retrieval-augmented generation (RAG) systems with Fireworks, featuring projects that connect with vector databases and search tools for enhanced, context-aware AI responses.
**Notebooks**
A basic RAG implementation using ChromaDB with League of Legends data, comparing responses across multiple models. (Python)
An agentic system using RAG for generating catchy research paper titles with embeddings and LLM completions. (Python)
A movie recommendation system using Fireworks' function-calling models and MongoDB Atlas for personalized, real-time suggestions. (Python)
**Apps**
A RAG chatbot using SurrealDB for vector storage and Fireworks for real-time, context-aware responses.
***
### Integration partners
We welcome contributions from integration partners! Follow these steps:
1. **Clone the Repo**: [Fireworks Cookbook repo](https://github.com/fw-ai/cookbook)
2. **Create Folder**: Add your company/tool under `integrations`
3. **Add Examples**: Include code, notebooks, or demos
4. **Use Template**: Fill out the [integration guide](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md)
5. **Submit PR**: Create a pull request
6. **Review**: Fireworks will review and merge
Need help? Contact us or open an issue.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# Community showcase
Source: https://docs.fireworks.ai/cookbook/projects_showcase/community_examples
Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
Convert any PDF into a personalized podcast using open-source LLMs and TTS models. Powered by Fireworks-hosted Llama 3.1, MeloTTS, and Bark, this app generates engaging dialogue and outputs it as an MP3 file via a user-friendly Gradio interface.
High-throughput code generation with Qwen2.5 Coder models, optimized for fast inference on Fireworks. Includes a robust pipeline for data creation, fine-tuning with Unsloth, and real-time application in AI-powered code editors.
Ensure accurate and reliable technical documentation with ProoferX, built using Fireworks’ fast Llama models and Firefunc for structured output. This project addresses a key challenge in developer tools by validating and streamlining documentation with real-time checks.
***
## Community project submissions
We welcome your contributions to the **Fireworks Cookbook**! Share your projects and help expand our collaborative resource.
Here’s how:
1. **Clone the Repo**: [Fireworks Cookbook](https://github.com/fw-ai/cookbook) and go to `showcase`.
2. **Create Folder**: Add a folder named after your project.
3. **Include Code**: Add notebooks, apps, or other resources demonstrating your project.
4. **Complete Template**: Fill out the [Showcase Template](https://github.com/fw-ai/cookbook/blob/main/showcase/template_projectMDX.md) for key project details.
5. **Submit PR**: Submit your project as a pull request.
6. **Review & Feature**: Our team will review your submission; selected projects may be highlighted in docs or social media.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# DeepSeek Resources
Source: https://docs.fireworks.ai/deepseek/general-deepseek
Access information, blog posts, FAQs, and detailed documentation for DeepSeek v3 and R1.
## 1. How to Access DeepSeek v3 & R1
DeepSeek models are available on Fireworks AI with flexible deployment options.
You can test DeepSeek v3 and R1 in an interactive environment without any coding.
🔗 [Try DeepSeek v3 on Fireworks Playground](https://fireworks.ai/playground?model=deepseek-v3)\
🔗 [Try DeepSeek R1 on Fireworks Playground](https://fireworks.ai/playground?model=deepseek-r1)
Developers can integrate DeepSeek models into applications using Fireworks' API.
🔗 [Fireworks API Reference](https://docs.fireworks.ai/api-reference/introduction)\
🔗 [Using reasoning JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting#reasoning-model-json-mode)
* **Serverless API** – Instantly access DeepSeek models with pay-as-you-go pricing.
* **Dedicated Deployments** – Higher throughput and lower latency for enterprise use. Contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
***
## 2. General FAQs
Below are common questions about DeepSeek models on Fireworks, organized by category.
#### Model Integrity & Modifications
No, Fireworks hosts the unaltered versions of DeepSeek models.
* ❌ No quantization – Full-precision versions are hosted.
* ❌ No additional censorship – Fireworks does not apply additional content moderation beyond DeepSeek’s built-in policies.
* ❌ No forced system prompt – Users have full control over prompts.
🔹 Fireworks hosts DeepSeek R1 and V3 models on Serverless. Contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) if you need a dedicated deployment.
🔹 Fireworks also offers six R1-distill models released by DeepSeek on on-demand.
#### Data Privacy & Hosting Locations
Fireworks has zero-data retention by default and does not log or store prompt or generation data.
See [Fireworks Data Handling Policy](https://docs.fireworks.ai/guides/security_compliance/data_handling) for details.
Fireworks hosts DeepSeek models on servers in North America and the EU.
Fireworks hosts DeepSeek models on servers in North America and the EU.\
Fireworks has zero-data retention by default and does not log or store prompt or generation data.
See [Fireworks Data Handling Policy](https://docs.fireworks.ai/guides/security_compliance/data_handling) for details.
The company DeepSeek does not have access to user API requests or outputs.
#### Pricing & Cost Considerations
Fireworks hosts DeepSeek models on our own infrastructure. We do not proxy requests to DeepSeek API.
We are continuously optimizing the model for speed and throughput. We also offer useful developer features like JSON mode, structured outputs, and dedicated deployment options.
Yes, Fireworks offers dedicated deployments for DeepSeek models.\
Contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) if you need a dedicated deployment.
* 🚀 Lower latency – Dedicated instances have better response times than shared serverless.
* 📈 Higher throughput – More consistent performance for large-scale applications.
* 💰 Pricing: Depends on workload, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
#### Output Control & Limits
Yes! Fireworks supports structured outputs through:
* ✔️ **JSON Mode** – Enforce JSON responses for structured applications.
* ✔️ **Grammar Mode** – Define syntactic constraints for predictable outputs.
Currently, DeepSeek R1 does not support native function calling like OpenAI models.\
However:
* Users can implement function calling logic via prompt engineering or structured output parsing.
* Fireworks is evaluating future support for function calling in DeepSeek models.
Max token length for DeepSeek models is only limited by the context window of the model, which is **128K tokens**.
If responses are cut off, try increasing `max_tokens` in your API call:\
🔗 [Fireworks Max Tokens Documentation](https://docs.fireworks.ai/guides/querying-text-models#max-tokens)
**Reasoning Effort** allows you to control how much computation DeepSeek R1 spends on reasoning:
* ✨ **Key Benefits**:
* 🚀 **Faster responses** for time-sensitive applications
* 💰 **Cost optimization** for budget-conscious deployments
* ⚙️ **Predictable latency** for production systems
* 🎛️ **Control Options**:
* `reasoning_effort = "low"`: Limits Chain-of-Thought (CoT) reasoning to 40% of full length
* Achieves 63% accuracy on AIME 2024 math problems (better than o1-mini\_low at 60%)
* `reasoning_effort = [integer < 20,000]`: Custom effort limit in computational units
* 💻 **Example Usage**:
```python
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="fireworks/deepseek-r1",
messages=[{
"role": "user",
"content": "Solve this math problem: What is 2 + 2?",
}],
reasoning_effort="low" # or an integer like 5000
)
print(response.choices[0].message.content)
```
* 📝 **Technical Notes**:
* Works with the `fireworks/deepseek-r1` and `fireworks/deepseek-r1-basic` models
* Server-side logic handles truncation (no prompt tweaking needed)
* Forces a `` token at the defined effort limit
* Prevents excessive deliberation in responses
#### Parsing & API Response Handling
DeepSeek R1 uses `` tags to denote reasoning before the final structured output.
Fireworks defaults to the simplest approach of returning `...` in the response and letting the user parse the response, such as using regex parsing.
#### Roadmap & Feature Requests
Fireworks updates DeepSeek R1 and v3 in alignment with DeepSeek AI’s official releases and Fireworks' own performance optimizations.
Updates include bug fixes, efficiency improvements, and potential model refinements.
Users can track updates through Fireworks documentation and announcements.
🔗 For the latest version information, refer to the [Fireworks API documentation](https://docs.fireworks.ai) or join the [Fireworks community Discord](https://discord.com/invite/fireworks-ai).
#### General Troubleshooting
If you're encountering an error while using DeepSeek v3 on Fireworks, follow these steps:
✅ **Step 1:** Check Fireworks' [Status Page](https://status.fireworks.ai) for any ongoing outages.
✅ **Step 2:** Verify API request formatting. Ensure:
* No missing/invalid API keys
* Proper request format
* No exceeded rate limits or context window
✅ **Step 3:** Reduce request complexity if your request is too long.
✅ **Step 4:** Adjust parameters if experiencing instability:
* Lower **temperature** for more deterministic responses
* Adjust **top\_p** to control randomness
* Increase **max\_tokens** to avoid truncation
✅ **Step 5:** Contact Fireworks support via:
* 🔗 [Fireworks Support](https://docs.fireworks.ai/support)
* 🔗 [Fireworks Discord](https://discord.com/invite/fireworks-ai) for real-time help.
DeepSeek v3 and R1, like other LLMs, have a **fixed maximum context length of 128K tokens**.\
If responses are getting cut off:
🔹 **Possible Causes & Solutions:**
1️⃣ Exceeded `max_tokens` setting → 🔧 Increase `max_tokens`
2️⃣ Requesting too much text in a single prompt → 🔧 Break input into smaller chunks
3️⃣ Model context window limit reached → 🔧 Summarize prior messages before appending new ones
💡 **Fix:**
```python
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3",
messages=[{"role": "user", "content": "Generate a long article summary"}],
max_tokens=4096, # Adjust as needed
)
```
📌 **Alternative Fix:** If you need longer responses, re-prompt the model with the last part of the output and ask it to continue.
Intermittent API response issues could be due to:\
🔹 **Common Causes & Fixes:**
1️⃣ **High Server Load** – Fireworks may be experiencing peak traffic.\
**Fix:** Retry the request after a few seconds or try during non-peak hours.
2️⃣ **Rate Limits or Spend Limits Reached** – If you've exceeded the API rate limits, requests may temporarily fail.\
**Fix:** Check your rate limits and spend limits in the API dashboard and adjust your usage accordingly.\
🔗 To increase spend limits, add credits: [Fireworks Spend Limits](https://docs.fireworks.ai/guides/quotas_usage/rate-limits#spend-limits)
3️⃣ **Network Connectivity Issues** – Fireworks API may be unreachable due to network issues.\
**Fix:** Restart your internet connection or use a different network/VPN.
📌 **If problems persist, check Fireworks' [status page](https://status.fireworks.ai) or reach out via our [Discord](https://discord.com/invite/fireworks-ai).** 🚀
***
## 3. Learn about R1 & V3
Stay up to date with the latest advancements and insights into DeepSeek models.
Check out our blog, where experts from Fireworks breakdown everything you need to know about R1 and V3
A deep dive into DeepSeek R1's capabilities, architecture, and use cases.
Explore how DeepSeek v3 now supports vision capabilities through document inlining.
Learn how reinforcement learning with verifiable rewards is shaping AI training.
Learn about the distillation process for DeepSeek R1 and how it impacts reasoning capabilities.
Discover how structured output techniques like reasoning mode improve AI responses.
We've also published videos on our youtube channel
# Direct routing
Source: https://docs.fireworks.ai/deployments/direct-routing
Direct routing enables enterprise users reduce latency to their deployments.
## Internet direct routing
Internet direct routing bypasses our global API load balancer and directly routes your request to the machines where
your deployment is running. This can save several tens or even hundreds of milliseconds of time-to-first-token (TTFT)
latency.
To create a deployment using Internet direct routing:
```bash
$ firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--direct-route-type INTERNET \
--direct-route-api-keys
Name: accounts/my-account/deployments/abcd1234
...
Direct Route Handle: my-account-abcd1234.us-arizona-1.direct.fireworks.ai
Region: US_ARIZONA_1
```
If you have multiple API keys, use repeated fields, such as:
`--direct-route-api-keys= --direct-route-api-keys=`. These keys can
be any alpha-numeric string and are a distinct concept from the API keys provisioned via the Fireworks console. A key
provisioned in the console but not specified the list here will not be allowed when querying the model via direct
routing.
Take note of the `Direct Route Handle` to get the inference endpoint. This is what you will use access the deployment
instead of the global `https://api.fireworks.ai/inference/` endpoint. For example:
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"prompt": "The sky is"
}' \
--url https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai/v1/completions
```
## Supported Regions for Direct Routing
Direct routing is currently supported in the following regions:
* US\_Iowa\_1
* US\_Nevada\_1
* US\_Virginia\_1
* US\_Virginia\_2
* US\_Arizona\_1
* US\_Illinois\_1
* US\_Illinois\_2
* EU\_Frankfurt\_1
* EU\_Paris\_1
* EU\_Helsinki\_1
* AP\_Tokyo\_1
## Private Service Connect (PSC)
Contact your Fireworks representative to set up [GCP Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect)
to your deployment.
## AWS PrivateLink
Contact your Fireworks representative to set up [AWS PrivateLink](https://aws.amazon.com/privatelink/) to your
deployment.
# Regions
Source: https://docs.fireworks.ai/deployments/regions
Fireworks runs a global fleet of hardware on which you can deploy your models.
## Availability
Current region availability:
| **Region** | **Launch status** | **Hardware availability** |
| ---------------- | ------------------- | ------------------------------------- |
| `US_ILLINOIS_2` | Generally Available | `NVIDIA_A100_80GB` |
| `US_VIRGINIA_2` | Generally Available | `NVIDIA_H100_80GB` `AMD_MI300X_192GB` |
| `EU_PARIS_1` | Generally Available | `NVIDIA_H200_141GB` |
| `AP_TOKYO_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `EU_FRANKFURT_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_ILLINOIS_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_IOWA_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_VIRGINIA_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_ARIZONA_1` | Enterprise only | `NVIDIA_H100_80GB` |
If you need deployments in a non-GA region, please contact our team at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
## Using a region
When creating a deployment, you can pass the `--region` flag:
```
firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--region US_IOWA_1
```
## Changing regions
Updating a region for a deployment in-place is currently not supported. To move a deployment between regions, please
create a new deployment in the new region, then delete the old deployment.
## Quotas
Each region has it's own separate quota for each hardware type. To view your current quotas, run
```
firectl list quotas
```
# Reserved capacity
Source: https://docs.fireworks.ai/deployments/reservations
Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the
following advantages over [on-demand deployments](/guides/ondemand-deployments):
* Guaranteed capacity
* Higher quotas
* Lower GPU-hour prices
* Pre-GA access to newer regions
* Pre-GA access to newest hardware
## Purchasing or renewing a reservation
To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account
manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us).
## Viewing your reservations
To view your existing reservations, run:
```
firectl list reservations
```
## Usage and billing
Reservations are automatically "consumed" when you create deployments that the meet the reservation parameters. For
example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both
deployments are running, 12 H100s will count towards using your reservation, while the excess 4 H100s will be metered
and billed at the on-demand rate.
When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding
number of deployments, otherwise you may be billed at for your usage at on-demand rates.
Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract
(e.g. monthly, quarterly, or yearly).
Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is
actively used.
# About Fireworks developer partners
Source: https://docs.fireworks.ai/ecosystem/integrations_process
Learn about the Fireworks Developer Partners Program, including goals, application process, and benefits for tools and platforms in the LLMOps/Gen-Ops ecosystem.
The **Fireworks developer integrations program** supports tools, platforms, and projects in the LLMOps/Gen-Ops ecosystem, enabling seamless collaboration with Fireworks. 🌐 Whether through **native integrations** or **compatible workflows**, developer integrations represent tools and platforms that:
* Offer **native integration** with Fireworks APIs, enabling deep functionality and seamless operation.
* Provide **compatible workflows**, demonstrating interoperability with Fireworks through shared use cases and adaptable processes.
* Add value to the Fireworks ecosystem by enhancing developer workflows, improving scalability, or solving key challenges in LLMOps/Gen-Ops. 🔧
***
# Goals of the developer partners program
1. **Expand the ecosystem**: Build a rich network of tools that extend Fireworks’ capabilities. 🌱
2. **Showcase interoperability**: Demonstrate how Fireworks works with diverse tools to solve real-world challenges. 🌍
3. **Support innovation**: Encourage the creation of impactful generative AI solutions. 💡
4. **Promote collaboration**: Highlight shared contributions through joint marketing, workshops, and developer resources. 🤝
***
## Types of developer partners
1. **Native integrations** 🛠️
* Tools with direct integration into Fireworks APIs or SDKs, offering seamless plug-and-play functionality.
* Examples include official connectors, plugins, and platform integrations.
2. **Compatible workflows**
* Tools or platforms that interoperate with Fireworks through shared APIs, workflows, or third-party bridges.
* Examples include vector stores, fine-tuning tools, and monitoring solutions that work alongside Fireworks.
***
# What does a developer integration look like?
A developer integration can include:
* **Native integrations**: Fully integrated tools or connectors offering seamless user experiences.
* **Workflow compatibility**: Examples and documentation showing how a tool works with Fireworks APIs.
* **Developer resources**: Contributed guides, notebooks, and sample repositories to enable other users.
**Examples**:
* **Native integration**: A plugin for a vector database that directly connects with Fireworks’ RAG workflows.
* **Compatible workflow**: A step-by-step guide for using Fireworks APIs alongside an MLOps monitoring tool.
***
# How to apply
### Step 1: Demonstrate compatibility or build integration 🔍
* **Native integrations**: Develop a connector or integration directly into Fireworks APIs or SDKs.
* **Compatible workflows**: Validate how your tool works with Fireworks workflows and APIs.
* Prepare resources such as GitHub repos, notebooks, or workflow guides.
### Step 2: Submit your application 📤
1. **Create documentation**
* Use the [Fireworks cookbook template](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md) to document your integration or workflow.
2. **Submit your contribution**
* Fork the [Fireworks cookbook](https://github.com/fw-ai/cookbook) and submit a pull request with your materials.
* Include links to your GitHub repo or supporting documentation.
3. **Contact developer relations**\
For guidance, reach out to [DevRel](mailto:devrel@fireworks.ai).
### Step 3: Review and feedback ✅
* Fireworks developer relations will review your submission to ensure technical accuracy and alignment with program goals.
* Once approved, your integration or workflow will be published in Fireworks documentation and promoted through official channels.
***
# Benefits of becoming a Fireworks developer partner 🌟
1. **Ecosystem visibility**
* Be featured in Fireworks documentation and resources as a trusted integration.
* Gain recognition within the growing LLMOps/Gen-Ops developer community.
2. **Technical and marketing support**
* Access Fireworks resources and technical support for building integrations.
* Collaborate on co-marketing campaigns, webinars, and tutorials.
3. **Community collaboration**
* Join a network of ecosystem partners working to push generative AI innovation forward.
* Share insights and learn from other projects in the LLMOps/Gen-Ops space.
***
# Program FAQ ❓
**Q: Who can apply to the Developer Partners program?**\
A: Tools, platforms, and projects that either integrate natively with Fireworks or demonstrate compatibility through workflows are welcome to apply.
**Q: What types of contributions are required?**\
A: Contributions can include technical documentation, integration guides, sample workflows, GitHub repos, and co-marketing materials.
**Q: Is there a cost to participate?**\
A: No, the Developer Partners program is free.
**Q: Can compatible workflows evolve into native integrations?**\
A: Yes! Tools demonstrating strong adoption and compatibility may transition to deeper integrations and partnerships.
***
For more information or to get started, contact us at:
* **Discord**: [Join here](https://discord.gg/fireworks-ai)
* **Email**: [devrel@fireworks.ai](mailto:devrel@fireworks.ai)
# null
Source: https://docs.fireworks.ai/evaluators/api_reference/api_overview
# Reward Kit API Reference
This API reference provides detailed documentation for the key classes, functions, and data models in the Reward Kit.
## Core Components
### Classes and Decorators
* [RewardFunction Class](reward_function_class.md): Core class for wrapping and calling reward functions
* [reward\_function Decorator](reward_function_decorator.md): Decorator for creating deployable reward functions
### Data Models
* [Data Models](data_models.md): Documentation for Message, RewardOutput, MetricRewardOutput, and other data models
## Modules
### reward\_function Module
The `reward_function` module contains the core functionality for creating and using reward functions.
```python
from reward_kit.reward_function import RewardFunction, reward_function
```
### evaluation Module
The `evaluation` module provides functions for previewing and creating evaluations.
```python
from reward_kit.evaluation import preview_evaluation, create_evaluation
```
Key functions:
* **`preview_evaluation`**: Previews an evaluation with sample data before deployment
* **`create_evaluation`**: Creates and deploys an evaluator to Fireworks
### models Module
The `models` module contains data models used throughout the Reward Kit.
```python
from reward_kit.models import RewardOutput, MetricRewardOutput, Message
```
### rewards Module
The `rewards` module contains specialized reward functions for specific use cases.
```python
from reward_kit.rewards.function_calling import match_function_call
```
### server Module
The `server` module provides functionality for running a reward function as a server.
```python
from reward_kit.server import run_server
```
### auth Module
The `auth` module handles authentication with Fireworks.
```python
from reward_kit.auth import get_authentication
```
## Command Line Interface
The Reward Kit provides a command-line interface for common operations:
```bash
# Show help
reward-kit --help
# Preview an evaluator
reward-kit preview --metrics-folders "metric=./path" --samples ./samples.jsonl
# Deploy an evaluator
reward-kit deploy --id my-evaluator --metrics-folders "metric=./path" --force
```
For detailed CLI documentation, see the [CLI Reference](../cli_reference/cli_overview.mdx).
## Common Patterns
### Creating a Basic Reward Function
```python
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
@reward_function
def my_reward_function(messages, original_messages=None, **kwargs):
# Your evaluation logic here
response = messages[-1].get("content", "")
score = calculate_score(response)
return RewardOutput(
score=score,
metrics={
"my_metric": MetricRewardOutput(
score=score,
reason="Explanation for the score"
)
}
)
```
### Using a Deployed Reward Function
```python
from reward_kit import RewardFunction
# Create a reference to a deployed reward function
reward_fn = RewardFunction(
name="my-deployed-evaluator",
mode="remote"
)
# Call the reward function
result = reward_fn(messages=[
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
])
print(f"Score: {result.score}")
```
## Next Steps
* Explore the [Examples](../examples/) for practical implementations
* Follow the [Tutorials](../tutorials/) for step-by-step guidance
* Review the [Developer Guide](../developer_guide/) for conceptual understanding
# null
Source: https://docs.fireworks.ai/evaluators/api_reference/data_models
# Data Models Reference
This document describes the core data models used in the Reward Kit for representing messages, evaluation results, and metrics.
## Message Models
### Message
The `Message` class represents a single message in a conversation.
```python
from reward_kit import Message
message = Message(
role="assistant",
content="This is the response content",
name=None, # Optional
tool_call_id=None, # Optional
tool_calls=None, # Optional
function_call=None # Optional
)
```
#### Attributes
* **`role`** (`str`): The role of the message sender. Typically one of:
* `"user"`: Message from the user
* `"assistant"`: Message from the assistant
* `"system"`: System message providing context/instructions
* **`content`** (`str`): The text content of the message.
* **`name`** (`Optional[str]`): Optional name of the sender (for named system messages).
* **`tool_call_id`** (`Optional[str]`): Optional ID for a tool call (used in tool calling).
* **`tool_calls`** (`Optional[List[Dict[str, Any]]]`): Optional list of tool calls in the message.
* **`function_call`** (`Optional[Dict[str, Any]]`): Optional function call information (legacy format).
#### Compatibility
The `Message` class is compatible with OpenAI's `ChatCompletionMessageParam` interface, allowing for easy integration with OpenAI-compatible APIs.
## Reward Output Models
### RewardOutput
The `RewardOutput` class represents the complete result of a reward function evaluation.
```python
from reward_kit import RewardOutput, MetricRewardOutput
result = RewardOutput(
score=0.75,
metrics={
"clarity": MetricRewardOutput(score=0.8, reason="The response clearly explains the concept"),
"accuracy": MetricRewardOutput(score=0.7, reason="Contains one minor factual error")
}
)
```
#### Attributes
* **`score`** (`float`): The overall reward score, typically between 0.0 and 1.0.
* **`metrics`** (`Dict[str, MetricRewardOutput]`): Dictionary of component metrics, with metric names as keys and `MetricRewardOutput` objects as values.
#### Methods
* **`to_dict()`** → `Dict[str, Any]`: Converts the `RewardOutput` to a dictionary representation.
* **`from_dict(data: Dict[str, Any])`** → `RewardOutput`: Class method that creates a `RewardOutput` from a dictionary representation.
* **`__str__()`** → `str`: Returns a JSON string representation of the `RewardOutput`.
### MetricRewardOutput
The `MetricRewardOutput` class represents a single component metric in a reward evaluation.
```python
from reward_kit import MetricRewardOutput
metric = MetricRewardOutput(
score=0.8,
reason="The response provides a clear explanation with appropriate examples"
)
```
#### Attributes
* **`score`** (`float`): The score for this specific metric, typically between 0.0 and 1.0.
* **`reason`** (`Optional[str]`): Optional explanation for why this score was assigned.
## Evaluation Models
### EvaluateResult
The `EvaluateResult` class represents the complete result of an evaluator with multiple metrics.
```python
from reward_kit import EvaluateResult, MetricResult
result = EvaluateResult(
score=0.75,
reason="Overall good response with minor issues",
metrics={
"clarity": MetricResult(score=0.8, reason="Clear and concise"),
"accuracy": MetricResult(score=0.7, reason="Contains a minor factual error")
},
error=None # Optional error message
)
```
#### Attributes
* **`score`** (`float`): The overall evaluation score, typically between 0.0 and 1.0.
* **`reason`** (`Optional[str]`): Optional explanation for the overall score.
* **`metrics`** (`Dict[str, MetricResult]`): Dictionary of component metrics.
* **`error`** (`Optional[str]`): Optional error message if the evaluation encountered a problem.
### MetricResult
The `MetricResult` class represents a single metric in an evaluation.
```python
from reward_kit import MetricResult
metric = MetricResult(
score=0.8,
reason="The response is clear and well-structured",
success=True # Optional success indicator
)
```
#### Attributes
* **`score`** (`float`): The score for this specific metric, typically between 0.0 and 1.0.
* **`reason`** (`str`): Explanation for why this score was assigned.
* **`success`** (`Optional[bool]`): Optional flag indicating whether this metric evaluation was successful.
## Example Usages
### Working with Messages
```python
from reward_kit import Message
# Create a user message
user_message = Message(
role="user",
content="Can you explain how machine learning works?"
)
# Create an assistant message
assistant_message = Message(
role="assistant",
content="Machine learning is a method where computers learn from data without being explicitly programmed."
)
# Create a system message
system_message = Message(
role="system",
content="You are a helpful assistant that provides clear and accurate explanations."
)
# Create a message with tool calls
tool_call_message = Message(
role="assistant",
content=None,
tool_calls=[{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "San Francisco", "unit": "celsius"}'
}
}]
)
```
### Working with RewardOutput
```python
from reward_kit import RewardOutput, MetricRewardOutput
# Create a RewardOutput with multiple metrics
result = RewardOutput(
score=0.75,
metrics={
"clarity": MetricRewardOutput(score=0.8, reason="Response is clear and well-structured"),
"accuracy": MetricRewardOutput(score=0.7, reason="Contains minor factual errors"),
"completeness": MetricRewardOutput(score=0.75, reason="Covers most aspects of the topic")
}
)
# Convert to dictionary
result_dict = result.to_dict()
print(result_dict)
# Create from dictionary
new_result = RewardOutput.from_dict(result_dict)
print(new_result.score) # 0.75
# String representation
result_str = str(result)
print(result_str) # JSON string
```
### Working with EvaluateResult
```python
from reward_kit import EvaluateResult, MetricResult
# Create an EvaluateResult
eval_result = EvaluateResult(
score=0.75,
reason="Overall good response with some minor issues",
metrics={
"clarity": MetricResult(score=0.8, reason="Clear and concise explanation"),
"accuracy": MetricResult(score=0.7, reason="Contains one minor factual error"),
"relevance": MetricResult(score=0.75, reason="Mostly relevant to the query")
}
)
# Access metrics
clarity_score = eval_result.metrics["clarity"].score
print(f"Clarity score: {clarity_score}") # Clarity score: 0.8
# Check for errors
if eval_result.error:
print(f"Evaluation error: {eval_result.error}")
else:
print(f"Evaluation successful with score: {eval_result.score}")
```
## Conversion Between Types
Reward Kit provides functions for converting between `RewardOutput` and `EvaluateResult` formats:
```python
from reward_kit.utils import convert_to_reward_output, convert_to_evaluate_result
# Convert EvaluateResult to RewardOutput
evaluate_result = EvaluateResult(score=0.8, metrics={"quality": MetricResult(score=0.8, reason="Good quality")})
reward_output = convert_to_reward_output(evaluate_result)
# Convert RewardOutput to EvaluateResult
reward_output = RewardOutput(score=0.9, metrics={"clarity": MetricRewardOutput(score=0.9, reason="Very clear")})
evaluate_result = convert_to_evaluate_result(reward_output)
```
This conversion functionality is useful when interacting with different parts of the API that might expect different formats.
## Type Compatibility
While the classes provide strong typing for development, the Reward Kit also accepts dictionary representations for flexibility:
```python
# Using dictionaries instead of Message objects
messages = [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a method..."}
]
# These are automatically converted to the appropriate types internally
```
This flexibility makes it easier to integrate with different APIs and data formats.
# null
Source: https://docs.fireworks.ai/evaluators/api_reference/reward_function_class
# RewardFunction Class Reference
The `RewardFunction` class is a core component of the Reward Kit, providing a unified interface for calling reward functions locally or remotely.
## Overview
The `RewardFunction` class wraps a reward function (either a local function or a remote endpoint) and provides a consistent interface for evaluation. It supports:
* Local functions (mode="local")
* Remote endpoints (mode="remote")
* Fireworks-hosted models (mode="fireworks\_hosted")
## Import
```python
from reward_kit.reward_function import RewardFunction
```
## Constructor
```python
RewardFunction(
func: Optional[Callable] = None,
func_path: Optional[str] = None,
mode: str = "local",
endpoint: Optional[str] = None,
name: Optional[str] = None,
model_id: Optional[str] = None,
**kwargs
)
```
### Parameters
* **`func`** (`Optional[Callable]`): The local function to use (for mode="local").
* **`func_path`** (`Optional[str]`): A string path to a function (e.g., "module.submodule:function\_name").
* **`mode`** (`str`): The mode of operation. Options:
* `"local"`: Run the function locally
* `"remote"`: Call a remote endpoint
* `"fireworks_hosted"`: Use a Fireworks-hosted model
* **`endpoint`** (`Optional[str]`): The URL of the remote endpoint (for mode="remote").
* **`name`** (`Optional[str]`): The name of the deployed evaluator (for mode="remote").
If provided and endpoint is not, the endpoint will be constructed from the name.
* **`model_id`** (`Optional[str]`): The ID of the Fireworks-hosted model (for mode="fireworks\_hosted").
* **`**kwargs`**: Additional keyword arguments to pass to the function when called.
### Exceptions
* **`ValueError`**: Raised if required parameters for the specified mode are missing or if an invalid mode is provided.
## Methods
### `__call__`
Call the reward function with the provided messages.
```python
__call__(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput
```
#### Parameters
* **`messages`** (`List[Dict[str, str]]`): List of conversation messages, each with 'role' and 'content' keys.
* **`original_messages`** (`Optional[List[Dict[str, str]]]`): Original conversation messages (for context).
Defaults to all messages except the last one if not provided.
* **`**kwargs`**: Additional keyword arguments to pass to the function.
#### Returns
* **`RewardOutput`**: Object with score and metrics.
#### Exceptions
* **`ValueError`**: Raised if no function or endpoint is provided for the selected mode.
* **`TypeError`**: Raised if the function returns an invalid type.
* **`requests.exceptions.RequestException`**: Raised if there is an error calling the remote endpoint.
### `get_trl_adapter`
Create an adapter function for use with the TRL (Transformer Reinforcement Learning) library.
```python
get_trl_adapter() -> Callable
```
#### Returns
* **`Callable`**: A function that takes batch inputs and returns a batch of reward values, compatible with TRL.
#### Adapter Behavior
The returned adapter function:
1. Handles batch inputs (list of message lists or list of strings)
2. Returns a list of reward scores (one for each input)
3. Handles exceptions gracefully, returning 0.0 for any errors
## Examples
### Local Mode
```python
from reward_kit import RewardFunction, RewardOutput, MetricRewardOutput
# Define a reward function
def my_reward_fn(messages, **kwargs):
response = messages[-1].get("content", "")
score = min(len(response) / 100, 1.0) # Simple score based on length
return RewardOutput(
score=score,
metrics={"length": MetricRewardOutput(score=score, reason=f"Length: {len(response)}")}
)
# Create a reward function in local mode
reward_fn = RewardFunction(func=my_reward_fn, mode="local")
# Call the reward function
result = reward_fn(messages=[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there! How can I help you today?"}
])
print(f"Score: {result.score}")
```
### Remote Mode
```python
# Create a reward function in remote mode
remote_reward = RewardFunction(
name="my-deployed-evaluator",
mode="remote"
)
# Call the reward function
result = remote_reward(messages=[
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a method of data analysis..."}
])
print(f"Score: {result.score}")
```
### Fireworks Hosted Mode
```python
# Create a reward function using a Fireworks-hosted model
hosted_reward = RewardFunction(
model_id="accounts/fireworks/models/llama-v3-8b-instruct",
mode="fireworks_hosted"
)
# Call the reward function
result = hosted_reward(messages=[
{"role": "user", "content": "Explain quantum computing"},
{"role": "assistant", "content": "Quantum computing uses quantum bits or qubits..."}
])
print(f"Score: {result.score}")
```
### Using with TRL
```python
from reward_kit import RewardFunction
# Create a reward function
reward_fn = RewardFunction(name="my-deployed-evaluator", mode="remote")
# Get a TRL-compatible adapter
trl_reward_fn = reward_fn.get_trl_adapter()
# Use in TRL (example)
batch_inputs = [
[{"role": "user", "content": "Question 1"}, {"role": "assistant", "content": "Answer 1"}],
[{"role": "user", "content": "Question 2"}, {"role": "assistant", "content": "Answer 2"}]
]
# Get reward scores for the batch
reward_scores = trl_reward_fn(batch_inputs)
print(reward_scores) # [score1, score2]
```
## Implementation Details
### Mode-Specific Requirements
* **Local Mode**: Requires either `func` or `func_path`.
* **Remote Mode**: Requires either `endpoint` or `name`.
* **Fireworks Hosted Mode**: Requires `model_id`.
### Function Loading
When providing a `func_path`, the path can be specified in two formats:
* `module.path:function_name` - Module with colon separator (preferred)
* `module.path.function_name` - Module with function as last component
### Authentication
For remote and Fireworks-hosted modes, the authentication token is retrieved from the `FIREWORKS_API_KEY` environment variable.
# null
Source: https://docs.fireworks.ai/evaluators/api_reference/reward_function_decorator
# reward\_function Decorator Reference
The `@reward_function` decorator transforms a regular Python function into a reward function with standardized inputs/outputs and deployment capabilities.
## Overview
The decorator serves several key purposes:
1. Ensures consistent input and output formats
2. Adds error handling and validation
3. Provides a `.deploy()` method for deploying the function to Fireworks
## Import
```python
from reward_kit import reward_function
```
## Usage
```python
@reward_function
def my_reward_function(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput:
# Your evaluation logic here
score = 0.75 # Example score
return RewardOutput(score=score, metrics={...})
```
## Parameter Requirements
Functions decorated with `@reward_function` should accept the following parameters:
* **`messages`** (`List[Dict[str, str]]`): Required. List of conversation messages, with the last message typically being the one evaluated.
* **`original_messages`** (`Optional[List[Dict[str, str]]]`): Optional. The conversation context, without the message being evaluated.
* **`**kwargs`**: Optional. Additional parameters (like metadata) that can be passed to the function.
## Return Value Requirements
Functions must return a `RewardOutput` object or a compatible tuple format:
```python
# Preferred return format
return RewardOutput(
score=0.75, # Overall score
metrics={
"clarity": MetricRewardOutput(score=0.8, reason="Good clarity"),
"accuracy": MetricRewardOutput(score=0.7, reason="Minor errors")
}
)
# Legacy tuple format (also supported)
return 0.75, {"clarity": 0.8, "accuracy": 0.7}
```
## Added Methods
### `.deploy()`
The decorator adds a `.deploy()` method to the function, allowing it to be deployed to Fireworks.
```python
evaluation_id = my_reward_function.deploy(
name="my-evaluator",
description="Evaluates responses based on clarity and accuracy",
account_id=None, # Optional, defaults to configured account
auth_token=None, # Optional, defaults to configured token
force=False, # Set to True to overwrite if it already exists
providers=None # Optional model providers configuration
)
```
#### Parameters
* **`name`** (`str`): Required. ID for the deployed evaluator.
* **`description`** (`str`): Optional. Human-readable description of the evaluator.
* **`account_id`** (`Optional[str]`): Optional. Fireworks account ID. If not provided, will be read from config or environment.
* **`auth_token`** (`Optional[str]`): Optional. Authentication token. If not provided, will be read from config or environment.
* **`force`** (`bool`): Optional. Whether to overwrite an existing evaluator with the same name. Default is False.
* **`providers`** (`Optional[List[Dict[str, str]]]`): Optional. List of provider configurations. If not provided, uses a default provider.
#### Returns
* **`str`**: The evaluation ID that can be used in RL training.
#### Exceptions
* **`ValueError`**: Raised if authentication fails or required parameters are missing.
* **`requests.exceptions.HTTPError`**: Raised if the API request fails.
## Implementation Details
### Validation Logic
The decorator performs the following validations:
1. Ensures the decorated function has the expected parameters
2. Validates that the return value is a `RewardOutput` or a compatible tuple
3. Handles exceptions that occur during function execution
### Backward Compatibility
For backward compatibility, the decorator supports the legacy tuple return format:
```python
return score, component_scores_dict
```
This gets automatically converted to a `RewardOutput` object.
### Deployment Process
When `.deploy()` is called, the decorator:
1. Extracts the function's source code
2. Creates a wrapper that handles the Fireworks evaluation format
3. Creates a temporary directory with the wrapped function
4. Uploads and registers the function with the Fireworks API
## Examples
### Basic Usage
```python
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
from typing import List, Dict, Optional
@reward_function
def word_count_reward(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput:
"""Evaluate response based on word count."""
response = messages[-1].get("content", "")
word_count = len(response.split())
score = min(word_count / 100, 1.0)
return RewardOutput(
score=score,
metrics={
"word_count": MetricRewardOutput(
score=score,
reason=f"Word count: {word_count}"
)
}
)
```
### Using Metadata
```python
@reward_function
def configurable_reward(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
metadata: Optional[Dict[str, any]] = None,
**kwargs
) -> RewardOutput:
"""Reward function that accepts configuration via metadata."""
metadata = metadata or {}
# Get threshold from metadata or use default
threshold = metadata.get("threshold", 50)
response = messages[-1].get("content", "")
word_count = len(response.split())
score = min(word_count / threshold, 1.0)
return RewardOutput(
score=score,
metrics={
"configured_word_count": MetricRewardOutput(
score=score,
reason=f"Word count: {word_count}, threshold: {threshold}"
)
}
)
```
### Deploying a Reward Function
```python
# Define and decorate the reward function
@reward_function
def clarity_reward(messages, original_messages=None, **kwargs):
# ... evaluation logic ...
return RewardOutput(score=0.8, metrics={...})
# Deploy the function to Fireworks
evaluation_id = clarity_reward.deploy(
name="clarity-evaluator",
description="Evaluates the clarity of responses",
force=True # Overwrite if it already exists
)
print(f"Deployed evaluator with ID: {evaluation_id}")
```
### Using with Custom Providers
```python
# Deploy with a specific model provider
evaluation_id = my_reward_function.deploy(
name="my-evaluator-anthropic",
description="My evaluator using Claude model",
force=True,
providers=[
{
"providerType": "anthropic",
"modelId": "claude-3-sonnet-20240229"
}
]
)
```
# null
Source: https://docs.fireworks.ai/evaluators/cli_reference/cli_overview
# Command Line Interface Reference
The Reward Kit provides a command-line interface (CLI) for common operations like previewing evaluations, deploying reward functions, and running agent evaluations.
## Installation
When you install the Reward Kit, the CLI is automatically installed:
```bash
pip install reward-kit
```
You can verify the installation by running:
```bash
reward-kit --help
```
## Authentication Setup
Before using the CLI, set up your authentication credentials:
```bash
# Set your API key
export FIREWORKS_API_KEY=your_api_key
# Optional: Set the API base URL (for development environments)
export FIREWORKS_API_BASE=https://api.fireworks.ai
```
## Command Overview
The Reward Kit CLI supports the following main commands:
* `preview`: Preview an evaluation with sample data
* `deploy`: Deploy a reward function as an evaluator
* `agent-eval`: Run agent evaluations on task bundles
* `list`: List existing evaluators (coming soon)
* `delete`: Delete an evaluator (coming soon)
## Preview Command
The `preview` command allows you to test an evaluation with sample data before deployment.
### Syntax
```bash
reward-kit preview [options]
```
### Options
* `--metrics-folders`: Specify metrics to use in the format "name=path"
* `--samples`: Path to a JSONL file containing sample conversations
* `--max-samples`: Maximum number of samples to process (optional)
* `--output`: Path to save preview results (optional)
* `--verbose`: Enable verbose output (optional)
### Examples
```bash
# Basic usage
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl
# Multiple metrics
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" "accuracy=./my_metrics/accuracy" --samples ./samples.jsonl
# Limit sample count
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --max-samples 5
# Save results to file
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --output ./results.json
```
### Sample File Format
The samples file should be a JSONL (JSON Lines) file with each line containing a conversation in the following format:
```json
{"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a method of data analysis..."}]}
```
## Deploy Command
The `deploy` command deploys a reward function as an evaluator on the Fireworks platform.
### Syntax
```bash
reward-kit deploy [options]
```
### Options
* `--id`: ID for the deployed evaluator (required)
* `--metrics-folders`: Specify metrics to use in the format "name=path" (required)
* `--display-name`: Human-readable name for the evaluator (optional)
* `--description`: Description of the evaluator (optional)
* `--force`: Overwrite if an evaluator with the same ID already exists (optional)
* `--providers`: List of model providers to use (optional)
* `--verbose`: Enable verbose output (optional)
### Examples
```bash
# Basic deployment
reward-kit deploy --id my-evaluator --metrics-folders "clarity=./my_metrics/clarity"
# With display name and description
reward-kit deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--display-name "Clarity Evaluator" \
--description "Evaluates responses based on clarity"
# Force overwrite existing evaluator
reward-kit deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--force
# Multiple metrics
reward-kit deploy --id comprehensive-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" "accuracy=./my_metrics/accuracy" \
--display-name "Comprehensive Evaluator"
```
## Common Workflows
### Iterative Development Workflow
A typical development workflow might look like:
1. Create a reward function
2. Preview it with sample data
3. Refine the function based on preview results
4. Deploy when satisfied
```bash
# Step 1: Create a reward function (manually in ./my_metrics/clarity)
# Step 2: Preview with samples
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl
# Step 3: Refine the function (manually)
# Step 4: Preview again
reward-kit preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl
# Step 5: Deploy when satisfied
reward-kit deploy --id clarity-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--display-name "Clarity Evaluator" \
--description "Evaluates response clarity" \
--force
```
### Comparing Multiple Metrics
You can preview multiple metrics to compare their performance:
```bash
# Preview with multiple metrics
reward-kit preview \
--metrics-folders \
"metric1=./my_metrics/metric1" \
"metric2=./my_metrics/metric2" \
"metric3=./my_metrics/metric3" \
--samples ./samples.jsonl
```
### Deployment with Custom Providers
You can deploy with specific model providers:
```bash
# Deploy with custom provider
reward-kit deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--providers '[{"providerType":"anthropic","modelId":"claude-3-sonnet-20240229"}]'
```
## Agent-Eval Command
The `agent-eval` command enables you to run agent evaluations using task bundles.
### Syntax
```bash
reward-kit agent-eval [options]
```
### Options
#### Task Specification:
* `--task-dir`: Path to task bundle directory containing reward.py, tools.py, etc.
* `--dataset` or `-d`: Path to JSONL file containing task specifications.
#### Output and Models:
* `--output-dir` or `-o`: Directory to store evaluation runs (default: "./runs").
* `--model`: Override MODEL\_AGENT environment variable.
* `--sim-model`: Override MODEL\_SIM environment variable for simulated user.
#### Testing and Debugging:
* `--no-sim-user`: Disable simulated user (use static initial messages only).
* `--test-mode`: Run in test mode without requiring API keys.
* `--mock-response`: Use a mock agent response (works with --test-mode).
* `--debug`: Enable detailed debug logging.
* `--validate-only`: Validate task bundle structure without running evaluation.
* `--export-tools`: Export tool specifications to directory for manual testing.
#### Advanced Options:
* `--task-ids`: Comma-separated list of task IDs to run.
* `--max-tasks`: Maximum number of tasks to evaluate.
* `--registries`: Custom tool registries in format 'name=path'.
* `--registry-override`: Override all toolset paths with this registry path.
* `--evaluator`: Custom evaluator module path (overrides default).
### Examples
```bash
# Run agent evaluation with default settings
export MODEL_AGENT=openai/gpt-4o-mini
reward-kit agent-eval --task-dir examples/flight_task
# Use a specific dataset file
reward-kit agent-eval --dataset examples/flight_task/task.jsonl
# Run in test mode (no API keys required)
reward-kit agent-eval --task-dir examples/flight_task --test-mode --mock-response
# Validate task bundle structure without running
reward-kit agent-eval --task-dir examples/flight_task --validate-only
# Use a custom model and limit to specific tasks
reward-kit agent-eval --task-dir examples/flight_task \
--model anthropic/claude-3-opus-20240229 \
--task-ids flight.booking.001,flight.booking.002
# Export tool specifications for manual testing
reward-kit agent-eval --task-dir examples/flight_task --export-tools ./tool_specs
```
### Task Bundle Structure
A task bundle is a directory containing the following files:
* `reward.py`: Reward function with @reward\_function decorator
* `tools.py`: Tool registry with tool definitions
* `task.jsonl`: Dataset rows with task specifications
* `seed.sql` (optional): Initial database state
See the [Agent Evaluation](../developer_guide/agent_evaluation.md) guide for more details.
## Environment Variables
The CLI recognizes the following environment variables:
* `FIREWORKS_API_KEY`: Your Fireworks API key (required for deployment operations)
* `FIREWORKS_API_BASE`: Base URL for the Fireworks API (defaults to `https://api.fireworks.ai`)
* `FIREWORKS_ACCOUNT_ID`: Your Fireworks account ID (optional, can be configured in auth.ini)
* `MODEL_AGENT`: Default agent model to use (e.g., "openai/gpt-4o-mini")
* `MODEL_SIM`: Default simulation model to use (e.g., "openai/gpt-3.5-turbo")
## Troubleshooting
### Common Issues
1. **Authentication Errors**:
```
Error: Authentication failed. Check your API key.
```
Solution: Ensure `FIREWORKS_API_KEY` is correctly set.
2. **Metrics Folder Not Found**:
```
Error: Metrics folder not found: ./my_metrics/clarity
```
Solution: Check that the path exists and contains a valid `main.py` file.
3. **Invalid Sample File**:
```
Error: Failed to parse sample file. Ensure it's a valid JSONL file.
```
Solution: Verify the sample file is in the correct JSONL format.
4. **Deployment Permission Issues**:
```
Error: Permission denied. Your API key doesn't have deployment permissions.
```
Solution: Use a production API key with deployment permissions or request additional permissions.
5. **Task Bundle Validation Errors**:
```
Error: Missing required files in task bundle: tools.py, reward.py
```
Solution: Ensure your task bundle has all required files.
6. **Model API Key Not Set**:
```
Warning: MODEL_AGENT environment variable is not set
```
Solution: Set the MODEL\_AGENT environment variable or use the --model parameter.
7. **Import Errors with Task Bundle**:
```
Error: Failed to import tool registry from example.task.tools
```
Solution: Check that the Python path is correct and the module can be imported.
### Getting Help
For additional help, use the `--help` flag with any command:
```bash
reward-kit --help
reward-kit preview --help
reward-kit deploy --help
reward-kit agent-eval --help
```
## Next Steps
* Explore the [Developer Guide](../developer_guide/getting_started.md) for conceptual understanding
* Try the [Creating Your First Reward Function](../tutorials/creating_your_first_reward_function.md) tutorial
* Learn about [Agent Evaluation](../developer_guide/agent_evaluation.md) to create your own task bundles
* See [Examples](../examples/basic_reward_function.md) for practical implementations
# null
Source: https://docs.fireworks.ai/evaluators/developer_guide/agent_evaluation
# Agent Evaluation Framework
The Agent Evaluation Framework allows you to evaluate agent models with tool-augmented reasoning using "Task Bundles" - self-contained directories that include all the necessary components for testing and evaluation.
## Task Bundle Structure
A task bundle is a self-contained directory with all the components needed to evaluate an agent:
```
my_task/
├─ reward.py # Reward function with @reward_function decorator
├─ tools.py # Tool registry for this specific task
├─ seed.sql # Initial DB state (optional)
└─ task.jsonl # Dataset rows with task specifications
```
## CLI Usage
The agent evaluation framework is integrated with the Reward Kit CLI through the `agent-eval` command.
### Basic Usage
```bash
# Run agent evaluation on a task bundle
reward-kit agent-eval --task-dir ./flight_task
# You can also specify just the task.jsonl file
reward-kit agent-eval --dataset ./flight_task/task.jsonl
```
### Environment Variables
Models can be specified using environment variables:
```bash
# Set model for agent evaluation
export MODEL_AGENT=openai/gpt-4o
# Set model for simulated user (optional)
export MODEL_SIM=openai/gpt-3.5-turbo
# Then run evaluation
reward-kit agent-eval --task-dir ./flight_task
```
### Advanced Options
```bash
# Specify model directly (overrides environment variable)
reward-kit agent-eval --task-dir ./flight_task --model openai/gpt-4o
# Use custom output directory
reward-kit agent-eval --task-dir ./flight_task --output-dir ./my_runs
# Disable simulated user (use static initial messages only)
reward-kit agent-eval --task-dir ./flight_task --no-sim-user
# Use test mode without requiring API keys
reward-kit agent-eval --task-dir ./flight_task --test-mode
# Use mock response in test mode
reward-kit agent-eval --task-dir ./flight_task --test-mode --mock-response
# Run in debug mode with verbose output
reward-kit agent-eval --task-dir ./flight_task --debug
# Limit the number of tasks to evaluate
reward-kit agent-eval --task-dir ./flight_task --max-tasks 2
# Run specific tasks by ID
reward-kit agent-eval --task-dir ./flight_task --task-ids flight.booking.001,flight.booking.002
# Use a specific registry for a task
reward-kit agent-eval --task-dir ./flight_task --registry-override my_custom_tools.flight_tools
# Use multiple tool registries
reward-kit agent-eval --task-dir ./complex_task --registries flight=flight_tools,hotel=hotel_tools
# Specify evaluator
reward-kit agent-eval --task-dir ./flight_task --evaluator flight_reward.success_evaluator
```
## Testing & Debugging
The CLI provides several options for testing and debugging:
```bash
# Test mode verifies tool setup without making API calls
reward-kit agent-eval --task-dir ./flight_task --test-mode
# Debug mode shows detailed information about tool execution
reward-kit agent-eval --task-dir ./flight_task --debug
# Export tools as OpenAPI spec for manual testing
reward-kit agent-eval --task-dir ./flight_task --export-tools ./tools_spec
# Validate task bundle structure and requirements
reward-kit agent-eval --task-dir ./flight_task --validate-only
```
## Examples
### Basic Flight Task Evaluation
```bash
export MODEL_AGENT=openai/gpt-4o
reward-kit agent-eval --task-dir ./examples/flight_task
```
### Testing Without API Keys
```bash
reward-kit agent-eval --task-dir ./examples/flight_task --test-mode --mock-response
```
### Complex Task with Multiple Tool Registries
```bash
reward-kit agent-eval --task-dir ./examples/travel_task --registries flight=flight_tools,hotel=hotel_tools
```
### Running with Specific Task IDs
```bash
reward-kit agent-eval --task-dir ./examples/flight_task --task-ids flight.booking.001,flight.booking.002
```
### Using Debug Mode
```bash
reward-kit agent-eval --task-dir ./examples/flight_task --debug
```
# null
Source: https://docs.fireworks.ai/evaluators/developer_guide/core_data_types
# Core Data Types
This guide explains the primary data types used in the Reward Kit, including the input and output structures for reward functions.
## Overview
The Reward Kit uses several core data types to represent:
* Conversation messages
* Evaluation results
* Component metrics
Understanding these types is crucial for creating effective reward functions.
## Message Types
### The `Message` Class
```python
from reward_kit import Message
message = Message(
role="assistant",
content="This is the response content",
name=None, # Optional
tool_call_id=None, # Optional, for tool calling
tool_calls=None, # Optional, for tool calling
function_call=None # Optional, for function calling
)
```
The `Message` class represents a single message in a conversation and is compatible with the OpenAI message format.
### Message Dictionary Format
When working with reward functions, messages are often passed as dictionaries:
```python
message_dict = {
"role": "assistant",
"content": "This is the response content"
}
```
The minimum required fields are:
* `role`: The sender of the message (`"user"`, `"assistant"`, or `"system"`)
* `content`: The text content of the message
Additional fields for function/tool calling may include:
* `name`: Name of the sender (for named system messages)
* `tool_calls`: Tool call information
* `function_call`: Function call information (legacy format)
## Evaluation Output Types
### `EvaluateResult` Class
```python
from reward_kit import EvaluateResult, MetricResult
result = EvaluateResult(
score=0.75, # Overall score between 0.0 and 1.0
reason="The response meets quality requirements", # Optional explanation
metrics={ # Component metrics dictionary
"clarity": MetricResult(
score=0.8,
reason="The response is clear and concise"
),
"accuracy": MetricResult(
score=0.7,
reason="Contains one minor factual error"
)
}
)
```
The `EvaluateResult` class represents the complete result of a reward function evaluation, containing:
* An overall score (typically 0.0 to 1.0)
* An optional reason/explanation for the overall score
* A dictionary of component metrics
* An optional error field for handling evaluation failures
### `MetricResult` Class
```python
from reward_kit import MetricResult
metric = MetricResult(
score=0.8, # Score for this specific metric
reason="Explanation for why this score was assigned", # Description
success=True # Optional success indicator
)
```
The `MetricResult` class represents a single component metric in the evaluation, containing:
* A score value (typically 0.0 to 1.0)
* A reason/explanation for the score
* An optional success flag indicating if this metric evaluation was successful
### Deprecated Output Types
The `RewardOutput` and `MetricRewardOutput` classes are deprecated and will be removed in a future version:
```python
# Deprecated - use EvaluateResult instead
from reward_kit import RewardOutput, MetricRewardOutput
# This will show a deprecation warning
result = RewardOutput(
score=0.75,
metrics={
"clarity": MetricRewardOutput(
score=0.8,
reason="The response is clear and concise"
)
}
)
# Convert to the preferred EvaluateResult format
evaluate_result = result.to_evaluate_result()
```
## Type Conversion
The Reward Kit provides built-in methods for converting between types:
```python
# Convert EvaluateResult to RewardOutput (for backwards compatibility)
evaluate_result = EvaluateResult(
score=0.8,
metrics={"quality": MetricResult(score=0.8, reason="Good quality")}
)
reward_output = evaluate_result.to_reward_output()
# Convert RewardOutput to EvaluateResult
reward_output = RewardOutput(
score=0.9,
metrics={"clarity": MetricRewardOutput(score=0.9, reason="Very clear")}
)
evaluate_result = reward_output.to_evaluate_result()
```
## Using Types in Reward Functions
Here's how to use these types properly in your reward functions:
```python
from reward_kit import reward_function, EvaluateResult, MetricResult, Message
from typing import List, Optional, Dict, Any
@reward_function
def my_reward_function(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
metadata: Optional[Dict[str, Any]] = None,
**kwargs
) -> EvaluateResult:
"""
Example reward function with proper type annotations.
"""
# Default values
metadata = metadata or {}
# Get the assistant's response
response = messages[-1].get("content", "")
# Evaluate the response
clarity_score = evaluate_clarity(response)
# Create metrics
metrics = {
"clarity": MetricResult(
score=clarity_score,
reason=f"Clarity score: {clarity_score:.2f}",
success=clarity_score >= 0.7 # Optional success indicator
)
}
return EvaluateResult(
score=clarity_score,
reason=f"Overall quality assessment: {clarity_score:.2f}",
metrics=metrics
)
```
## Best Practices for Data Types
1. **Use EvaluateResult**: Always return EvaluateResult from your reward functions
2. **Use Type Hints**: Include proper type annotations in your functions
3. **Provide Reasons**: Include clear reason strings for both overall score and individual metrics
4. **Use Success Flags**: Set the success flag in MetricResult to indicate pass/fail conditions
5. **Default Values**: Provide defaults for optional parameters
6. **Validation**: Validate input data before processing
7. **Error Handling**: Handle missing or malformed data gracefully
8. **Documentation**: Document the expected format for your inputs and outputs
## Migration from RewardOutput to EvaluateResult
If you have existing code using RewardOutput, here's how to migrate to EvaluateResult:
```python
# Old code (deprecated)
@reward_function
def my_reward(messages, **kwargs):
# ...
return RewardOutput(
score=0.75,
metrics={
"clarity": MetricRewardOutput(score=0.8, reason="Clear explanation")
}
)
# New code (preferred)
@reward_function
def my_reward(messages, **kwargs):
# ...
return EvaluateResult(
score=0.75,
reason="Overall assessment", # Add an overall reason
metrics={
"clarity": MetricResult(
score=0.8,
reason="Clear explanation",
success=True # Add success flag if applicable
)
}
)
```
## Next Steps
Now that you understand the core data types:
1. Learn about [Evaluation Workflows](evaluation_workflows.md) for testing and deploying your functions
2. Explore [Advanced Reward Functions](../examples/advanced_reward_functions.md) to see these types in action
3. Check the [API Reference](../api_reference/data_models.md) for complete details on all data types
# null
Source: https://docs.fireworks.ai/evaluators/developer_guide/evaluation_workflows
# Evaluation Workflows
This guide explains the complete lifecycle of a reward function, from local development and testing to deployment on the Fireworks platform.
## Development Workflow Overview
The typical workflow for developing and deploying reward functions involves:
1. **Local Development**: Writing and testing reward functions locally
2. **Preview Evaluation**: Testing with sample data to validate performance
3. **Deployment**: Making the reward function available for training workflows
4. **Integration**: Using the deployed evaluator in RLHF training
## 1. Local Development
### Creating a Reward Function
Start by creating a reward function with the `@reward_function` decorator:
```python
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
from typing import List, Dict, Optional
@reward_function
def helpfulness_reward(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput:
"""Evaluate the helpfulness of a response."""
# Get the assistant's response
response = messages[-1].get("content", "").lower()
# Define helpful keywords
helpful_keywords = ["help", "assist", "solve", "solution", "answer", "explain"]
# Count helpful keywords
keyword_count = sum(1 for keyword in helpful_keywords if keyword in response)
# Calculate score based on keyword presence (simple example)
score = min(keyword_count / 3, 1.0) # Cap at 1.0
return RewardOutput(
score=score,
metrics={
"helpfulness": MetricRewardOutput(
score=score,
reason=f"Found {keyword_count} helpful keywords"
)
}
)
```
### Local Testing
Test your reward function with sample messages:
```python
# Sample test messages
test_messages = [
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "I can help you reset your password. First, go to the login page and click on 'Forgot Password'. Then follow the instructions sent to your email."}
]
# Test the reward function
result = helpfulness_reward(messages=test_messages)
# Print the results
print(f"Overall Score: {result.score}")
print("Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score} - {metric.reason}")
```
### Creating a Test File
For more comprehensive testing, create a separate test script:
```python
# test_helpfulness.py
import json
from reward_kit import RewardOutput
from my_rewards import helpfulness_reward
def load_test_cases(file_path):
"""Load test cases from a JSONL file."""
with open(file_path, 'r') as f:
return [json.loads(line) for line in f]
def main():
# Load test cases
test_cases = load_test_cases("samples/test_conversations.jsonl")
print(f"Testing helpfulness reward on {len(test_cases)} cases...")
# Test each case
for i, case in enumerate(test_cases):
messages = case.get("messages", [])
result = helpfulness_reward(messages=messages)
print(f"\nCase {i+1}:")
print(f"User: {messages[0].get('content', '')[:50]}...")
print(f"Assistant: {messages[-1].get('content', '')[:50]}...")
print(f"Score: {result.score}")
print(f"Reason: {result.metrics.get('helpfulness', {}).get('reason', 'No reason provided')}")
if __name__ == "__main__":
main()
```
## 2. Preview Evaluation
### Creating Sample Data
Create a JSONL file with sample conversations for evaluation:
```json
{"messages": [{"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "I can help you reset your password. First, go to the login page and click on 'Forgot Password'. Then follow the instructions sent to your email."}]}
{"messages": [{"role": "user", "content": "What's the weather today?"}, {"role": "assistant", "content": "I don't have access to real-time information like weather."}]}
```
### Using the CLI for Preview
Use the Reward Kit CLI to preview your evaluation:
```bash
# Activate virtual environment
source .venv/bin/activate
# Preview with the CLI
reward-kit preview \
--metrics-folders "helpfulness=./path/to/helpfulness_metric" \
--samples ./path/to/samples.jsonl
```
### Programmatic Preview
Alternatively, use the API for programmatic preview:
```python
from reward_kit.evaluation import preview_evaluation
# Preview the evaluation
preview_result = preview_evaluation(
metric_folders=["helpfulness=./path/to/helpfulness_metric"],
sample_file="./path/to/samples.jsonl",
max_samples=5 # Optional: limit number of samples
)
# Display the results
preview_result.display()
```
## 3. Deployment
### Direct Deployment from Function
You can deploy the reward function directly:
```python
# Deploy the function
evaluation_id = helpfulness_reward.deploy(
name="helpfulness-evaluator",
description="Evaluates the helpfulness of responses",
force=True # Overwrite if it already exists
)
print(f"Deployed helpfulness evaluator with ID: {evaluation_id}")
```
### Using the CLI for Deployment
Or use the CLI to deploy the function:
```bash
# Deploy with the CLI
reward-kit deploy \
--id helpfulness-evaluator \
--metrics-folders "helpfulness=./path/to/helpfulness_metric" \
--display-name "Helpfulness Evaluator" \
--description "Evaluates the helpfulness of responses" \
--force
```
### Custom Provider Deployment
Deploy with a specific model provider:
```python
# Deploy with a custom provider
custom_evaluation_id = helpfulness_reward.deploy(
name="helpfulness-evaluator-anthropic",
description="Helpfulness evaluation using Claude model",
force=True,
providers=[
{
"providerType": "anthropic",
"modelId": "claude-3-sonnet-20240229"
}
]
)
print(f"Deployed custom provider evaluator: {custom_evaluation_id}")
```
### Using create\_evaluation Function
You can also use the `create_evaluation` function directly:
```python
from reward_kit.evaluation import create_evaluation
# Create an evaluation
evaluator = create_evaluation(
evaluator_id="helpfulness-evaluator",
metric_folders=["helpfulness=./path/to/helpfulness_metric"],
display_name="Helpfulness Evaluator",
description="Evaluates the helpfulness of responses",
force=True
)
print(f"Created evaluator: {evaluator['name']}")
```
## 4. Integration with Training
### Using in an RL Training Job
Once deployed, use the evaluator in an RL training job:
```bash
# Example of using the evaluator in a Fireworks RL job
firectl create rl-job \
--reward-endpoint "https://api.fireworks.ai/v1/evaluations/helpfulness-evaluator" \
--model-id "accounts/fireworks/models/llama-v3-8b-instruct" \
--dataset-id "my-training-dataset"
```
### Programmatic Integration with TRL
For programmatic integration with the Transformer Reinforcement Learning (TRL) library:
```python
from reward_kit import RewardFunction
# Create a reward function instance
reward_fn = RewardFunction(
name="helpfulness-evaluator",
mode="remote" # Use the deployed evaluator
)
# Get a TRL-compatible adapter
trl_reward_fn = reward_fn.get_trl_adapter()
# Use in your TRL training pipeline
# ...
```
## Best Practices
1. **Iterative Development**: Start simple, test thoroughly, and refine your reward function
2. **Version Control**: Use version control for your reward functions and track changes
3. **Sample Diversity**: Test with a diverse set of samples to ensure robustness
4. **Documentation**: Document the behavior and assumptions of your reward function
5. **Error Handling**: Include robust error handling to prevent evaluation failures
6. **Logging**: Add detailed logging for debugging and monitoring
## Next Steps
Now that you understand the complete workflow:
1. Try creating a [Basic Reward Function](../examples/basic_reward_function.md)
2. Explore [Advanced Reward Functions](../examples/advanced_reward_functions.md) with multiple metrics
3. Learn about [Best Practices](../tutorials/best_practices.md) for designing effective reward functions
# null
Source: https://docs.fireworks.ai/evaluators/developer_guide/getting_started
# Getting Started with Reward Functions
This guide will help you understand the basics of creating, testing, and deploying reward functions using the Reward Kit.
## What is a Reward Function?
A reward function is a mechanism for evaluating the quality of model outputs in reinforcement learning from machine feedback (RLMF) workflows. Reward functions help:
* Evaluate model responses based on specific criteria
* Provide numerical scores that can be used to optimize models
* Offer explanations for why specific scores were assigned
## Getting started on [www.fireworks.ai](http://www.fireworks.ai)
You will start your journey on our evaluators page

Click on "Create Evaluator" on the upper right corner; you will be taken to the rewards page we have been working on.

You can check out how to define an evaluator in our [tutorials](../tutorials) or in our examples for [out of the box evaluators](../examples). But before we decide on authoring any evaluators, let's actually pick a dataset creating\_your\_first\_reward\_function. Let's take a look at eval-result-job17-epoch1.

It is a tool calling dataset, with messages and tools field. Let's update the evaluator to run this. We will say that if the message is longer than 3 rows, then we have engaged user for long enough and call it a success (score is 1), otherwise it is a failure (score is 0).

## Installation
To get started with Reward Kit, install it via pip:
```bash
pip install reward-kit
```
For development, you may want to install it in editable mode:
```bash
git clone https://github.com/your-organization/reward-kit.git
cd reward-kit
pip install -e .
```
## Authentication Setup
To use Reward Kit with the Fireworks AI platform, set up your authentication credentials:
```bash
# Set your API key
export FIREWORKS_API_KEY=your_api_key
```
For development environments, you might use:
```bash
# Set environment variables for development
export FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY
export FIREWORKS_API_BASE=https://dev.api.fireworks.ai
```
## Basic Reward Function Structure
Here's a simple reward function that evaluates responses based on word count:
```python
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
from typing import List, Dict, Optional
@reward_function
def word_count_reward(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput:
"""
Evaluate a response based on its word count.
Args:
messages: List of conversation messages
original_messages: Original messages (usually without the response being evaluated)
**kwargs: Additional parameters
Returns:
RewardOutput with score and metrics information
"""
# Get the assistant's response (last message)
if not messages or messages[-1].get("role") != "assistant":
return RewardOutput(
score=0.0,
metrics={"error": MetricRewardOutput(score=0.0, reason="No assistant response found")}
)
response = messages[-1].get("content", "")
# Count words and calculate score
word_count = len(response.split())
score = min(word_count / 100, 1.0) # Cap at 1.0
return RewardOutput(
score=score,
metrics={
"word_count": MetricRewardOutput(
score=score,
reason=f"Word count: {word_count}"
)
}
)
```
## Testing Your Reward Function
You can test your reward function with sample conversations:
```python
# Sample conversation
test_messages = [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a method of data analysis that automates analytical model building."}
]
# Test the reward function
result = word_count_reward(messages=test_messages)
print(f"Score: {result.score}")
print(f"Explanation: {result.metrics['word_count'].reason}")
```
## Previewing Your Reward Function
Before deployment, you can preview how your reward function performs on a set of sample data:
```bash
# Using the CLI
reward-kit preview --metrics-folders "word_count=./path/to/metric" --samples ./path/to/samples.jsonl
```
Or programmatically:
```python
from reward_kit.evaluation import preview_evaluation
preview_result = preview_evaluation(
metric_folders=["word_count=./path/to/metric"],
sample_file="./path/to/samples.jsonl"
)
# Display the results
preview_result.display()
```
## Deploying Your Reward Function
When you're ready, deploy your reward function to use in training workflows:
```python
# Deploy programmatically
evaluator_id = word_count_reward.deploy(
name="word-count-evaluator",
description="Evaluates responses based on word count"
)
print(f"Deployed with ID: {evaluator_id}")
```
Or using the CLI:
```bash
reward-kit deploy --id word-count-evaluator --metrics-folders "word_count=./path/to/metric" --force
```
## Next Steps
Now that you understand the basics of reward functions:
1. Learn about [Reward Function Anatomy](reward_function_anatomy.md) for deeper insights
2. Explore [Core Data Types](core_data_types.md) to understand the components
3. Try creating advanced reward functions with [multiple metrics](../examples/advanced_reward_functions.md)
4. Follow our [step-by-step tutorial](../tutorials/creating_your_first_reward_function.md) for a complete walkthrough
# null
Source: https://docs.fireworks.ai/evaluators/developer_guide/reward_function_anatomy
# Reward Function Anatomy
This guide provides a detailed explanation of how reward functions are structured in the Reward Kit, focusing on the `@reward_function` decorator and the components that make up a complete reward function.
## The `@reward_function` Decorator
The `@reward_function` decorator is the core mechanism that transforms a regular Python function into a reward function that can be used for evaluation and deployment.
```python
from reward_kit import reward_function
@reward_function
def my_reward_function(messages, original_messages=None, **kwargs):
# Your evaluation logic here
return RewardOutput(...)
```
### What the Decorator Does
The `@reward_function` decorator performs several important functions:
1. **Input Validation**: Ensures the function receives the expected parameters
2. **Output Standardization**: Ensures the function returns a properly formatted `RewardOutput` object
3. **Deployment Capability**: Adds a `.deploy()` method to the function for easy deployment
4. **Backward Compatibility**: Handles legacy return formats (tuples of score and metrics)
### Under the Hood
Internally, the decorator wraps your function with logic that:
1. Processes the input parameters
2. Calls your function with the standardized inputs
3. Handles any exceptions that occur during execution
4. Formats the output as a `RewardOutput` object
5. Provides deployment capabilities through the `.deploy()` method
## Function Parameters
A standard reward function has these parameters:
```python
def reward_function(
messages: List[Dict[str, str]],
original_messages: Optional[List[Dict[str, str]]] = None,
**kwargs
) -> RewardOutput:
# ...
```
### Required Parameters
* **`messages`**: A list of message dictionaries in the conversation, where each message has at least `"role"` and `"content"` keys. The last message is typically the one being evaluated.
### Optional Parameters
* **`original_messages`**: The conversation context, usually messages before the response being evaluated. If not provided, it defaults to `messages[:-1]`.
* **`**kwargs`**: Additional parameters that can be used to customize the evaluation.
## Return Value
A reward function must return a `RewardOutput` object:
```python
return RewardOutput(
score=0.75, # Overall score between 0.0 and 1.0
metrics={ # Component metrics
"clarity": MetricRewardOutput(
score=0.8,
reason="The response clearly explains the concept"
),
"accuracy": MetricRewardOutput(
score=0.7,
reason="Contains one minor factual error"
)
}
)
```
### RewardOutput Structure
* **`score`**: The final aggregate score (typically between 0.0 and 1.0)
* **`metrics`**: A dictionary of component metrics, each with its own score and explanation
## Multi-Component Reward Functions
Complex reward functions often evaluate multiple aspects of a response:
```python
@reward_function
def comprehensive_evaluation(messages, original_messages=None, **kwargs):
response = messages[-1]["content"]
metrics = {}
# Evaluate clarity
clarity_score = evaluate_clarity(response)
metrics["clarity"] = MetricRewardOutput(
score=clarity_score,
reason=f"Clarity score: {clarity_score:.2f}"
)
# Evaluate accuracy
accuracy_score = evaluate_accuracy(response)
metrics["accuracy"] = MetricRewardOutput(
score=accuracy_score,
reason=f"Accuracy score: {accuracy_score:.2f}"
)
# Combine scores (weighted average)
final_score = clarity_score * 0.4 + accuracy_score * 0.6
return RewardOutput(score=final_score, metrics=metrics)
```
## Deployment Capabilities
The `@reward_function` decorator adds a `.deploy()` method to your function:
```python
# Deploy the function to Fireworks
evaluation_id = my_reward_function.deploy(
name="my-evaluator",
description="Evaluates responses based on custom criteria",
force=True # Overwrite if already exists
)
```
### Deploy Method Parameters
* **`name`**: ID for the deployed evaluator (required)
* **`description`**: Human-readable description (optional)
* **`force`**: Whether to overwrite an existing evaluator with the same name (optional)
* **`providers`**: List of model providers to use for evaluation (optional)
## Error Handling
Robust reward functions include proper error handling:
```python
@reward_function
def safe_evaluation(messages, original_messages=None, **kwargs):
try:
# Ensure we have a valid response to evaluate
if not messages or messages[-1].get("role") != "assistant":
return RewardOutput(
score=0.0,
metrics={"error": MetricRewardOutput(
score=0.0,
reason="No assistant response found"
)}
)
# Your evaluation logic here
# ...
except Exception as e:
# Handle any unexpected errors
return RewardOutput(
score=0.0,
metrics={"error": MetricRewardOutput(
score=0.0,
reason=f"Evaluation error: {str(e)}"
)}
)
```
## Working with Metadata
You can pass additional configuration through the `**kwargs` parameter:
```python
@reward_function
def configurable_evaluation(messages, original_messages=None, metadata=None, **kwargs):
"""Reward function that supports configuration via metadata."""
metadata = metadata or {}
# Get configurable thresholds from metadata
min_length = metadata.get("min_length", 50)
max_score = metadata.get("max_score", 1.0)
weight_factor = metadata.get("weight_factor", 1.0)
# Use these parameters in your evaluation
# ...
# Apply any metadata-based adjustments to the final score
final_score = base_score * weight_factor
return RewardOutput(score=final_score, metrics=metrics)
```
When calling the function, you can pass this metadata:
```python
result = configurable_evaluation(
messages=test_messages,
metadata={"min_length": 100, "weight_factor": 1.2}
)
```
## Next Steps
Now that you understand the structure of reward functions:
1. Learn about the [Core Data Types](core_data_types.md) used in reward functions
2. Explore [Evaluation Workflows](evaluation_workflows.md) for testing and deployment
3. See [Code Examples](../examples/basic_reward_function.md) for practical implementations
# null
Source: https://docs.fireworks.ai/evaluators/documentation_home
# Reward Kit Documentation
Welcome to the Reward Kit documentation. This guide will help you create, test, and deploy reward functions for evaluating and optimizing LLM responses.

## Getting Started
### Developer Guide
* [Getting Started with Reward Functions](developer_guide/getting_started): Learn the basics of reward functions
* [Reward Function Anatomy](developer_guide/reward_function_anatomy): Understand the structure of reward functions
* [Core Data Types](developer_guide/core_data_types): Explore the data models used in reward functions
* [Evaluation Workflows](developer_guide/evaluation_workflows): Learn the complete lifecycle from development to deployment
### Examples and Built-in Reward Functions
* [Reward Functions Overview](examples/reward_functions_overview): Overview of all built-in reward functions
* [Basic Reward Function](examples/basic_reward_function): Simple example evaluating response clarity
* [Advanced Reward Functions](examples/advanced_reward_functions): More complex examples with multiple metrics
#### Built-in Reward Function Documentation
* [Code Execution Evaluation](examples/code_execution_evaluation): Evaluate code by running it locally
* [Code Execution with E2B](examples/code_execution_with_e2b): Evaluate code using E2B cloud sandbox
* [Function Calling Evaluation](examples/function_calling_evaluation): Evaluate function calls made by AI models
* [JSON Schema Validation](examples/json_schema_validation): Validate JSON outputs against schemas
* [Math Evaluation](examples/math_evaluation): Evaluate mathematical answers in responses
### Tutorials
* [Creating Your First Reward Function](tutorials/creating_your_first_reward_function): Step-by-step guide to creating a reward function
## API Reference
* Coming soon
## Command Line Interface
* Coming soon
## Best Practices
* Coming soon
## Community and Support
* GitHub Issues: Report bugs and request features
* Contributing Guide: How to contribute to the Reward Kit project
# null
Source: https://docs.fireworks.ai/evaluators/examples/accuracy_length/accuracy_length_overview
# Accuracy + Length Reward Examples
This directory contains examples demonstrating the use of combined accuracy and length-based reward functions.
## Overview
These examples show how to use the `cosine_scaled_accuracy_length_reward` function to evaluate model responses based on both:
1. Accuracy (correctness of the answer)
2. Length efficiency (brevity of the response)
This combined approach rewards responses that are both accurate and concise, penalizing verbosity in correct answers and providing a clear separation between correct and incorrect responses.
**Note**: The accuracy detection depends on specific text-extraction mechanisms that may need customization for different types of content using the `extract_fn` and `compare_fn` parameters.
## Examples
### Cosine-Scaled Accuracy + Length Example
The [cosine\_scaled\_example.py](./cosine_scaled_example.py) script demonstrates the reward function's behavior with different types of responses:
* Short correct answers (highest score)
* Long correct answers (moderate score)
* Short incorrect answers (very low score)
* Long incorrect answers (low score, but still penalized for being wrong)
It also shows how to customize the weighting between accuracy and length components.
## Running the Examples
```bash
# Make sure you're in the reward-kit directory
cd /path/to/reward-kit
# Activate the virtual environment
source .venv/bin/activate
# Run the example
python examples/accuracy_length/cosine_scaled_example.py
```
## Expected Output
```
===== Evaluating with Default Parameters =====
Short Correct Answer:
Response (1 words): "Paris..."
Combined Score: 1.00
Accuracy Score: 1.00
Length Score: 1.00
Long Correct Answer:
Response (69 words): "The capital of France is Paris. Paris is located i..."
Combined Score: 0.88
Accuracy Score: 1.00
Length Score: 0.61
Short Incorrect Answer:
Response (1 words): "Lyon..."
Combined Score: 0.00
Accuracy Score: 0.00
Length Score: 0.00
Long Incorrect Answer:
Response (46 words): "I need to identify the capital city of France. Fra..."
Combined Score: 0.04
Accuracy Score: 0.00
Length Score: 0.13
===== Evaluating with Custom Parameters =====
Short Correct Answer (80% accuracy weight, 20% length weight):
Response (1 words): "Paris..."
Combined Score: 1.00
Accuracy Score: 1.00
Length Score: 1.00
```
## Custom Configurations
You can customize the reward function with various parameters:
```python
from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
result = cosine_scaled_accuracy_length_reward(
messages=messages,
ground_truth="Expected answer",
max_length=500, # Maximum ideal length
correctness_weight=0.7, # Weight for accuracy component
length_weight=0.3, # Weight for length component
min_value_correct=0.5, # Minimum score for correct answers
max_value_correct=1.0, # Maximum score for correct answers
min_value_wrong=0.0, # Minimum score for wrong answers
max_value_wrong=0.3, # Maximum score for wrong answers
token_method="whitespace" # Method to count tokens
)
```
## Use Cases
This reward function is particularly useful for:
* Factual QA tasks where concise, correct answers are preferred
* Text summarization evaluation
* Mathematical problem-solving with step-by-step reasoning
* Any task where both accuracy and brevity are important
## Further Reading
For more information, see:
* [Combined Metrics Rewards Documentation](../../docs/examples/combined_metrics_rewards.md)
* [Reward Functions Overview](../../docs/examples/reward_functions_overview.md)
# null
Source: https://docs.fireworks.ai/evaluators/examples/code_execution_evaluation
# Code Execution Evaluation
This guide demonstrates how to evaluate code solutions using the Reward Kit's code execution reward functions.
## Overview
The code execution reward functions allow you to:
1. Extract code blocks from LLM responses
2. Execute the code in a secure environment
3. Compare the output with expected results
4. Get detailed execution metrics and error reports
## Prerequisites
Before using the code execution rewards, ensure you have:
1. **Python 3.8+** installed on your system
2. **Reward Kit** installed: `pip install reward-kit`
3. For JavaScript evaluation: **Node.js** installed on your system
## Available Reward Functions
Reward Kit provides two main methods for code execution evaluation:
1. **Local Code Execution**: Executes code securely on your local machine
2. **E2B Code Execution**: Executes code in a cloud sandbox (requires E2B account)
## Local Code Execution
### Basic Usage
Here's a simple example of evaluating Python code:
````python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Example conversation with a coding task
messages = [
{
"role": "user",
"content": "Write a Python function to calculate the factorial of a number."
},
{
"role": "assistant",
"content": """Here's a Python function to calculate the factorial of a number:
```python
def factorial(n):
if n == 0 or n == 1:
return 1
else:
return n * factorial(n - 1)
# Test the function
print(factorial(5))
````
This function uses recursion to calculate the factorial. For n = 5, it should output 120."""
}
]
# Evaluate the code
result = local\_code\_execution\_reward(
messages=messages,
expected\_output="120",
language="python",
timeout=5
)
# Print the results
print(f"Score: {result.score}")
print("Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" {metric.reason}")
````
### How It Works
The local code execution reward function:
1. Extracts code blocks from the last assistant message
2. Creates a secure, isolated environment for execution
3. Runs the code with timeout and resource limits
4. Captures stdout, stderr, and exit status
5. Compares the output with expected results if provided
6. Returns detailed metrics about execution and output matching
### Security Features
The local code execution uses multiple security layers:
- **Process Isolation**: Code runs in a separate process
- **Resource Limits**: Restricts memory usage and CPU time
- **Filesystem Restrictions**: Disables destructive file operations
- **System Call Restrictions**: Prevents access to sensitive system calls
- **Timeout Enforcement**: Terminates long-running code
- **Safe Libraries**: Disables potentially dangerous library functions
### JavaScript Execution
You can also evaluate JavaScript code:
```python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Example with JavaScript code
messages = [
{
"role": "user",
"content": "Write a JavaScript function to check if a string is a palindrome."
},
{
"role": "assistant",
"content": """Here's a JavaScript function to check if a string is a palindrome:
```javascript
function isPalindrome(str) {
// Remove non-alphanumeric characters and convert to lowercase
const cleanStr = str.toLowerCase().replace(/[^a-z0-9]/g, '');
// Compare with its reverse
const reversedStr = cleanStr.split('').reverse().join('');
return cleanStr === reversedStr;
}
// Test the function
console.log(isPalindrome("A man, a plan, a canal: Panama")); // Should output true
console.log(isPalindrome("hello")); // Should output false
````
This function removes any non-alphanumeric characters and converts the string to lowercase before checking if it reads the same forward and backward."""
}
]
# Evaluate the JavaScript code
result = local\_code\_execution\_reward(
messages=messages,
expected\_output="true\nfalse",
language="javascript",
timeout=5
)
````
### Advanced Options
You can customize the execution with various parameters:
```python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Custom execution parameters
result = local_code_execution_reward(
messages=messages,
expected_output="120",
language="python",
timeout=10, # Longer timeout for complex code
max_memory_mb=200 # Higher memory limit
)
````
### Automatic Expected Output Extraction
If the expected output is mentioned in the conversation, it can be extracted automatically:
````python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Conversation with expected output mentioned in the prompt
messages = [
{
"role": "user",
"content": "Write a function to calculate the sum of numbers from 1 to n. For n=5, the expected output is 15."
},
{
"role": "assistant",
"content": """Here's a function to calculate the sum of numbers from 1 to n:
```python
def sum_to_n(n):
return sum(range(1, n+1))
# Test the function
print(sum_to_n(5))
````
This function uses the built-in sum() and range() functions to calculate the sum efficiently."""
}
]
# Extract expected output from the conversation
result = local\_code\_execution\_reward(
messages=messages,
original\_messages=messages, # Provide the original messages for extraction
language="python"
)
```
## E2B Code Execution
For more information on using E2B for code execution, see the dedicated guide: [Code Execution with E2B](code_execution_with_e2b.md).
## Output Comparison
The code execution reward functions use sophisticated output comparison methods to handle various output formats.
### Exact Matching
For simple outputs, exact matching is used:
```
Expected: "Hello, world!"
Actual: "Hello, world!"
Score: 1.0
```
### Numeric Comparison
For numeric outputs, relative difference is calculated:
```
Expected: "42"
Actual: "42.001"
Score: 0.99 # Very close match
```
### Array/List Comparison
For arrays and lists, both structure and content are compared:
```
Expected: "\[1, 2, 3]"
Actual: "\[1, 2, 3, 4]"
Score: 0.75 # Partial match
```
### Multiline Text Comparison
For multiline output, line-by-line comparison is used:
```
Expected: "Line 1\nLine 2\nLine 3"
Actual: "Line 1\nLine 2\nLine X"
Score: 0.89 # Most lines match
````
## Use Cases
### Coding Assessment
Evaluate code solutions to programming problems:
```python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Define a coding problem
problem = "Write a function that finds the largest number in a list."
expected_output = "9"
# User's solution
solution = """
def find_largest(numbers):
return max(numbers)
print(find_largest([5, 2, 9, 3, 7]))
"""
# Create message format
messages = [
{"role": "user", "content": problem},
{"role": "assistant", "content": f"```python\n{solution}\n```"}
]
# Evaluate the solution
result = local_code_execution_reward(
messages=messages,
expected_output=expected_output,
language="python"
)
print(f"Score: {result.score}")
````
### Algorithm Comparison
Compare different algorithms for the same problem:
````python
from reward_kit.rewards.code_execution import local_code_execution_reward
import time
# Problem: Find all prime numbers less than 100
# Solution 1: Simple approach
solution1 = """
def is_prime(n):
if n <= 1:
return False
for i in range(2, n):
if n % i == 0:
return False
return True
primes = []
for num in range(2, 100):
if is_prime(num):
primes.append(num)
print(len(primes))
"""
# Solution 2: Optimized approach
solution2 = """
def sieve_of_eratosthenes(limit):
primes = []
sieve = [True] * (limit + 1)
sieve[0] = sieve[1] = False
for num in range(2, limit + 1):
if sieve[num]:
primes.append(num)
for multiple in range(num * num, limit + 1, num):
sieve[multiple] = False
return primes
print(len(sieve_of_eratosthenes(99)))
"""
# Expected output: 25 prime numbers less than 100
expected_output = "25"
# Evaluate solutions
solutions = [solution1, solution2]
results = []
for i, solution in enumerate(solutions, 1):
messages = [
{"role": "user", "content": "Find all prime numbers less than 100 and print the count."},
{"role": "assistant", "content": f"```python\n{solution}\n```"}
]
start_time = time.time()
result = local_code_execution_reward(
messages=messages,
expected_output=expected_output,
language="python",
timeout=10
)
execution_time = time.time() - start_time
results.append({
"solution": i,
"score": result.score,
"execution_time": execution_time
})
# Compare results
for res in results:
print(f"Solution {res['solution']}: Score={res['score']}, Time={res['execution_time']:.4f}s")
````
### Multiple Language Support
Evaluate solutions in different programming languages:
````python
from reward_kit.rewards.code_execution import local_code_execution_reward
# Problem: Check if a number is even
problem = "Write a function to check if a number is even. Test it with the numbers 4 and 7."
expected_output = "true\nfalse"
# Python solution
python_solution = """
def is_even(number):
return number % 2 == 0
print(is_even(4))
print(is_even(7))
"""
# JavaScript solution
js_solution = """
function isEven(number) {
return number % 2 === 0;
}
console.log(isEven(4));
console.log(isEven(7));
"""
# Evaluate both solutions
languages = ["python", "javascript"]
solutions = [python_solution, js_solution]
for lang, solution in zip(languages, solutions):
messages = [
{"role": "user", "content": problem},
{"role": "assistant", "content": f"```{lang}\n{solution}\n```"}
]
result = local_code_execution_reward(
messages=messages,
expected_output=expected_output,
language=lang
)
print(f"{lang.capitalize()} solution score: {result.score}")
````
## Best Practices
1. **Security First**: Always use the built-in security mechanisms and don't disable them
2. **Timeout Setting**: Choose reasonable timeouts based on task complexity
3. **Expected Output**: Be specific about expected output format for accurate comparison
4. **Error Handling**: Check execution error metrics even when code runs successfully
5. **Resource Limits**: Set appropriate memory limits for the complexity of the code
6. **Test Environment**: Ensure required dependencies are available in the execution environment
7. **Edge Cases**: Test the reward function with a variety of inputs, including edge cases
## Limitations
* Cannot evaluate non-deterministic code reliably
* Limited to languages supported by the reward function (Python and JavaScript for local execution)
* Cannot evaluate code that requires external resources (databases, APIs, etc.) without mocking
* May have limitations with GUI applications or complex I/O operations
* Security mechanisms may prevent some valid code from executing
## Next Steps
* For cloud-based code execution, see [Code Execution with E2B](code_execution_with_e2b.md)
* Learn about [Function Calling Evaluation](function_calling_evaluation.md) for evaluating tool use
* Explore [JSON Schema Validation](json_schema_validation.md) for structured outputs
* See [Creating Custom Reward Functions](../tutorials/creating_your_first_reward_function.md) to build your own evaluators
# null
Source: https://docs.fireworks.ai/evaluators/examples/code_execution_with_e2b
# Code Execution with E2B
This guide demonstrates how to use the E2B code execution reward function to evaluate code by running it in the E2B cloud sandbox.
## Overview
The `e2b_code_execution_reward` function allows you to:
1. Extract code blocks from LLM responses
2. Execute the code securely in E2B's cloud sandbox
3. Compare the output with expected results
4. Generate a score and detailed metrics
## Prerequisites
To use the E2B code execution reward function, you need:
1. An E2B API key from [E2B Dashboard](https://e2b.dev/dashboard)
2. The `e2b_code_interpreter` Python package installed: `pip install e2b_code_interpreter`
Note: The code will also work with the `e2b` package, but `e2b_code_interpreter` is recommended as it provides a more stable interface specifically designed for code execution.
## Basic Usage
Here's a simple example of how to use the reward function:
````python
from reward_kit.rewards.code_execution import e2b_code_execution_reward
# Example conversation with a Python coding task
messages = [
{
"role": "user",
"content": "Write a Python function to calculate the factorial of a number."
},
{
"role": "assistant",
"content": """Here's a Python function to calculate the factorial of a number:
```python
def factorial(n):
if n == 0 or n == 1:
return 1
else:
return n * factorial(n - 1)
# Test the function
print(factorial(5)) # Should output 120
````
This function uses recursion to calculate the factorial. For n = 5, it computes 5 \* 4 \* 3 \* 2 \* 1 = 120."""
}
]
# Define expected output
expected\_output = "120"
# Evaluate the code using E2B
result = e2b\_code\_execution\_reward(
messages=messages,
expected\_output=expected\_output,
language="python",
api\_key="your\_e2b\_api\_key",
timeout=10
)
# Use the results
print(f"Score: {result.score}")
for metric\_name, metric in result.metrics.items():
print(f"\n{metric_name}: {metric.reason}")
````
## Supported Languages
The E2B code execution reward function currently supports:
- Python (`language="python"`)
- JavaScript (`language="javascript"` or `language="js"`)
## Advanced Options
### Automatic Output Extraction
You can let the reward function automatically extract the expected output from the prompt:
```python
# Conversation with expected output in the prompt
messages = [
{
"role": "user",
"content": "Write a Python function to find the sum of a list. Expected output: 15 (for [1,2,3,4,5])"
},
{
"role": "assistant",
"content": """```python
def sum_list(numbers):
return sum(numbers)
print(sum_list([1, 2, 3, 4, 5]))
```"""
}
]
# Pass the original messages for expected output extraction
result = e2b_code_execution_reward(
messages=messages,
original_messages=messages,
language="python",
api_key="your_e2b_api_key"
)
````
### Fallback to Local Execution
You can gracefully fall back to local execution when an E2B API key is not available:
```python
from reward_kit.rewards.code_execution import (
e2b_code_execution_reward,
local_code_execution_reward
)
# Try to use E2B if API key is provided
api_key = os.environ.get("E2B_API_KEY")
if api_key:
result = e2b_code_execution_reward(
messages=messages,
expected_output=expected_output,
language="python",
api_key=api_key
)
else:
# Fall back to local execution
result = local_code_execution_reward(
messages=messages,
expected_output=expected_output,
language="python"
)
```
## Parameters
The `e2b_code_execution_reward` function accepts the following parameters:
| Parameter | Type | Description |
| ------------------- | ---------------------- | -------------------------------------------------------------------- |
| `messages` | List\[Dict\[str, str]] | Generated conversation messages (required) |
| `original_messages` | List\[Dict\[str, str]] | Original conversation context (optional) |
| `expected_output` | str | Expected output from code execution (optional) |
| `language` | str | Programming language of the code (default: "python") |
| `timeout` | int | Maximum execution time in seconds (default: 30) |
| `api_key` | str | E2B API key (default: None, uses E2B\_API\_KEY environment variable) |
## Return Value
The reward function returns a `RewardOutput` object with:
* `score`: A float between 0.0 and 1.0 indicating how well the code performed
* `metrics`: A dictionary of `MetricRewardOutput` objects with detailed information about the execution
Key metrics include:
* `extracted_code`: The code that was extracted and executed
* `expected_output`: The expected output (if provided or extracted)
* `execution_result`: Details about the execution (success or failure)
* `output_match`: Comparison between actual and expected outputs
## Examples
See the `examples/` directory for complete examples:
* `e2b_reward_example.py`: Basic Python example
* `e2b_javascript_example.py`: JavaScript example
* `e2b_auto_extract_example.py`: Automatic output extraction example
* `e2b_fallback_example.py`: Fallback to local execution example
# null
Source: https://docs.fireworks.ai/evaluators/examples/combined_metrics_rewards
# Combined Metrics Rewards
This guide focuses on reward functions that combine multiple evaluation aspects into a single score. These combined metrics provide a more comprehensive assessment of model responses.
## Introduction to Combined Metrics
In real-world evaluation scenarios, we often want to consider multiple aspects of quality simultaneously. For example:
* Responses should be both accurate AND concise
* Code solutions should be both correct AND efficient
* Explanations should be both clear AND well-structured
Combined metric rewards allow you to assess multiple dimensions in a single reward function with appropriate weightings.
## Available Combined Metric Rewards
### Cosine-Scaled Accuracy + Length Reward
The `cosine_scaled_accuracy_length_reward` function combines accuracy evaluation with length efficiency into a unified score. Note that this function depends on the accuracy detection mechanisms, which may need customization for different types of content through the `extract_fn` and `compare_fn` parameters.
```python
from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
result = cosine_scaled_accuracy_length_reward(
messages=messages,
ground_truth="Paris",
max_length=200,
correctness_weight=0.7, # Weight for accuracy component
length_weight=0.3 # Weight for length component
)
```
#### Key Features
* **Dual Evaluation**: Assesses both factual accuracy and response length
* **Cosine Scaling**: Uses cosine scheduling to reward brevity in correct responses
* **Weighted Components**: Allows customizing the importance of accuracy vs. length
* **Asymmetric Penalties**: Handles correct and incorrect responses differently
* For correct answers: shorter is better (higher reward)
* For incorrect answers: longer explanations are penalized less (encouraging showing work)
* **Customizable Parameters**: Flexible configuration for different use cases
#### How It Works
1. **Accuracy Evaluation**:
* Extracts an answer from the response
* Compares to ground truth with semantic matching
* Produces an accuracy score (0.0-1.0)
2. **Length Evaluation**:
* Counts tokens in the response
* Applies cosine scaling based on token count vs. max\_length
* Produces a length score (0.0-1.0)
3. **Combined Scoring**:
* Weighted average of accuracy and length scores
* Clear separation between correct and incorrect answers
* Final score prioritizes accuracy while considering length
#### Parameters
| Parameter | Type | Default | Description |
| -------------------- | ------------------- | ------------ | ------------------------------------------- |
| `messages` | List\[Dict/Message] | Required | Conversation messages to evaluate |
| `ground_truth` | str | None | Expected correct answer |
| `extract_fn` | Callable | None | Custom function to extract answer from text |
| `compare_fn` | Callable | None | Custom function to compare answers |
| `max_length` | int | 1000 | Maximum token length for scaling |
| `min_value_wrong` | float | 0.0 | Minimum reward for wrong answers |
| `max_value_wrong` | float | 0.3 | Maximum reward for wrong answers |
| `min_value_correct` | float | 0.5 | Minimum reward for correct answers |
| `max_value_correct` | float | 1.0 | Maximum reward for correct answers |
| `token_method` | str | "whitespace" | Method to count tokens |
| `correctness_weight` | float | 0.7 | Weight for accuracy component |
| `length_weight` | float | 0.3 | Weight for length component |
#### Return Value
An `EvaluateResult` object with:
* **score**: Combined weighted score (0.0-1.0)
* **reason**: Detailed explanation of the evaluation
* **metrics**:
* **combined\_reward**: Overall evaluation result
* **accuracy**: Accuracy component evaluation
* **length**: Length component evaluation
* **token\_count**: Token count details
#### Example
```python
from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
# Define test messages
messages = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
# Evaluate with cosine-scaled accuracy + length reward
result = cosine_scaled_accuracy_length_reward(
messages=messages,
ground_truth="Paris",
max_length=200,
correctness_weight=0.7,
length_weight=0.3
)
# Print results
print(f"Combined score: {result['score']}")
print(f"Accuracy score: {result['metrics']['accuracy']['score']}")
print(f"Length score: {result['metrics']['length']['score']}")
print(f"Reason: {result['reason']}")
```
#### Use Cases
* **Factual QA**: Reward concise, correct answers over verbose ones
* **Mathematical problems**: Evaluate correctness while encouraging brevity
* **Knowledge retrieval**: Balance accuracy with response length
* **Instruction following**: Ensure responses are both correct and appropriately sized
#### Advanced Configuration
Fine-tune the behavior with these parameter adjustments:
* **Encouraging brevity**: Increase `length_weight` and decrease `max_length`
* **Prioritizing accuracy**: Increase `correctness_weight` and decrease `length_weight`
* **Allowing detailed explanations**: Increase `max_length` while maintaining weighting
* **Strict scoring**: Increase gap between `max_value_wrong` and `min_value_correct`
## Creating Custom Combined Metrics
You can create custom combined metrics by using the `@reward_function` decorator:
```python
from reward_kit.rewards.accuracy import accuracy_reward
from reward_kit.rewards.reasoning_steps import reasoning_steps_reward
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
@reward_function
def accuracy_reasoning_reward(
messages,
ground_truth=None,
accuracy_weight=0.6,
reasoning_weight=0.4,
**kwargs
):
"""Combine accuracy and reasoning steps evaluation."""
# Evaluate accuracy
accuracy_result = accuracy_reward(
messages=messages,
ground_truth=ground_truth
)
# Evaluate reasoning steps
reasoning_result = reasoning_steps_reward(
messages=messages,
min_steps=3
)
# Calculate combined score
combined_score = (
accuracy_weight * accuracy_result["score"] +
reasoning_weight * reasoning_result["score"]
)
# Combine metrics
metrics = {
"accuracy": MetricRewardOutput(
score=accuracy_result["score"],
reason=accuracy_result["reason"]
),
"reasoning": MetricRewardOutput(
score=reasoning_result["score"],
reason=reasoning_result["reason"]
)
}
return RewardOutput(score=combined_score, metrics=metrics)
```
### Tips for Creating Combined Metrics
1. **Choose appropriate weights** based on the relative importance of each component
2. **Ensure scale consistency** across all component metrics (typically 0.0-1.0)
3. **Provide detailed reasons** for each component and the combined score
4. **Handle edge cases** where one component might fail
5. **Document parameters** clearly for users of your combined metric
## Best Practices
When using combined metrics rewards:
1. **Start simple**: Begin with equal weights and adjust based on results
2. **Test on diverse examples**: Ensure your metrics work across different response styles
3. **Avoid too many components**: Two or three aspects is typically optimal
4. **Balance importance**: Set weights to reflect true priorities
5. **Document clearly**: Make sure users understand what each component measures
## Next Steps
* Explore other [out-of-the-box reward functions](reward_functions_overview.md)
* Learn how to [create your own reward functions](../tutorials/creating_your_first_reward_function.md)
* Study [best practices for reward functions](../tutorials/best_practices.md)
* See how to [deploy your reward functions](../developer_guide/evaluation_workflows.md)
# null
Source: https://docs.fireworks.ai/evaluators/examples/deepseek_prover_v2
# DeepSeek-Prover-V2 Reward Functions
This document describes the reward functions for evaluating Lean theorem proofs based on the [DeepSeek-Prover-V2](https://github.com/deepseek-ai/DeepSeek-Prover-V2) research paper from DeepSeek.
## Overview
DeepSeek-Prover-V2 is an advanced large language model specifically designed for formal theorem proving in Lean 4. It incorporates both informal mathematical reasoning and formal theorem proving capabilities, with a particular focus on subgoal decomposition for tackling complex mathematical proofs.
The reward functions in this module evaluate how well model responses can formulate valid Lean proofs, with special attention to the recursive subgoal decomposition technique central to DeepSeek-Prover-V2's approach.
## Reward Functions
Three reward functions are provided:
1. `lean_prover_reward`: A basic reward function for evaluating Lean proofs
2. `deepseek_prover_v2_reward`: An enhanced reward function that also evaluates quality of subgoal decomposition
3. `deepseek_huggingface_prover_benchmark`: A reward function for evaluating against the DeepSeek-ProverBench dataset from Hugging Face
## Installation
Basic functionality requires only the standard reward-kit installation:
```bash
pip install reward-kit
```
For using the HuggingFace dataset integration with `deepseek_huggingface_prover_benchmark`, install the dependencies:
```bash
pip install "reward-kit[deepseek]"
```
## Usage
### Basic Lean Prover Reward
```python
from reward_kit.rewards.lean_prover import lean_prover_reward
statement = "For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2."
response = """
theorem sum_naturals (n : ℕ) : ∑ i in range n, i = n * (n + 1) / 2 :=
begin
induction n with d hd,
{ simp, },
{ sorry }
end
"""
result = lean_prover_reward(
response=response,
statement=statement,
lean_version="4",
check_partial_progress=True,
verbose=True
)
print(result.score) # Example output: 0.25
```
### DeepSeek Prover V2 Reward
```python
from reward_kit.rewards.lean_prover import deepseek_prover_v2_reward
statement = "For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2."
response = """
theorem sum_naturals (n : ℕ) : ∑ i in range n, i = n * (n + 1) / 2 :=
begin
-- We'll prove this by induction on n
induction n with d hd,
-- Base case: n = 0
{ simp, },
-- Inductive step: assume true for n = d, prove for n = d + 1
{
have step1 : ∑ i in range (d + 1), i = (∑ i in range d, i) + d,
by simp [sum_range_succ],
have step2 : (∑ i in range d, i) + d = d * (d + 1) / 2 + d,
by rw [hd],
calc
∑ i in range (d + 1), i = (∑ i in range d, i) + d : by simp [sum_range_succ]
... = d * (d + 1) / 2 + d : by rw [hd]
... = (d * (d + 1) + 2 * d) / 2 : by ring
... = (d + 1) * ((d + 1) + 1) / 2 : by ring,
}
end
"""
result = deepseek_prover_v2_reward(
response=response,
statement=statement,
check_subgoals=True,
verbose=True
)
print(result.score) # Example output: 0.85
```
### HuggingFace ProverBench Evaluation
```python
from reward_kit.rewards.lean_prover import deepseek_huggingface_prover_benchmark
# This function will attempt to find a matching problem in the DeepSeek-ProverBench dataset
# Requires installation of the Hugging Face datasets package: pip install datasets
result = deepseek_huggingface_prover_benchmark(
response=response,
statement="For any positive integers a and b, gcd(a,b) divides any linear combination of a and b",
dataset_name="deepseek-ai/DeepSeek-ProverBench",
check_for_answer=True,
verbose=True
)
print(result.score)
```
## Scoring Methodology
The scoring uses the following criteria:
### Basic Lean Proofs (lean\_prover\_reward)
* **0.0**: No valid Lean proof attempt
* **0.1-0.4**: Has definition but incomplete proof (with "sorry" or "admitted")
* Score scales based on number of tactics used
* **0.5-0.9**: Complete proof with varying complexity
* Base score of 0.5 for complete proofs
* Up to 0.4 additional points based on tactics complexity
* **1.0**: Perfect match with expected answer if provided
### DeepSeek-Prover-V2 Specific (deepseek\_prover\_v2\_reward)
Builds on the basic scoring and adds:
* **Subgoal Quality**: Up to +0.3 based on effective use of subgoals
* **Hierarchical Structure**: Up to +0.2 based on proof's hierarchical depth
## Example Output
Verbose output with `verbose=True` includes detailed breakdown of the evaluation:
```python
# Example of accessing metrics in verbose output
if result.metrics:
print(f"Score: {result.score}")
# Basic syntax metrics
syntax_score = result.metrics["syntax"].score
print(f"Syntax score: {syntax_score}")
# Completeness metrics
completeness = result.metrics["completeness"].score
print(f"Completeness: {completeness}")
# For DeepSeek-specific metrics
if "subgoal_decomposition" in result.metrics:
subgoal_score = result.metrics["subgoal_decomposition"].score
print(f"Subgoal quality: {subgoal_score}")
if "hierarchical_structure" in result.metrics:
hierarchy_score = result.metrics["hierarchical_structure"].score
print(f"Hierarchical structure: {hierarchy_score}")
```
The metrics include:
* `syntax`: Score for valid theorem definition
* `completeness`: Whether the proof is complete (not using "sorry" or "admitted")
* `tactics`: Evaluation of the tactics used
* `subgoal_decomposition`: Quality of subgoal usage (DeepSeek-specific)
* `hierarchical_structure`: Quality of hierarchical organization (DeepSeek-specific)
### Using with the HuggingFace Dataset
For more extensive evaluation with a full dataset:
```python
from datasets import load_dataset
from reward_kit.rewards.lean_prover import deepseek_huggingface_prover_benchmark
# Load the dataset
dataset = load_dataset("deepseek-ai/DeepSeek-ProverBench")
# Sample a few statements to evaluate
statements = [item["statement"] for item in dataset["train"][:5]]
# Evaluate each proof
for statement in statements:
# Your model generates a proof for the statement
lean_proof = generate_proof_with_your_model(statement)
# Evaluate against the dataset
result = deepseek_huggingface_prover_benchmark(
response=lean_proof,
statement=statement,
dataset_name="deepseek-ai/DeepSeek-ProverBench",
verbose=True
)
print(f"Statement: {statement[:50]}...")
print(f"Score: {result.score}")
# Check if the dataset matched successfully
if "dataset_match" in result.metrics:
match_score = result.metrics["dataset_match"].score
match_reason = result.metrics["dataset_match"].reason
print(f"Dataset match: {match_score} - {match_reason}")
```
## References
* [DeepSeek-Prover-V2 GitHub Repository](https://github.com/deepseek-ai/DeepSeek-Prover-V2)
* [DeepSeek-ProverBench Dataset](https://huggingface.co/datasets/deepseek-ai/DeepSeek-ProverBench)
# null
Source: https://docs.fireworks.ai/evaluators/examples/examples_overview
# Reward Kit Examples
This directory contains examples and guides for using Reward Kit's built-in reward functions and creating custom reward functions.
## Out-of-the-Box Reward Functions
Reward Kit provides several pre-built reward functions for common evaluation tasks:
* [**Reward Functions Overview**](reward_functions_overview.md) - Overview of all available reward functions
* [**Combined Metrics Rewards**](combined_metrics_rewards.md) - Evaluate responses using multiple metrics combined
* [**Code Execution Evaluation**](code_execution_evaluation.md) - Evaluate code by running it locally
* [**Code Execution with E2B**](code_execution_with_e2b.md) - Evaluate code using E2B cloud sandbox
* [**Function Calling Evaluation**](function_calling_evaluation.md) - Evaluate function calls made by AI models
* [**JSON Schema Validation**](json_schema_validation.md) - Validate JSON outputs against schemas
* [**Math Evaluation**](math_evaluation.md) - Evaluate mathematical answers in responses
* [**DeepSeek-Prover-V2**](deepseek_prover_v2.md) - Evaluate formal proofs in Lean theorem prover
## Creating Your Own Reward Functions
Learn how to create custom reward functions:
* [**Basic Reward Function**](basic_reward_function.md) - Simple example of a custom reward function
* [**Advanced Reward Functions**](advanced_reward_functions.md) - More complex reward function examples
## Next Steps
* See the [Developer Guide](../developer_guide/getting_started.md) for comprehensive information
* Check the [Tutorials](../tutorials/creating_your_first_reward_function.md) for step-by-step guides
* Refer to the [API Reference](../api_reference/api_overview.mdx) for detailed documentation of all available functions
# null
Source: https://docs.fireworks.ai/evaluators/examples/function_calling_evaluation
# Function Calling Evaluation
This guide demonstrates how to evaluate function calls made by AI models using a combination of schema validation and LLM judgment.
## Prerequisites
Before using the function calling evaluation rewards, ensure you have:
1. **Python 3.8+** installed on your system
2. **Reward Kit** installed: `pip install reward-kit`
3. **OpenAI Python Client** installed (for LLM judge): `pip install openai`
4. **OpenAI API Key** (for LLM judge evaluation)
## Function Calling Reward Components
The Reward Kit provides three approaches to evaluating function calls:
1. **Schema Jaccard Reward**: Compares function call structure to expected schema using Jaccard similarity
2. **LLM Judge Reward**: Uses GPT-4o-mini to evaluate function call quality based on expected behavior
3. **Composite Reward**: Combines schema validation and LLM judgment for comprehensive evaluation
## Schema Jaccard Reward
The Schema Jaccard Reward evaluates how well a function call matches the expected schema by calculating the Jaccard similarity between property sets.
### Example Usage
```python
from reward_kit.rewards.function_calling import schema_jaccard_reward
import json
# Define expected schema
expected_schema = {
"name": "get_weather",
"arguments": {
"location": {"type": "string"},
"unit": {"type": "string"}
}
}
# Messages with function call
messages = [
{"role": "user", "content": "What's the weather in New York?"},
{
"role": "assistant",
"content": "I'll check the weather for you.",
"function_call": {
"name": "get_weather",
"arguments": json.dumps({
"location": "New York",
"unit": "celsius"
})
}
}
]
# Evaluate the function call
result = schema_jaccard_reward(
messages=messages,
expected_schema=expected_schema
)
# Print the results
print(f"Overall Score: {result.score}")
print("Component Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" Reason: {metric.reason}")
```
### How It Works
1. Extracts function call information from the messages or directly from provided function\_call parameter
2. Compares the function name against the expected name (exact match required)
3. Compares argument schema structure using Jaccard similarity, which measures:
* The intersection of properties divided by the union of properties
4. Generates a comprehensive report of matching, missing, and unexpected properties
5. Calculates final score as weighted combination of name match and schema similarity
## LLM Judge Reward
The LLM Judge Reward uses GPT-4o-mini to evaluate the quality and correctness of function calls based on expected behavior.
### Example Usage
```python
from reward_kit.rewards.function_calling import llm_judge_reward
import json
import os
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
# Define expected schema and behavior
expected_schema = {
"name": "get_weather",
"arguments": {
"location": {"type": "string"},
"unit": {"type": "string"}
}
}
expected_behavior = """
This function should retrieve weather information for the specified location.
- The location should be a valid city or place name
- The unit parameter should be either 'celsius' or 'fahrenheit'
- The function should be called when the user explicitly asks about weather
"""
# Messages with function call
messages = [
{"role": "user", "content": "What's the weather in New York?"},
{
"role": "assistant",
"content": "I'll check the weather for you.",
"function_call": {
"name": "get_weather",
"arguments": json.dumps({
"location": "New York",
"unit": "celsius"
})
}
}
]
# Evaluate the function call
result = llm_judge_reward(
messages=messages,
expected_schema=expected_schema,
expected_behavior=expected_behavior
)
# Print the results
print(f"Overall Score: {result.score}")
print("Component Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" Reason: {metric.reason}")
```
### How It Works
1. Extracts function call information from the messages
2. Formats a prompt with:
* Conversation context
* Function call details
* Expected schema
* Expected behavior description
3. Sends the prompt to GPT-4o-mini (or another specified model)
4. Parses the response to extract:
* Numeric score between 0.0 and 1.0
* Detailed explanation of strengths and weaknesses
5. Returns the LLM's evaluation as a reward score with explanation
## Composite Function Call Reward
The Composite Function Call Reward combines both schema validation and LLM judgment for a comprehensive evaluation.
### Example Usage
```python
from reward_kit.rewards.function_calling import composite_function_call_reward
import json
import os
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
# Define expected schema and behavior
expected_schema = {
"name": "get_weather",
"arguments": {
"location": {"type": "string"},
"unit": {"type": "string"}
}
}
expected_behavior = """
This function should retrieve weather information for the specified location.
- The location should be a valid city or place name
- The unit parameter should be either 'celsius' or 'fahrenheit'
- The function should be called when the user explicitly asks about weather
"""
# Messages with function call
messages = [
{"role": "user", "content": "What's the weather in New York?"},
{
"role": "assistant",
"content": "I'll check the weather for you.",
"function_call": {
"name": "get_weather",
"arguments": json.dumps({
"location": "New York",
"unit": "celsius"
})
}
}
]
# Evaluate the function call with custom weights
result = composite_function_call_reward(
messages=messages,
expected_schema=expected_schema,
expected_behavior=expected_behavior,
weights={"schema": 0.6, "llm": 0.4} # Emphasize schema validation
)
# Print the results
print(f"Overall Score: {result.score}")
print("Component Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" Reason: {metric.reason}")
```
### How It Works
1. Runs both schema\_jaccard\_reward and llm\_judge\_reward separately
2. Combines the metrics from both evaluations with prefixes:
* `schema_` for schema validation metrics
* `llm_` for LLM judgment metrics
3. Calculates a weighted average of both scores based on provided weights
4. Returns a comprehensive set of metrics with the weighted final score
## Advanced Usage
### Custom Weights
You can customize the weights for different components:
```python
# Emphasize LLM judgment over schema validation
result = composite_function_call_reward(
messages=messages,
expected_schema=expected_schema,
expected_behavior=expected_behavior,
weights={"schema": 0.3, "llm": 0.7} # Higher weight for LLM judgment
)
```
### Custom LLM Model
You can specify a different model for LLM evaluation:
```python
# Use a different model
result = llm_judge_reward(
messages=messages,
expected_schema=expected_schema,
expected_behavior=expected_behavior,
model="gpt-4-turbo", # Use a more powerful model
temperature=0.2 # Add some randomness
)
```
### Direct Function Call Evaluation
You can also evaluate a function call directly without extracting from messages:
```python
function_call = {
"name": "get_weather",
"arguments": json.dumps({
"location": "New York",
"unit": "celsius"
})
}
result = schema_jaccard_reward(
messages=[], # Can be empty as function_call is provided directly
function_call=function_call,
expected_schema=expected_schema
)
```
## Use Case: Evaluating Tool Use in Models
One common application is evaluating how well different models use tools:
```python
import json
from reward_kit.rewards.function_calling import composite_function_call_reward
# Define expected schema for search function
search_schema = {
"name": "search",
"arguments": {
"query": {"type": "string"}
}
}
expected_behavior = """
The search function should be called:
1. When the user is asking for factual information
2. With a clear, specific query that captures what the user is looking for
3. Without including instructions, formatting requests, or explanations in the query
"""
# Test different model responses to the same query
models = ["llama-3-8b", "claude-3-sonnet", "gpt-4o"]
model_responses = {
"llama-3-8b": {
"name": "search",
"arguments": json.dumps({
"query": "latest developments in quantum computing"
})
},
"claude-3-sonnet": {
"name": "search",
"arguments": json.dumps({
"query": "quantum computing recent advances 2023-2024"
})
},
"gpt-4o": {
"name": "search",
"arguments": json.dumps({
"query": "recent breakthroughs in quantum computing please search for detailed technical information"
})
}
}
# User query
user_query = "What are the latest developments in quantum computing?"
# Evaluate each model's function call
results = {}
for model in models:
messages = [
{"role": "user", "content": user_query},
{"role": "assistant", "function_call": model_responses[model]}
]
result = composite_function_call_reward(
messages=messages,
expected_schema=search_schema,
expected_behavior=expected_behavior
)
results[model] = result.score
# Print the results
print("Model Function Call Evaluation Scores:")
for model, score in results.items():
print(f"{model}: {score:.2f}")
```
## Best Practices
1. **Clear Expected Schemas**: Define schemas with precise types and required properties
2. **Detailed Expected Behavior**: Provide specific guidance for what constitutes correct behavior
3. **Combined Evaluation**: Use the composite reward for the most comprehensive evaluation
4. **Custom Weights**: Adjust weights based on whether structure or behavior is more important
5. **Testing**: Test reward functions with a variety of function calls, including edge cases
6. **Fallback Options**: Always handle API errors gracefully in the LLM judge evaluation
## Next Steps
* Learn about [Creating Custom Reward Functions](../tutorials/creating_your_first_reward_function.md)
* Explore [Advanced Reward Functions](advanced_reward_functions.md) for more complex evaluations
* See [Best Practices](../tutorials/best_practices.md) for reward function design
# null
Source: https://docs.fireworks.ai/evaluators/examples/general_usage
# Reward Kit Examples
This directory contains examples demonstrating how to use the Reward Kit library for evaluating and deploying reward functions for LLM fine-tuning.
## Prerequisites
Before running the examples, make sure you have:
1. A Fireworks AI account and API key
2. The Reward Kit package installed
## Setup
### 1. Create a Virtual Environment
```bash
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
```
### 2. Install Reward Kit
```bash
# Install the package in development mode
pip install -e .
```
### 3. Configure API Access
For development, use these environment variables:
```bash
# Set environment variables for development
export FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY
export FIREWORKS_API_BASE=https://dev.api.fireworks.ai
```
For production, use:
```bash
# Set environment variables for production
export FIREWORKS_API_KEY=your_api_key
export FIREWORKS_API_BASE=https://api.fireworks.ai
```
## Example Walkthroughs
### Combined Accuracy and Length Evaluation
The `accuracy_length/cosine_scaled_example.py` demonstrates the `cosine_scaled_accuracy_length_reward` function which evaluates responses based on both accuracy and length efficiency.
```bash
# Run the example
python examples/accuracy_length/cosine_scaled_example.py
```
This example:
1. Demonstrates evaluation of different response types (short correct, long correct, short incorrect, long incorrect)
2. Shows how the combined reward function prioritizes short correct answers
3. Illustrates customizing the weights between accuracy and length components
See the [Accuracy + Length Overview](accuracy_length/accuracy_length_overview.mdx) for more details.
### Basic Evaluation Example
The `evaluation_preview_example.py` demonstrates how to preview and create an evaluation using the Reward Kit.
#### Step 1: Understand the Metric
Examine the example metric in the `metrics/word_count` directory. This metric evaluates responses based on their word count:
```python
@reward_function
def evaluate(messages, original_messages=None, **kwargs):
# Get the last message (assistant's response)
last_message = messages[-1]
content = last_message.content or ''
# Count words and calculate score
word_count = len(content.split())
score = min(word_count / 100, 1.0) # Cap at 1.0
return EvaluateResult(
score=score,
reason=f'Word count: {word_count}',
metrics={
'word_count': MetricResult(
score=score,
reason=f'Word count: {word_count}'
)
}
)
```
#### Step 2: Prepare Sample Data
Review the sample conversations in `samples/samples.jsonl`. Each line contains a JSON object representing a conversation:
```json
{"messages": [{"role": "user", "content": "Tell me about AI"}, {"role": "assistant", "content": "AI refers to systems designed to mimic human intelligence."}]}
```
#### Step 3: Run the Preview
Execute the evaluation preview example:
```bash
source .venv/bin/activate && FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai python examples/evaluation_preview_example.py
```
This will:
1. Load the word count metric from `examples/metrics/word_count`
2. Load sample conversations from `examples/samples/samples.jsonl`
3. Preview the evaluator using the Fireworks API
4. Display the evaluation results for each sample
5. Create an evaluator named "word-count-eval"
### Deployment Example
The `deploy_example.py` demonstrates how to deploy a reward function to the Fireworks platform.
#### Step 1: Examine the Reward Function
Review the informativeness reward function in the deploy example, which evaluates responses based on:
* Length
* Specificity markers
* Content density
#### Step 2: Run the Deployment
Execute the deployment example:
```bash
source .venv/bin/activate && FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai python examples/deploy_example.py
```
This will:
1. Test the reward function locally with sample data
2. Deploy the function to the Fireworks platform
3. Display the deployed evaluator ID
### Using the CLI
The Reward Kit also provides a command-line interface for common operations.
#### Preview an Evaluator Using CLI
```bash
# Activate the virtual environment and set environment variables
source .venv/bin/activate
# Preview an evaluator
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai reward-kit preview \
--metrics-folders "word_count=./examples/metrics/word_count" \
--samples ./examples/samples/samples.jsonl
```
#### Deploy an Evaluator Using CLI
```bash
# Activate the virtual environment and set environment variables
source .venv/bin/activate
# Deploy an evaluator
FIREWORKS_API_KEY=$DEV_FIREWORKS_API_KEY FIREWORKS_API_BASE=https://dev.api.fireworks.ai reward-kit deploy \
--id my-evaluator \
--metrics-folders "word_count=./examples/metrics/word_count" \
--display-name "My Word Count Evaluator" \
--description "Evaluates responses based on word count" \
--force
```
## Creating Your Own Evaluators
Follow these steps to create your own custom evaluator:
1. Create a directory for your metric (e.g., `my_metrics/coherence`)
2. Create a `main.py` file with an `evaluate` function
3. Test your evaluator using the preview functionality
4. Deploy your evaluator when ready
### Example Custom Metric
```python
from reward_kit import EvaluateResult, MetricResult, reward_function, Message
from typing import List
@reward_function
def evaluate(messages: List[Message], original_messages: List[Message] = list(), **kwargs):
"""Custom evaluation metric."""
# Your evaluation logic here
# ...
return EvaluateResult(
score=your_score,
reason="Explanation of score",
metrics={
'your_metric': MetricResult(
score=your_score,
reason="Detailed explanation"
)
}
)
```
## Next Steps
After exploring these examples, you can:
1. Create your own custom metrics
2. Integrate reward functions into model training workflows
3. Use deployed evaluators to score model outputs
4. Combine multiple metrics for comprehensive evaluation
# null
Source: https://docs.fireworks.ai/evaluators/examples/json_schema_validation
# JSON Schema Validation
This guide demonstrates how to validate JSON outputs from LLM responses against a defined schema.
## Overview
The JSON Schema reward functions allow you to:
1. Extract JSON data from LLM responses
2. Validate it against a predefined JSON Schema
3. Get detailed validation metrics and error reports
4. Score models based on schema adherence
## Prerequisites
Before using the JSON Schema validation rewards, ensure you have:
1. **Python 3.8+** installed on your system
2. **Reward Kit** installed: `pip install reward-kit`
3. **jsonschema** package (installed automatically with Reward Kit)
## Basic Usage
Here's a simple example of how to use the JSON Schema validation:
````python
from reward_kit.rewards.json_schema import json_schema_reward
import json
# Define a JSON schema for a person object
person_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
"email": {"type": "string", "format": "email"},
"interests": {"type": "array", "items": {"type": "string"}}
},
"required": ["name", "age", "email"]
}
# Example conversation with an LLM
messages = [
{
"role": "user",
"content": "Create a JSON object for a person with name, age, email, and interests."
},
{
"role": "assistant",
"content": """Here's a JSON object for a person:
```json
{
"name": "Jane Smith",
"age": 32,
"email": "jane.smith@example.com",
"interests": ["hiking", "photography", "machine learning"]
}
````
This includes all the requested fields with sample data."""
}
]
# Validate the JSON against the schema
result = json\_schema\_reward(
messages=messages,
schema=person\_schema
)
# Print the results
print(f"Overall score: {result.score}")
print("Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" {metric.reason}")
````
## How It Works
The JSON Schema validation reward function:
1. Extracts JSON data from the last assistant message using regex
2. Parses the extracted JSON string into a Python object
3. Validates the object against the provided JSON Schema
4. Returns detailed metrics about validation success or errors
5. Provides an overall score based on validation results
## Advanced Usage
### Custom JSON Extraction
You can provide a custom function to extract JSON from messages:
```python
from reward_kit.rewards.json_schema import json_schema_reward
import json
import re
# Custom extractor that looks for JSON in specific format
def my_json_extractor(message):
pattern = r"USER DATA:\s*```json\s*([\s\S]*?)\s*```"
match = re.search(pattern, message)
if match:
return match.group(1)
return None
# Messages with custom JSON format
messages = [
{"role": "user", "content": "Format user data for Jane Smith."},
{"role": "assistant", "content": """Here is the formatted user data:
USER DATA: ```json
{
"name": "Jane Smith",
"age": 32,
"email": "jane.smith@example.com"
}
````
The data has been formatted according to the requirements."""}
]
# Use custom extractor
result = json\_schema\_reward(
messages=messages,
schema=person\_schema,
json\_extractor=my\_json\_extractor
)
````
### Handling Multiple JSON Objects
If multiple JSON objects are present, you can specify which one to validate:
```python
from reward_kit.rewards.json_schema import json_schema_reward
# Message with multiple JSON objects
messages = [
{"role": "user", "content": "Generate JSON for a person and their pet."},
{"role": "assistant", "content": """Here are the requested JSON objects:
Person:
```json
{
"name": "John Doe",
"age": 35,
"email": "john.doe@example.com"
}
````
Pet:
```json
{
"name": "Buddy",
"species": "dog",
"age": 5
}
```
Both objects follow the standard format."""}
]
# Validate only the first JSON object
result = json\_schema\_reward(
messages=messages,
schema=person\_schema,
json\_index=0 # 0-based index for which JSON object to validate
)
````
### Direct JSON String Validation
You can also validate a JSON string directly:
```python
from reward_kit.rewards.json_schema import validate_json_string
# JSON string to validate
json_str = """
{
"name": "Alice Johnson",
"age": 28,
"email": "alice@example.com",
"interests": ["coding", "chess"]
}
"""
# Validate directly
result = validate_json_string(
json_str=json_str,
schema=person_schema
)
print(f"Valid: {result['valid']}")
if not result['valid']:
print(f"Errors: {result['errors']}")
````
### Multiple Schema Requirements
For more complex requirements, you can specify an array of valid schemas:
````python
from reward_kit.rewards.json_schema import json_schema_reward
# Define two valid schemas
schemas = [
# Schema for regular users
{
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 18},
"email": {"type": "string", "format": "email"},
"role": {"type": "string", "enum": ["user"]}
},
"required": ["name", "email", "role"]
},
# Schema for admin users
{
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"role": {"type": "string", "enum": ["admin"]},
"permissions": {"type": "array", "items": {"type": "string"}}
},
"required": ["name", "email", "role", "permissions"]
}
]
# Message with JSON that should conform to one of the schemas
messages = [
{"role": "user", "content": "Create JSON for an admin user."},
{"role": "assistant", "content": """Here's the admin user JSON:
```json
{
"name": "Admin User",
"email": "admin@example.com",
"role": "admin",
"permissions": ["read", "write", "delete"]
}
````
This follows the admin schema with the required permissions."""}
]
# Validate against multiple schemas (passes if valid against any schema)
result = json\_schema\_reward(
messages=messages,
schema=schemas,
require\_all\_valid=False # Only need to be valid against one schema
)
````
## Use Cases
### Data Formatting Validation
Use JSON Schema validation to ensure LLMs generate data in the correct format for:
- API request/response bodies
- Configuration files
- Data interchange formats
- Application settings
### Structured Output Generation
Validate structured outputs for:
- Database records
- User profiles
- Product catalogs
- Event descriptions
- Log entries
### Response Normalization
Ensure various models produce outputs in a standardized format:
```python
from reward_kit.rewards.json_schema import json_schema_reward
import json
# Define schema for product data
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
"price": {"type": "number", "minimum": 0},
"category": {"type": "string"},
"available": {"type": "boolean"}
},
"required": ["name", "price", "category", "available"]
}
# Test different models with the same prompt
models = ["model-a", "model-b", "model-c"]
model_scores = {}
for model in models:
# Get response from model (example responses)
if model == "model-a":
response = """```json
{
"name": "Wireless Headphones",
"price": 79.99,
"category": "Electronics",
"available": true,
"description": "Noise-cancelling wireless headphones"
}
```"""
elif model == "model-b":
response = """```json
{
"name": "Wireless Headphones",
"price": "79.99", // Incorrect type - string instead of number
"category": "Electronics",
"available": true
}
```"""
else:
response = """```json
{
"productName": "Wireless Headphones", // Wrong field name
"price": 79.99,
"category": "Electronics"
// Missing required field: available
}
```"""
# Create message
messages = [
{"role": "user", "content": "Generate JSON for wireless headphones product"},
{"role": "assistant", "content": response}
]
# Validate against schema
result = json_schema_reward(messages=messages, schema=product_schema)
model_scores[model] = result.score
# Compare model performance
print("JSON Schema Validation Scores:")
for model, score in model_scores.items():
print(f"{model}: {score:.2f}")
````
## Best Practices
1. **Clear Schemas**: Define schemas with precise types and constraints
2. **Required Fields**: Explicitly specify which fields are required
3. **Helpful Error Messages**: Include good descriptions in schema for better error messages
4. **Nested Validation**: Use nested schemas for complex data structures
5. **Alternative Schemas**: Consider using anyOf/oneOf for flexible validation
6. **Test with Examples**: Validate schema against known good and bad examples
## Limitations
* Cannot evaluate the quality or usefulness of the content, only its structure
* Requires properly formatted JSON to validate
* Some aspects of data quality (like whether values are reasonable) may require custom checks
## Next Steps
* Learn about [Function Calling Evaluation](function_calling_evaluation.md) for validating function calls
* Explore [Code Execution Evaluation](code_execution_with_e2b.md) for evaluating code solutions
* See [Creating Custom Reward Functions](../tutorials/creating_your_first_reward_function.md) to build custom validation logic
# null
Source: https://docs.fireworks.ai/evaluators/examples/math_evaluation
# Math Evaluation
This guide demonstrates how to evaluate mathematical answers in LLM responses using the math reward functions.
## Overview
The `math_reward` function allows you to:
1. Extract numerical answers from LLM responses
2. Compare them with expected answers or reference solutions
3. Handle various formats including fractions, decimals, and scientific notation
4. Support LaTeX formatted answers in markdown
## Prerequisites
Before using the math evaluation rewards, ensure you have:
1. **Python 3.8+** installed on your system
2. **Reward Kit** installed: `pip install reward-kit`
## Basic Usage
Here's a simple example of how to use the math reward function:
```python
from reward_kit.rewards.math import math_reward
# Example conversation with a math problem
messages = [
{
"role": "user",
"content": "Calculate 15% of 80."
},
{
"role": "assistant",
"content": "To calculate 15% of 80, I'll multiply 80 by 0.15:\n\n80 × 0.15 = 12\n\nTherefore, 15% of 80 is 12."
}
]
# Expected answer
expected_answer = "12"
# Evaluate the response
result = math_reward(
messages=messages,
expected_answer=expected_answer
)
# Print the results
print(f"Score: {result.score}")
print("Metrics:")
for name, metric in result.metrics.items():
print(f" {name}: {metric.score}")
print(f" {metric.reason}")
```
## How It Works
The math reward function:
1. Extracts potential answer values from the last assistant message
2. Extracts expected answer value from the provided string
3. Compares them with tolerance for floating-point values
4. Returns a score of 1.0 for correct answers and 0.0 for incorrect answers
5. Provides detailed metrics about the extraction and comparison process
## Supported Answer Formats
The math reward function can extract and compare answers in various formats:
### Integer and Decimal Numbers
```
42
-27
3.14159
0.5
```
### Fractions
```
3/4
-5/8
1 2/3 (mixed fractions)
```
### Scientific Notation
```
1.23e4
6.022 × 10^23
5.67 × 10⁻⁸
```
### LaTeX Formatting
```
\boxed{42}
\frac{3}{4}
\frac{22}{7} \approx 3.14
\pi \approx 3.14159
2.998 \times 10^8 \text{ m/s}
```
### Units
```
42 kg
3.14 m/s²
5 \text{ meters}
```
## Advanced Usage
### Customizing Extraction
You can customize the extraction process to look for answers in particular formats or locations:
```python
from reward_kit.rewards.math import math_reward
# Messages with LaTeX formatted answer
messages = [
{
"role": "user",
"content": "What is the area of a circle with radius 3 cm?"
},
{
"role": "assistant",
"content": "To find the area of a circle, I'll use the formula:\n\nArea = πr²\n\nSubstituting r = 3 cm:\n\nArea = π × 3² = 9π cm²\n\nCalculating with π ≈ 3.14159:\n\nArea ≈ 28.27 cm²\n\nTherefore, the area of a circle with radius 3 cm is \n\n$$\\boxed{28.27 \\text{ cm}^2}$$"
}
]
# Evaluate with custom extraction patterns
result = math_reward(
messages=messages,
expected_answer="28.27 cm^2",
extract_boxed_only=True, # Only look for answers in \boxed{} environments
ignore_units=False, # Consider units in the comparison
tolerance=0.01 # Allow for slight differences in rounding
)
```
### Multiple Valid Answers
Sometimes, multiple forms of the same answer are acceptable. You can evaluate against multiple correct answers:
```python
from reward_kit.rewards.math import math_reward
# Message with fraction answer
messages = [
{
"role": "user",
"content": "What is 1/4 + 1/6?"
},
{
"role": "assistant",
"content": "To add fractions with different denominators, I need to find a common denominator.\n\n1/4 + 1/6\n\nLCD = 12\n\n1/4 = 3/12\n1/6 = 2/12\n\n3/12 + 2/12 = 5/12\n\nTherefore, 1/4 + 1/6 = 5/12"
}
]
# Accept either fraction or decimal form
result = math_reward(
messages=messages,
expected_answer=["5/12", "0.41666"], # Accept either form
tolerance=0.001 # Small tolerance for decimal approximation
)
```
### Original Messages as Reference
If the correct answer is in the original messages, you can extract it automatically:
```python
from reward_kit.rewards.math import math_reward
# Original conversation with correct answer
original_messages = [
{
"role": "user",
"content": "Solve the equation 2x + 5 = 15. The answer is x = 5."
}
]
# Generated response to evaluate
generated_messages = [
{
"role": "user",
"content": "Solve the equation 2x + 5 = 15."
},
{
"role": "assistant",
"content": "To solve the equation 2x + 5 = 15, I'll isolate the variable x.\n\n2x + 5 = 15\n2x = 15 - 5\n2x = 10\nx = 10/2\nx = 5\n\nTherefore, the solution is x = 5."
}
]
# Extract expected answer from original messages
result = math_reward(
messages=generated_messages,
original_messages=original_messages,
extract_answer_from_original=True # Extract answer from original messages
)
```
## Use Cases
### Evaluating Math Problem Solving
The math reward function is perfect for evaluating responses to:
* Basic arithmetic problems
* Algebra equations
* Calculus problems
* Physics calculations
* Economics computations
* Statistics problems
### Educational Applications
Use the math reward function to:
* Automatically grade math homework
* Provide instant feedback on practice problems
* Evaluate mathematical reasoning in tutoring systems
## Best Practices
1. **Be Explicit About Units**: Specify whether units should be considered in the comparison
2. **Consider Fractions vs. Decimals**: Decide if approximate decimal answers are acceptable for fraction problems
3. **Set Appropriate Tolerance**: Use a tolerance appropriate for the problem (e.g., higher for complex calculations)
4. **Look for Final Answers**: Set up extraction patterns to focus on the final answer rather than intermediate steps
5. **Multiple Representations**: Consider all valid forms of an answer (fraction, decimal, scientific notation)
6. **LaTeX Handling**: Take advantage of the LaTeX support for nicely formatted answers
## Limitations
* Cannot evaluate the correctness of the solution method, only the final answer
* May have difficulty with extremely complex LaTeX expressions
* Cannot evaluate mathematical proofs or abstract reasoning
* Works best with numerical answers rather than symbolic expressions
## Next Steps
* Learn about [Code Execution Evaluation](code_execution_with_e2b.md) for evaluating code solutions
* Explore [Function Calling Evaluation](function_calling_evaluation.md) for evaluating tool use
* See [Creating Custom Reward Functions](../tutorials/creating_your_first_reward_function.md) to build your own specialized math evaluators
# null
Source: https://docs.fireworks.ai/evaluators/examples/reward_functions_overview
# Reward Functions Overview
This guide provides an overview of all out-of-the-box reward functions available in the Reward Kit library.
## Introduction
Reward Kit includes several pre-built reward functions for common evaluation tasks. These functions can be used directly or as building blocks for more complex evaluations.
## Available Reward Functions
### Format and Structure Rewards
These reward functions evaluate the format and structure of responses.
* **Format Reward**: Evaluate responses against a regex pattern (e.g., `......`)
```python
from reward_kit.rewards.format import format_reward
result = format_reward(
messages=messages,
pattern=r"^\n.*?\n\n.*?$",
flags=re.DOTALL
)
```
* **Tag Count Reward**: Check for exactly one of each specified tag
```python
from reward_kit.rewards.tag_count import tag_count_reward
result = tag_count_reward(
messages=messages,
tags=["pros", "cons"]
)
```
### Accuracy and Correctness Rewards
These reward functions evaluate the accuracy of responses against expected answers.
* **Accuracy Reward**: Compare answers to ground truth
```python
from reward_kit.rewards.accuracy import accuracy_reward
result = accuracy_reward(
messages=messages,
ground_truth="Paris"
)
```
* **Math Reward**: Compare numerical answers with expected values
```python
from reward_kit.rewards.math import math_reward
result = math_reward(
messages=messages,
expected_answer="42"
)
```
### Language and Style Rewards
These reward functions evaluate linguistic aspects of responses.
* **Language Consistency Reward**: Ensure response is in the target language
```python
from reward_kit.rewards.language_consistency import language_consistency_reward
result = language_consistency_reward(
messages=messages,
target_language="spanish"
)
```
* **Reasoning Steps Reward**: Encourage step-by-step reasoning
```python
from reward_kit.rewards.reasoning_steps import reasoning_steps_reward
result = reasoning_steps_reward(
messages=messages,
min_steps=3
)
```
### Length and Verbosity Rewards
These reward functions evaluate the length and verbosity of responses.
* **Length Reward**: Evaluate response against length targets
```python
from reward_kit.rewards.length import length_reward
result = length_reward(
messages=messages,
target_length=200, # Target token count
token_method="whitespace"
)
```
* **Cosine Length Reward**: Scale rewards based on length using cosine schedule
```python
from reward_kit.rewards.length import cosine_length_reward
result = cosine_length_reward(
messages=messages,
correctness=0.9, # High correctness score
max_length=500,
min_value_correct=0.5,
max_value_correct=1.0
)
```
* **Repetition Penalty Reward**: Penalize repetitive content
```python
from reward_kit.rewards.repetition import repetition_penalty_reward
result = repetition_penalty_reward(
messages=messages,
max_penalty=0.5,
ngram_size=3
)
```
### Code Execution Rewards
These reward functions evaluate code by running it and comparing the output to expected results.
* **Binary Code Reward**: Binary pass/fail for code execution
```python
from reward_kit.rewards.code_execution import binary_code_reward
result = binary_code_reward(
messages=messages,
expected_output="expected result",
language="python"
)
```
* **Fractional Code Reward**: Return exact pass rate for code execution
```python
from reward_kit.rewards.code_execution import fractional_code_reward
result = fractional_code_reward(
messages=messages,
test_cases=[
{"input": "arg1", "expected_output": "result1"},
{"input": "arg2", "expected_output": "result2"}
],
language="python"
)
```
* **IOI C/C++ Code Reward**: Evaluate C/C++ code using Piston engine
```python
from reward_kit.rewards.cpp_code import ioi_cpp_code_reward
result = ioi_cpp_code_reward(
messages=messages,
test_cases=[
{"input": "4\n5", "expected_output": "9"},
{"input": "10\n20", "expected_output": "30"}
],
language="cpp" # or "c"
)
```
* **Binary C/C++ Code Reward**: Binary pass/fail for C/C++ code
```python
from reward_kit.rewards.cpp_code import binary_cpp_code_reward
result = binary_cpp_code_reward(
messages=messages,
test_cases=[
{"input": "4\n5", "expected_output": "9"}
],
language="cpp"
)
```
### Function Calling Rewards
These reward functions evaluate function calls in LLM responses against expected schemas and behaviors.
* **Schema Jaccard Reward**: Compare function calls to expected schema
```python
from reward_kit.rewards.function_calling import schema_jaccard_reward
result = schema_jaccard_reward(
messages=messages,
expected_schema=schema
)
```
* **LLM Judge Reward**: Use an LLM to evaluate function call quality
```python
from reward_kit.rewards.function_calling import llm_judge_reward
result = llm_judge_reward(
messages=messages,
expected_schema=schema,
expected_behavior=behavior_description
)
```
* **Composite Function Call Reward**: Combine schema validation and LLM judgment
```python
from reward_kit.rewards.function_calling import composite_function_call_reward
result = composite_function_call_reward(
messages=messages,
expected_schema=schema,
expected_behavior=behavior_description
)
```
### JSON Schema Rewards
These reward functions validate JSON outputs against predefined schemas.
* **JSON Schema Reward**: Validate JSON against a schema
```python
from reward_kit.rewards.json_schema import json_schema_reward
result = json_schema_reward(
messages=messages,
schema=json_schema
)
```
### Combined Metrics Rewards
These reward functions combine multiple evaluation aspects into a single score.
* **Cosine-Scaled Accuracy + Length Reward**: Combine accuracy with length efficiency
```python
from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
result = cosine_scaled_accuracy_length_reward(
messages=messages,
ground_truth="Paris",
max_length=200,
correctness_weight=0.7,
length_weight=0.3
)
```
## Choosing the Right Reward Function
Here's a guide to help you choose the appropriate reward function for your task:
| Task | Recommended Reward Function |
| -------------------------------------- | ------------------------------------------------- |
| Evaluating format adherence | `format_reward` |
| Checking tag usage and structure | `tag_count_reward` |
| Evaluating factual accuracy | `accuracy_reward` |
| Ensuring consistent language | `language_consistency_reward` |
| Encouraging step-by-step reasoning | `reasoning_steps_reward` |
| Controlling response length | `length_reward` |
| Optimizing for brevity and correctness | `cosine_scaled_accuracy_length_reward` |
| Reducing repetition | `repetition_penalty_reward` |
| Evaluating Python code | `fractional_code_reward` or `binary_code_reward` |
| Evaluating C/C++ code | `ioi_cpp_code_reward` or `binary_cpp_code_reward` |
| Validating tool use and function calls | `composite_function_call_reward` |
| Checking structured data outputs | `json_schema_reward` |
| Evaluating mathematical solutions | `math_reward` |
| Evaluating formal proofs in Lean | `lean_prover_reward`, `deepseek_prover_v2_reward` |
### Lean Theorem Prover Rewards
These reward functions evaluate formal proofs written in the Lean theorem prover language.
* **Lean Prover Reward**: Basic evaluation of Lean proofs
```python
from reward_kit.rewards.lean_prover import lean_prover_reward
result = lean_prover_reward(
response=model_response,
statement="For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2.",
lean_version="4",
check_partial_progress=True
)
```
* **DeepSeek Prover V2 Reward**: Evaluate Lean proofs with focus on subgoal decomposition
```python
from reward_kit.rewards.lean_prover import deepseek_prover_v2_reward
result = deepseek_prover_v2_reward(
response=model_response,
statement="For all natural numbers n, the sum of the first n natural numbers is n(n+1)/2.",
check_subgoals=True,
verbose=True
)
```
* **DeepSeek HuggingFace Prover Benchmark**: Evaluate proofs against the DeepSeek-ProverBench dataset
```python
from reward_kit.rewards.lean_prover import deepseek_huggingface_prover_benchmark
result = deepseek_huggingface_prover_benchmark(
response=model_response,
statement="For any positive integers a and b, gcd(a,b) divides any linear combination of a and b",
dataset_name="deepseek-ai/DeepSeek-ProverBench",
check_for_answer=True
)
```
## Combining Reward Functions
You can combine multiple reward functions to create comprehensive evaluations:
```python
from reward_kit.rewards.accuracy import accuracy_reward
from reward_kit.rewards.length import length_reward
from reward_kit import reward_function, RewardOutput, MetricRewardOutput
@reward_function
def combined_accuracy_length(messages, ground_truth=None, **kwargs):
"""Combine accuracy and length evaluation."""
# Check accuracy
accuracy_result = accuracy_reward(
messages=messages,
ground_truth=ground_truth
)
# Check length
length_result = length_reward(
messages=messages,
target_length=150
)
# Combine scores with weighting
# 70% accuracy, 30% length
combined_score = 0.7 * accuracy_result["score"] + 0.3 * length_result["score"]
# Combine metrics
metrics = {
"accuracy": MetricRewardOutput(
score=accuracy_result["score"],
reason=accuracy_result["reason"]
),
"length": MetricRewardOutput(
score=length_result["score"],
reason=length_result["reason"]
)
}
return RewardOutput(score=combined_score, metrics=metrics)
```
## Pre-Built Combined Metrics
Reward Kit offers pre-built functions that combine multiple metrics:
* **Cosine-Scaled Accuracy + Length**: Combines accuracy with length using a cosine schedule
```python
from reward_kit.rewards.accuracy_length import cosine_scaled_accuracy_length_reward
result = cosine_scaled_accuracy_length_reward(
messages=messages,
ground_truth="Paris",
max_length=200,
correctness_weight=0.7, # Weight for accuracy component
length_weight=0.3 # Weight for length component
)
```
This function:
* Evaluates response accuracy against ground truth
* Measures response length efficiency using a cosine schedule
* Rewards shorter correct answers more than longer ones
* Maintains a clear separation between correct and incorrect answers
* Allows customizable weighting between accuracy and length
## Next Steps
* Explore individual reward function documentation:
* [Format and Structure Rewards](../api_reference/reward_functions/format.md)
* [Accuracy and Correctness Rewards](../api_reference/reward_functions/accuracy.md)
* [Language and Style Rewards](../api_reference/reward_functions/language.md)
* [Length and Verbosity Rewards](../api_reference/reward_functions/length.md)
* [Code Execution Rewards](code_execution_evaluation.md)
* [Function Calling Rewards](function_calling_evaluation.md)
* [JSON Schema Validation](json_schema_validation.md)
* [Math Evaluation](math_evaluation.md)
* [DeepSeek-Prover-V2](deepseek_prover_v2.md)
* [Combined Metrics Rewards](../api_reference/reward_functions/combined.md)
* Learn how to [create your own reward functions](../tutorials/creating_your_first_reward_function.md)
* Read [best practices](../tutorials/best_practices.md) for effective evaluations
* See [examples](../developer_guide/evaluation_workflows.md) of common evaluation workflows
# Account setup & management
Source: https://docs.fireworks.ai/faq/account/access/setup-management
Solutions for common account access issues and management procedures for Fireworks.ai accounts
## Multiple account access
**Q: What should I do if I can't access my company account after being invited when I already have a personal account?**
This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to).
To resolve this:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts
2. Include in your email:
* The account ID you created personally (e.g., username-44ace8)
* The company account ID you need access to (e.g., company-a57b2a)
* Mention that you're having trouble accessing your company account
Note: This is a known scenario that support can resolve once they verify your email ownership.
***
## Account closure
**Q: How do I close my Fireworks.ai account?**
To close your account:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
2. Include in your request:
* Your account ID
* A clear request for account deletion
Before closing your account, please ensure:
* All outstanding invoices are paid
* Any active deployments are terminated
* Important data is backed up if needed
***
## Signing in from different Fireworks accounts
**Q: I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this?**
If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Billing management
Source: https://docs.fireworks.ai/faq/billing-pricing-usage/billing/billing-management
Information about Fireworks.ai invoicing and API billing.
## Invoice questions
**Q: Why did I receive an invoice when I only deposited credits?**
Fireworks.ai billing works as follows:
* **Deposited credits** are used first.
* Once credits are exhausted, you **continue to accrue charges** for additional usage.
* **Usage charges** are billed at the end of each month.
* You’ll receive an invoice for any usage that **exceeded your pre-purchased credits**.
This process happens automatically, regardless of subscription status. To prevent additional charges, please monitor your usage or contact support to set up spending restrictions.
**Q: Where's my receipt for purchased credits?**
Receipts for purchased credits are sent via Stripe upon initial credit purchase. Check your email for receipts from Stripe (not Fireworks). Contact [billing@fireworks.ai](mailto:billing@fireworks.ai) if you still are encountering problems.
***
## API billing
**Q: Are calls to the Models API billable?**
No, calls to the **Models API** endpoint are free. This applies to all **management API calls** for:
* Accounts
* Users
* Models
* Datasets
*Note*: While the API calls themselves are free, charges apply for:
* **Model deployments**
* **Fine-tuning jobs**
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Credit system
Source: https://docs.fireworks.ai/faq/billing-pricing-usage/billing/credit-system
Understanding how Fireworks.ai billing, credits, and account suspension work.
## Billing and credit usage
**Q: How does billing and credit usage work?**
Usage and billing operate through a **tiered system**:
* Each **tier** has a monthly usage limit, regardless of available credits.
* Once you reach your tier's limit, **service will be suspended** even if you have remaining credits.
* **Usage limits** reset at the beginning of each month.
* Pre-purchased credits do not prevent additional charges once the limit is exceeded.
***
## Account suspension
**Q: Why might my account be suspended even with remaining credits?**
Your account may be suspended due to several factors:
1. **Monthly usage limits**:
* Each tier includes a monthly usage limit, independent of any credits.
* Once you reach this limit, your service will be suspended, even if you have credits remaining.
* Usage limits automatically reset at the beginning of each month.
2. **Billing structure**:
* Pre-purchased credits do not prevent additional charges.
* You can exceed your pre-purchased credits and will be billed for any usage beyond that limit.
* **Example**: If you have `$20` in pre-purchased credits but incur `$83` in usage, you will be billed for the `$63` difference.
***
## Missing credits
**Q: I bought credits but don’t see them reflected in my account. Did they disappear?**
Fireworks operates with a **postpaid billing** system where:
* **Prepaid credits** are instantly applied to any outstanding balance.
* **Example**: If you had a `$750` outstanding bill and added `$500` in credits, your bill would reduce to `$250`, with \$0 remaining credits available for new usage.
To check your credit balance:
1. Visit your **billing dashboard**.
2. Review the **"Credits"** section.
3. Check your **current outstanding balance**.
*Note*: Credits are always applied to any existing balance before being available for new usage.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Cost structure
Source: https://docs.fireworks.ai/faq/billing-pricing-usage/pricing/cost-structure
Understanding Fireworks.ai pricing and fees for various services.
## Platform costs
**Q: How much does Fireworks cost?**
Fireworks AI operates on a **pay-as-you-go** model for all non-Enterprise usage, and new users automatically receive free credits. You pay based on:
* **Per token** for serverless inference
* **Per GPU usage time** for on-demand deployments
* **Per token of training data** for fine-tuning
For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options.
Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing).
***
## Fine-tuning fees
**Q: Are there extra fees for serving fine-tuned models?**
No, deploying fine-tuned models to serverless infrastructure is free. Here’s what you need to know:
**What’s free**:
* Deploying fine-tuned models to serverless infrastructure
* Hosting the models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Discounts
Source: https://docs.fireworks.ai/faq/billing-pricing-usage/pricing/discounts
Information about bulk usage discounts and special pricing options.
## Bulk usage
**Q: Are there discounts for bulk usage?**
Yes, we offer discounts for **bulk or pre-paid purchases** exclusively for on-demand deployments—not for serverless GPUs. Please contact [inquiries@firework.ai](mailto:inquiries@fireworks.ai) if you're interested.
***
## Serverless discounts
**Q: Are there discounts for bulk spend on serverless deployments?**
Our publicly accessible services have **standard rates** for all customers. Currently, we do not offer bulk discounts for serverless deployments.
***
## Additional information
For **enterprise customers** or **high-volume users**:
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
* Discuss **annual commitment discounts**
* Explore **enterprise-specific features and benefits**
# Billing & scaling
Source: https://docs.fireworks.ai/faq/deployment/ondemand/billing-scaling
Understanding billing and scaling mechanisms for on-demand deployments.
## Autoscaling and costs
**Q: How does autoscaling affect my costs?**
* **Scaling from 0**: No minimum cost when scaled to zero
* **Scaling up**: Each new replica adds to your total cost proportionally. For example:
* Scaling from 1 to 2 replicas doubles your GPU costs
* If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing).
***
## Rate-limits for on-demand deployment
**Q: What are the rate limits for on-demand deployments?**
Request throughput scales with your GPU allocation. Base allocations include:
* Up to 8 A100 GPUs
* Up to 8 H100 GPUs
On-demand deployments offer several advantages:
* **Predictable pricing** based on time units, not token I/O
* **Protected latency and performance**, independent of traffic on the serverless platform
* **Choice of GPUs**, including A100s and H100s
Need more GPUs? Contact us to discuss higher allocations for your specific use case.
***
## On-demand billing
**Q: How does billing work for on-demand deployments?**
On-demand deployments come with automatic cost optimization features:
* **Default autoscaling**: Automatically scales to 0 replicas when not in use
* **Pay for what you use**: Charged only for GPU time when replicas are active
* **Flexible configuration**: Customize autoscaling behavior to match your needs
**Best practices for cost management**:
1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use
2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges
3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling).
***
## Scaling structure
**Q: How does billing and scaling work for on-demand GPU deployments?**
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
**Billing**:
* Charges start when the server begins accepting requests
* **Billed by GPU-second** for each active instance
* Costs accumulate even if there are no active API calls
**Scaling options**:
* Supports **autoscaling** from 0 to multiple GPUs
* Each additional GPU **adds to the billing rate**
* Can handle unlimited requests within the GPU’s capacity
**Management requirements**:
* Not fully serverless; requires some manual management
* **Manually delete deployments** when no longer needed
* Or configure autoscaling to **scale down to 0** during inactive periods
**Cost control tips**:
* Regularly **monitor active deployments**
* **Delete unused deployments** to avoid unnecessary costs
* Consider **serverless options** for intermittent usage
* Use **autoscaling to 0** to optimize costs during low-demand times
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Deployment issues
Source: https://docs.fireworks.ai/faq/deployment/ondemand/deployment-issues
Troubleshooting and resolving common issues with on-demand deployments.
## Custom model issues
**Q: What are the common issues when deploying custom models?**
Here are key areas to troubleshoot for custom model deployments:
### 1. Deployment hanging or crashing
**Common causes**:
* **Missing model files**, especially when using Hugging Face models
* **Symlinked files** not uploaded correctly
* **Outdated firectl version**
**Solutions**:
* Download models without symlinks using:
```bash
huggingface-cli download model_name --local-dir=/path --local-dir-use-symlinks=False
```
* Update **firectl** to the latest version
### 2. LoRA adapters vs full models
* **Compatibility**: LoRA adapters work with specific base models.
* **Performance**: May experience slightly lower speed with LoRA, but **quality should remain similar** to the original model.
* **Troubleshooting quality drops**:
* Check **model configuration**
* Review **conversation template**
* Add `echo: true` to debug requests
### 3. Performance optimization factors
Consider adjusting the following for improved performance:
* **Accelerator count** and **accelerator type**
* **Long prompt** settings to handle complex inputs
***
## Autoscaling
**Q: What should I expect for deployment and scaling performance?**
* **Initial deployment**: Should complete within minutes
* **Scaling from zero**: You may experience brief availability delays while the system scales up
* **Troubleshooting**: If deployment takes over 1 hour, this typically indicates a crash and should be investigated
* **Best practice**: Monitor deployment status and contact support if deployment times are unusually long
***
## Performance questions
**Q: I have more specific performance questions about improvements**
For detailed discussions on performance and optimization options:
* **Schedule a consultation** directly with our PM, Ray Thai ([calendly](https://calendly.com/raythai))
* Discuss your **specific use cases**
* Get **personalized recommendations**
* Review **advanced configuration options**
*Note*: Monitor costs carefully during the deployment and testing phase, as repeated deployments and tests can quickly consume credits.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Hardware options
Source: https://docs.fireworks.ai/faq/deployment/ondemand/hardware-options
Understanding hardware choices for Fireworks.ai on-demand deployments.
## Hardware selection
**Q: Which accelerator/GPU should I use?**
It depends on your specific needs. Fireworks has two grouping of accelerators: smaller (A100) and larger (H100, H200, and MI300X) accelerators. Small accelerators are less expensive (see [pricing page](https://fireworks.ai/pricing)), so they’re more cost-effective for low-volume use cases. However, if you have enough volume to fully utilize a larger accelerator, we find that they tend to be both faster and more cost-effective per token.
Choosing between larger accelerators depends on the use case.
* MI300X has the highest memory capacity and sometimes enables large models to be deployed with comparatively few GPUs. For example, unquantized Llama 3.1 70B fits on one MI300X and FP8 Llama 405B fits on 4 MI300X’s. Higher memory also may enable better throughput for longer prompts and less sharded deployments. It’s also more affordably priced than the H100.
* H100 offers blazing fast inference and often provides the highest throughput, especially for high-volume use cases
* H200 is recommended for large models like DeepSeek V3 and DeepSeek R1 e.g. the minimum config for DeepSeek V3, DeepSeek R1 is 8 H200s.
### Best Practices for Selection
1. **Analyze your workload requirements** to determine which GPU fits your processing needs.
2. Consider your **throughput needs** and the scale of your deployment.
3. Calculate the **cost-performance ratio** for each hardware option.
4. Factor in **future scaling needs** to ensure the selected GPU can support growth.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# On-demand deployment scaling
Source: https://docs.fireworks.ai/faq/deployment/ondemand/ondemand-deployment-scaling
Understanding Fireworks.ai system scaling and request handling capabilities.
## System scaling
**Q: How does the system scale?**
Our system is **horizontally scalable**, meaning it:
* Scales linearly with additional **replicas** of the deployment
* **Automatically allocates resources** based on demand
* Manages **distributed load handling** efficiently
***
## Auto scaling
**Q: Do you support Auto Scaling?**
Yes, our system supports **auto scaling** with the following features:
* **Scaling down to zero** capability for resource efficiency
* Controllable **scale-up and scale-down velocity**
* **Custom scaling rules and thresholds** to match your specific needs
***
## Throughput capacity
**Q: What’s the supported throughput?**
Throughput capacity typically depends on several factors:
* **Deployment type** (serverless or on-demand)
* **Traffic patterns** and **request patterns**
* **Hardware configuration**
* **Model size and complexity**
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
The request handling capacity is influenced by multiple factors:
* **Model size and type**
* **Number of GPUs** allocated to the deployment
* **GPU type** (e.g., A100 vs. H100)
* **Prompt size** and **generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance optimization
Source: https://docs.fireworks.ai/faq/deployment/performance/optimization
Guidelines for optimizing performance and benchmarking Fireworks.ai deployments.
## Performance improvement
**Q: What are the techniques to improve performance?**
To optimize model performance, consider the following techniques:
1. **Quantization**
2. **Check model type**: Determine whether the model is **GQA** (Grouped Query Attention) or **MQA** (Multi-Query Attention).
3. **Increase batch size** to improve throughput.
***
## Benchmarking
**Q: How can we benchmark?**
There are multiple ways to benchmark your deployment’s performance:
* Use our [open-source load-testing tool](https://github.com/fw-ai/benchmark)
* Develop custom performance testing scripts
* Integrate with monitoring tools to track metrics
***
## Model latency
**Q: What’s the latency for small, medium, and large LLM models?**
Model latency and performance depend on various factors:
* **Input/output prompt lengths**
* **Model quantization**
* **Model sharding**
* **Disaggregated prefill processes**
* **Hardware configuration**
* **Multiple layers of caching**
* **Fire optimizations**
* **LoRA adapters** (Low-Rank Adaptation)
Our team specializes in personalizing model performance. We work with you to understand your traffic patterns and create customized deployment templates that maximize performance for your use case.
***
## Performance factors
**Q: What factors affect model latency and performance?**
Key factors that impact latency and performance include:
* **Model architecture and size**
* **Hardware configuration**
* **Network conditions**
* **Request patterns**
* **Batch size settings**
* **Caching implementation**
***
## Best practices
**Q: What are the best practices for optimizing performance?**
For optimal performance, follow these recommendations:
1. **Choose an appropriate model size** for your specific use case.
2. **Implement batching strategies** to improve efficiency.
3. **Use quantization** where applicable to reduce computational load.
4. **Monitor and adjust scaling parameters** to meet demand.
5. **Optimize prompt lengths** to reduce processing time.
6. **Implement caching** to minimize repeated calculations.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Costs & management
Source: https://docs.fireworks.ai/faq/deployment/serverless/costs-management
Understanding costs and model availability for serverless deployments.
## Deployment costs
**Q: Are there costs associated with deploying fine-tuned models to serverless infrastructure?**
No, deploying fine-tuned models to serverless infrastructure is free.
**What’s free**:
* Deploying fine-tuned models to serverless
* Hosting models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Model availability
**Q: Do you provide notice before removing model availability?**
Yes, we provide advance notice before removing models from the serverless infrastructure:
* **Minimum 2 weeks’ notice** before model removal
* Longer notice periods may be provided for **popular models**, depending on usage
* Higher-usage models may have extended deprecation timelines
**Best Practices**:
1. Monitor announcements regularly.
2. Prepare a migration plan in advance.
3. Test alternative models to ensure continuity.
4. Keep your contact information updated for timely notifications.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance issues
Source: https://docs.fireworks.ai/faq/deployment/serverless/performance-issues
Troubleshooting timeout errors and performance issues with serverless LLM models.
## Timeout and response times
**Q: Why am I experiencing request timeout errors and slow response times with serverless LLM models?**
Timeout errors and increased response times can occur due to **server load during high-traffic periods**.
With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned.
The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**.
Deployment barriers should be **minimal** and **pricing is based on usage**.
However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**.
With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes.
Which is why our two recommended ways to address timeout and response time issues is:
### Current solution (recommended for production)
* **Use on-demand deployments** for more stable performance
* **Guaranteed response times**
* **Dedicated resources** to ensure availability
We are always investing in ways to improve speed and performance.
### Upcoming improvements
* Enhanced SLAs for uptime
* More consistent generation speeds during peak load times
If you experience persistent issues, please include the following details in your support request:
1. Exact **model name**
2. **Timestamp** of errors (in UTC)
3. **Frequency** of timeouts
4. **Average wait times**
### Performance optimization tips
* Consider **batch processing** for handling bulk requests
* Implement **retry logic with exponential backoff**
* Monitor **usage patterns** to identify peak traffic times
* Set **appropriate timeout settings** based on model complexity
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Service levels
Source: https://docs.fireworks.ai/faq/deployment/serverless/service-levels
Understanding SLAs and service guarantees for Fireworks.ai serverless deployments.
## Latency guarantees
**Q: Is latency guaranteed for serverless models?**
Currently there are **no latency or availability guarantees** for serverless models, however they are coming soon and we recommend contacting [sales](https://fireworks.ai/company/contact-us) to discuss any specific needs or requirements you have.
***
## Service level agreements
**Q: Are there any SLAs for serverless models?**
Our **multi-tenant serverless offering** does not currently come with **Service Level Agreements (SLAs)**. However they are coming and we'd love to understand what your use case is in order to ensure you have the best experience possible on the Fireworks platform. Reach out to us via sales or our Discord community.
***
## Quota information
**Q: Are there any quotas for serverless?**
For **serverless deployments**, quotas are as follows:
* **Developer accounts**: 600 requests per minute (RPM)
* **Enterprise accounts**: 600 requests per minute (RPM)
* Quotas apply **across all models** and cannot be exceeded within the serverless infrastructure
**For higher quotas**:
* Consider switching to **on-demand deployments**
* **Contact enterprise sales** for custom solutions
* Evaluate **dedicated infrastructure options** for greater flexibility
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Certifications
Source: https://docs.fireworks.ai/faq/enterprise/compliance/certifications
Information about Fireworks.ai compliance certifications and HIPAA requirements.
## Security certifications
**Q: What type of certifications do you have?**
We are **SOC 2 Type II** and **HIPAA Certified**. These certifications demonstrate our commitment to:
* **Security**
* **Availability**
* **Processing integrity**
* **Confidentiality**
* **Privacy**
You can view more at [https://trust.fireworks.ai/](https://trust.fireworks.ai/).
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
# Enterprise quotas
Source: https://docs.fireworks.ai/faq/enterprise/service/quotas
Understanding quota allocations for Enterprise customers.
## Enterprise limits
**Q: Are there any quotas for Enterprise Tier?**
No, there are **no quotas** for Enterprise Tier. Enterprise customers benefit from:
1. **Resource Allocation**:
* **Unlimited request capacity**
* **Flexible scaling options**
* **Custom resource allocation**
2. **Performance Benefits**:
* **Dedicated infrastructure**
* **Priority processing**
* **Enhanced support**
3. **Custom Solutions**:
* **Tailored deployment options**
* **Specialized configurations**
* **Customized scaling policies**
For specific requirements or custom configurations, contact your **enterprise account representative**.
***
## Additional resources
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
# Platform support
Source: https://docs.fireworks.ai/faq/general/support/platform-support
Information about Fireworks.ai deployment regions, general support channels, and platform requests.
## General support
**Q: I have another question or issue.**
We have an active [Discord community](https://discord.gg/mMqQxvFD9A) where you can:
* Post questions
* Request features
* Report bugs
* Interact directly with the Fireworks team and community
***
## Feature requests
**Q: How can I request a new model to be added to the platform?**
Head over to our **Discord server** and let us know which models you would like to see deployed. We actively take feature requests for new, popular models.
***
## Product feedback
**Q: I have specific performance questions or want to know about further performance improvement options.**
If you need more tailored performance advice or want to discuss advanced optimization options, here are two ways to get support:
1. **General support**: Reach out via our [support channels](https://fireworks.ai/company/contact-us) or check out the performance optimization practices for tips on maximizing efficiency with on-demand deployments.
2. **Direct consultation**: For in-depth questions, feel free to schedule a consultation directly with our Product Manager, Ray Thai, using [this link to his calendar](https://calendly.com/raythai). Ray can assist with advanced optimization strategies and hardware recommendations based on your specific workload and deployment needs.
***
## Deployment regions
**Q: Do you host your deployments in the EU or Asia?**
We are currently deployed in multiple U.S.-based locations. However, we’re open to hearing more about your specific requirements. You can:
* Join our [Discord community](https://discord.gg/mMqQxvFD9A)
* Write to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
If you're an Enterprise customer, please contact your dedicated customer support representative to ensure a timely response.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Support structure & access
Source: https://docs.fireworks.ai/faq/general/support/structure-access
Information about Fireworks.ai support options, access methods, and communication channels.
## Support options
**Q: What support options exist?**
* Enterprise accounts receive **dedicated support**.
* Developer-tier customers can interact directly with the Fireworks team and community through our **Discord channel**.
***
## Support process
**Q: How does Support work?**
Fireworks provides support for its services with **target response times** based on the **priority level** of the issue. Customers can indicate priority when creating support issues through the **Fireworks support system**.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Enterprise support tiers & SLAs
Source: https://docs.fireworks.ai/faq/general/support/tiers-slas
Detailed information about Fireworks.ai support priority levels and response time commitments.
## Enterprise support contact
**Q: If you're an Enterprise customer, how do you contact support?**
Enterprise customers have access to **dedicated support channels**. Please contact your assigned **customer support representative** for timely assistance.
***
## Communication channels
**Q: Do you have a shared Slack channel?**
For customers who use Slack internally, we create a **shared Slack channel**. This channel is used for:
* **Answering questions** about Fireworks’ platform and features
* **Receiving bug reports** from customers
* **Communicating** around incidents and escalations
* **Announcing new features** and requesting feedback on current offerings
***
## Support priority levels
**Q: What are the support tiers and SLAs for enterprise?**
Support issues are categorized into four priority levels, with specific examples for each:
| Priority Level | Response Time | Description | Examples |
| --------------- | ----------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| **Urgent (P0)** | Within 1 hour | Reserved for critical cases that break live production workflows | • Production scheduled task/runbook unexpectedly failing • Application inaccessible to end users |
| **High (P1)** | Within 4 business hours | Problems that prevent regular platform usage but not breaking live production | • Development/staging schedule failing • Task deployment failing |
| **Normal (P2)** | Within 8 business hours | Requests for information, enhancements, or documentation clarification with no negative service impact | • Feature requests • Documentation questions |
| **Low (P3)** | Within 2 business days | Any issues that don't fall into P0, P1, or P2 categories | • General inquiries • Non-urgent requests |
*Note: Business hours refer to standard working hours.*
# Platform models
Source: https://docs.fireworks.ai/faq/models/availability/platform-models
Information about custom and available models on Fireworks.ai.
## Custom models
**Q: Does Fireworks support custom base models?**
Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models).
***
## Model availability
**Q: There’s a model I would like to use that isn’t available on Fireworks. Can I request it?**
Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform.
**To request new models**:
1. **Join our [Discord server](https://discord.gg/fireworks-ai)**
2. Let us know which models you’d like to see
3. Provide **use case details**, if possible, to help us prioritize
We regularly evaluate and add new models based on:
* **Community requests**
* **Popular demand**
* **Technical feasibility**
* **Licensing requirements**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning service
Source: https://docs.fireworks.ai/faq/models/fine-tuning/service-overview
Overview of Fireworks.ai fine-tuning capabilities and supported models.
## Service availability
**Q: Does Fireworks offer a fine-tuning service?**
Yes, Fireworks offers a fine-tuning service. Take a look at our [fine-tuning guide](https://docs.fireworks.ai/fine-tuning/fine-tuning-models), which is also available [via REST API](https://docs.fireworks.ai/fine-tuning/fine-tuning-via-api) for detailed information about our services and capabilities.
***
## Model support
**Q: What models are supported for fine-tuning? Is Llama 3 supported for fine-tuning?**
Yes, **Llama 3** (8B and 70B) is supported for fine-tuning with **LoRA adapters**, which can be deployed via our **serverless** and **on-demand** options for inference.
**Capabilities include**:
* **LoRA adapter training** for flexible model adjustments
* **Serverless deployment support** for scalable, cost-effective usage
* **On-demand deployment options** for high-performance inference
* A variety of **base model options** to suit different use cases
For a complete list of models available for fine-tuning, refer to our [documentation](https://docs.fireworks.ai/fine-tuning/fine-tuning-models).
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning troubleshooting
Source: https://docs.fireworks.ai/faq/models/fine-tuning/troubleshooting
Solutions for common fine-tuning deployment and access issues.
## Access issues
**Q: Why am I getting "Model not found" errors when trying to access my fine-tuned model?**
If you’re unable to access your fine-tuned model, try these troubleshooting steps:
**First steps**:
* Attempt to access the model through both the **playground** and the **API**.
* Check if the error occurs for **all users** on the account.
* Ensure your **API key** is valid.
**Common causes**:
* User email previously associated with a **deleted account**
* **API key permissions** issues
* **Access conflicts** due to multiple accounts
**Debug process**:
1. Verify the API key’s validity using:
```bash
curl -v -H "Authorization: Bearer $FIREWORKS_API_KEY" https://api.fireworks.ai/verifyApiKey
```
2. Check if the issue persists across different **API keys**.
3. Identify which specific **users/emails** are affected.
**Getting help**:
* Contact support with:
* Your **account ID**
* **API key verification** results
* A list of **affected users/emails**
* Results from both **playground** and **API** tests
*Note*: If you have multiple accounts, ensure that access permissions are checked across all of them.
***
## Troubleshooting firectl deployment
**Q: Why am I getting "invalid id" errors when using firectl commands like create deployment or list deployments?**
This error typically occurs when your **account ID** is not properly configured.
### Common symptoms
* Error message: `invalid id: id must be at least 1 character long`
* Affects multiple commands, including:
* `firectl create deployment`
* `firectl list deployments`
To resolve:
### Steps to resolve
1. Run `firectl whoami` to check which **account id** is being used.
2. Ensure the correct **account ID** is being used. If not, run `firectl signin` to sign-in to the right account.
***
## LoRA deployment issues
**Q: Why can’t I deploy my fine-tuned Llama 3.1 LoRA adapter?**
If you encounter the following error:
```bash
Invalid LoRA weight model.layers.0.self_attn.q_proj.lora_A.weight shape: torch.Size([16, 4096]), expected (16, 8192)
```
This issue is due to the `fireworks.json` file being set to **Llama 3.1 70b instruct** by default.
**Workaround**:
1. Download the **model weights**.
2. Modify the base model to be `accounts/fireworks/models/llama-v3p1-8b-instruct`.
3. Follow the instructions in the [documentation](https://fireworks.ai/fine-tuning/model-upload) to upload and deploy the model.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# FLUX capabilities
Source: https://docs.fireworks.ai/faq/models/image-generation/flux
Understanding FLUX image generation features and limitations.
## Multiple images
**Q: Can I generate multiple images in a single API call using FLUX serverless?**
No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance.
***
## Image-to-image generation
**Q: Does FLUX support image-to-image generation?**
No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development.
***
## LoRA models
**Q: Can I create custom LoRA models with FLUX?**
Inference on FLUX-LoRA adapters is currently supported. However managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available.
***
## Size control
**Q: How do I control output image sizes when using SDXL ControlNet?**
When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request:
The input control signal image will be automatically:
* **Resized** to fit your specified dimensions
* **Cropped** to preserve aspect ratio
**Example**: To generate a 768x1344 image, explicitly include these parameters in your request:
```json
{
"width": 768,
"height": 1344
}
```
*Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Limitations & controls
Source: https://docs.fireworks.ai/faq/models/inference/limitations-controls
Understanding model limitations, safety features, and token limits.
## Safety Features
**Q: Can safety filters or content restrictions be disabled on text generation models?**
No, safety features and content restrictions for text generation models (such as Llama, Mistral, etc.) are embedded by the original model creators during training:
* **Safety measures** are integrated directly into the models by the teams that trained and released them.
* These are **core behaviors** of the model, not external filters.
* Different models may have varying levels of built-in safety.
* **Fireworks.ai does not add additional censorship layers** beyond what is inherent in the models.
* Original model behaviors **cannot be modified** via API parameters or configuration.
*Note*: For specific content handling needs, review the documentation of each model to understand its inherent safety features.
## Token Limits
**Q: What are the maximum completion token limits for models, and can they be increased?**
* For most models, the max completion token limit is the full context window of the model, e.g. 128K for DeepSeek R1
* `max_tokens` is set at 2K by default, and you should set it to a higher value if you plan to have long generations.
* For **Llama 3.1 405B**, have a **4096 token completion limit**. Setting a higher `max_tokens` in API calls **will not override** this limit.
* You will see `"finish_reason": "length"` in responses when hitting a max token limit.
**Example API Response at Limit**:
```json
{
"finish_reason": "length",
"usage": {
"completion_tokens": 4096,
"prompt_tokens": 4206,
"total_tokens": 8302
}
}
```
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Inference performance
Source: https://docs.fireworks.ai/faq/models/inference/performance
Understanding model performance, quantization, and batching capabilities.
## Model quantization
**Q: What quantization format is used for the Llama 3.1 405B model?**
The **Llama 3.1 405B model** uses the **FP8 quantization format**, which:
* Closely matches **Meta's reference implementation**
* Provides further details in the model description at [fireworks.ai/models/fireworks/llama-v3p1-405b-instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct)
* Has a general quantization methodology documented in our [Quantization blog](https://fireworks.ai/blog/fireworks-quantization)
*Note*: **BF16 precision** will be available soon for on-demand deployments.
***
## API capabilities
**Q: Does the API support batching and load balancing?**
Current capabilities include:
* **Load balancing**: Yes, supported out of the box
* **Continuous batching**: Yes, supported
* **Batch inference**: Not currently supported (on the roadmap)
* Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
* **Streaming**: Yes, supported
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
Request handling capacity depends on several factors:
* **Model size and type**
* **Number of GPUs allocated** to the deployment
* **GPU type** (e.g., A100, H100)
* **Prompt size**
* **Generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Data security
Source: https://docs.fireworks.ai/faq/security/infrastructure/data-security
Information about Fireworks.ai data encryption and security measures.
## Data at rest
**Q: How is data encrypted at rest?**
All resources stored within Fireworks are **encrypted at rest**, including:
* **Models**
* **Datasets**
* **LoRA Adapters**
* Other stored resources
***
## Data in transit
**Q: How is data encrypted in transit?**
All data passed through Fireworks is encrypted using **industry-standard protocols and methods**.
***
## Encryption options
**Q: Does Fireworks provide client-side encryption or allow customers to bring their own encryption keys?**
Currently, Fireworks does not provide:
* **Client-side encryption**
* **Customer-managed keys** for encrypting data at rest
*Note*: We continuously evaluate additional encryption options based on customer needs and security requirements.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Security documentation
Source: https://docs.fireworks.ai/faq/security/infrastructure/documentation
Access to Fireworks.ai security policies and documentation.
## Security policies
**Q: Where can I find more information about your security policies?**
Comprehensive security documentation is available at [trust.fireworks.ai](https://trust.fireworks.ai), including:
* **Security measures**
* **Compliance information**
* **Best practices**
* **Policy updates**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Model security
Source: https://docs.fireworks.ai/faq/security/infrastructure/model-security
Understanding model security and guardrail implementations.
## Model guardrails
**Q: Do you put any guardrails before any LLM models?**
By default, we don’t apply any guardrails to LLM models. Our customers can implement guardrails through various methods:
1. **Using built-in options**:
* Models such as **Llama Guard** provide built-in guardrails.
* Integration with existing **security frameworks**.
2. **Third-party solutions**:
* AI gateways like **Portkey** offer guardrails as a feature.
* Documentation available at: [Portkey Guardrails](https://docs.portkey.ai/docs/product/guardrails)
**Best practices**:
* Implement guardrails appropriate to your **use case**.
* Conduct regular **security audits**.
* Monitor **model outputs** consistently.
* Keep **security policies** updated.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Private access
Source: https://docs.fireworks.ai/faq/security/network/private-access
Understanding private connection options for Fireworks.ai services.
## Private connections
**Q: Do you provide private connections?**
Fireworks provides various forms of **private connections**:
**Cloud provider options**:
* **AWS PrivateLink**
* **GCP Private Service Connect**
**Additional options**:
* **Direct Routing**, which allows you to connect your dedicated API Gateway
**Benefits**:
* **Enhanced security**
* **Reduced latency**
* **Private network communication**
* **Improved reliability**
**Implementation process**:
1. **Contact support** to initiate setup.
2. **Choose connection type** based on your requirements.
3. **Configure network settings** as per the guidelines.
4. **Verify connectivity** to ensure successful integration.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning models
Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-models
Supervised Fine-Tuning (SFT) adapts general-purpose models to domain-specific tasks, significantly improving performance in real-world applications. Fireworks' fine-tuning service is easy to use, and supports continued training from another fine-tuned model. Fine-tuned models can be seamlessly deployed for inference and multi-LoRA serving allows multiple fine-tuned models to run simultaneously on a single deployment. You can run your supervised fine-tuning job via CLI, API or [UI](https://youtu.be/xTYLEtkF4AI).
We're introducing an upgraded tuning service with improved speed, usability and reliability! The new service utilizes different commands and model coverage. The new service is offered for free as we're in public preview.
## Benefits of fine-tuning:
* Higher Accuracy: Fine-tuning helps the model better match the dataset, boosting precision and performance.
* Better Fit for Specific Domains: Adapting general models with domain-specific data makes them more effective for specialized tasks.
* Less Bias: Using diverse, curated datasets during fine-tuning reduces built-in biases for fairer results.
* Up-to-Date Knowledge: Fine-tuning with new data keeps the model aligned with the latest information.
Fireworks uses [LoRA](https://huggingface.co/docs/diffusers/training/lora)-based fine-tuning to reduce the computational cost of fine-tuning large models by updating only a small subset of parameters in a low‑rank structure. For models with 70B or more parameters, qLoRA (quantized) to improve training speeds.
### Impact on inference speed
For fast inference speeds, the fine-tuned LoRA should be [merged](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model) into the base model. Note that fine-tuned model inference on Serverless is slower than base model inference on Serverless.
## Fine-tuning a model
1. **Enhanced Precision:** The model can adapt to the unique attributes and trends within a dataset, leading to significantly improved precision and effectiveness.
2. **Domain Adaptation:** While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field.
3. **Bias Reduction:** General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes.
4. **Contemporary Relevance:** Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant.
5. **Customization for Specific Applications:** The model can be tailored to meet unique objectives and needs not achievable with standard models.
In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let's try fine-tuning a model!
Fine-tuned model inference on Serverless is slower than base model inference on Serverless. For use cases that need low latency, we recommend using [on-demand deployments](https://docs.fireworks.ai/guides/ondemand-deployments). For on-demand deployements, fine-tuned model inference speeds are significant closer to base model speeds (but still slightly slower). If you are only using 1 LoRA on-demand, [merging fine-tuned weights](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model) into the base model when using on-demand deployments will provide identical speed to base model inference. If you have an enterprise use case that needs fast fine-tuned models, please [contact us!](https://fireworks.ai/company/contact-us)
### Step 1: Check Available Models for Fine-Tuning
In the model library page, select the [Tunable](https://fireworks.ai/models?tunable=true) filter. Alternately, in the model page, check whether the "Fine Tuning" field is set to "supported" in the model's details page.
Our new tuning service is currently free but will eventually be charged based on the total number of tokens processed (`dataset_tokens * num_empochs`). Running inference on fine-tuned models incurs no extra costs outside of base inference fees.
### Step 2: Prepare the Dataset
Datasets must be in JSONL format, where each line represents a complete JSON-formatted training example.
* Minimum examples needed: 3
* Maximum examples: Up to 3 million examples per dataset
* File format: JSONL (each line is a valid JSON object)
* Message Schema: Each training sample must include a `messages` array, where each message is an object with two fields:
* `role`: one of `system`, `user`, or `assistant`. A message with the "system" role is optional, but if specified, it must be the first message of the conversation
* `content`: a string representing the message content
This format conforms to OpenAI's [Chat Completions API](https://docs.fireworks.ai/guides/querying-text-models#chat-completions-api)
Here is an example conversation dataset:
```json
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris."}
]},
{"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2"},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4"}
]}
```
### Step 3: Create and Upload the Dataset
Create and check the dataset via CLI and API
```bash via CLI
firectl create dataset path/to/training_dataset.jsonl
firectl get dataset
```
```python via Python API
import requests
# ========================
# Fireworks API configurations
# ========================
ACCOUNT_ID = ""
API_TOKEN = ""
BASE_URL = f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}"
HEADERS = {"Authorization": f"Bearer {API_TOKEN}"}
HEADERS_WITH_CONTENT_TYPE = {
"Authorization": f"Bearer {API_TOKEN}",
"Content-Type": "application/json"
}
# Create dataset payload
create_dataset_payload = {
"datasetId": "trader-poe-sample-data",
"dataset": {
"userUploaded": {}
}
}
url_create_dataset = f"{BASE_URL}/datasets"
response = requests.post(url_create_dataset, headers=HEADERS_WITH_CONTENT_TYPE, json=create_dataset_payload)
# Upload file
url_upload = f"{BASE_URL}/datasets/{DATASET_ID}:upload"
with open(local_filename, "rb") as f:
files = {"file": f}
response = requests.post(url_upload, headers=HEADERS, files=files)
```
### Step 4: Creating a Fine-Tuning Job
Using CLI: To start a structured fine-tuning job (sftj), run the command below. This will also return the fine-tuning job ID.
```bash
firectl create sftj --base-model --dataset --output-model
```
For example:
To start a structured fine-tuning job (sftj), run the following bash/ Python code
```bash via CLI
firectl create sftj --base-model llama-v3p1-8b-instruct --dataset my_dataset --output-model my_model
```
```python via Python API
ft_payload = {
"displayName": "Trader Poe's Fine Tuning",
"dataset": f"accounts/{ACCOUNT_ID}/datasets/{DATASET_ID}",
"outputModel": "accounts/fireworks/models/deepseek-r1-distill-llama-70b-finetuned-04-15",
"baseModel": "accounts/fireworks/models/deepseek-r1-distill-llama-70b",
"earlyStop": False,
"epochs": 1,
"learningRate": 0.0001,
"maxContextLength": 2048,
"loraRank": 8,
"wandbConfig": {
"enabled": False,
"apiKey": "",
"project": "",
"entity": "",
"runId": ""
},
"isTurbo": True
}
response = requests.post(url_ft, headers=HEADERS_WITH_CONTENT_TYPE, json=ft_payload)
```
*`firectl` will return the fine-tuning job ID.*
### Step 5. Checking the job status
You can monitor the progress of the tuning job by running
```bash via CLI
firectl get sftj
```
```python via Python API
url_list_jobs = f"{BASE_URL}/supervisedFineTuningJobs"
response = requests.get(url_list_jobs, headers=HEADERS)
```
Once the job successfully completes, a model will be created in your account. You can see a list of models, and then you use this id to monitor the status of the job::
```bash via CLI
firectl list models
firectl get model
```
```python via Python API
url = "https://api.fireworks.ai/v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}"
headers = {"Authorization": "Bearer "}
response = requests.request("GET", url, headers=headers)
print(response.text)
```
### Continue training from a fine-tuned model
When creating a fine-tuning job, you can start tuning from a base model, or from a fine-tuned model you tuned earlier:
1. **Base model**: Use the `base-model` parameter in CLI (or `baseModel` in API) to start from a pre-trained base model.
2. **Existing LoRA add-on**: Use the `warm-start-from` parameter in CLI (or `warmStartFrom` parameter in API) to start from an existing LoRA addon model, where the LoRA is specified with the format "accounts/\/models/\"
You must specify either `base-model` or `warm-start-from` parameter.
## Deploying and using a model
Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on [Deploying a model](/models/deploying#lora-addons) and [Querying text models](/guides/querying-text-models) for detailed instructions.
Some base models may not support serverless addons. To check:
1. Run `firectl -a fireworks get `
2. Look under `Deployed Model Refs` to see if a Fireworks-owned deployment exists, e.g. `accounts/fireworks/deployments/3c7a68b0`
3. If so, then it is supported
If the base model doesn't support serverless addons, you will need use an [on-demand deployment](/models/deploying#deploying-to-on-demand) to deploy it.
## Additional tuning options
Tuning settings are specified when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like `epochs` and `learning rate`, we recommend using default settings and only changing hyperparameters if results are not as desired. All tuning options must be specified via command line flags as shown in the below example:
```shell
firectl create sftj \
--base-model llama-v3p1-8b-instruct \
--dataset cancerset \
--output-model my-tuned-model \
--job-id my-fine-tuning-job \
--learning-rate 0.0001 \
--epochs 2 \
--early-stop \
--evaluation-dataset my-eval-set
```
### Evaluation
By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data.
`evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl
```shell
firectl create sftj \
...
--evaluation-dataset my-eval-set \
...
```
### Early stopping
Early stopping stops training early in the validation loss does not improve. It is off by default
```shell
firectl create sftj \
...
--early-stop \
...
```
### Max Context Length
By default, fine-tuned models support a max context length of 8k. Increase max context length if your use case requires context above 8k. Maximum context length can be increased up to the default context length of your selected model. For models with over 70B parameters, we only support up to 65536 max context length.
```shell
firectl create sftj \
...
--max-context-length 65536
...
```
### Epochs
Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported.
**Note: we set a max value of 3 million dataset examples \* epochs**
```shell
firectl create sftj \
...
--epochs 2.0 \
...
```
### Learning rate
Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value set is automatically based on your selected model.
```shell
firectl create sftj \
...
--learning-rate 0.0001 \
...
```
### Lora Rank
LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8.
```shell
firectl create sftj \
...
--lora-rank 16 \
...
```
### Training progress and monitoring
The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.
```shell
firectl create sftj \
...
--wandb-entity my-org \
--wandb-api-key xxx \
--wandb-project "My Project" \
...
```
### Model ID
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within [ID constraints](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids).
```shell
firectl create sftj \
...
--output-model-id my-model \
...
```
### Job ID
By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.
```shell
firectl create sftj \
...
--job-id my-fine-tuning-job \
...
```
### Turbo Mode
By default, the fine-tuning job will use a single GPU. You can optionally enable the turbo mode to accelerate with multiple GPUs (only for non-Deepseek models)
```shell
firectl create sftj \
...
--turbo \
...
```
## Downloading model weights
To download model weights run
```shell
firectl download model
```
## Appendix
### Supported base models - tuning
[Using UI](https://fireworks.ai/models?tunable=true): In the model library page, select the Tunable filter. In the model page, check whether the "fine-tuning" field is set to "supported" in the model's details page.
All models available for tuning also support LoRAs on their [dedicated](/models/deploying#deploying-to-on-demand) deployments, meaning that [up to 100 LoRAs](https://docs.fireworks.ai/guides/quotas_usage/rate-limits#other-quotas) can be deployed to a dedicated instance for no extra fees compared to the base deployment costs. Some models support LoRAs on dedicated deployments even though Fireworks does not support tuning for these models. This means that users can tune a LoRA on a separate platform but upload this LoRA to Fireworks for inference.
### Supported base models - LoRAs on serverless
Some [serverless](/models/deploying#deploying-to-serverless) models support LoRA deployment, allowing up to 100 LoRAs can be deployed for inference that's constantly available on a pay-per-token basis. The field for "Serverless LoRA Deployment" will be set to "supported" for these models in their model details page.
# Using Document Inlining
Source: https://docs.fireworks.ai/firesearch/inline-multimodal
## Overview
Document Inlining allows any LLM to process images and PDFs through our chat completions API. Simply append `#transform=inline` to your document URL to enable this feature. Document Inlining connects our proprietary Fireworks Parsing Service to any LLM to provide advantages including:
* **Improved reasoning (compared to VLMs):** LLMs reason better over text than over image and document inlining allows you to use specialized and more recently updated text models
* \*\*Improved input flexibility: \*\*Document Inlining enables PDFs and multiple images to be ingested
* **Ultra-simple usage:** Use Document Inlining through our openAI-compability, chat completions API. Simply add 1-line to specify to add your file and turn on Document Inlining
Read our [announcement blog](https://fireworks.ai/blog/document-inlining-launch) for more details.
## Usage
### Basic Example
Note the `#transform=inline` suffix on the image URL.
```python Python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://pdfobject.com/pdf/sample.pdf#transform=inline"
}
},
{
"type": "text",
"text": "What information can you extract from this document?"
}
]
}
]
)
```
```typescript TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "",
baseURL: "https://api.fireworks.ai/inference/v1"
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://example.com/document.pdf#transform=inline"
}
},
{
type: "text",
text: "What information can you extract from this document?"
}
]
}
]
});
```
```javascript JavaScript
const OpenAI = require("openai");
const client = new OpenAI({
apiKey: "",
baseURL: "https://api.fireworks.ai/inference/v1"
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
messages: [
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://example.com/document.pdf#transform=inline"
}
},
{
type: "text",
text: "What information can you extract from this document?"
}
]
}
]
});
```
The `image_url.url` field supports both direct URLs and base64-encoded data URLs, compatible with VLM API:
```text
# For PDF files
data:application/pdf;base64,{base64_str_for_pdf}
# For images (png/jpg/gif/tiff supported)
data:image/png;base64,{base64_str_for_image}
data:image/jpeg;base64,{base64_str_for_image}
data:image/gif;base64,{base64_str_for_image}
data:image/tiff;base64,{base64_str_for_image}
```
Similarly, append `#transform=inline` to the base64 string to enable document inlining for base64 image inputs.
### Combining with Structured Output
Document Inlining works seamlessly with structured output formats. Here's how to extract specific fields using [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting):
```python
from pydantic import BaseModel
class DocumentInfo(BaseModel):
title: str
key_points: list[str]
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[...], # Same as above
response_format={"type": "json_object", "schema": DocumentInfo.model_json_schema()}
)
```
## Limitations
Document Inlining is only intended to handle images and documents that contain text. Document Inlining may provide subpar results for highly visual, spatially dependent, or layout-heavy content that does not translate well into structured text.
* \*\*Maximum document size: \*\*50 pages or the model's context size (whichever is smaller)
* **Maximum image size:** \~32 MB if sent as base64 encoded string, \~100 MB if sent as URL
* **Supported formats:** PDFs and images
## Model Compatibility
Document Inlining works with any LLM on Fireworks, including:
* Serverless models
* On-demand models
* Fine-tuned and custom models
* Vision models
Simply append `#transform=inline` to your document URL to enable the feature with any supported model. Multiple documents are supported. Vision models also support document inlining with images for use cases that require both document processing and non-document vision. Users can control whether to inline a document by selectively appending `#transform=inline` to image\_url.url of each attachment.
## Pricing
During public preview, Document Inlining incurs no added costs compared to our typical text models. For example, let’s say you’re conducting a structured extraction task where you provide:
* **Input:** 10 token prompt and a document with 1,000 tokens worth of text
* **Output:** 100 tokens
You would simply pay for the 1110 tokens worth of input and output token costs but will NOT incur additional costs for document parsing.
Please note that Document Inlining is in Public Preview mode and subject to changes. Please contact us on Discord if you have feedback or questions or at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) for enterprise inquries.
# Concepts
Source: https://docs.fireworks.ai/getting-started/concepts
This document outlines basic Fireworks AI concepts.
## Resources
### Account
Your account is the top-level resource under which other resources are located. Quotas and billing are enforced at the account level, so usage for all users in an account contribute to the same quotas and bill.
* For developer accounts, the account ID is auto-generated from the email address used to sign up.
* Enterprise accounts can optionally choose a custom, unique account ID.
### User
A user is an email address associated with an account. Users added to an account have full access to delete, edit, and create resources within the account, such as deployments and models.
### Models and model types
A model is a set of model weights and metadata associated with the model. Each model has a [**globally unique name**](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) of the form `accounts//models/`. There are two types of models:
**Base models:** A base model consists of the full set of model weights, including models pre-trained from scratch and full fine-tunes.
* Fireworks has a library of common base models that can be used for [**serverless inference**](https://docs.fireworks.ai/models/overview#serverless-inference) as well as [**dedicated deployments**](https://docs.fireworks.ai/models/overview#dedicated-deployments). Model IDs for these models are pre-populated. For example, `llama-v3p1-70b-instruct` is the model ID for the Llama 3.1 70B model that Fireworks provides. The ID for each model can be found on its page ([**example**](https://fireworks.ai/models/fireworks/llama-v3p1-70b-instruct))
* Users can also [upload their own](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model) custom base models and specify model IDs.
**LoRA (low-rank adaptation) addons:** A LoRA addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to a fully fine-tuned model. Fireworks supports [**training**](https://docs.fireworks.ai/fine-tuning/fine-tuning-models), [**uploading**](https://docs.fireworks.ai/models/uploading-custom-models#custom-lora-addons), and [**serving**](https://docs.fireworks.ai/models/deploying) LoRA addons. LoRA addons must be deployed on a serverless or dedicated deployment for its corresponding base model. Model IDs for LoRAs can be either auto-generated or user-specified.
### Deployments and deployment types
A model must be deployed before it can be used for inference. A deployment is a collection (one or more) model servers that host one base model and optionally one or more LoRA addons.
Fireworks supports two types of deployments:
* **Serverless deployments:** Fireworks hosts popular base models on shared "serverless" deployments. Users pay-per-token to query these models and do not need to configure GPUs. The most popular serverless deployments also support serverless LoRA addons. See the [**Deploying to serverless**](https://docs.fireworks.ai/models/deploying#deploying-to-serverless) guide for details.
* **Dedicated deployments:** Dedicated deployments enable users to configure private deployments with a wide array of hardware (see [on-demand deployments guide](https://docs.fireworks.ai/guides/ondemand-deployments)). Dedicated deployments give users performance guarantees and the most flexibility and control over what models can be deployed. Both LoRA addons and base models can be deployed to dedicated deployments. Dedicated deployments are billed by a GPU-second basis (see [**pricing**](https://fireworks.ai/pricing#ondemand) page).
See the [**Querying text models guide**](https://docs.fireworks.ai/guides/querying-text-models) for a comprehensive overview of making LLM inference.
### Deployed model
Users can specify a model to query for inference using the model name and deployment name. Alternatively, users can refer to a "deployed model" name that refers to a unique instance of a base model or LoRA addon that is loaded into a deployment. See [deploying models guide](https://docs.fireworks.ai/models/deploying#inference) for more.
### Dataset
A dataset is an immutable set of training examples that can be used to fine-tune a model.
### Fine-tuning job
A fine-tuning job is an offline training job that uses a dataset to train a LoRA addon model.
## Resource names and IDs
Resource IDs must satisfy the following constraints:
* Between 1 and 63 characters (inclusive)
* Consists of a-z, 0-9, and hyphen (-)
* Does not begin or end with a hyphen (-)
* Does not begin with a digit
A full resource name looks like
```
accounts//models/
```
Some APIs take the full resource name, while others may take a resource ID if the context is clear.
## Control plane and data plane
The Fireworks API can be split into a control plane and a data plane.
* The **control plane** consists of APIs used for managing the lifecycle of resources. This
includes your account, models, and deployments.
* The **data plane** consists of the APIs used for inference and the backend services that power
them.
## Interfaces
Users can interact with Fireworks through one of many interfaces:
* The **web console** at [https://fireworks.ai](https://fireworks.ai)
* The command-line interface `firectl`
* [Python SDK](/tools-sdks/python-client/installation)
# Fireworks AI Developer Platform
Source: https://docs.fireworks.ai/getting-started/introduction
Start building with open source AI models
Fireworks AI is the best platform for building AI product experiences with open source AI models. You can run and customize AI models with just a few lines of code!
### Start building
Make your first API call with Fireworks Serverless Inference
View 100s of supported models across text, vision, audio, image and more
Get the best speed, reliability, & scalability
Customize a model for your specific use case
Query vision language models
Convert speech to text async or in realtime
Get responses in your specified JSON schema
Customize and deploy a model on Fireworks
### Resources
Get support and discuss with other developers
Code examples, tutorials and guides
Technical analysis, features and customer stories
Check status of Fireworks AI services
Security and compliance resources
Contact Sales or reach out to our team
### What we offer
The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly [pricing](https://fireworks.ai/pricing). See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links.
* **Inference:** Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case.
* **Modalities and Models:** Use 100s models (or bring your own) across modalities of:
* [Text](https://docs.fireworks.ai/guides/querying-text-models)
* [Audio](https://docs.fireworks.ai/api-reference/audio-transcriptions)
* [Image](https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt)
* [Embedding](https://docs.fireworks.ai/guides/querying-embeddings-models)
* [Vision-understanding](https://docs.fireworks.ai/guides/querying-vision-language-models)
* **Adaptation:** [Tune](https://docs.fireworks.ai/fine-tuning/fine-tuning-models) and optimize your model and deployment for the best . [Serve](https://docs.fireworks.ai/models/deploying) and experiment with hundreds of fine-tuned models with our multi-LoRA [capabilities](https://fireworks.ai/blog/multi-lora).
* **Compound AI Development:** Use [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting), [grammar mode](https://docs.fireworks.ai/structured-responses/structured-output-grammar-based) or [function calling](https://docs.fireworks.ai/guides/function-calling) to build a collaborative system with reliable and performant outputs
## Inference
Fireworks has 3 options for running generative AI models with unparalleled speed and costs.
* **Serverless:** The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots.
* [On-demand](https://fireworks.ai/blog/why-gpus-on-demand)**:** The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both \~250% improved throughput and 50% improved latency compared to vLLM. Excels for:
* **Production volume** - Per-token costs decrease with more volume and there are no set rate limits
* **Custom needs and reliability** - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models
* **Enterprise Reserved GPUs:** Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations.
| Property | **Serverless** | **On-demand** | **Enterprise reserved** |
| -------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Performance** | Industry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage. | Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM. | Tailor-made set-up by Fireworks AI experts for best possible latency |
| **Getting Started** | Self-serve - immediately use serverless with 1 line of code | Self-serve - configure GPUs, then use them with 1 line of code. | [Chat with Fireworks](https://fireworks.ai/company/contact-us) |
| **Scaling and management** | Scale up and down freely within rate limits | Option for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups. | [Chat with Fireworks](https://fireworks.ai/company/contact-us) |
| **Pricing** | Pay fixed price per token | Pay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM. | Customized price based on reserved GPU capacity |
| **Commitment** | None | None | Arrange plan length with Fireworks |
| **Rate limits** | Yes, see [quotas](https://docs.fireworks.ai/accounts/quotas) | No rate limits. [Quotas](https://docs.fireworks.ai/accounts/quotas) on number of GPUs | None |
| **Model Selection** | Collection of popular models, curated by Fireworks | Use 100s of pre-uploaded models or upload your own custom model within supported [architecture](https://docs.fireworks.ai/models/uploading-custom-models) | Use 100s of pre-uploaded models or upload any model |
## FireOptimizer
**FireOptimizer:** Fireworks optimizes inference for your workload and your use case, and performs fine-tuning, through FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are:
* [Fine-tuning](https://fireworks.ai/blog/fine-tune-launch)**:** Quickly fine-tune models with LoRA for the best quality on your use case
* Upload data and choose your model to start tuning
* Pay per token of training data.
* Serve and evaluate models immediately on Fireworks
* Download models weights to use anywhere
* [Multi-LoRA serving](https://fireworks.ai/blog/multi-lora)**:** Deploy 100s of fine-tuned models at no extra cost.
* Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model.
* Use models fine-tuned on Fireworks or upload your own fine-tuned adapter
* Host hundreds of models on the same deployment on either serverless or dedicated deployments
## Compound AI
Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include:
* [JSON mode and grammar mode](https://fireworks.ai/blog/why-do-all-LLMs-need-structured-output-modes)**:** Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components.
* [Function calling](https://fireworks.ai/blog/firefunction-v2-launch-post)**:** Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B
# Quickstart
Source: https://docs.fireworks.ai/getting-started/quickstart
Get started in minutes with an OpenAI-compatible endpoint
Fireworks AI is the best platform for building AI product experiences with open source AI models. You can run and customize AI models with just a few lines of code!
Using the API, you can access popular [open-source models](https://fireworks.ai/models/fireworks/deepseek-r1/playground) like Llama, DeepSeek, etc. The example below generates [text output](/guides/querying-text-models) through an OpenAI-compatible `chat completions` API endpoint.
In this guide, you will get an API key, set up your development environment, and call the Fireworks API with an API Key.
### Get an API key
[Sign up or login](https://fireworks.ai/login) to your Fireworks account. Generate an API key by navigating to the [API Keys](https://fireworks.ai/api-keys) page and click on 'Create API key'. Store the API Key in a safe location.
### Set up your developer environment & call the Fireworks API
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade fireworks-ai
```
Fireworks Python Client is OpenAI API Compatible.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API.
```python
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade openai
```
Fireworks AI platform offers drop-in replacement with OpenAI Python Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $OPENAI_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx OPENAI_API_BASE "https://api.fireworks.ai/inference/v1"
setx OPENAI_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %OPENAI_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI Python SDK.
```python
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
# notice the change in the model name
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
)
print(response.choices[0].message.content)
```
```bash
bun add @fireworksai/sdk
```
Fireworks JavaScript Client is OpenAI API Compatible.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
```javascript
import { FireworksAI } from "@fireworksai/sdk";
const fireworksAI = new FireworksAI({
apiKey: process.env.FIREWORKS_API_KEY,
});
const completion = await fireworksAI.chat.completions.create({
messages: [{ role: "user", content: "Say this is a test" }],
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
});
console.log(completion.choices[0].message.content);
```
Before installing, ensure that you have the right version of Node. Please make sure you have the `npm` installed or a package manager of your choice.
```bash
npm install openai
```
Fireworks AI platform offers drop-in replacement with OpenAI JavaScript Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI JavaScript SDK.
```javascript
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.fireworks.ai/inference/v1',
apiKey: process.env['FIREWORKS_API_KEY']
});
const completion = await openai.chat.completions.create({
messages: [{ role: "user", content: "Say this is a test" }],
model: "accounts/fireworks/models/llama-v3p3-70b-instruct",
});
console.log(completion.choices[0].message.content);
```
cURL is a popular open-source command line tool to send HTTP requests. Most Operating systems ship cURL by default.
However, if you are not sure, you can follow the first two steps of this guide to setup cURL. If not, we recommend skipping to **Step Three**.
Check if your operating system has cURL installed by running `curl https://api.fireworks.ai`
macOS comes with the cURL tool bundled with the operating system.
If you want to upgrade to the latest version shipped by the cURL project, we recommend installing homebrew:
```bash Homebrew
brew install curl
```
Most Linux distributions offer curl and libcurl to be installed if they are not installed by default.
```bash apt
apt install curl
```
```bash yum
yum install curl
```
Windows 10 comes with the cURL tool bundled with the operating system since version 1804.
If you have an older Windows version or just want to upgrade to the latest version shipped by the cURL project, download the latest official cURL release for Windows from [curl.se/windows](https://curl.se/windows).
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
Making your first API request with cURL. Notice the use of `$FIREWORKS_API_KEY`.
```
curl \
--header 'Authorization: Bearer '$FIREWORKS_API_KEY \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3p3-70b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
More details on calling various APIs can be found at our [API Reference](/api-reference)
## Explore further
Learn more about prompting text models
View the full API reference
Customize a model for your specific use case
Get the best speed, reliability, & scalability
Query vision language models
Convert speech to text async or in realtime
# Using function-calling
Source: https://docs.fireworks.ai/guides/function-calling
## Introduction
Function calling enables models to intelligently select and utilize tools based on user input. This powerful feature allows you to build dynamic agents that can access real-time information and generate structured outputs. The function calling API doesn't execute functions directly. Instead, it generates [OpenAI](https://platform.openai.com/docs/guides/function-calling)-compatible function call specifications that you then implement.
## How function calling works
1. **Tools specifications:** You specify a query along with the list of available tools for the model. The tools are specified using [JSON Schema](https://json-schema.org/learn/getting-started-step-by-step). Each tool includes its name, description, and required parameters.
2. **Intent detection:** The model analyzes user input and determines whether to provide a conversational response or generate function calling specifications.
3. **Function call generation:** When appropriate, the model outputs structured function calls in OpenAI-compatible format, including all necessary parameters based on the context.
4. **Execution and response generation:** You execute the specified function calls and feed results back to the model for continued conversation.
## Supported models
A subset of models hosted on Fireworks supports function calling using the described syntax. These models are listed below. The [supportsTools](https://docs.fireworks.ai/api-reference/get-model#response-supports-tools) field in the model response also indicates whether the model supports function calling.
* [Llama 3.1 405B Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct)
* [Llama 3.1 70B Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-70b-instruct)
* [Qwen 2.5 72B Instruct](https://fireworks.ai/models/fireworks/qwen2p5-72b-instruct)
* [Mixtral MoE 8x22B Instruct](https://fireworks.ai/models/fireworks/mixtral-moe-8x22b-instruct)
* [Firefunction-v2](https://fireworks.ai/models/fireworks/firefunction-v2): Latest and most performant model, optimized for complex function calling scenarios (on-demand only)
* [Firefunction-v1](https://fireworks.ai/models/fireworks/firefunction-v1): Previous generation, Mixtral-based function calling model optimized for fast routing and structured output (on-demand only)
These models can all utilize function calling with the same syntax, shown below.
## Basic example: City population data retrieval: Llama 3.1 405B Instruct
For this example, let’s consider a user looking for population data for a specific city. We will provide the model with a tool that it can invoke to retrieve city population data.
1. To achieve this, we detail the purpose, arguments, and usage of the `get_city_population` function using [JSON Schema](https://json-schema.org/). This information is provided through the `tools` argument. The user query is sent as usual through the `messages` argument.
```python Request
import openai
import json
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key=""
)
# Define the function tool for getting city population
tools = [
{
"type": "function",
"function": {
# The name of the function
"name": "get_city_population",
# A detailed description of what the function does
"description": "Retrieve the current population data for a specified city.",
# Define the JSON schema for the function parameters
"parameters": {
# Always declare a top-level object for parameters
"type": "object",
# Properties define the arguments for the function
"properties": {
"city_name": {
# JSON Schema type
"type": "string",
# A detailed description of the property
"description": "The name of the city for which population data is needed, e.g., 'San Francisco'."
},
},
# Specify which properties are required
"required": ["city_name"],
},
},
}
]
# Define a comprehensive system prompt
prompt = f"""
You have access to the following function:
Function Name: '{tools[0]["function"]["name"]}'
Purpose: '{tools[0]["function"]["description"]}'
Parameters Schema: {json.dumps(tools[0]["function"]["parameters"], indent=4)}
Instructions for Using Functions:
1. Use the function '{tools[0]["function"]["name"]}' to retrieve population data when required.
2. If a function call is necessary, reply ONLY in the following format:
{{"city_name": "example_city"}}
3. Adhere strictly to the parameters schema. Ensure all required fields are provided.
4. Use the function only when you cannot directly answer using general knowledge.
5. If no function is necessary, respond to the query directly without mentioning the function.
Examples:
- For a query like "What is the population of Toronto?" respond with:
{{"city_name": "Toronto"}}
- For "What is the population of the Earth?" respond with general knowledge and do NOT use the function.
"""
# Initial message context
messages = [
{"role": "system", "content": prompt},
{"role": "user", "content": "What is the population of San Francisco?"}
]
# Call the model
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
# Print the model's response
print(chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": null,
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": [
{
"id": "call_tPSbe4guTSXuUWbqtWguSJzu",
"function": {
"arguments": "{\"city_name\": \"San Francisco\"}",
"name": "get_city_population"
},
"type": "function",
"index": 0
}
]
}
```
1. In our case, the model decides to invoke the `get_city_population` tool with a specific argument. Note that the model itself does not invoke the tool. It just specifies the argument. When the model issues a function call, the completion reason will be set to `tool_calls`. The API caller is responsible for parsing the function name and arguments supplied by the model and invoking the appropriate tool.
```python Call External API
def get_city_population(city_name: str):
print(f"{city_name=}")
if city_name == "San Francisco":
return {"population": 883305}
else:
raise NotImplementedError()
function_call = chat_completion.choices[0].message.tool_calls[0].function
tool_response = locals()[function_call.name](**json.loads(function_call.arguments))
print(tool_response)
```
```json Response
city_name='San Francisco'
{'population': 883305}
```
1. The API caller obtains the response from the tool invocation and passes its response back to the model for generating a response.
```python Request
agent_response = chat_completion.choices[0].message
# Append the response from the agent
messages.append(
{
"role": agent_response.role,
"content": "",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat_completion.choices[0].message.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response)
}
)
next_chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(next_chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": "The population of San Francisco is 883305.",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
```
This results in the following response
```
The population of San Francisco is 883305.
```
## Advanced example: Financial data retrieval
**TL;DR** **This example tutorial is available as a Python notebook** \[[code](https://github.com/fw-ai/cookbook/blob/main/learn/function-calling/notebooks_firefunction_openai/fireworks_function_calling_demo.ipynb) | [Colab](https://colab.research.google.com/drive/1m7Bk1360CFI50y24KBVxRAKYuEU3pbPU?usp=sharing)].
For this example, let's consider a user looking for Nike's financial data. We will provide the model with a tool that the model is allowed to invoke to get access to the financial information of any company.
1. To achieve our goal, we will provide the model with information about the `get_financial_data` function. We detail its purpose, arguments, etc in [JSON Schema](https://json-schema.org/). We send this information in through the `tools` argument. We send the user query as usual through the `messages` argument.
```python Request
import openai
import json
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = ""
)
messages = [
{"role": "system", "content": f"You are a helpful assistant with access to functions."
"Use them if required."},
{"role": "user", "content": "What are Nike's net income in 2022?"}
]
tools = [
{
"type": "function",
"function": {
# name of the function
"name": "get_financial_data",
# a good, detailed description for what the function is supposed to do
"description": "Get financial data for a company given the metric and year.",
# a well defined json schema: https://json-schema.org/learn/getting-started-step-by-step#define
"parameters": {
# for OpenAI compatibility, we always declare a top level object for the parameters of the function
"type": "object",
# the properties for the object would be any arguments you want to provide to the function
"properties": {
"metric": {
# JSON Schema supports string, number, integer, object, array, boolean and null
# for more information, please check out https://json-schema.org/understanding-json-schema/reference/type
"type": "string",
# You can restrict the space of possible values in an JSON Schema
# you can check out https://json-schema.org/understanding-json-schema/reference/enum for more examples on how enum works
"enum": ["net_income", "revenue", "ebdita"],
},
"financial_year": {
"type": "integer",
# If the model does not understand how it is supposed to fill the field, a good description goes a long way
"description": "Year for which we want to get financial data."
},
"company": {
"type": "string",
"description": "Name of the company for which we want to get financial data."
}
},
# You can specify which of the properties from above are required
# for more info on `required` field, please check https://json-schema.org/understanding-json-schema/reference/object#required
"required": ["metric", "financial_year", "company"],
},
},
}
]
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": "",
"role": "assistant",
"function_call": null,
"tool_calls": [
{
"id": "call_XstygHYlzKrI8hbERr0ybeOQ",
"function": {
"arguments": "{\"metric\": \"net_income\", \"financial_year\": 2022, \"company\": \"Nike\"}",
"name": "get_financial_data"
},
"type": "function",
"index": 0
}
]
}
```
1. In our case, the model decides to invoke the tool `get_financial_data` with some specific set of arguments. Again note that the model itself won't invoke the tool -- it just specifies the arguments. When the model issues a function call, the completion reason will be set to `tool_calls`. The API caller is responsible for parsing the function name and arguments supplied by the model and invoking the appropriate tool.
```python Call External API
def get_financial_data(metric: str, financial_year: int, company: str):
print(f"{metric=} {financial_year=} {company=}")
if metric == "net_income" and financial_year == 2022 and company == "Nike":
return {"net_income": 6_046_000_000}
else:
raise NotImplementedError()
function_call = chat_completion.choices[0].message.tool_calls[0].function
tool_response = locals()[function_call.name](**json.loads(function_call.arguments))
print(tool_response)
```
```json Response
metric='net_income' financial_year=2022 company='Nike'
{'net_income': 6046000000}
```
1. The API caller obtains the response from the tool invocation and passes its response back to the model for generating a response.
```python Request
agent_response = chat_completion.choices[0].message
# Append the response from the agent
messages.append(
{
"role": agent_response.role,
"content": "",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat_completion.choices[0].message.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response)
}
)
next_chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=messages,
tools=tools,
temperature=0.1
)
print(next_chat_completion.choices[0].message.content)
```
```json Response
{
"content": "Nike's net income for the year 2022 was $6,046,000,000.",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
```
This results in the following response
```
Nike's net income for the year 2022 was $6,046,000,000.
```
## Tools specification
The `tools` field is an array where each component includes the following fields:
1. `type` (`string`) Specifies the type of the tool. Currently, only `function` is supported.
2. `function` (`object`) Specifies the function to be called. It includes the following fields:
* `description` (`string`): A description of what the function does, used by the model to choose when and how to call the function.
* `name` (`string`): The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
* `parameters` (`object`): The parameters the functions accepts, described as a JSON Schema object. See the [JSON Schema reference](https://json-schema.org/understanding-json-schema/reference) for documentation about the format.
## Tool choice
The `tool_choice` parameter controls whether the model is allowed to call functions or not. Currently, we support `auto`, `none` , `any` or a specific function name.
* `auto` (default)
The model can dynamically choose between generating a message or calling a function. This is the default tool choice when no value is specified for `tool_choice`.
* `none`
Disables the use of any tools, similar to not specifying the `tool_choice` field.
* `any`
Allows the model to call any function. You can also specify:
```
tool_choice = {"type": "function"}
```
This ensures that a function call will always be made, with no restriction on the function's name.
* Specific function name
To force the model to use a particular function, you can explicitly specify the function name in the `tool_choice` field. For example:
```
tool_choice = {"type": "function", "function": {"name": "get_financial_data"}}
```
This ensures that the model will only use the `get_financial_data` function.
## OpenAI compatibility
Fireworks AI's function calling API is fully compatible with OpenAI's implementation, with a few differences:
* No support for parallel function calling
* No nested function calling
* Simplified tool choice options
## Best practices
* **Number of Functions:** The length of the list of functions specified to the model directly impacts its performance. For best performance, keep the list of functions below 7. It's possible to see some degradation in the quality of the model as the tool list length exceeds 10.
* **Function Description:** The function specification follows [JSON Schema](https://json-schema.org/). For best performance, describe in great detail what the function does under the "description" section. An example of a good function description is "Get financial data for a company given the metric and year". A bad example would be "Get financial data for a company".
* **System Prompt:** In order to ensure optimal performance, we recommend not adding any additional system prompt. User-specified system prompts can interfere with the function detection and calling ability of the model. The auto-injected prompt for our function calling model is designed to ensure optimal performance.
* **Temperature:** Set the temperature to 0.0 or some low value. This helps the model to only generate confident predictions and avoid hallucinating parameter values.
* **Function descriptions:** Providing verbose descriptions for functions and its parameters. This is similar to prompt engineering: the more elaborate and accurate the function definition/documentation, the better the model is at deciphering the accurate intent of the function and its parameters.
## Function calling vs JSON mode
When should you use function calling vs [JSON mode](/structured-responses/structured-response-formatting)?
Use function calling when:
* Building interactive agents
* Requiring structured API calls
* Implementing multi-step workflows
* Needing dynamic decision making
Use JSON mode when:
* Performing simple data extraction
* Working with static data
* Needing non-JSON output formats
* Processing batch data without interaction
## Example apps
* Official demos
* [Interactive Image and Finnace Dashboard](https://functional-chat.vercel.app/)
* [Data Extraction Pipeline](https://colab.research.google.com/drive/1SI6jz66k122vv641e8wDDI0Ujh4cwlUy?usp=sharing)
* Langchain integrations
* [Javascript Function Calling](https://github.com/langchain-ai/langchainjs/blob/main/cookbook/function_calling_fireworks.ipynb)
* [Agent Executor Implementation](https://colab.research.google.com/drive/1huPsNm9l4OcJvIcu63u0FFWF8X2J7zW3?usp=sharing)
* [RAG with Langchain](https://colab.research.google.com/drive/1Vy4tYxP_rlbkAKi4pGpaDRV7hnSQeG2d?usp=sharing)
## Resources
* [Fireworks Blog Post on FireFunction-v2](https://fireworks.ai/blog/firefunction-v2-launch-post)
* [Open AI Docs on Function Calling](https://platform.openai.com/docs/guides/function-calling)
* [Open AI Cookbook on Function Calling](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models)
* [Function Calling Best Practices](#best-practices)
## Data policy
Data from Firefunction is logged and automatically deleted after 30 days to ensure product quality and to prevent abuse ( bulk data on average # functions used, etc). This data will never be used to train models. Please contact [raythai@fireworks.ai](mailto:raythai@fireworks.ai) if you have questions, comments, or use cases where data cannot be logged.
# Merging LoRA adapters with base models
Source: https://docs.fireworks.ai/guides/lora-model-merge
A guide for downloading base models, merging them with LoRA adapters, and deploying the result using Fireworks.
# Merging LoRA adapters with base models on Fireworks
A guide for downloading base models, merging them with LoRA adapters, and deploying the result using Fireworks.
**Prerequisites:**
* Fireworks account and `firectl` installed
* Python environment with necessary packages
* Local LoRA adapter or access to HuggingFace
* Python 3.9 or later (\< 3.13)
Follow the steps below to merge and deploy your models.
## 1. Access and download base model
### 1.1 List available models
View all models in your Fireworks account:
```bash
firectl list models
```
Example output:
```
Code Llama 13B (code-llama-13b) 2024-02-29 20:36:24 HF_BASE_MODEL
CodeGemma 7B (codegemma-7b) 2024-06-19 22:57:22 HF_BASE_MODEL
... ... ...
```
Recall the supported base models:
* Gemma
* Phi, Phi-3
* Llama 1, 2, 3, 3.1
* LLaVa
* Mistral & Mixtral
* Qwen2
* StableLM
* Starcoder (GPTBigCode) & Starcoder2
* DeepSeek V1 & V2
* GPT NeoX
### 1.2 Download base model
Download your chosen model to a local directory:
```bash
firectl download model
```
Example:
```bash
firectl download model code-llama-13b ./base_model
```
Available flags:
* `--quiet`: Suppress progress bar
* `-h, --help`: Display help information
## 2. Obtain LoRA adapter
### 2.1 Download LoRA adapter from Fireworks
The easiest way to obtain a LoRA adapter is to download it directly from Fireworks. LoRA adapters are listed alongside models when using `firectl list models` and are denoted with the type `HF_PEFT_ADDON`. Download a LoRA adapter using the same command as downloading a model.
### 2.2 Download from HuggingFace (Optional)
If you need to download a LoRA adapter from HuggingFace, follow these steps:
**Requirements**
Install the required package:
```bash
pip install huggingface_hub
```
**Download code**
```python
from huggingface_hub import snapshot_download
# Configure download parameters
adapter_id = "hf-account/adapter-name" # Your HuggingFace adapter path
output_path = "./path/to/save/adapter" # Local directory to save adapter
# Download the adapter
local_path = snapshot_download(
repo_id=adapter_id,
local_dir=output_path
)
```
Important notes:
* Replace `adapter_id` with your desired LoRA adapter
* Ensure `output_path` is a valid directory path
* The function returns the local path where files are downloaded
## 3. Merging base model with LoRA adapter
### 3.1 Installation requirements
First, ensure you have the necessary libraries installed:
```bash
pip install torch transformers peft
```
### 3.2 Merging script
Create a Python script (`merge_model.py`) with the following code:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
def merge_lora_with_base_model(base_model_path: str, lora_path: str, output_path: str):
"""
Merge a LoRA adapter with a base model and save the result.
Args:
base_model_path (str): Path to the base model directory
lora_path (str): Path to the LoRA adapter directory
output_path (str): Directory to save the merged model
"""
# Load base model
print(f"Loading base model from {base_model_path}")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and apply LoRA adapter
print(f"Loading LoRA adapter from {lora_path}")
model = PeftModel.from_pretrained(
base_model,
lora_path
)
# Merge adapter with base model
print("Merging LoRA adapter with base model...")
merged_model = model.merge_and_unload()
# Save merged model
print(f"Saving merged model to {output_path}")
merged_model.save_pretrained(output_path)
# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
tokenizer.save_pretrained(output_path)
print("Merge completed successfully!")
if __name__ == "__main__":
# Example usage
merge_lora_with_base_model(
base_model_path="./base_model", # Directory containing the base model
lora_path="./lora_adapter", # Directory containing the LoRA adapter
output_path="./merged_model" # Output directory for merged model
)
```
If you downloaded the base model from Fireworks AI, then you might need to update the `base_model_path` to `./base_model/hf` because required files such as `config.json `might be within the `hf` directory.
### 3.3 Running the merge
Execute the script after setting your paths:
```bash
python merge_model.py
```
**Important:** After merging, verify that all necessary tokenizer files are present in the output directory. The merging process might skip some essential tokenizer files. You may need to manually copy these files from the base model:
* `tokenizer_config.json`
* `tokenizer.json`
* `special_tokens_map.json`
These files can be found in the original base model directory or the model's HuggingFace repository (e.g., meta-llama/Llama-3.1-70B-Instruct).
### 3.4 Important Notes
* Ensure sufficient disk and GPU memory for all models
* Check your cache directory (\~/.cache/huggingface/hub) as models may already be downloaded there
* Verify LoRA adapter compatibility with base model
* All paths must exist and have proper permissions
* Memory issues can be resolved by setting `device_map="cpu"`
## 4. Uploading and deploying merged model
### 4.1 Create model in Fireworks
Upload your merged model to Fireworks:
```bash
firectl create model
```
Example:
```bash
firectl create model sql-enhanced-model ./merged_model
```
For additional options:
```bash
firectl create model -h
```
### 4.2 Create deployment
Deploy your uploaded model:
Basic deployment:
```bash
firectl create deployment
```
Using full model path:
```bash
firectl create deployment accounts//models/
```
Example:
```bash
firectl create deployment sql-enhanced-model
# OR
firectl create deployment accounts/myaccount/models/sql-enhanced-model
```
Recall, for additional deployment parameters/configuration options:
```bash
firectl create deployment -h
```
### 4.3 Verification
After deployment, you can verify the status using:
```bash
firectl list deployments
```
***
## Complete workflow summary
1. Download base model from Fireworks using `firectl`
2. Download LoRA adapter to local device (e.g. using HuggingFace)
3. Merge models using provided Python script
4. Upload merged model to Fireworks
5. Create deployment
# On-demand deployments
Source: https://docs.fireworks.ai/guides/ondemand-deployments
Deploying on your own GPUs
Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:
* Predictable performance unaffected by load caused by other users
* No hard rate limits, but subject to the maximum load capacity of the deployment
* Cheaper under high utilization
* Access to larger selection of models not available via our serverless models
* [Custom base models](/models/uploading-custom-models#custom-base-models) from Hugging Face files
If you plan on using a significant amount of on-demand deployments, consider purchasing [reserved capacity](/deployments/reservations) for cheaper pricing and higher GPU quotas.
## Quickstart
See the "All models" list on our [Models](https://fireworks.ai/models) page for a list of pre-uploaded models on the
Fireworks AI platform. You can also use a [custom base model](#custom-base-models).
To create a new deployment of a [model provided by Fireworks](https://fireworks.ai/models), run:
```bash
firectl create deployment accounts/fireworks/models/ --wait
```
This command will complete when the deployment is `READY`. To let it run asynchronously, remove the `--wait` flag.
The string `accounts/fireworks/models/` is an example of a ``. [Read more](https://docs.fireworks.ai/models/overview#introduction) about model names.
To create a new deployment using a custom base model, follow the [Uploading custom models](/models/uploading-custom-models#custom-base-models) guide to first upload your custom base model to the Fireworks platform. Then run:
```bash
firectl create deployment
```
The deployment ID is the last part of the deployment name: `accounts//deployments/`.
You can verify the deployment is complete by running:
```bash
firectl get deployment
```
The state field should show `READY`.
To query a specific deployment, use the model identifier in the format: `#`
In most cases, the model identifier follows this pattern:
`accounts//models/#accounts//deployments/`
**Example:**
The model identifier for querying Llama3.2-3B Instruct (listed as `accounts/fireworks/models/llama-v3p2-3b-instruct`) for Acme Inc.'s deployment (deployment ID being `12ab34cd56ef`) would be:
`accounts/fireworks/models/llama-v3p2-3b-instruct#accounts/acmeInc/deployments/12ab34cd56ef`
**Sample Request:**
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/#accounts//deployments/",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
By default, deployments will automatically [scale down to zero](#customizing-autoscaling-behavior) replicas if unused (i.e. no
inference requests) for 1 hour, and automatically delete itself if unused for one week.
To completely delete the deployment, run:
```bash
firectl delete deployment
```
Notes:
* Make sure you include the `#` in the model identifier when querying a specific deployment.
* If you are unsure about the model identifier format, refer to the [Model Identifiers](https://docs.fireworks.ai/models/deploying#model-identifier) section for more details and alternatives.
## Deployment options
### Replica count (horizontal scaling)
The number of replicas (horizontal scaling) is specified by passing the `--min-replica-count` and `--max-replica-count` flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will automatically scale based on server load.
Auto-scaling up may fail if there is a GPU stockout. Use [reserved capacity](/deployments/reservations) to guarantee capacity for your deployments.
The default value for `--min-replica-count` is 0. Setting `--min-replica-count` to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified "scale-to-zero" time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.
The default value for `--max-replica-count` is 1 if `--min-replica-count=0`, or the value of
`--min-replica-count` otherwise.
```bash create
firectl create deployment \
--min-replica-count 2 \
--max-replica-count 3
```
```bash update
firectl update deployment \
--min-replica-count 2 \
--max-replica-count 3
```
### Customizing autoscaling behavior
You can customize certain aspects of the deployment's autoscaling behavior by setting the following flags:
* `--scale-up-window` The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is `30s`.
* `--scale-down-window` The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is `10m`.
* `--scale-to-zero-window` The duration after which there are no requests that the deployment will be scaled down to zero replicas. This is ignored if `--min-replica-count` is greater than 0. Default is `1h`. The minimum is `5m`.
* `--load-targets =[,=...]` Load target thresholds for scaling the replica count. If not specified, the load target is default with `--load-targets default=0.8`. If multiple load targets are specified the maximum replica count across all of them is used.
* `default=` - A general default value for 0 to 1 load targets. Default is default=0.8.
* `tokens_generated_per_second=` - The desired tokens generated per second per replica.
There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling from 0 to 1 replicas.
A deployment with `--min-replica-count` set to 0 will be automatically deleted if it receives no traffic for 7 days.
Refer to [time.ParseDuration](https://pkg.go.dev/time#ParseDuration) for valid syntax for the duration string.
### Multiple GPUs (vertical scaling)
The number of GPUs used per replica is specified by passing the `--accelerator-count` flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
```bash create
firectl create deployment --accelerator-count 2
```
```bash update
firectl update deployment --accelerator-count 2
```
### Choosing hardware type
By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB, NVIDIA H200 141GB or AMD MI300X GPUs by passing the `--accelerator-type` flag. Valid values for `--accelerator-type` are:
* `NVIDIA_A100_80GB`
* `NVIDIA_H100_80GB`
* `NVIDIA_H200_141GB`
* `AMD_MI300X_192GB` - Note that MoE-based models like DeepSeek Coder and Mixtral are currently not supported on MI300X
See [Regions](/deployments/regions) for a list of accelerator availability by region. Region can be either specified or auto-selected for a deployment upon creation. After creation, the region cannot be changed. If you plan on changing the accelerator type, you may need to re-create the deployment in a new region where it is availabile.
For advice on choosing a hardware type, see this [FAQ](https://docs.fireworks.ai/faq/deployment/ondemand/hardware-options#hardware-selection)
```bash create
firectl create deployment --accelerator-type="NVIDIA_H100_80GB"
```
```bash update
firectl update deployment --accelerator-type="NVIDIA_H100_80GB"
```
### Model based speculative decoding
Model based speculative decoding allows you to speed up output generation in some cases, by using a smaller model to assist the larger model in generation.
Fireworks also offers speculative decoding based on a user-provided prediction which works in addition to model based speculative decoding. Read [Using Predicted Outputs](guides/predicted-outputs.mdx) to learn more.
Speculative decoding may slow down output generation if the smaller model is not a good speculator for the larger model, or token count / speculation length is too high or too low. Speculative decoding may also reduce the max throughput you can achieve with your deployment. Test different models and speculation lengths to determine the best settings for your use case.
We offer the following settings that can be set as flags in `firectl`, our CLI tool:
| Flag | Type | Description |
| ---------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--draft-model` | string | To use a draft model for speculative decoding, set this flag to the name of the draft model you want to use. See the table below for recommendations on draft models to use for popular model families. Note that draft models can be standalone models (referred from Fireworks account or custom models uploaded to your account) or an add-on (e.g. Eagle) |
| `--draft-token-count` | int32 | When using a draft model, set this flag to the number of tokens to generate per step for speculative decoding. Setting `--draft-token-count=0` turns off draft model speculation for the deployment. As a rough guideline, use `--draft-token-count=3 `for eagle draft models and `--draft-token-count=4` for other draft models |
| `--ngram-speculation-length` | int32 | To use N-gram based speculation, set this flag to the length of the previous input sequence to be considered for N-gram speculation |
`--draft-token-count` must be set when `--draft-model` or `--ngram-speculation-length` is used.
`--draft-model` and `--ngram-speculation-length` cannot be used together as they are alternative approaches to model-based speculation. Setting both will throw an error.
You can use the following draft models directly:
| Draft model name | Recommended for |
| -------------------------------------------------------- | -------------------------- |
| accounts/fireworks/models/llama-v3p2-1b-instruct | All Llama models > 3B |
| accounts/fireworks/models/qwen2p5-0p5b-instruct | All Qwen models > 3B |
| accounts/fireworks/models/eagle-llama-v3-3b-instruct-v2 | Llama 3.2 3B |
| accounts/fireworks/models/eagle-qwen-v2p5-3b-instruct-v2 | Qwen 2.5 3B |
| accounts/fireworks/models/eagle-llama-v3-8b-instruct-v2 | Llama 3.1 8B, Llama 3.0 8B |
| accounts/fireworks/models/eagle-qwen-v2p5-7b-instruct-v2 | Qwen 2.5 7B |
Here's an example of deploying Llama 3.3 70B with a draft model:
```bash
firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--accelerator-type="NVIDIA_H100_80GB" \
--draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
--draft-token-count=4
```
In most cases, speculative decoding does not change the quality of the output generated (mathematically, outputs are unchanged, but there might be numerical differences, especially at higher temperatures). If speculation is used on the deployment and you want to verify the output is unchanged, you can set `disable_speculation=True` in the inference API call - in this case, the draft model is still called but its output are not used, so performance will be impacted.
### Quantization
By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics which may introduce small changes to the output.
In order to deploy a base model using quantization, it must be prepared first. See our [Quantization](/models/quantization)
guide for details.
To create a deployment using a quantized model, pass the `--precision` flag with the desired precision.
```bash
firectl create deployment \
--accelerator-type="NVIDIA_H100_80GB" \
--precision="FP8"
```
Quantized deployments can only be served using H100 GPUs.
### Optimizing your deployments for long context
By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be achieved for long-prompt length (>\~3000 tokens) workloads by passing the `--long-prompt` flag.
This option roughly doubles the amount of GPU memory required to serve the model and requires a minimum of two GPUs to be effective. If `--accelerator-count` is not specified, then a deployment using twice the minimum number of GPUs (to serve without `--long-prompt`) will be created.
```bash create
firectl create deployment --accelerator-count=2 --long-prompt
```
```bash update
firectl update deployment --long-prompt
```
To update a deployment to disable this option, pass `--long-prompt=false`.
Additional optimization options are available through our enterprise plan.
## Deploying LoRA addons
By default, LoRA addons are disabled for deployments. To enable addons, pass the `--enable-addons` flag:
```bash create
firectl create deployment --enable-addons
```
```bash update
firectl update deployment --enable-addons
```
See [Uploading a custom model](/models/uploading-custom-models#custom-lora-addons) for instructions on how to upload custom LoRA addons. To deploy a LoRA addon to a on-demand deployment, pass the `--deployment` flag to `firectl load-lora`. For
example:
```bash
firectl load-lora --deployment
```
The base model of the deployment must match the base model of the addon.
# Pricing
On-demand deployments are billed by GPU-second. Consult our [pricing page](https://fireworks.ai/pricing) for details.
# Using Predicted Outputs
Source: https://docs.fireworks.ai/guides/predicted-outputs
Use Predicted Outputs to boost output generation speeds for editing / rewriting use cases
This feature is in beta and we are working on improvements. We welcome your feedback on [Discord](https://discord.gg/fireworks-ai)
In cases where large parts of the LLM output are known in advance, e.g. editing or rewriting a document or code snippet, you can improve output generation speeds with predicted outputs. Predicted outputs allows you to provide strong "guesses" of what output may look like.
To use Predicted Outputs, set the `prediction` field in the Fireworks API with the predicted output. For example, you may want to edit a survey and add an option to contact users by text message:
```
{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
```
In this case, we expect most of the code will remain the same. We set the ‘prediction’ field to be the original survey code. The output generation speed increases using predicted outputs.
```python Python (Fireworks)
from fireworks.client import Fireworks
code = """{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
"""
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{
"role": "user",
"content": "Edit the How to Contact question to add an option called Text Message. Output the full edited code, with no markdown or explanations.",
},
{
"role": "user",
"content": code
}
],
temperature=0,
prediction={"type": "content", "content": code}
)
print(response.choices[0].message.content)
```
### Additional information on Predicted Outputs:
* Using Predicted Outputs is free at this time
* We recommend setting `temperature=0` for best results for most intended use cases of Predicted Outputs. In these cases, using Predicted Outputs does not impact the quality of outputs generated
* If the prediction is substantially different from the generated output, output generation speed may decrease
* The max length of the `prediction` field is set by `max_tokens` and is 2048 by default, and needs to be updated if you have a longer input and prediction.
* If you are using an on-demand deployment, you can set `rewrite_speculation=True` and potentially get even faster output generation. We are working on rolling this out to Serverless soon.
# Prompt caching
Source: https://docs.fireworks.ai/guides/prompt-caching
Prompt caching is a performance optimization feature that allows Fireworks to
respond faster to requests with prompts that share common prefixes. In many
situations, it can reduce time to first token (TTFT) by as much as 80%.
Prompt caching is enabled by default for all Fireworks models and deployments.
For dedicated deployments, prompt caching frees up resources, leading to higher
throughput on the same hardware. Dedicated deployments on the Enterprise plan allow
additional configuration options to further optimize cache performance.
## Using prompt caching
### Common use cases
Requests to LLMs often share a large portion of their prompt. For example:
* Long system prompts with detailed instructions
* Descriptions of available tools for function calling
* Growing previous conversation history for chat use cases
* Shared per-user context, like a current file for a coding assistant
Prompt caching avoids re-processing the cached prefix of the prompt and starts output generation much sooner.
### Structuring prompts for caching
Prompt caching works only for exact prefix matches within a prompt. To
realize caching benefits, place static content like instructions and examples at
the beginning of your prompt, and put variable content, such as user-specific
information, at the end.
For function calling models, tools are considered part of the prompt.
For vision-language models, images currently aren't cached (but this might be improved in the future).
### How it works
Fireworks will automatically find the longest prefix of the request that is
present in the cache and reuse it. The remaining portion of the prompt will be
processed as usual.
The entire prompt is stored in the cache for future reuse. Cached prompts
usually stay in the cache for at least several minutes. Depending on the model,
load level, and deployment configuration, it can be up to several hours. The
oldest prompts are evicted from the cache first.
Prompt caching doesn't alter the result generated by the model. The response you
receive will be identical to what you would get if prompt caching was not used.
Each generation is sampled from the model independently on each request and is not
cached for future usage.
## Monitoring
For dedicated deployments, information about prompt caching is returned in the
response headers. The header `fireworks-prompt-tokens` contains the number of tokens
in the prompt, out of which `fireworks-cached-prompt-tokens` are cached.
Aggregated metrics are also available in the [usage dashboard](https://fireworks.ai/account/usage?type=deployments).
## Data privacy
Serverless deployments maintain separate caches for each Fireworks account to prevent data leakage and timing attacks.
Dedicated deployments by default share a single cache across all requests.
Because prompt caching doesn't change the outputs, privacy is preserved even
if the deployment powers a multi-tenant application. It does open a minor risk
of a timing attack: potentially, an adversary can learn that a particular prompt
is cached by observing the response time. To ensure full isolation, you can pass
the `x-prompt-cache-isolation-key` header or the `prompt_cache_isolation_key`
field in the body of the request. It can contain an arbitrary string that acts
as an additional cache key, i.e., no sharing will occur between requests with
different IDs.
## Limiting or turning off caching
Additionally, you can pass the `prompt_cache_max_len` field in the request body to
limit the maximum prefix of the prompt (in tokens) that is considered for
caching. It's rarely needed in real applications but can come in handy for
benchmarking the performance of dedicated deployments by passing
`"prompt_cache_max_len": 0`.
## Advanced: cache locality for Enterprise deployments
Dedicated deployments on an Enterprise plan allow you to pass an additional hint in the request to improve cache hit rates.
First, the deployment needs to be created or updated with an additional flag:
```bash
firectl create deployment ... --enable-session-affinity
```
Then the client can pass an opaque identifier representing a single user or
session in the `user` field of the body or in the `x-session-affinity` header. Fireworks
will try to route requests with the identifier to the same server, further reducing response times.
It's best to choose an identifier that groups requests with long shared prompt
prefixes. For example, it can be a chat session with the same user or an
assistant working with the same shared context.
### Migration and Traffic Management
When migrating between deployments that use prompt caching, it's crucial to implement proper traffic routing to maintain optimal cache hit rates. When gradually routing traffic to a new deployment, use consistent user/session-based sampling rather than random sampling.
Here's the recommended implementation for traffic routing:
```python
import hashlib
# Configure traffic fraction (e.g., 20% to new deployment)
fireworks_traffic_fraction = 0.2
user_id = "session-id-123"
# Generate deterministic hash from user_id
hashed_user_id = int(hashlib.md5(user_id.encode()).hexdigest(), 16) # MD5 hash on user-id and convert to integer
MAX_HASH = 2**128 - 1 # MD5 hash maximum value
# Compute ratio for consistent routing
ratio = hashed_user_id / MAX_HASH # Returns 0.0 to 1.0
if (ratio < fireworks_traffic_fraction):
send_to_new_deployment(user=hashed_user_id) # Pass user ID for caching
else:
send_elsewhere() # Route to old deployment or serverless
```
Avoid random sampling for traffic routing as it can negatively impact cache hit rates:
```python
# Don't do this:
if random() < fireworks_traffic_fraction: # ❌ Reduces cache effectiveness
send_to_new_deployment(user=hashed_user_id)
```
Using consistent user-based routing ensures complete user sessions are maintained on the same deployment, optimizing prompt cache performance regardless of the traffic fraction.
# Querying embedding models
Source: https://docs.fireworks.ai/guides/querying-embeddings-models
Fireworks hosts many embedding models. Let's walk through an example of using `nomic-ai/nomic-embed-text-v1.5` to see how to query Fireworks with embeddings API.
# Embedding documents
The embedding model inputs text and outputs a vector (list) of floating point numbers to use for tasks like similarity comparisons and search. Our embedding service is OpenAI compatible. Refer to OpenAI's embeddings [guide](https://platform.openai.com/docs/guides/embeddings) and OpenAI's [embeddings documentation](https://platform.openai.com/docs/api-reference/embeddings) for more information on using these models.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: Spiderman was a particularly entertaining movie with...",
)
print(response)
```
This code embeds the text `search_document: Spiderman was a particularly entertaining movie with...` and returns the following
```json Response
CreateEmbeddingResponse(data=[Embedding(embedding=[0.006380197126418352, 0.011841800063848495,...], index=0, object='embedding')], model='intfloat/e5-mistral-7b-instruct', object='list', usage=Usage(prompt_tokens=12, total_tokens=12))
```
# Embedding queries and document
In the previous example, you might have noticed the `search_document:` prefix. Nomic models have been fine-tuned to take prefixes, so for user queries, you will need add the`search_query: `prefix, and for documents, you need to prefix with `search_document: `
Here's a quick example:
* Let's say we previously used the embedding model to embed many movie reviews that we stored in a vector database. All the documents should have been prefixed with`search_document: `
* We now want to create a movie recommendation that takes in a user query and outputs recommendations based on this data. The code below demonstrates how to embed the user query and system prompt.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
query = "I love superhero movies, any recommendations?"
task_description="Given a user query for movies, retrieve the relevant movie that can fulfill the query. "
query_emb = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input=f"search_query: {query}"
)
print(query_emb)
```
To view this example end-to-end and see how to use a MongoDB vector store and Fireworks-hosted generation model for RAG, see our full [guide](https://github.com/fw-ai/cookbook/blob/main/examples/rag/mongo_basic.ipynb). For more information on what kind of prefixes are possible with nomic, please check out [this guide from nomic](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage).
# Variable dimensions
The model also supports variable embedding dimension sizes. In this case, we can provide dimension as a query to the `embeddings.create()` request
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: I like Christmas movies, can you make any recommendations?",
dimensions=128,
)
print(len(response.data[0].embedding))
```
You will see that the returned results are embeddings with dimension 128.
# List of available models
| Model name | model size |
| :--------------------------------------------- | :--------- |
| `nomic-ai/nomic-embed-text-v1.5` (recommended) | 137M |
| `nomic-ai/nomic-embed-text-v1` | 137M |
| `WhereIsAI/UAE-Large-V1` | 335M |
| `thenlper/gte-large` | 335M |
| `thenlper/gte-base` | 109M |
# Querying text models
Source: https://docs.fireworks.ai/guides/querying-text-models
Fireworks.ai offers an OpenAI-compatible REST API for querying text models. There are several ways to interact with it:
* The [Fireworks Python client library](/tools-sdks/python-client/installation)
* The [web console](https://fireworks.ai)
* [LangChain](https://python.langchain.com/docs/integrations/providers/fireworks)
* Directly invoking the [REST API](/api-reference/post-completions) using your favorite tools or language
* The [OpenAI Python client](https://github.com/openai/openai-python)
## Using the web console
All Fireworks models can be accessed through the web console at [fireworks.ai](https://fireworks.ai). Clicking on a model will take you to the playground where you can enter a prompt along with additional request parameters.
Non-chat models will use the [completions API](/api-reference/post-completions) which passes your input directly into the model.
Models with a conversation config are considered chat models (also known as instruct models). By default, chat models will use the [chat completions API](/api-reference/post-chatcompletions) which will automatically format your input with the conversation style of the model. Advanced users can revert back to the completions API by unchecking the "Use chat template" option.
## Using the API
### Chat completions API
Models with a conversation config have the [chat completions API](/api-reference/post-completions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Llama chat models use the following [template](https://gpus.llm-utils.org/llama-2-prompt-template/):
> \\[INST] \<\>
>
> {system_prompt}
>
> \<\>
>
> user\_message\_1 \[/INST]
Some templates can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
#### Overriding the system prompt
A conversation style may include a default system prompt. For example, Llama 2 models use the default Llama prompt:
> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
For styles that support a system prompt, you may override this prompt by setting the first message with the role `system`. For example:
```json JSON
[
{
"role": "system",
"content": "You are a pirate."
},
{
"role": "user",
"content": "Hello, what is your name?"
}
]
```
To completely omit the system prompt, you can set `content` to the empty string.
The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on [`echo`](#echo).
### Completions API
Text models generate text based on the provided input prompt. All text models support this basic [completions API](/api-reference/post-completions). Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model's special end-of-sequence (EOS) token has been generated.
Most models will automatically prepend the beginning-of-sequence (BOS) token (e.g. ``) to your prompt input. You can always double-check by passing [raw\_output](#raw-output) and inspecting the resulting `prompt_token_ids`.
Here are some examples of calling the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.Completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
## Getting usage info
The returned object will contain a `usage` field containing
* The number of prompt tokens ingested
* The number of completion tokens (i.e. the number of tokens generated)
## Advanced options
See the API reference for the [completions](/api-reference/post-completions) and [chat completions](/api-reference/post-completions) APIs for a detailed description of these options.
### Streaming
By default, results are returned to the client once the generation is finished. Another option is to stream the results back, which is useful for chat use cases where the client can incrementally see results as each token is generated.
Here is an example with the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.Completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text, end="")
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"prompt": "Say this is a test",
"stream": true
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
and one with the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if "content" in chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end="")
```
### Aborting requests
When using streaming, you can stop the generation process by closing the HTTP connection partway through. This will immediately halt server-side processing. For serverless models, no additional tokens will be generated or billed. For dedicated deployments, it may free up resources for other requests.
The prompt is always fully processed and billed, and you cannot cancel or abort a request before the first token is generated. Non-streaming requests cannot be aborted.
To abort generation when using the Fireworks or OpenAI client, call the `.close()` method on the returned generator object. HTTP clients in other languages typically offer similar functionality to close the connection.
```python
# example Fireworks or OpenAI streaming setup above
start_time = time.time()
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# abort after 1 second
if time.time() - start_time > 1:
response_generator.close()
break
```
### Async mode
The Python client library also supports asynchronous mode for both completion and chat completion.
```python Python (Fireworks)
import asyncio
from fireworks.client import AsyncFireworks
client = AsyncFireworks(api_key="")
async def main():
stream = client.completion.acreate(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
```python Python (OpenAI 1.x)
import asyncio
import openai
client = openai.AsyncOpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
async def main():
stream = await client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
### Predicted Outputs
See [Using Predicted Outputs](/guides/predicted-outputs)
### Sampling options
The API auto-regressively generates text based on choosing the next token using the probability distribution over the space of tokens. For detailed information on how to implement these options, please refer to the [Chat Completions](/api-reference/post-chatcompletions) or [Completions](/api-reference/post-completions) API documentation.
#### Multiple choices
By default, the API will return a single generation choice per request. You can create multiple generations by setting the `n` parameter to the number of desired choices. The returned `choices` array will contain the result of each generation.
#### Max tokens
`max_tokens` or `max_completion_tokens` defines the maximum number of tokens the model can generate, with a default of 2000. If the combined token count (prompt + output) exceeds the model's limit, it automatically reduces the number of generated tokens to fit within the allowed context.
#### Temperature
Temperature allows you to configure how much randomness you want in the generated text. A higher temperature leads to more "creative" results. On the other hand, setting a temperature of 0 will allow you to generate deterministic results which is useful for testing and debugging.
#### Top-p
Top-p (also called [nucleus sampling](https://en.wikipedia.org/wiki/Top-p_sampling)) is an alternative to sampling with temperature, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
#### Top-k
Top-k is another sampling method where the $k$ most probable tokens are filtered and the probability mass is redistributed among tokens.
#### Min-p
[`min_p`](https://arxiv.org/abs/2407.01082) specifies a probability threshold to control which tokens can be selected during generation. Tokens with probabilities lower than this threshold are excluded, making the model more focused on higher-probability tokens. The default value varies, and setting a lower value ensures more variety, while a higher value produces more predictable, focused outputs.
#### Repetition penalty
LLMs are sometimes prone to repeat a single character or a sentence. Using a frequency and presence penalty can reduce the likelihood of sampling repetitive sequences of tokens. They work by directly modifying the model's logits (un-normalized log-probabilities) with an additive contribution.
`logits[j] -= c[j] * frequency_penalty + (c[j] > 0 ? 1 : 0) * presence_penalty`
where
* `logits[j]` is the logits of the j-th token
* `c[j]` is how often that token was sampled before the current position
The [`repetition_penalty`](https://arxiv.org/pdf/1909.05858.pdf) modifies the logit (raw model output) for repeated tokens. If a token has already appeared in the prompt or output, the penalty is applied to its probability of being selected again.
**Key differences to keep in mind:**
* `frequency_penalty`: Works on how often a word has been used, increasing the penalty for more frequent words. OAI compatible.
* `presence_penalty`: Penalizes words once they appear, regardless of frequency. OAI compatible.
* `repetition_penalty`: Adjusts the likelihood of repeated tokens based on previous appearances, providing an exponential scaling effect to control repetition more precisely, including from the prompt.
#### Mirostat (learning rate and target)
The [Mirostat algorithm](https://arxiv.org/abs/2007.14966) is a sampling method that helps keep the output's unpredictability, or perplexity, at a set target. It adjusts token probabilities as the text is generated to balance between more diverse or more predictable results. This is useful when you need steady control over how random or focused the text output should be.
There are two parameters that can be adjusted:
* `mirostat_target`: Sets the desired level of unpredictability (perplexity) for the Mirostat algorithm. A higher target results in more diverse output, while a lower target keeps the text more predictable.
* `mirostat_lr`: Controls how quickly the Mirostat algorithm adjusts token probabilities to reach the target perplexity. A lower learning rate makes the adjustments slower and more gradual, while a higher rate speeds up the corrections.
#### Logit bias
Parameter that modifies the likelihood of specified tokens appearing. Pass in a `Dict[int, float]` that maps a `token_id` to a logits bias value between -200.0 and 200.0. For example
```Text python
client.completions.create(
model="...",
prompt="...",
logit_bias={0: 10.0, 2: -50.0}
)
```
## Debugging options
### Ignore EOS
This option allows you to control whether the model stops when it generated the End of Sequence (EOS) token. This is helpful primarily for performance benchmarking to reliably generate exactly `max_tokens`. Note the quality of the output may degrade as we override model's decision to generate EOS token.
### Logprobs
The `logprobs` parameter determines how many token probabilities are returned. If set to N, it will return log (base e)
probabilities for N+1 tokens: the chosen token plus the N most likely alternative tokens.
The log probabilities will be returned in a LogProbs object for each choice.
* `tokens` contains each token of the chosen result.
* `token_ids` contains the integer IDs of each token of the chosen result.
* `token_logprobs` contains the logprobs of each chosen token.
* `top_logprobs` will be a list whose length is the number of tokens of the output. Each element is a dictionary of size `logprobs`, from the most likely tokens at the given position to their respective log probabilities.
When used in conjunction with `echo`, this option can be set to see how the model tokenized your input.
### Top logprobs
Setting the `top_logprobs` parameter to an integer value in conjunction with `logprobs=True` will also return the above information but in an OpenAI client-compatible format.
### Echo
Setting the `echo` parameter to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.
### Raw output
This is an unstable, experimental API. It may change at any time and should not be relied upon for production use cases.
Setting the `raw_output` parameter to true will cause the API to return a `raw_output` object in the response containing
addititional debugging information with regards to how the raw prompt and completion response as seen/produced by the
model.
* `prompt_fragments` - Pieces of the prompt (like individual messages) before truncation and concatenation.
* `prompt_token_ids` - Fully tokenized prompt as seen by the model.
* `completion` - Raw completion produced by the model before any tool calls are parsed.
* `completion_logprobs` - Log probabilities for the completion. Only populated if `logprobs` is specified in the
request.
## Appendix
### Tokenization
Language models read and write text in chunks called tokens. In English, a **token** can be as short as one character or as long as one word (e.g., a or apple), and in some languages, tokens can be even shorter than one character or even longer than one word.
Different model families use different **tokenizers**. The same text might be translated to different numbers of tokens depending on the model. It means that generation cost may vary per model even if the model size is the same. For the Llama model family, you can use [this tool](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) to estimate token counts. The actual number of tokens used in prompt and generation is returned in the `usage` field of the API response.
# Querying vision-language models
Source: https://docs.fireworks.ai/guides/querying-vision-language-models
See [Querying text models](/guides/querying-text-models) for a general guide on the API and its options.
## Using the API
Both completions API and chat completions API are supported. However, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.
You can pass images via a URL link or base64 encoded format. Code examples for both methods are below.
For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answer
### Chat completions API
All vision-language models should have a conversation config and have [chat completions API](https://docs.fireworks.ai/api-reference/post-chatcompletions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:
```
SYSTEM: {system message}
USER:
{user message}
ASSISTANT:
```
The `` substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```bash cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:
```python Python (Fireworks)
import fireworks.client
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```Text cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
### Completions API
Advanced users can also query the completions API directly. Users will need to manually insert the image token `` where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:
```python
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.Completion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
prompt = "SYSTEM: Hello\n\nUSER:\ntell me about the image\n\nASSISTANT:",
images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)
```
## Best practices
1. The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download. We do not persist the images longer than the server lifetime, and will be deleted automatically.
2. For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
3. If you have image metadata that you want the model to understand, please provide them through the prompt.
## API limitations
Right now, we impose certain limits on the completions API and chat completions API as follows:
1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
2. If images are provided in base64 encoding, they must be less than 10MB in total (when converted to base64 encoding).
3. If images are provided as URLs, then each image needs to be smaller than 5MB. If the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
4. We currently support `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff` and `.ppm` format images.
## Calculating cost
An image is treated as a dynamic number of tokens based on image resolution. For one image, the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to [our pricing page here.](https://fireworks.ai/pricing)
# Rate limits, spend limits and quotas
Source: https://docs.fireworks.ai/guides/quotas_usage/rate-limits
Rate limits, spend limits and quotas for serverless inference and on-demand deployments
## Rate Limits on Serverless
Rate limits on Serverless exist to ensure fair usage and reasonable performance for all users. We enforce fixed, maximum rate limits with a spike arrest policy - please read this section completely to understand how rate limits work.
* Fixed limits reflect the maximum usage allowed on Serverless
* Usage that spikes quickly may be throttled if a serverless deployment is in the process of scaling
If you need higher rate limits, faster speeds, more consistent latency, or guaranteed reliability with SLAs, [contact us](https://fireworks.ai/company/contact-us) to learn more about our Enterprise offerings, or consider using [on-demand deployments](https://github.com/fw-ai/docs/blob/main/guides/ondemand-deployments.mdx)
### Fixed Limits
| Limits | Self-Serve |
| ---------------------------------------------------------------------------- | ---------- |
| Requests per minute | 6,000 |
| Audio min per minute, Whisper-v3-large | 200 |
| Audio min per minute, Whisper-v3-turbo | 400 |
| Concurrent connections, streaming speech transcription | 10 |
| # [LoRAs](https://docs.fireworks.ai/getting-started/concepts#deployed-model) | 100 |
### Spike Arrest Policy
LLM traffic that spikes quickly has the potential to be throttled. Here's how it works:
* Each user has a guaranteed rate limit, which increases with sustained usage near the limit. Typically, you can expect to stay within the limits if your traffic gradually doubles within an hour.
* You can see your guaranteed limits using API response headers (see below)
* Exceeding your guaranteed limit means that there's the potential for your requests to be processed with lower-priority. Fireworks operates serverless deployments by [autoscaling](https://en.wikipedia.org/wiki/Autoscaling) capacity (within limits) as user traffic increases. However, if a deployment is overloaded while auto-scaling, requests that fall outside of guaranteed limits may be processed with lower-latency or dropped with HTTP code 429 (if limits are significantly exceeded). You can monitor if you exceed limits via API response header `x-ratelimit-over-limit: yes`.
* Exceeding your guaranteed limit does not guarantee that your requests will be throttled. You can monitor if your requests are actually being throttled by monitoring latencies.
Here's an example of how dynamic rate limits scale up:
| Metric | Minimum Guaranteed Limit | 10 Minutes | 1 Hour | 2 Hours |
| ------------------------ | ------------------------ | ---------- | ------ | ------- |
| Requests per minute | 60 | 120 | 720 | 1440 |
| Input tokens per minute | 60000 | 120000 | 720000 | 1440000 |
| Output tokens per minute | 6000 | 12000 | 72000 | 144000 |
### Rate limit response headers
| Header | Description |
| ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| x-ratelimit-limit-requests, x-ratelimit-limit-tokens-prompt, x-ratelimit-limit-tokens-generated | The maximum number of requests or tokens that are permitted per minute before the limit is exhausted and future requests are de-prioritized. `requests` refers to the number of completions (`n > 1` counts as several requests). `tokens-prompt` and `tokens-generated` refer to the number of input and output tokens respectively. |
| x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens-prompt, x-ratelimit-remaining-tokens-generated | The remaining number of requests or tokens that are permitted before exhausting the rate limit. Note that the limit is replenished continuously. If your usage is sustainably below the rate limit, this number will hover near its maximum value. |
| x-ratelimit-over-limit | Contains "yes" or "no". The value "yes" means that at least one of the limits is exhausted and this request was executed with lower priority. |
### Daily Token Limits
Daily token limits are set at thresholds that provide smooth transitions to enterprise reservations. If you think you may hit daily token limits, please [contact us](https://fireworks.ai/company/contact-us) to learn about enterprise packages.
| Limits | Self-Serve |
| ---------------------------------------------------------------- | ---------- |
| Tokens per day, models \< 40B | 2.5B |
| Tokens per day, models between 40B - 100B | 1.25B |
| Tokens per day, models > 100B (incl. large MoE like Deepseek R1) | 600M |
## GPU Limits with On-Demand Deployments
If you need higher limits, [contact us](https://fireworks.ai/company/contact-us) to learn more about our Enterprise offerings.
| **Quota Name** | **Default Value** |
| -------------------------------------------------------------------------------- | ----------------- |
| # Nvidia A100 | 8 |
| # Nvidia H100 | 8 |
| # Nvidia H200 | 8 |
| # AMD MI300X | 8 |
| Total GPU Hours per month | 2000 |
| # [LoRAs](https://docs.fireworks.ai/getting-started/concepts#deployed-model) | 100 |
| Note that the limit on # LoRAs is a total limit across Serverless and On-Demand. | |
## Spend limits
In order to prevent fraud, Fireworks imposes a monthly spending limit on your account. Once you hit the spending limit, your account will automatically enter a suspended state, API requests will be rejected and all Fireworks usage will be stopped. This includes serverless inference, dedicated deployments, and fine-tuning jobs.
Your spend limit will organically increase over time as you spend more on the platform. You can also increase your spend limit at any time, by purchasing prepaid credits to meet the historical spend required for a higher tier. For instance, if you are a new Tier 1 user with `$0` historical spend, you can purchase `$100` prepaid credits and become a Tier 2 user.
You can qualify for a higher tier by adding credits into your Fireworks account. There may be a propagation delay for a few minutes after you prepay for credits - you may still see "monthly usage exceeded error" for a few minutes after adding credits.
| **Tier** | **Qualification** | **Spending Limit** |
| --------- | --------------------------------------------------------------------- | ------------------ |
| Tier 1 | Valid payment method added | \$50/mo |
| Tier 2 | \$50 spent in payments or credits added | \$500/mo |
| Tier 3 | \$500 spent in payments or credits added | \$5,000/mo |
| Tier 4 | \$5000 spent in payments or credits added | \$50,000/mo |
| Unlimited | Contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) | Unlimited |
### Reducing Spend Limits
In certain cases, developers want to reduce their spend limit. For example, developers may fear unexpected costs from their app unexpectedly going viral. Users can lower or raise spend limits to any arbitrary number within their Tier with the following command:
```bash
firectl update quota monthly-spend-usd --value
```
## Viewing quotas
You can view your current quota capacity by running:
```bash
firectl list quotas
```
## Account suspension
Account suspension occurs when your spending limit is hit, no payment method is on file after credits are depleted, or past invoice payment fails. If you have a failed payment, go to the \[Invoices] section at [https://fireworks.ai/billing](https://fireworks.ai/billing), pay all failed invoices, and your account will be automatically unsuspended. If your account is still suspended after 1 hour, contact the Fireworks team in Discord or via email.
# Hitchhikers guide to open models
Source: https://docs.fireworks.ai/guides/recommended-models
A list of recommended open models for common use cases
## Which Open Models Should I Use?
There's no single right answer! Here’s a curated list based on **Fireworks internal testing**, **community feedback**, and **external benchmarks**. We recommend you use it as a starting point, and we will update it regularly as new models emerge.
Note:
* Model sizes are marked as *Small*, *Medium* or *Large*
* For best latency, use small or medium models. For best quality, use large models, or fine-tune medium or small models.
* You can explore all models in the [Fireworks Model Library](https://fireworks.ai/models)
| **Use Case** | **Recommended Models** |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| **Code generation & reasoning** | DeepSeek R1, V3-0324 *(Large)* Qwen2.5-32B-Coder *(Medium)* |
| **Code completion & bug fixing** | Qwen2.5-32B-Coder *(Medium)* DeepSeek V2.5 *(Medium)* Qwen2.5 0.5–14B *(Small)* |
| **General reasoning & planning** | DeepSeek R1, V3-0324 *(Large)* Qwen2.5-72B-Instruct *(Medium)* Llama 3.3 70B *(Medium)* |
| **Function calling & tool use** | Qwen2.5-72B-Instruct *(Medium)* |
| **Long context & summarization** | Llama 4 Maverick & Scout *(Medium/Large)* |
| **Vision & document understanding** | Qwen2.5-32B-VL, 72B-VL *(Medium)* Llama 4 Maverick & Scout *(Medium/Large)* Qwen2.5 3–7B *(Small)* |
| **Low-latency NLU & extraction** | Llama 3.1 8B, 3.2 3B/1B *(Small)* Qwen2.5 0.5–7B *(Small)* |
*Last updated: April 28, 2025*
# Data privacy & security
Source: https://docs.fireworks.ai/guides/security_compliance/data_handling
How we secure and handle your data
# Zero Data Retention
Fireworks has Zero Data Retention by default. Specifically, this means
* Fireworks does not log or store prompt or generation data for any open models, without explicit user opt-in.
* More technically: prompt and generation data exist only in volatile memory for the duration of the request. If [prompt caching](https://docs.fireworks.ai/guides/prompt-caching#data-privacy) is active, some prompt data (and associated KV caches) can be stored in volatile memory for several minutes. In either case, prompt and generation data are not logged into any persistent storage.
* Fireworks logs metadata (e.g. number of tokens in a request) as required to deliver the service.
* Users can explicitly opt-in to log prompt and generation data for certain advanced features (e.g. FireOptimizer).
* For proprietary Fireworks models (e.g. f1, FireFunction), prompt and generation data may be logged to enable bulk analytics to improve the model.
* In this case, the model description will contain an explicit message about logging.
# Deploying models
Source: https://docs.fireworks.ai/models/deploying
A model must be deployed before it can be used for inference. Fireworks deploys the most popular base models to
serverless deployments that can be used out of the box (including LoRA addons). See [Querying text models](/guides/querying-text-models).
Less popular base models or custom base
models must be used with an [on-demand deployment](/guides/ondemand-deployments).
## Deploying a model
### LoRA addons
#### Loading to serverless
Fireworks also supports loading serverless addons for [supported base models](/fine-tuning/fine-tuning-models#appendix).
To load a LoRA addon to serverless, run `firectl load-lora` without passing a deployment ID:
```bash
firectl load-lora
```
Serverless addons are charged by input and output tokens for inference. There is no additional charge for loading
serverless addons.
LoRA addons on serverless have higher latency compared with base model inference. This includes LoRA fine-tunes, which
are one type of LoRA addon. For faster inference speeds with LoRA addons, we recommend deploying to on-demand.
Unused addons may be automatically unloaded after a week.
#### Deploying to on-demand
Addons may also be deployed in an [on-demand deployment](/guides/ondemand-deployments) of [supported base models](/fine-tuning/fine-tuning-models#appendix).
To create an on-demand deployment, run:
```bash
firectl create deployment "accounts/fireworks/models/" --enable-addons
```
On-demand deployments are charged by GPU-hour. See
[Pricing](https://fireworks.ai/pricing#ondemand) for details.
Once the deployment is ready, deploy the addon to the deployment:
```bash
firectl load-lora --deployment
```
### Base models
Custom base models may only be used with [on-demand deployments](/guides/ondemand-deployments). To create one, run:
```bash
firectl create deployment
```
On-demand deployments are charged by GPU-hour. See
[Pricing](https://fireworks.ai/pricing#ondemand) for details.
Use the `` specified during [model upload](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model-2). Creating the deployment will automatically deploy the base model to the deployment.
## Checking whether a model is deployed
You can check the status of a model deployment by looking at the "Deployed Model Refs" section from:
```
firectl get model
```
If successful, there will be an entry with `State: DEPLOYED`.
Alternatively, you can list all deployed models within your account by running:
```
firectl list deployed-models
```
## Inference
### Model identifier
After your model is successfully deployed, it will be ready for inference. A model can be queried using one of the
following model identifiers:
* The model and deployment names - `accounts//models/#accounts//deployments/`,
e.g.
* `accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678`
* `accounts/alice/models/custom-model#accounts/alice/deployments/12345678`
* The model and deployment short-names - `/#/`,
e.g.
* `fireworks/mixtral-8x7b#alice/12345678`
* `alice/custom-model#alice/12345678`
* Deployed model name - Instead of needing to use both the model and deployment name to refer to a deployed model, you can optionally just use a unique deployed model name. This name utilizes a unique deployed model ID that is created upon deployment. The deployed model ID takes the form \-\/`
* `/#/`
### Multiple deployments
Since a model may be deployed to multiple deployments, querying by model name will route to the "default" deployed
model. You can see which deployed model entry is marked with `Default: true` by describing the model:
```
firectl get model
...
Deployed Model Refs:
[{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
Default: true
},
{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
},
]
```
To update the default deployed model, note the "Name" of the deployed model reference above. Then run:
```
firectl update deployed-model --default
```
Deleting a default deployment:
To delete a default deployment you must delete all other deployments for the same model first,
or designate a different deployed model as the default as described above. This is to ensure that querying by model name
will always route to an unambiguous default deployment as long as deployments for the model exist.
### Querying the model
To test the model using the completions API, run:
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
See [Querying text models](/guides/querying-text-models) for a more comprehensive guide.
## Publishing a deployed model
By default, models can only be queried by the account that owns them. To make a deployed model public so anyone with a
valid Fireworks API key can query it, update the deployed model with the `--public` flag.
```bash
firectl update deployed-model --public
```
To unpublish it, run:
```bash
firectl update deployed-model --public=false
```
You must use the **deployed model ID**, not the **model ID**. To get a list of
deployed models, run `firectl list deployed-models`.
# null
Source: https://docs.fireworks.ai/models/quantization
By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits
used to serve the model, improving performance and reducing cost to serve. However, this can change model numerics
which may introduce small changes to the output.
Take a look at our [blog post](https://fireworks.ai/blog/fireworks-quantization) for a detailed treatment of how
quantization affects model quality.
## Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision using `firectl prepare-model`:
```bash
firectl prepare-model
```
This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.
You can check on the status of preparation by running
```bash
firectl get model
```
and checking if the state is still in `PREPARING`. A successfully prepared model will have the desired precision added
to the `Precisions` list.
## Creating an FP8 deployment
By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a
model, run:
```bash
firectl get model
```
The `Precisions` field will indicate what precisions the model has been prepared for.
To use the quantized FP8 checkpoint, pass the `--precision` flag:
```bash
firectl create deployment --accelerator-type NVIDIA_H100_80GB --precision FP8
```
Quantized deployments can only be served using H100 GPUs.
# Uploading a custom model
Source: https://docs.fireworks.ai/models/uploading-custom-models
In addition to the predefined set of models already available on Fireworks and models you fine-tune on the Fireworks platform, you can also upload your own custom models. Both custom base models and LoRA addons are supported.
## Custom LoRA addons
### Requirements
Your custom LoRA addon must contain the following files:
* `adapter_config.json` - The Hugging Face adapter configuration file.
* `adapter_model.bin` or `adapter_model.safetensors` - The saved addon file.
The `adapter_config.json` must contain the following fields:
* `r` - The number of LoRA ranks. Must be between an integer between 4 and 64, inclusive.
* `target_modules` - A list of target modules. Currently the following target modules are supported:
* `q_proj`
* `k_proj`
* `v_proj`
* `o_proj`
* `up_proj` or `w1`
* `down_proj` or `w2`
* `gate_proj` or `w3`
* `block_sparse_moe.gate`
Additional fields may be specified but are ignored.
### Enabling chat completions
To enable the chat completions API for your LoRA addon, add a `fireworks.json` file directory containing:
```json
{
"conversation_config": {
"style": "jinja",
"args": {
"template": ""
}
}
}
```
### Uploading the model
To upload a LoRA addon, run the following command. The MODEL\_ID is an arbitrary [resource ID](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) to refer to the model within Fireworks.
Only some base models support LoRA addons.
```bash
firectl create model /path/to/files/ --base-model "accounts/fireworks/models/"
```
## Custom base models
### Requirements
Fireworks currently supports the following model architectures:
* [Gemma](https://huggingface.co/docs/transformers/en/model_doc/gemma)
* [Phi, Phi-3](https://huggingface.co/docs/transformers/en/model_doc/phi)
* [Llama 1,2,3,3.1](https://huggingface.co/docs/transformers/en/model_doc/llama2)
* [LLaVa](https://huggingface.co/docs/transformers/main/en/model_doc/llava)
* [Mistral](https://huggingface.co/docs/transformers/en/model_doc/mistral) & [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral)
* [Qwen2](https://huggingface.co/docs/transformers/en/model_doc/qwen2)
* [StableLM](https://huggingface.co/docs/transformers/main/en/model_doc/stablelm)
* [Starcoder(GPTBigCode)](https://huggingface.co/docs/transformers/en/model_doc/gpt_bigcode) & [Starcoder2](https://huggingface.co/docs/transformers/main/en/model_doc/starcoder2)
* [DeepSeek V1 & V2](https://huggingface.co/deepseek-ai)
* [GPT NeoX](https://huggingface.co/docs/transformers/en/model_doc/gpt_neox)
The model files you will need to provide depend on the model architecture. In general, you will need the following files:
* Model configuration: `config.json`.
Fireworks does not support the `quantization_config` option in `config.json`.
* Model weights, in one of the following formats:
* `*.safetensors`
* `*.bin`
* Weights index:`*.index.json`
* Tokenizer file(s), e.g.
* `tokenizer.model`
* `tokenizer.json`
* `tokenizer_config.json`
If the requisite files are not present, model deployment may fail.
### Enabling chat completions
To enable the chat completions API for your custom base model, ensure your `tokenizer_config.json` contains a `chat_template` field. See the Hugging Face guide on [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating) for details.
### Uploading the model
To upload a custom base model, run the following command.
```bash
firectl create model /path/to/files/
```
### Uploading models from S3 buckets
For larger models, you can upload directly from an Amazon S3 bucket, which provides a faster transfer process than uploading from local files.
To upload a model directly from an S3 bucket, run the following command.
```bash
firectl create model s3://// --aws-access-key-id --aws-secret-access-key
```
See the [AWS documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id-credentials-access-keys-update.html) for how to generate an access key ID and secret access key pair.
Ensure the IAM user has read access to the S3 bucket containing the model.
## Deploying
A model cannot be used for inference until it is deployed. See the [Deploying models](/models/deploying) guide to deploy the model.
## Publishing
By default, all models you create are only visible to and deployable by users within your account. To publish a model so anyone with a Fireworks account can deploy it, you can create it with the `--public` flag. This will allow it to show up in public model lists.
```bash create
firectl create model /path/to/files --public
```
```bash update
firectl update model --public
```
To unpublish the model, just run
```bash update
firectl update model --public=false
```
# Using grammar mode
Source: https://docs.fireworks.ai/structured-responses/structured-output-grammar-based
## What is grammar-based structured output?
Grammar mode is the ability to specify a forced output schema for any Fireworks model via an extended BNF formal grammar ([GBNF format](https://github.com/ggerganov/llama.cpp/tree/master/grammars)). This method is popularly used to constrain model outputs in [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
What is a formal grammar? It's a way to define rules to declare strings to be valid or invalid. See the "Syntax for Describing Grammars" below for more info. Similar to our [JSON mode](/structured-responses/structured-response-formatting) format, you provide `response_format` field in the request like `{"type": "grammar", "grammar": }`.
For best results, we still recommend that you do some prompt engineering and describe the desired output to the model to guide decision-making.
## Why grammar-based structured output?
* Relying solely on system prompt engineering is finicky and time-consuming. It can be difficult to coerce the model to do certain things, for example
* Behave like a classifier, only output from a predefined list
* Output only Japanese, Chinese, a specified programming language, or otherwise prevent the model from generating a large set of of tokens
* Sometimes JSON is not what you need (e.g. it may be finicky with string escaping) and you need some other structured output
* Small models may have difficulty following instructions
## End-to-end examples
This guide provides a step-by-step example of creating a structured output response with grammar using the Fireworks API. The example uses Python and the OpenAI library to define the schema for the output.
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` libraries installed. You can install them using pip:
```bash
pip install openai
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all fireworks models support this feature.
### Step 1: Configure the Fireworks.ai client
You can use either the Fireworks or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
### Step 2: Define the output grammar
Let's say you have a classifier model that sorts patient requests into a few predefined classes. Then, you can ask the model to only respond within these classes.
```
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
```
### Step 3: Specify your output grammar in your chat completions request
```python Python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
diagnosis_grammar = """
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": diagnosis_grammar},
messages=[
{
"role": "system",
"content": "Given the symptoms try to guess the possible diagnosis. Possible choices: arthritis, dengue, urinary tract infection, impetigo, cervical spondylosis. Answer with a single word",
},
{
"role": "user",
"content": "I have been having trouble with my muscles and joints. My neck is really tight and my muscles feel weak. I have swollen joints and it is hard to move around without becoming stiff. It is also really uncomfortable to walk.",
},
],
)
print(chat_completion.choices[0].message.content)
```
and for the response, we will only get one of the 5 classes we specified. In this case, the model output is
```
'arthritis'
```
Note that we have done some prompt engineering to instruct the model about possible diagnoses in free form. Alternatively, we may have used one of the fine-tuned models for the medical domain.
## Advanced examples
### Japanese and Chinese
Given the below configuration
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
cjk_grammar = """
root ::= jp-char+ ([ \t\n] jp-char+)*
jp-char ::= hiragana | katakana | punctuation | cjk
hiragana ::= [ぁ-ゟ]
katakana ::= [ァ-ヿ]
punctuation ::= [、-〾]
cjk ::= [一-鿿]
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": cjk_grammar},
messages=[
{
"role": "user",
"content": "You are a Japanese tour guide who speaks fluent Japanese. Please tell me what are some good places for me to visit in Kyoto",
},
],
)
print(chat_completion.choices[0].message.content)
```
The model will reply in Japanese
```
こんにちは、私は日本語を母国語として話せるキョトの私が案内する旅行案内者です。京都を旅行にお付き合いいただきありがとうごさいます。京都にはたくさんの楽しいところがありますが、私はあなたの需要に基いて、いくつかのおすすめていきます。\n最初に、古都の一面を体感できる場所として、清水寺をおすすめします。清水寺は世界的に有名な寺院で、美しい金面山だまのホームページと、きれいな庭で知られています。\n次に、京
```
And since the grammar is actually more lenient than Japanese and covers Chinese as well, we can also just prompt the model to be a fluent Chinese speaker.
```
You are a Japanese tour guide who speaks fluent Chinese. Please tell me what are some good places for me to visit in Shanghai?",
```
We can try something even more difficult: asking a Japanese tour guide to speak Chinese. With the help from the grammar, the model replied in Chinese, with the same grammar specified
```
当您访问上海时、我建议您参观以下几个地方。\\n上海外国语大学。这是一所著名的大学、校园美景优秀、还有各种餐馆可供您选择。\\n上海中山公园。这是一座位于城市中心的公园、风景优美、有许多古老的建设和展览馆。\\n南京路。这是一条繁华的商业大街、有许多品牌商店和餐馆可供您选择。\\n上海南京东路步行街。这是全球最大的步行街之一、有许多商店和餐馆可供您选择。\\n上海世博中心。这是一座展览馆复合体、经常举办各种展览和会议
```
Without the grammar, the model replies in a mixture of Chinese and English
```
你好!uming as a Japanese tour guide who speaks fluent Chinese, I would be happy to recommend some places for you to visit in Shanghai!\\n\\n1. The Bund: This is a famous waterfront area in Shanghai that offers stunning views of the city's skyline, including the iconic Oriental Pearl Tower. You can take a leisurely stroll along the promenade and enjoy the beautiful scenery.\\n\\n2. Yuyuan Garden: This is a beautiful classical Chinese garden that dates back to the Ming Dynasty. It features pavilions, halls, rockeries, ponds, and cloisters, and is a great place to experience traditional Chinese architecture and garden design.\\n\\n3. Shanghai Tower: This is the tallest building in China and the second-tallest building in the world. You can take the elevator up to the observation deck on the 128th floor for breatht
```
### C code generation
Programming languages like C can also be expressed as a grammar.
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="your_API_Key",
)
c_grammar = """
root ::= (declaration)*
declaration ::= dataType identifier "(" parameter? ")" "{" statement* "}"
dataType ::= "int" ws | "float" ws | "char" ws
identifier ::= [a-zA-Z_] [a-zA-Z_0-9]*
parameter ::= dataType identifier
statement ::=
( dataType identifier ws "=" ws expression ";" ) |
( identifier ws "=" ws expression ";" ) |
( identifier ws "(" argList? ")" ";" ) |
( "return" ws expression ";" ) |
( "while" "(" condition ")" "{" statement* "}" ) |
( "for" "(" forInit ";" ws condition ";" ws forUpdate ")" "{" statement* "}" ) |
( "if" "(" condition ")" "{" statement* "}" ("else" "{" statement* "}")? ) |
( singleLineComment ) |
( multiLineComment )
forInit ::= dataType identifier ws "=" ws expression | identifier ws "=" ws expression
forUpdate ::= identifier ws "=" ws expression
condition ::= expression relationOperator expression
relationOperator ::= ("<=" | "<" | "==" | "!=" | ">=" | ">")
expression ::= term (("+" | "-") term)*
term ::= factor(("*" | "/") factor)*
factor ::= identifier | number | unaryTerm | funcCall | parenExpression
unaryTerm ::= "-" factor
funcCall ::= identifier "(" argList? ")"
parenExpression ::= "(" ws expression ws ")"
argList ::= expression ("," ws expression)*
number ::= [0-9]+
singleLineComment ::= "//" [^\n]* "\n"
multiLineComment ::= "/*" ( [^*] | ("*" [^/]) )* "*/"
ws ::= ([ \t\n]+)"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": c_grammar},
messages=[
{
"role": "user",
"content": "You are an expert in writing C code. Can you write a program that prints hello world?",
},
],
)
print(chat_completion.choices[0].message.content)
```
In this case, we get a cute little C program as the output:
```
char\nc(int a){return 2*a;}
```
## Syntax
### Background
[Bakus-Naur Form (BNF)](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) is a notation for describing the syntax of formal languages like programming languages, file formats, and protocols. Fireworks API uses an extension of BNF with a few modern regex-like features, inspired by [Llama.cpp's implementation](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
### Basics
In BNF, we define *production rules* that specify how a *non-terminal* (rule name) can be replaced with sequences of *terminals* (characters, specifically Unicode [code points](https://en.wikipedia.org/wiki/Code_point)) and other non-terminals. The basic format of a production rule is `nonterminal ::= sequence...`.
Consider an example of a small chess notation grammar:
```
# `root` specifies the pattern for the overall output
root ::= (
# it must start with the characters "1. " followed by a sequence
# of characters that match the `move` rule, followed by a space, followed
# by another move, and then a newline
"1. " move " " move "\n"
# it's followed by one or more subsequent moves, numbered with one or two digits
([1-9] [0-9]? ". " move " " move "\n")+
)
# `move` is an abstract representation, which can be a pawn, nonpawn, or castle.
# The `[+#]?` denotes the possibility of checking or mate signs after moves
move ::= (pawn | nonpawn | castle) [+#]?
pawn ::= ...
nonpawn ::= ...
castle ::= ...
```
### Non-terminals and terminals
Non-terminal symbols (rule names) stand for a pattern of terminals and other non-terminals. They are required to be a dashed lowercase word, like `move`, `castle`, or `check-mate`.
Terminals are actual characters ([code points](https://en.wikipedia.org/wiki/Code_point)). They can be specified as a sequence like `"1"` or `"O-O"` or as ranges like `[1-9]` or `[NBKQR]`.
### Characters and character ranges
Terminals support the full range of Unicode. Unicode characters can be specified directly in the grammar, for example `hiragana ::= [ぁ-ゟ]`, or with escapes: 8-bit (`\xXX`), 16-bit (`\uXXXX`) or 32-bit (`\UXXXXXXXX`).
Character ranges can be negated with `^`:
```
single-line ::= [^\n]+ "\n"`
```
Dot `.` symbol matches any character:
```
any-three-symbol-sequence ::= ...
```
### Sequences and alternatives
The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optional symbols (below) to a sequence.
### Repetition and optional symbols
* `*` after a symbol or sequence means that it can be repeated zero or more times.
* `+` denotes that the symbol or sequence should appear one or more times.
* `?` makes the preceding symbol or sequence optional.
### Comments and newlines
Comments can be specified with `#`:
```
# defines optional whitespace
ws ::= [ \t\n]+
```
Newlines are allowed between rules and between symbols or sequences nested inside parentheses. Additionally, a newline after an alternate marker `|` will continue the current rule, even outside of parentheses.
### The root rule
In a full grammar, the `root` rule always defines the starting point of the grammar. In other words, it specifies what the entire output must match.
```
# a grammar for lists
root ::= ("- " item)+
item ::= [^\n]+ "\n"
```
# Using JSON mode
Source: https://docs.fireworks.ai/structured-responses/structured-response-formatting
## What is JSON mode?
JSON mode allows you to force the output of any Fireworks language model to conform to a provided [JSON schema](https://json-schema.org/).
## Why JSON responses?
1. **Clarity and Precision:** Responding in JSON ensures that the output from the LLM is clear, precise, and easy to parse. This is particularly beneficial in scenarios where the response needs to be further processed or analyzed by other systems.
2. **Ease of Integration:** JSON, being a widely-used format, allows for easy integration with various platforms and applications. This interoperability is essential for developers looking to incorporate AI capabilities into their existing systems without extensive modifications.
## End-to-end example
This guide provides a step-by-step example of how to create a structured output response using the Fireworks API. The example uses Python and the `pydantic` library to define the schema for the output. You can find more information about Pydantic [here](https://docs.pydantic.dev/latest/).
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` and `pydantic` libraries installed. You can install them using pip:
```bash
pip install openai pydantic
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all Fireworks models support this feature.
### Step 1: Import libraries
Start by importing the required libraries:
```python
import openai
from pydantic import BaseModel, Field
```
### Step 2: Configure the Fireworks client
You can use either Fireworks or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
### Step 3: Define the output schema
Define a Pydantic model to specify the schema of the output. For example, this model defines a simple schema with a single field `winner`.
```python
class Result(BaseModel):
winner: str
```
### Step 4: Specify your output schema in your chat completions request
Make a request to the Fireworks API to get a JSON response. In your request, specify the output schema you used in step 3. For example, to ask who won the US presidential election in 2012:
```python
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
response_format={"type": "json_object", "schema": Result.model_json_schema()},
messages=[
{
"role": "user",
"content": "Who won the US presidential election in 2012? Reply just in one JSON.",
},
],
)
```
### Step 5: Display the result
Finally, print the result:
```python
print(repr(chat_completion.choices[0].message.content))
```
This will display the response in the format defined by the `Result` schema. We get a nice JSON response that can be parsed and integrated with the rest of your application.
```
'{\n "winner": "Barack Obama"\n}'
```
We use a grammar-based state machine to make sure that the LLMs would always generate all the fields in the schema. If your provided output schema is not a valid JSON schema, we will fail the response.
## Structured response modes
Fireworks support the following variants:
* **Arbitrary JSON**. Similar to [OpenAI](https://platform.openai.com/docs/guides/text-generation/json-mode), you can force the model to produce any valid json by providing `{"type": "json_object"}` as `response_format` in the request. This forces the model to output JSON but does not specify what specific JSON schema to use.
* **JSON with the given schema**. To specify a given JSON schema, you can provide the schema according to [JSON schema spec](https://json-schema.org/specification) to be imposed on the model generation. See supported constructs in the next section.
When using JSON mode, you MUST instruct the model to produce JSON and describe the desired schema via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly "stuck" request.
To get the best outcome, you need to include the schema in **both the prompt and the schema.**
Technically, it means that when using "JSON with the given schema" mode, the model doesn't automatically "see" the schema passed in the `response_format` field. Adherence to the schema is forced upon the model during sampling. So for best results, you need to include the desired schema in the prompt in addition to specifying it as `response_format`. You may need to experiment with the best way to describe the schema in the prompt depending on the model: besides JSON schema, describing it in plain English might work well too, e.g. "extract name and address of the person in JSON format".
\*\*Note: \*\*that the message content may be partially cut off if `finish_reason="length"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length. In this case, the return value might not be a valid JSON.
Structured response modes work for both Completions and Chat Completions APIs.
If you use [function calling](/docs/function-calling), JSON mode is enabled automatically and function schema is added to the prompt. So none of the comments above apply.
### JSON schema constructs
Fireworks supports a subset of [JSON schema specification](https://json-schema.org/specification).
Supported:
* Nested schemas composition, including `anyOf` and `$ref`
* `type`: `string`, `number`, `integer` `boolean`, `object`, `array`, `null`
* `properties` and `required` for objects
* `items` for arrays
Fireworks API doesn't error out on unsupported constructs. They just won't be enforced. Not yet supported constraints include:
* Sophisticated composition with `oneOf`
* Length/size constraints for objects and arrays
* Regular expressions via `pattern`
**Note**: JSON specification [allows for arbitrary field names](https://json-schema.org/understanding-json-schema/reference/object#additionalproperties) to appear in an object with the `properties` constraint unless `"additionalProperties": false` or `"unevaluatedProperties": false` is provided. It's a poor default for LLM constrained generation since any hallucination would be accepted. Thus Fireworks treats any schema with `properties` constraint as if it had `"unevaluatedProperties": false`.
An example of `response_format` field with the schema accepting an object with two fields - a required string and an optional integer:
```
{
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"foo": {"type": "string"},
"bar": {"type": "integer"}
},
"required": ["foo"]
}
}
```
### Reasoning Model JSON Mode
In addition to standard JSON responses, Fireworks JSON mode now supports generating an output that includes the model’s internal reasoning. In this mode, the response contains a “reasoning” section wrapped in `...` tags followed by the JSON object that adheres to your specified schema.
#### How It Works
When using Reasoning Model JSON Mode, the model first outputs its reasoning process enclosed in `` tags. After the reasoning section, it outputs the JSON data. This allows you to capture both the rationale behind the model’s answer as well as the structured data for downstream processing.
#### Example Usage with Pydantic
Below is an example illustrating how to parse the response directly into a Pydantic model. In this example, the response contains both a reasoning part and a JSON part. The JSON part is then parsed into the `QAResult` Pydantic model using Pydantic’s `.parse_raw()` method.
```python
import json
import re
from pydantic import BaseModel
from openai import OpenAI
import os
# Initialize the Fireworks client
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key=os.getenv("FIREWORKS_API_KEY"),
)
# Define the output schema using Pydantic
class QAResult(BaseModel):
question: str
answer: str
# Prepare the user input
user_input = "Who wrote 'Pride and Prejudice'?"
# Construct the messages payload
messages = [{"role": "user", "content": user_input}]
# Make the API call to the model
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-r1",
messages=messages,
response_format={"type": "json_object", "schema": QAResult.model_json_schema()},
max_tokens=1000,
)
# Extract the content of the response
response_content = response.choices[0].message.content
print("show response content", response_content)
# Extract the reasoning part enclosed in ... tags
reasoning_match = re.search(r"(.*?)", response_content, re.DOTALL)
reasoning = reasoning_match.group(1).strip() if reasoning_match else "No reasoning provided."
# Extract the JSON part that follows after the reasoning
json_match = re.search(r"\s*(\{.*\})", response_content, re.DOTALL)
json_str = json_match.group(1).strip() if json_match else "{}"
# Parse the JSON string directly into a Pydantic model
qa_result = QAResult.parse_raw(json_str)
# Output the reasoning and the parsed JSON data
print("Reasoning:")
print(reasoning)
print("\nQAResult (JSON Data):")
print(qa_result.json(indent=4))
```
#### Additional notebooks and exmaples
Explore how Reasoning JSON Mode is used in different contexts:
Generate **structured PC specifications** while capturing the model’s thought process behind component choices.
Structure **patient healthcare records** with AI-generated reasoning, ensuring interpretability and compliance.
#### Key Points
* **Dual Output:** The model outputs both a reasoning explanation and a structured JSON object.
* **Extraction:** Use regular expressions to split the output—one capturing the reasoning within `` tags and another capturing the JSON.
* **Direct Parsing:** Parse the JSON part into your Pydantic model with `QAResult.parse_raw()`, leveraging Pydantic’s validation and serialization capabilities.
This new mode is ideal for scenarios where understanding the model’s thought process is as important as obtaining the final answer. It is especially useful during debugging, auditing, or whenever transparency in the decision-making process is required.
## Similar features
Check out our [function calling model](/guides/function-calling) if you're interested in use cases like:
* **Multi-turn capabilities:** For example, the ability for the model to ask for clarifying information about parameters
* **Routing:** The ability for the model to route across multiple different options or models. Instead of just having one possible JSON Schema, you have many different JSON schemas to work across.
Check out [grammar mode](/structured-responses/structured-output-grammar-based) if you want structured output specified not through JSON, but rather through an arbitrary grammar (limit output to specific words, character limits, character types, etc).
# Authentication
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/authentication
Authentication for access to your account
### Signing in
Users using Google SSO can run:
```
firectl signin
```
If you are using [custom SSO](/accounts/sso), also specify the account ID:
```
firectl signin my-enterprise-account
```
### Authenticate with API Key
To authenticate without a web browser, append `--api-key` to any firectl command.
```
firectl --api-key API_KEY
```
To persist the API key for all subsequent commands, run:
```
firectl set-api-key API_KEY
```
# Create a Dataset
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/create-dataset
Create a Dataset on Fireworks AI platform
```
firectl create dataset [flags]
```
### Example
```
firectl create dataset my-dataset /path/to/dataset.jsonl
```
### Flags
```
--display-name string The display name of the dataset.
-h, --help help for dataset
--quiet If true, does not print the upload progress bar.
```
# Create a deployment
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/create-deployment
Create a Deployment on Fireworks AI platform
Creates a new deployment.
```
firectl create deployment [flags]
```
### Example
```
firectl create deployment falcon-7b
```
### Flags
```
--description string Description of the deployment.
--disable-speculative-decoding If true, speculative decoding is disabled.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
--max-peft-batch-size int32 Max batching of concurrent peft requests of the server.
--max-replica-count int32 Maximum number of replicas for the deployment. If min-replica-count > 0 defaults to 0, otherwise defaults to 1.
--min-replica-count int32 Minimum number of replicas for the deployment. If min-replica-count < max-replica-count the deployment will automatically scale between the two replica counts based on load.
--model-id string The ID of a model that should be deployed when the deployment is created.
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled. (default 168h0m0s)
--wait Wait until the deployment is ready.
--world-size int32 The number of GPUs the base model is served with.
-h, --help help for deployment
```
### Flags inherited from parent commands
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Create a fine-tuning job
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/create-finetune-job
Create a fine-tuning job with a base model
Creates a fine-tuning job on Fireworks AI platform with the provided configuration yaml.
```
firectl create sftj [flags]
```
### Example
```
firectl create sftj \
--base-model llama-v3p1-8b-instruct \
--dataset cancerset \
--output-model my-tuned-model \
--job-id my-fine-tuning-job \
--learning-rate 0.0001 \
--epochs 2 \
--early-stop \
--evaluation-dataset my-eval-set
```
### Flags
```
--base-model string (required) The base model used for fine-tuning. e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
--dataset string (required) The ID of the dataset for the fine tuning.
--display-name string (optional) The display name of the fine-tuning job.
--draft-base-model string (optional) The draft model hf base model field.
--epochs int (optional) The number of epochs to train for.
--evaluation-dataset string (optional) The evaluation dataset for the supervised fine-tuning job.
--job-id string (optional) The ID of the fine-tuning job.
--learning-rate float (optional) The learning rate used for training.
--lora-rank int32 (optional) The LoRA rank used for training.
--early-stop Enable early stopping for the supervised fine-tuning job.
--quiet If set, only errors will be printed.
-h, --help help for deployment
--wandb-api-key string (optional) A Weights & Biases API key associated with the entity.
--wandb-entity string (optional) The Weights & Biases entity where training progress should be reported.
--wandb-project string (optional) The Weights & Biases project where training progress should be reported.
--wandb-run-id string [WANDB_RUN_ID] WandB Run ID. Implies --wandb.
--wandb Enable WandB
```
# Create Model
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/create-model
Create a model on Fireworks AI platform
```
firectl create model [flags]
```
### Example
```
firectl create model my-model /path/to/checkpoint/
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--quiet If true, does not print the upload progress bar.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Delete Resources
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/delete-model
Deletes resource(s) in a Fireworks AI account
### Delete a model
```
firectl delete model [flags]
```
##### Example
```
firectl delete model my-model
```
### Delete a fine-tuning job.
```
firectl delete fine-tuning-job [flags]
```
#### Example
```
firectl delete fine-tuning-job my-fine-tuning-job
```
### Delete a deployment
Deletes an model deployment.
```
firectl delete deployment [flags]
```
#### Example
```
firectl delete deployment my-deployment
```
### Delete a dataset.
```
firectl delete dataset [flags]
```
#### Example
```
firectl delete dataset my-dataset
```
### Delete an API key
```
firectl delete api-key
```
If you are an admin, you can delete API keys of other users in your account using the same command.
### Flags
```
-h, --help help for deleting resources
```
# Download a model
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/download-model
Download a model from third-party locations
```
firectl download model [flags]
```
#### Example
```
firectl download model my-model /path/to/checkpoint/
```
### Flags
```
-h, --help help for download
```
# Get Resources
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/get-model
Retrieves model information from Fireworks AI platform
```
firectl get [flags]
```
#### Example
```
firectl get model [flags]
```
### Retrieve user information
Prints information about a user.
```
firectl get user [flags]
```
#### Example
```
firectl get user john-08bb29
```
### Retrieve fine-tuning job information
Prints information about a fine-tuning job.
```
firectl get fine-tuning-job [flags]
```
#### Example
```
firectl get fine-tuning-job my-fine-tuning-job
```
### Get information about a deployment.
```
firectl get deployment [flags]
```
#### Example
```
firectl get deployment my-deployment
```
### Get information about a dataset.
```
firectl get dataset [flags]
```
#### Example
```
firectl get dataset instr-fine-tuning
```
### Flags
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Import Model
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/import-model
Imports specified model from Fireworks AI Platform
Imports a model from the fireworks account.
```
firectl import model [flags]
```
#### Example
```
firectl import model llama-v3p1-8b-instruct
```
### Flags
```
-h, --help help for model
--model-id string The ID of the model to be created.
```
# List Resources
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/list-models
List various resources in an Fireworks AI account
```
firectl list [flags]
```
### List models
```
firectl list models
```
### List fine-tuning jobs
Prints all fine-tuning jobs in an account.
```
firectl list fine-tuning-jobs [flags]
```
### List deployments
Prints all deployments in the account.
```
firectl list deployments [flags]
```
### List deployed models
Prints all deployed models in an account.
```
firectl list deployed-models [flags]
```
### List datasets
Prints all datasets uploaded by a user in an account.
```
firectl list datasets [flags]
```
### List API keys
Prints all API keys for the current user.
```
firectl list api-key
```
If you are an admin for an account, you can view API keys for all users with the `--all-users` flag:
```
firectl list api-key --all-users
```
### Flags inherited from parent commands
```
--filter string Only resources satisfying the provided filter will be listed. See https://google.aip.dev/160 for the filter grammar.
-h, --help help for list
--no-paginate List all resources without pagination.
--order-by string A list of fields to order by. To specify a descending order for a field, append a " desc" suffix
--page-size int32 The maximum number of resources to list.
--page-token string The page to list.
```
# Load LoRA
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/load-lora
Load a LoRA model to a deployment.
```
firectl load-lora [flags]
```
#### Example
```
firectl load-lora my-model
firectl load-lora my-model --deployment="abcd1234"
```
### Flags
```
--deployment string The resource ID of the deployment where the LoRA model is to be loaded.
-h, --help help for load-lora
--public If true, the LoRA model will be publicly available for inference.
--wait Wait until the model is deployed.
```
# Undelete Resources
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/undelete
Undelete Resources on Fireworks AI platform
## Undelete a deployment
```
firectl undelete deployment [flags]
```
You can only undelete a deployment with `DELELETED` or `FAILED` status.
#### Example
```
firectl undelete deployment my-deployment
```
### Flags
```
--wait Wait until the deployment is ready.
-h, --help help for dataset
```
# Unload LoRA
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/unload-lora
Unload a LORA model from a deployment.
```
firectl unload-lora [flags]
```
#### Example
```
firectl unload-lora my-model
firectl unload-lora my-model --deployment="abcd1234"
```
### Flags
```
--deployment string The resource ID of the deployment where the model is to be undeployed.
-h, --help help for unload-lora
--wait Wait until the model is deployed.
```
# Update Resources
Source: https://docs.fireworks.ai/tools-sdks/firectl/commands/update
Updates Resources on Fireworks AI platform
```
firectl update model [flags]
```
#### Example
```
firectl update model my-model --display-name="New Name"
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
## Update a user
```
firectl update user [flags]
```
#### Example
```
firectl update user my-user --display-name="Alice Cullen"
```
### Flags
```
--display-name string The display name of the user.
-h, --help help for user
--user string The role of the user. Must be one of {user, admin}.
```
## Update a deployment
```
firectl update deployment [flags]
```
#### Example
```
firectl update deployment my-deployment
```
### Flags
```
--description string Description of the deployment. Must be fewer than 1000 characters long.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
-h, --help help for deployment
--max-peft-batch-size int32 Max batching of concurrent PEFT requests to the server.
--max-replica-count int32 The maximum number of replicas.
--min-replica-count int32 The minimum number of replicas. (default 1)
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled.
--world-size int32 The number of GPUs the base model is served with.
```
## Update a dataset
```
firectl update dataset [flags]
```
#### Example
```
firectl update dataset my-dataset
```
### Flags
```
--display-name string The display name of the model.
-h, --help help for dataset
```
# Getting Started
Source: https://docs.fireworks.ai/tools-sdks/firectl/firectl
Learn to create, deploy, and manage resources using Firectl
Firectl can be installed several ways based on your choice and platform.
```bash homebrew
brew tap fw-ai/firectl
brew install firectl
# If you encounter a failed SHA256 check, try first running
brew update
```
```bash macOS (Apple Silicon)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-arm64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash macOS (x86_64)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-amd64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash Linux (x86_64)
wget -O firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip firectl.gz
sudo install -o root -g root -m 0755 firectl /usr/local/bin/firectl
```
```Text Windows (64 bit)
wget -L https://storage.googleapis.com/fireworks-public/firectl/stable/firectl.exe
```
### Sign into Fireworks account
To sign into your Fireworks account:
```bash
firectl signin
```
If you have set up [Custom SSO](/accounts/sso) then also pass your account ID:
```bash
firectl signin
```
### Check you have signed in
To show which account you have signed into:
```bash
firectl whoami
```
### Check your installed version
```bash
firectl version
```
### Upgrade to the latest version
```bash
sudo firectl upgrade
```
# OpenAI compatibility
Source: https://docs.fireworks.ai/tools-sdks/openai-compatibility
You can use the [OpenAI Python client library](https://github.com/openai/openai-python) to interact with Fireworks. This makes migration of existing applications already using OpenAI particularly easy.
## Specify endpoint and API key
### Using the OpenAI client
You can use the OpenAI client by initializing it with your Fireworks configuration:
```python
from openai import OpenAI
# Initialize with Fireworks parameters
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
```
You can also use environment variables with the client:
```python
import os
from openai import OpenAI
# Initialize using environment variables
client = OpenAI(
base_url=os.environ.get("OPENAI_API_BASE", "https://api.fireworks.ai/inference/v1"),
api_key=os.environ.get("OPENAI_API_KEY"), # Set to your Fireworks API key
)
```
### Using environment variables
```shell
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
### Alternative approach
```python
import openai
# warning: it has a process-wide effect
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
```
## Usage
Use OpenAI's SDK how you'd normally would. Just ensure that the `model` parameter refers to one of [Fireworks models](https://fireworks.ai/models).
### Completion
Simple completion API that doesn't modify provided prompt in any way:
```python
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
completion = client.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
prompt="The quick brown fox",
)
print(completion.choices[0].text)
```
### Chat Completion
Works best for models fine-tuned for conversation (e.g. llama\*-chat variants):
```python
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": "Say this is a test",
},
],
)
print(chat_completion.choices[0].message.content)
```
## API compatibility
### Differences
The following options have minor differences:
* `stop`: the returned string includes the stop word for Fireworks while it's omitted for OpenAI (it can be easily truncated on client side)
* `max_tokens`: behaves differently if the model context length is exceeded. If the length of `prompt` or `messages` plus `max_tokens` is higher than the model's context window, `max_tokens` will be adjusted lower accordingly. OpenAI returns invalid request error in this situation. This behavior can be adjusted by `context_length_exceeded_behavior` parameter.
### Token usage for streaming responses
OpenAI API returns usage stats (number of tokens in prompt and completion) for non-streaming responses but doesn't for the streaming ones (see [forum post](https://community.openai.com/t/chat-completion-stream-api-token-usage/352964)).
Fireworks.ai returns usage stats in both cases. For streaming responses, the `usage` field is returned in the very last chunk on the response (i.e. the one having `finish_reason` set). For example:
```bash cURL
curl --request POST \
--url https://api.fireworks.ai/inference/v1/completions \
--header "accept: application/json" \
--header "authorization: Bearer $API_KEY" \
--header "content-type: application/json" \
--data '{"model": "accounts/fireworks/models/starcoder-16b-w8a16", "prompt": "def say_hello_world():", "max_tokens": 100, "stream": true}'
```
```
data: {..., "choices":[{"text":"\n print('Hello,","index":0,"finish_reason":null,"logprobs":null}],"usage":null}
data: {..., "choices":[{"text":" World!')\n\n\n","index":0,"finish_reason":null,"logprobs":null}],"usage":null}
data: {..., "choices":[{"text":"say_hello_","index":0,"finish_reason":null,"logprobs":null}],"usage":null}
data: {..., "choices":[{"text":"world()\n","index":0,"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":7,"total_tokens":24,"completion_tokens":17}}
data: [DONE]
```
Note, that if you're using OpenAI SDK, they `usage` field won't be listed in the SDK's structure definition. But it can be accessed directly. For example:
* In Python SDK, you can access the attribute directly, e.g. `for chunk in openai.ChatCompletion.create(...): print(chunk["usage"])`.
* In TypeScript SDK, you need to cast away the typing, e.g. `for await (const chunk of await openai.chat.completions.create(...)) { console.log((chunk as any).usage); }`.
### Not supported options
The following options are not yet supported:
* `presence_penalty`
* `frequency_penalty`
* `best_of`: you can use `n` instead
* `logit_bias`
* `functions`: you can use our [LangChain integration](https://python.langchain.com/docs/integrations/providers/fireworks) to achieve similar functionality client-side
Please reach out to us on [Discord](https://discord.gg/fireworks-ai) if you have a use case requiring one of these.
# API Reference
Source: https://docs.fireworks.ai/tools-sdks/python-client/api-reference
## BaseCompletion Objects
```python
class BaseCompletion()
```
Base class for handling completions. This class provides shared logic for creating completions,\
both synchronously and asynchronously, and both streaming and non-streaming.
**Attributes**:
* `endpoint` *str* - API endpoint for the completion request.
* `response_class` *Type* - Class used for parsing the non-streaming response.
* `stream_response_class` *Type* - Class used for parsing the streaming response.
#### create
```python
@classmethod
def create(cls,
model,
prompt_or_messages=None,
request_timeout=600,
stream=False,
**kwargs)
```
Create a completion or chat completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `prompt_or_messages` *Union\[str, List\[ChatMessage]]* - The prompt for Completion or a list of chat messages for ChatCompletion. If not specified, must specify either `prompt` or `messages` in kwargs.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, Generator[CompletionStreamResponse, None, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse\
or a generator yielding CompletionStreamResponse.
#### acreate
```python
@classmethod
def acreate(cls, model, *args, request_timeout=600, stream=False, **kwargs)
```
Asynchronously create a completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, AsyncGenerator[CompletionStreamResponse, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse or an async generator yielding CompletionStreamResponse.
# completion
## Completion Objects
```python
class Completion(BaseCompletion)
```
Class for handling text completions.
# chat\_completion
## ChatCompletion Objects
```python
class ChatCompletion(BaseCompletion)
```
Class for handling chat completions.
# api
## Choice Objects
```python
class Choice(BaseModel)
```
A completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionResponse Objects
```python
class CompletionResponse(BaseModel)
```
The response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `choices` *List\[Choice]* - The list of generated completion choices.
## CompletionResponseStreamChoice Objects
```python
class CompletionResponseStreamChoice(BaseModel)
```
A streamed completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionStreamResponse Objects
```python
class CompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[CompletionResponseStreamChoice]):\
The list of streamed completion choices.
## Model Objects
```python
class Model(BaseModel)
```
A model deployed to the Fireworks platform.
**Attributes**:
* `id` *str* - The model name.
* `object` *str* - The object type, which is always "model".
* `created` *int* - The Unix time in seconds when the model was generated.
## ListModelsResponse Objects
```python
class ListModelsResponse(BaseModel)
```
The response message from a /v1/models call.
**Attributes**:
* `object` *str* - The object type, which is always "list".
* `data` *List\[Model]* - The list of models.
## ChatMessage Objects
```python
class ChatMessage(BaseModel)
```
A chat completion message.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the message.
## ChatCompletionResponseChoice Objects
```python
class ChatCompletionResponseChoice(BaseModel)
```
A chat completion choice generated by a chat model.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `message` *ChatMessage* - The chat completion message.
* `finish_reason` *Optional\[str]* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## UsageInfo Objects
```python
class UsageInfo(BaseModel)
```
Usage statistics.
**Attributes**:
* `prompt_tokens` *int* - The number of tokens in the prompt.
* `total_tokens` *int* - The total number of tokens used in the request (prompt + completion).
* `completion_tokens` *Optional\[int]* - The number of tokens in the generated completion.
## ChatCompletionResponse Objects
```python
class ChatCompletionResponse(BaseModel)
```
The response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.
* `choices` *List\[ChatCompletionResponseChoice]* - The list of chat completion choices.
* `usage` *UsageInfo* - Usage statistics for the chat completion.
## DeltaMessage Objects
```python
class DeltaMessage(BaseModel)
```
A message delta.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the chunk message.
## ChatCompletionResponseStreamChoice Objects
```python
class ChatCompletionResponseStreamChoice(BaseModel)
```
A streamed chat completion choice.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `delta` *DeltaMessage* - The message delta.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## ChatCompletionStreamResponse Objects
```python
class ChatCompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[ChatCompletionResponseStreamChoice]):\
The list of streamed chat completion choices.
# model
## Model Objects
```python
class Model()
```
#### list
```python
@classmethod
def list(cls, request_timeout=60)
```
Returns a list of available models.
**Arguments**:
* `request_timeout` *int, optional* - The request timeout in seconds. Default is 60.
**Returns**:
* `ListModelsResponse` - A list of available models.
# log
#### set\_console\_log\_level
```python
def set_console_log_level(level: str) -> None
```
Controls console logging.
**Arguments**:
* `level` - the minimum level that prints out to console.\
Supported values: \[CRITICAL, FATAL, ERROR, WARN,\
WARNING, INFO, DEBUG]
# error
## PermissionError Objects
```python
class PermissionError(FireworksError)
```
A permission denied error.
## InvalidRequestError Objects
```python
class InvalidRequestError(FireworksError)
```
A invalid request error.
## AuthenticationError Objects
```python
class AuthenticationError(FireworksError)
```
A authentication error.
## RateLimitError Objects
```python
class RateLimitError(FireworksError)
```
A rate limit error.
## InternalServerError Objects
```python
class InternalServerError(FireworksError)
```
An internal server error.
## ServiceUnavailableError Objects
```python
class ServiceUnavailableError(FireworksError)
```
A service unavailable error.
# Getting Started
Source: https://docs.fireworks.ai/tools-sdks/python-client/installation
You can install the client library with pip:
```bash pip
pip install --upgrade fireworks-ai
```
### Authentication
You can authenticate with Fireworks by setting the `fireworks.client.api_key` variable:
```python
fireworks.client.api_key = ""
```
Or by setting the `FIREWORKS_API_KEY` environment variable:
```
export FIREWORKS_API_KEY=
```
# Inference errors
Source: https://docs.fireworks.ai/troubleshooting/status_error_codes/inference_error_code
This page lists common error codes encountered during inference requests using the Fireworks API, their meanings, and potential resolutions.
## Error codes
Below is a table of common status codes and their associated messages for inference-related API requests.
| **Error Code** | **Error Name** | **Possible Issue(s)** | **How to Resolve** |
| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `400` | `Bad Request` | Invalid input or malformed request. | Review the request parameters and ensure they match the expected format. |
| `401` | `Unauthorized` | Invalid API key or insufficient permissions. | Verify your API key and ensure it has the correct permissions. |
| `402` | `Payment Required` | User's account is not on a paid plan or has exceeded usage limits. | Check your billing status and ensure your payment method is up to date. Upgrade your plan if necessary. |
| `403` | `Forbidden` | The model name may be incorrect, or the model does not exist. This error is also returned to avoid leaking information about model availability. | Verify the model name on the Fireworks site and ensure it exists. Double-check the spelling of the model name in your request. |
| `404` | `Not Found` | The API endpoint is incorrect, or the resource path is invalid (e.g., a user tried accessing `/v1/foobar` instead of a valid endpoint). | Verify the URL path in your request and ensure you are using the correct API endpoint as per the documentation. |
| `405` | `Method Not Allowed` | Using an unsupported HTTP method (e.g., using GET instead of POST). | Check the API documentation for the correct HTTP method to use for the request. |
| `408` | `Request Timeout` | The request took too long to complete, possibly due to server overload or network issues. | Retry the request after a brief wait. Consider increasing the timeout value if applicable. |
| `412` | `Precondition Failed` | This error occurs when attempting to invoke a LoRA model that failed to load. The final validation of the model happens during inference, not at upload time. | Check the body of the request for a detailed error message. Ensure the LoRA model was uploaded correctly and is compatible. Contact support if the issue persists. |
| `413` | `Payload Too Large` | Input data exceeds the allowed size limit. | Reduce the size of the input payload (e.g., by trimming large text or image data). |
| `429` | `Over Quota` | The user has reached the API rate limit. | Wait for the quota to reset or upgrade your plan for a higher rate limit. |
| `500` | `Internal Server Error` | This indicates a server-side code bug and is unlikely to resolve on its own. | Contact Fireworks support immediately, as this error typically requires intervention from the engineering team. |
| `502` | `Bad Gateway` | The server received an invalid response from an upstream server. | Wait and retry the request. If the error persists, it may indicate a server outage. |
| `503` | `Service Unavailable` | The service is down for maintenance or experiencing issues. | Retry the request after some time. Check for any maintenance announcements. |
| `504` | `Gateway Timeout` | The server did not receive a response in time from an upstream server. | Wait briefly and retry the request. Consider using a shorter input prompt if applicable. |
| `520` | `Unknown Error` | An unexpected error occurred with no clear explanation. | Retry the request. If the issue persists, contact support for further assistance. |
## Troubleshooting tips
If you encounter an error not listed here, try the following:
* Review the API documentation for the correct usage of endpoints and parameters.
* Check the [Fireworks status page](https://status.fireworks.ai) for any ongoing service disruptions.
* Contact support at [support@fireworks.ai](mailto:support@fireworks.ai) for further assistance.
This will provide additional insights into any issues encountered.
## Need more help?
If you continue to experience issues, please reach out on our [Discord channel](https://discord.gg/fireworks-ai).