# Custom SSO
Set up custom Single Sign-On (SSO) authentication for Fireworks AI
Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform.
By default, Fireworks supports Google SSO.
If you have an enterprise account, Fireworks supports bringing your own identity provider using:
* OpenID Connect (OIDC) provider
* SAML 2.0 provider
Coordinate with your Fireworks AI representative to enable the integration.
## OpenID Connect (OIDC) provider
Create an OIDC client application in your identity provider, e.g. Okta.
Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret).
Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse
```
Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up.
## SAML 2.0 provider
Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm).
Set the SSO URL to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse
```
Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like:
```
urn:amazon:cognito:sp:
```
Create an Attribute Statement with the name:
```
http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
```
and the value `user.email`
Leave the rest of the settings as defaults.
Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up.
## Troubleshooting
### Invalid samlResponse or relayState from identity provider
This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports
service provider (SP) initiated login.
See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an
in-depth explanation.
### Required String parameter 'RelayState' is not present
See above.
# Managing users
Add and delete additional users in your Fireworks account
Only admin users can manage other users within the account.
## Adding users
To add a new user to your Fireworks account, run the following command:
```bash
firectl create user --email="alice@example.com"
```
To create another admin user, pass the `--role=admin` flag:
```bash
firectl create user --email="alice@example.com" --role=admin
```
## Updating a user's role
To update a user's role, run
```bash
firectl update user --role="{admin,user}"
```
## Deleting users
You can remove a user from your account by running:
```bash
firectl delete user
```
# Align transcription
post /audio/alignments
The default **api.fireworks.ai** endpoint is for evaluation use only. To unlock the best performance, get a dedicated endpoint by contacting us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
### Request
##### (multi-part form)
The input audio file to align with text. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance
The text to align with the audio.
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Can be one of `tdnn_ffn`, `mms_fa`, or `gentle`.
The format in which to return the response. Can be one of `srt`, `verbose_json`, or `vtt`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/3pddjjdc").content
text = "At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality"
client = AudioInference(
model="whisper-v3-turbo",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.align_async(audio=audio, text=text)
print(f"Took: {(time.time() - start):.3f}s. Response: '{r}'")
```
```curl curl
# Download audio file
curl -sL -o "30s.flac" "https://tinyurl.com/3pddjjdc"
# Make request
curl -X POST "http://api.fireworks.ai/inference/v1/audio/alignments" \
-H "Authorization: Bearer <...>" \
-F "file=@30s.flac"
-F "text=At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality"
```
# Transcribe audio (realtime)
_post /audio/transcriptions
This realtime API requires a dedicated endpoint. To get a dedicated endpoint, please contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
Open a WebSocket connection to the endpoint.
Stream audio data to the WebSocket and receive transcription from the WebSocket.
```mermaid
sequenceDiagram
Client->>Server: Open WebSocket connection
loop
Client->>Server: Audio chunk
Server->>Client: Transcription chunk
end
```
Map audio stream to text.
Detailed output with word-level timestamps.
### Input
An async generator that yields audio data chunks.
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`.
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Can be one of `tdnn_ffn`, `mms_fa`, or `gentle`.
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt with which to prime transcription. This can be used, for example, to continue a prior transcription given new audio data.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Output
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python basic
!pip install fireworks-ai
from fireworks.client.audio import AudioInference
client = AudioInference(
model="whisper-v3-turbo",
base_url="https://api.fireworks.ai",
api_key="<...>",
)
start = time.time()
audio_stream = stream_audios()
async for r in client.transcribe_stream_async(audio_stream):
took = (time.time() - start)
print(f"Took: {took:.3f}s. Text: '{r.text}'")
start = time.time()
```
```python verbose
!pip install fireworks-ai
from fireworks.client.audio import AudioInference
client = AudioInference(
model="whisper-v3-turbo",
base_url="https://api.fireworks.ai",
api_key="<...>",
)
start = time.time()
audio_stream = stream_audios()
async for r in client.transcribe_stream_async(
audio_stream,
response_format="verbose_json",
timestamp_granularities=["word"],
):
took = (time.time() - start)
print(f"Took: {took:.3f}s. Words: '{r.words}'")
start = time.time()
```
# Transcribe audio
post /audio/transcriptions
The default **api.fireworks.ai** endpoint is for evaluation use only. To unlock the best performance, get a dedicated endpoint by contacting us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
Send a sample audio to get a transcription.
### Request
##### (multi-part form)
The input audio file to transcribe. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`.
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Can be one of `tdnn_ffn`, `mms_fa`, or `gentle`.
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt with which to prime transcription. This can be used, for example, to continue a prior transcription given new audio data.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.transcribe_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
```curl curl
# Download audio file
curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions" \
-H "Authorization: Bearer <...>" \
-F "file=@1hr.flac"
```
# Translate audio
post /audio/translations
The default **api.fireworks.ai** endpoint is for evaluation use only. To unlock the best performance, get a dedicated endpoint by contacting us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
### Request
##### (multi-part form)
The input audio file to translate. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance
String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`.
String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`.
String name of the alignment model to use. Can be one of `tdnn_ffn`, `mms_fa`, or `gentle`.
The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128).
The input prompt with which to prime transcription. This can be used, for example, to continue a prior transcription given new audio data.
Sampling temperature to use when decoding text tokens during transcription.
The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`.
The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`.
Audio preprocessing mode. Currently supported:
* `none` to skip audio preprocessing.
* `dynamic` for arbitrary audio content with variable loudness.
* `soft_dynamic` for speech intense recording such as podcasts and voice-overs.
* `bass_dynamic` for boosting lower frequencies;
### Response
The task which was performed. Either `transcribe` or `translate`.
The language of the transcribed/translated text.
The duration of the transcribed/translated audio, in seconds.
The transcribed/translated text.
Extracted words and their corresponding timestamps.
The text content of the word.
Start time of the word in seconds.
End time of the word in seconds.
Segments of the transcribed/translated text and their corresponding details.
```python python
!pip install fireworks-ai requests
from fireworks.client.audio import AudioInference
# Prepare client
audio = requests.get("https://tinyurl.com/4cb74vas").content
client = AudioInference(
model="whisper-v3",
base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai",
api_key="<...>",
)
# Make request
start = time.time()
r = await client.translate_async(audio=audio)
print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'")
```
```curl curl
# Download audio file
curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas"
# Make request
curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/translations" \
-H "Authorization: Bearer <...>" \
-F "file=@1hr.flac"
```
# Create Dataset
post /v1/accounts/{account_id}/datasets
# Create Deployed Model
post /v1/accounts/{account_id}/deployedModels
# Create Deployment
post /v1/accounts/{account_id}/deployments
# Create Fine-tuning Job
post /v1/accounts/{account_id}/fineTuningJobs
# Create Model
post /v1/accounts/{account_id}/models
# Create User
post /v1/accounts/{account_id}/users
# null
post /embeddings
# Delete Dataset
delete /v1/accounts/{account_id}/datasets/{dataset_id}
# Delete Deployed Model
delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Delete Deployment
delete /v1/accounts/{account_id}/deployments/{deployment_id}
# Delete Fine-tuning Job
delete /v1/accounts/{account_id}/fineTuningJobs/{fine_tuning_job_id}
# Delete Model
delete /v1/accounts/{account_id}/models/{model_id}
# Generate an image
Official API reference for image generation workloads can be found on the corresponding models pages, upon clicking "view code". We support generating images from text prompts, other images, and/or ControlNet
[https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0)
[https://fireworks.ai/models/fireworks/SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B)
[https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic)
[https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl)
# Get Account
get /v1/accounts/{account_id}
# Get Dataset
get /v1/accounts/{account_id}/datasets/{dataset_id}
# Get Dataset Upload Endpoint
post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint
# Get Deployment
get /v1/accounts/{account_id}/deployments/{deployment_id}
# Get Fine-tuning Job
get /v1/accounts/{account_id}/fineTuningJobs/{fine_tuning_job_id}
# Get Model
get /v1/accounts/{account_id}/models/{model_id}
# Get Model Download Endpoint
get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint
# Get Model Upload Endpoint
post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint
# Get User
get /v1/accounts/{account_id}/users/{user_id}
# Introduction
Fireworks AI REST API enables you to interact with various Language, Image and Embedding Models using the API Key.
## Authentication
All requests made to the Fireworks AI via REST API must include an `Authorization` header.
Header should specify a valid `Bearer` Token with API key and must be encoded as JSON with the "Content-Type: application/json" header.
This ensures that your requests are properly authenticated and formatted for interaction with the Fireworks AI.
A Sample header to be included in the REST API request should look like below:
```json
authorization: Bearer
```
# List Datasets
get /v1/accounts/{account_id}/datasets
# List Deployments
get /v1/accounts/{account_id}/deployments
# List Fine-tuning Jobs
get /v1/accounts/{account_id}/fineTuningJobs
# List Models
get /v1/accounts/{account_id}/models
# List Users
get /v1/accounts/{account_id}/users
# null
post /chat/completions
# null
post /completions
# Update Dataset
patch /v1/accounts/{account_id}/datasets/{dataset_id}
# Update Deployment
patch /v1/accounts/{account_id}/deployments/{deployment_id}
# Update Fine-tuning Job
patch /v1/accounts/{account_id}/fineTuningJobs/{fine_tuning_job_id}
# Update Model
patch /v1/accounts/{account_id}/models/{model_id}
# Update User
patch /v1/accounts/{account_id}/users/{user_id}
# Upload Dataset Files
post /v1/accounts/{account_id}/datasets/{dataset_id}:upload
Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint).
# Validate Dataset Upload
post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload
# Validate Model Upload
get /v1/accounts/{account_id}/models/{model_id}:validateUpload
# Start here
The **Fireworks Cookbook** is your hands-on guide to building, deploying, and fine-tuning generative AI and agentic workflows. It offers curated examples, Jupyter Notebooks, apps, and resources tailored to various use cases and skill levels, making it a go-to resource for practical Fireworks implementations.
In this cookbook, you’ll find:
* **Production-ready projects**: Scalable, proven solutions with ongoing support from the Fireworks engineering team.
* **Learning-focused tutorials**: Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
* **Community-driven showcases**: Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
***
## Repository structure
To help you easily navigate and find the right resources, the Cookbook organizes examples by purpose:
**Hands-on projects for learning AI** techniques, maintained by the DevRel team.
**Explore user-contributed projects** that push creative boundaries with Fireworks.
***
### Feedback & support
We value your feedback! If you encounter issues, need clarification, or have questions, please contact us at
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
***
**Additional resources:**
* [Fireworks AI Blog](https://fireworks.ai/blog)
* [Fireworks AI YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Fireworks AI Twitter](https://x.com/fireworksai_hq)
# Build with Fireworks
Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques.
## Inference
Explore notebooks and projects showcasing how to run generative AI models on Fireworks, demonstrating both third-party integrations and innovative applications with industry-leading speed and flexibility.
### LLMs
Dive into examples that utilize Fireworks for deploying and fine-tuning large language models (LLMs), featuring integrations with popular libraries and cutting-edge use cases.
**Notebooks**
(Python) An interactive Streamlit app for comparing LLMs on Fireworks with parameter tuning and LLM-as-a-Judge functionality.
(Python) Demonstrates structured responses using Llama 3.1, covering Grammar Mode and JSON Mode for consistent output formats.
(Python) Explores generating synthetic data with Llama 3.1 models on Fireworks, including structured outputs for quizzes.
**Apps**
A Next.js app for real-time transcription chat using Fireworks and Vercel integration.
### Visual-language
Discover projects combining vision and language capabilities using Fireworks, integrating external frameworks for seamless multimodal understanding.
### Audio
Explore real-time audio transcription, processing, and generation examples using Fireworks’ advanced audio models and integrations.
**Notebooks**
A notebook demonstrating real-time audio transcription using Fireworks' Whisper-v3-turbo model. The project includes streaming audio input, transcribing speech, and aligning timestamps, making it ideal for tasks requiring accurate and responsive audio processing.
Learn how to perform real-time audio transcription and timestamp alignment using Fireworks' Whisper-v3-turbo model. This notebook demonstrates streaming audio input, transcription, and precise word-level alignment, ideal for applications needing accurate speech processing.
### Image
Experiment with image-based projects using Fireworks’ models, enhanced with third-party libraries for innovative applications in image creation, manipulation, and recognition.
### Multimodal
Learn from complex multimodal examples that blend text, audio, and image inputs, demonstrating the full potential of Fireworks combined with external tools for interactive AI experiences.
***
## Fine-tuning
Access notebooks that demonstrate efficient model fine-tuning on Fireworks, utilizing both internal capabilities and third-party tools like Axolotl for custom optimization.
***
## Function calling
Explore examples of function-calling workflows using Fireworks, showcasing how to integrate with external APIs and tools for sophisticated, multi-step AI operations.
**Notebooks**
Demonstrates Function-Calling with LangChain integration, including custom tool routing and query handling. (Python)
Explore the integration of Fireworks' function-calling model with LangChain tools. This notebook demonstrates building basic agents using `firefunction-v1` for tasks like answering questions, retrieving stock prices, and generating images with the Fireworks SDXL API (Javascript).
Showcases Function-Calling with LangGraph integration for graph-based agent systems and tool queries. (Python)
Uses Fireworks' Function-Calling for structured QA with OpenAI, featuring multi-turn conversation handling. (Python)
Demonstrates querying financial data using Fireworks' Function-Calling API with integrated tool setup. (Python)
Extracts structured information from web content using Fireworks' Function-Calling API. (Python)
Generates stock charts using Fireworks' Function-Calling API with AutoGen integration. (Python)
**Apps**
A demo app showcasing chat with function-calling capabilities for dynamic service invocation.
***
## RAG
Build retrieval-augmented generation (RAG) systems with Fireworks, featuring projects that connect with vector databases and search tools for enhanced, context-aware AI responses.
**Notebooks**
A basic RAG implementation using ChromaDB with League of Legends data, comparing responses across multiple models. (Python)
An agentic system using RAG for generating catchy research paper titles with embeddings and LLM completions. (Python)
A movie recommendation system using Fireworks' function-calling models and MongoDB Atlas for personalized, real-time suggestions. (Python)
**Apps**
A RAG chatbot using SurrealDB for vector storage and Fireworks for real-time, context-aware responses.
***
### Integration partners
We welcome contributions from integration partners! Follow these steps:
1. **Clone the Repo**: [Fireworks Cookbook repo](https://github.com/fw-ai/cookbook)
2. **Create Folder**: Add your company/tool under `integrations`
3. **Add Examples**: Include code, notebooks, or demos
4. **Use Template**: Fill out the [integration guide](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md)
5. **Submit PR**: Create a pull request
6. **Review**: Fireworks will review and merge
Need help? Contact us or open an issue.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# Community showcase
Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts.
Convert any PDF into a personalized podcast using open-source LLMs and TTS models. Powered by Fireworks-hosted Llama 3.1, MeloTTS, and Bark, this app generates engaging dialogue and outputs it as an MP3 file via a user-friendly Gradio interface.
High-throughput code generation with Qwen2.5 Coder models, optimized for fast inference on Fireworks. Includes a robust pipeline for data creation, fine-tuning with Unsloth, and real-time application in AI-powered code editors.
Ensure accurate and reliable technical documentation with ProoferX, built using Fireworks’ fast Llama models and Firefunc for structured output. This project addresses a key challenge in developer tools by validating and streamlining documentation with real-time checks.
***
## Community project submissions
We welcome your contributions to the **Fireworks Cookbook**! Share your projects and help expand our collaborative resource.
Here’s how:
1. **Clone the Repo**: [Fireworks Cookbook](https://github.com/fw-ai/cookbook) and go to `showcase`.
2. **Create Folder**: Add a folder named after your project.
3. **Include Code**: Add notebooks, apps, or other resources demonstrating your project.
4. **Complete Template**: Fill out the [Showcase Template](https://github.com/fw-ai/cookbook/blob/main/showcase/template_projectMDX.md) for key project details.
5. **Submit PR**: Submit your project as a pull request.
6. **Review & Feature**: Our team will review your submission; selected projects may be highlighted in docs or social media.
***
### Support
For help or feedback:
* **Discord**: [Join us](https://discord.gg/fireworks-ai)
* **Email**: [Contact us](mailto:inquiries@fireworks.ai)
**Resources**:
* [Blog](https://fireworks.ai/blog)
* [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA)
* [Twitter](https://x.com/fireworksai_hq)
# Regions
Fireworks runs a global fleet of hardware on which you can deploy your models.
## Availability
Current region availability:
| **Region** | **Launch status** | **Hardware availability** |
| ---------------- | ------------------- | ------------------------- |
| `US_ILLINOIS_2` | Generally Available | `NVIDIA_A100_80GB` |
| `US_TEXAS_1` | Generally Available | `NVIDIA_H100_80GB` |
| `US_VIRGINIA_2` | Generally Available | `AMD_MI300X_192GB` |
| `AP_TOKYO_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `EU_FRANKFURT_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `EU_LONDON_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_ILLINOIS_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_IOWA_1` | Enterprise only | `NVIDIA_H100_80GB` |
| `US_VIRGINIA_1` | Enterprise only | `NVIDIA_H100_80GB` |
If you need deployments in a non-GA region, please contact our team at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
## Using a region
When creating a deployment, you can pass the `--region` flag:
```
firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \
--region US_IOWA_1
```
## Changing regions
Updating a region for a deployment in-place is currently not supported. To move a deployment between regions, please
create a new deployment in the new region, then delete the old deployment.
## Quotas
Each region has it's own separate quota for each hardware type. To view your current quotas, run
```
firectl list quotas
```
# Reserved capacity
Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the
following advantages over [on-demand deployments](/guides/ondemand-deployments):
* Guaranteed capacity
* Higher quotas
* Lower GPU-hour prices
* Pre-GA access to newer regions
* Pre-GA access to newest hardware
## Purchasing or renewing a reservation
To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account
manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us).
## Viewing your reservations
To view your existing reservations, run:
```
firectl list reservations
```
## Usage and billing
Reservations are automatically "consumed" when you create deployments that the meet the reservation parameters. For
example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both
deployments are running, 12 H100s will count towards using your reservation, while the excess 4 H100s will be metered
and billed at the on-demand rate.
When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding
number of deployments, otherwise you may be billed at for your usage at on-demand rates.
Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract
(e.g. monthly, quarterly, or yearly).
Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is
actively used.
# Account setup & management
Solutions for common account access issues and management procedures for Fireworks.ai accounts
## Multiple account access
**Q: What should I do if I can't access my company account after being invited when I already have a personal account?**
This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to).
To resolve this:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts
2. Include in your email:
* The account ID you created personally (e.g., username-44ace8)
* The company account ID you need access to (e.g., company-a57b2a)
* Mention that you're having trouble accessing your company account
Note: This is a known scenario that support can resolve once they verify your email ownership.
***
## Account closure
**Q: How do I close my Fireworks.ai account?**
To close your account:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
2. Include in your request:
* Your account ID
* A clear request for account deletion
Before closing your account, please ensure:
* All outstanding invoices are paid
* Any active deployments are terminated
* Important data is backed up if needed
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Billing management
Information about Fireworks.ai invoicing and API billing.
## Invoice questions
**Q: Why did I receive an invoice when I only deposited credits?**
Fireworks.ai billing works as follows:
* **Deposited credits** are used first.
* Once credits are exhausted, you **continue to accrue charges** for additional usage.
* **Usage charges** are billed at the end of each month.
* You’ll receive an invoice for any usage that **exceeded your pre-purchased credits**.
This process happens automatically, regardless of subscription status. To prevent additional charges, please monitor your usage or contact support to set up spending restrictions.
***
## API billing
**Q: Are calls to the Models API billable?**
No, calls to the **Models API** endpoint are free. This applies to all **management API calls** for:
* Accounts
* Users
* Models
* Datasets
*Note*: While the API calls themselves are free, charges apply for:
* **Model deployments**
* **Fine-tuning jobs**
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Credit system
Understanding how Fireworks.ai billing, credits, and account suspension work.
## Billing and credit usage
**Q: How does billing and credit usage work?**
Usage and billing operate through a **tiered system**:
* Each **tier** has a monthly usage limit, regardless of available credits.
* Once you reach your tier's limit, **service will be suspended** even if you have remaining credits.
* **Usage limits** reset at the beginning of each month.
* Pre-purchased credits do not prevent additional charges once the limit is exceeded.
***
## Account suspension
**Q: Why might my account be suspended even with remaining credits?**
Your account may be suspended due to several factors:
1. **Monthly usage limits**:
* Each tier includes a monthly usage limit, independent of any credits.
* Once you reach this limit, your service will be suspended, even if you have credits remaining.
* Usage limits automatically reset at the beginning of each month.
2. **Billing structure**:
* Pre-purchased credits do not prevent additional charges.
* You can exceed your pre-purchased credits and will be billed for any usage beyond that limit.
* **Example**: If you have `$20` in pre-purchased credits but incur `$83` in usage, you will be billed for the `$63` difference.
***
## Missing credits
**Q: I bought credits but don’t see them reflected in my account. Did they disappear?**
Fireworks operates with a **postpaid billing** system where:
* **Prepaid credits** are instantly applied to any outstanding balance.
* **Example**: If you had a `$750` outstanding bill and added `$500` in credits, your bill would reduce to `$250`, with \$0 remaining credits available for new usage.
To check your credit balance:
1. Visit your **billing dashboard**.
2. Review the **"Credits"** section.
3. Check your **current outstanding balance**.
*Note*: Credits are always applied to any existing balance before being available for new usage.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Cost structure
Understanding Fireworks.ai pricing and fees for various services.
## Platform costs
**Q: How much does Fireworks cost?**
Fireworks AI operates on a **pay-as-you-go** model for all non-Enterprise usage, and new users automatically receive free credits. You pay based on:
* **Per token** for serverless inference
* **Per GPU usage time** for on-demand deployments
* **Per token of training data** for fine-tuning
For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options.
Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing).
***
## Fine-tuning fees
**Q: Are there extra fees for serving fine-tuned models?**
No, deploying fine-tuned models to serverless infrastructure is free. Here’s what you need to know:
**What’s free**:
* Deploying fine-tuned models to serverless infrastructure
* Hosting the models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Discounts
Information about bulk usage discounts and special pricing options.
## Bulk usage
**Q: Are there discounts for bulk usage?**
Yes, we offer discounts for **bulk or pre-paid purchases** exclusively for on-demand deployments—not for serverless GPUs. Please contact [inquiries@firework.ai](mailto:inquiries@fireworks.ai) if you're interested.
***
## Serverless discounts
**Q: Are there discounts for bulk spend on serverless deployments?**
Our publicly accessible services have **standard rates** for all customers. Currently, we do not offer bulk discounts for serverless deployments.
***
## Additional information
For **enterprise customers** or **high-volume users**:
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
* Discuss **annual commitment discounts**
* Explore **enterprise-specific features and benefits**
# Billing & scaling
Understanding billing and scaling mechanisms for on-demand deployments.
## Autoscaling and costs
**Q: How does autoscaling affect my costs?**
* **Scaling from 0**: No minimum cost when scaled to zero
* **Scaling up**: Each new replica adds to your total cost proportionally. For example:
* Scaling from 1 to 2 replicas doubles your GPU costs
* If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing).
***
## Rate-limits for on-demand deployment
**Q: What are the rate limits for on-demand deployments?**
Request throughput scales with your GPU allocation. Base allocations include:
* Up to 8 A100 GPUs
* Up to 8 H100 GPUs
On-demand deployments offer several advantages:
* **Predictable pricing** based on time units, not token I/O
* **Protected latency and performance**, independent of traffic on the serverless platform
* **Choice of GPUs**, including A100s and H100s
Need more GPUs? Contact us to discuss higher allocations for your specific use case.
***
## On-demand billing
**Q: How does billing work for on-demand deployments?**
On-demand deployments come with automatic cost optimization features:
* **Default autoscaling**: Automatically scales to 0 replicas when not in use
* **Pay for what you use**: Charged only for GPU time when replicas are active
* **Flexible configuration**: Customize autoscaling behavior to match your needs
**Best practices for cost management**:
1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use
2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges
3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling).
***
## Scaling structure
**Q: How does billing and scaling work for on-demand GPU deployments?**
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
**Billing**:
* Charges start when the server begins accepting requests
* **Billed by GPU-second** for each active instance
* Costs accumulate even if there are no active API calls
**Scaling options**:
* Supports **autoscaling** from 0 to multiple GPUs
* Each additional GPU **adds to the billing rate**
* Can handle unlimited requests within the GPU’s capacity
**Management requirements**:
* Not fully serverless; requires some manual management
* **Manually delete deployments** when no longer needed
* Or configure autoscaling to **scale down to 0** during inactive periods
**Cost control tips**:
* Regularly **monitor active deployments**
* **Delete unused deployments** to avoid unnecessary costs
* Consider **serverless options** for intermittent usage
* Use **autoscaling to 0** to optimize costs during low-demand times
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Deployment issues
Troubleshooting and resolving common issues with on-demand deployments.
## Custom model issues
**Q: What are the common issues when deploying custom models?**
Here are key areas to troubleshoot for custom model deployments:
### 1. Deployment hanging or crashing
**Common causes**:
* **Missing model files**, especially when using Hugging Face models
* **Symlinked files** not uploaded correctly
* **Outdated firectl version**
**Solutions**:
* Download models without symlinks using:
```bash
huggingface-cli download model_name --local-dir=/path --local-dir-use-symlinks=False
```
* Update **firectl** to the latest version
### 2. LoRA adapters vs full models
* **Compatibility**: LoRA adapters work with specific base models.
* **Performance**: May experience slightly lower speed with LoRA, but **quality should remain similar** to the original model.
* **Troubleshooting quality drops**:
* Check **model configuration**
* Review **conversation template**
* Add `echo: true` to debug requests
### 3. Performance optimization factors
Consider adjusting the following for improved performance:
* **Accelerator count** and **accelerator type**
* **Long prompt** settings to handle complex inputs
***
## Autoscaling
**Q: What should I expect for deployment and scaling performance?**
* **Initial deployment**: Should complete within minutes
* **Scaling from zero**: You may experience brief availability delays while the system scales up
* **Troubleshooting**: If deployment takes over 1 hour, this typically indicates a crash and should be investigated
* **Best practice**: Monitor deployment status and contact support if deployment times are unusually long
***
## Performance questions
**Q: I have more specific performance questions about improvements**
For detailed discussions on performance and optimization options:
* **Schedule a consultation** directly with our PM, Ray Thai ([calendly](https://calendly.com/raythai))
* Discuss your **specific use cases**
* Get **personalized recommendations**
* Review **advanced configuration options**
*Note*: Monitor costs carefully during the deployment and testing phase, as repeated deployments and tests can quickly consume credits.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# Hardware options
Understanding hardware choices for Fireworks.ai on-demand deployments.
## Hardware selection
**Q: Which accelerator/GPU should I use?**
It depends on your specific needs. Fireworks has two grouping of accelerators: smaller (A100) and larger (H100 and MI300X) accelerators. Small accelerators are less expensive (see [pricing page](https://fireworks.ai/pricing)), so they’re more cost-effective for low-volume use cases. However, if you have enough volume to fully utilize a larger accelerator, we find that they tend to be both faster and more cost-effective per token.
Choosing between larger accelerators depends on the use case.
* The MI300X has the highest memory capacity and sometimes enables large models to be deployed with comparatively few GPUs. For example, unquantized Llama 3.1 70B fits on one MI300X and FP8 Llama 405B fits on 4 MI300X’s. Higher memory also may enable better throughput for longer prompts and less sharded deployments. It’s also more affordably priced than the H100.
* The H100 offers blazing fast inference and often provides the highest throughput, especially for high-volume use cases
### Best Practices for Selection
1. **Analyze your workload requirements** to determine which GPU fits your processing needs.
2. Consider your **throughput needs** and the scale of your deployment.
3. Calculate the **cost-performance ratio** for each hardware option.
4. Factor in **future scaling needs** to ensure the selected GPU can support growth.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options**
# On-demand deployment scaling
Understanding Fireworks.ai system scaling and request handling capabilities.
## System scaling
**Q: How does the system scale?**
Our system is **horizontally scalable**, meaning it:
* Scales linearly with additional **replicas** of the deployment
* **Automatically allocates resources** based on demand
* Manages **distributed load handling** efficiently
***
## Auto scaling
**Q: Do you support Auto Scaling?**
Yes, our system supports **auto scaling** with the following features:
* **Scaling down to zero** capability for resource efficiency
* Controllable **scale-up and scale-down velocity**
* **Custom scaling rules and thresholds** to match your specific needs
***
## Throughput capacity
**Q: What’s the supported throughput?**
Throughput capacity typically depends on several factors:
* **Deployment type** (serverless or on-demand)
* **Traffic patterns** and **request patterns**
* **Hardware configuration**
* **Model size and complexity**
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
The request handling capacity is influenced by multiple factors:
* **Model size and type**
* **Number of GPUs** allocated to the deployment
* **GPU type** (e.g., A100 vs. H100)
* **Prompt size** and **generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance optimization
Guidelines for optimizing performance and benchmarking Fireworks.ai deployments.
## Performance improvement
**Q: What are the techniques to improve performance?**
To optimize model performance, consider the following techniques:
1. **Quantization**
2. **Check model type**: Determine whether the model is **GQA** (Grouped Query Attention) or **MQA** (Multi-Query Attention).
3. **Increase batch size** to improve throughput.
***
## Benchmarking
**Q: How can we benchmark?**
There are multiple ways to benchmark your deployment’s performance:
* Use our [open-source load-testing tool](https://github.com/fw-ai/benchmark)
* Develop custom performance testing scripts
* Integrate with monitoring tools to track metrics
***
## Model latency
**Q: What’s the latency for small, medium, and large LLM models?**
Model latency and performance depend on various factors:
* **Input/output prompt lengths**
* **Model quantization**
* **Model sharding**
* **Disaggregated prefill processes**
* **Hardware configuration**
* **Multiple layers of caching**
* **Fire optimizations**
* **LoRA adapters** (Low-Rank Adaptation)
Our team specializes in personalizing model performance. We work with you to understand your traffic patterns and create customized deployment templates that maximize performance for your use case.
***
## Performance factors
**Q: What factors affect model latency and performance?**
Key factors that impact latency and performance include:
* **Model architecture and size**
* **Hardware configuration**
* **Network conditions**
* **Request patterns**
* **Batch size settings**
* **Caching implementation**
***
## Best practices
**Q: What are the best practices for optimizing performance?**
For optimal performance, follow these recommendations:
1. **Choose an appropriate model size** for your specific use case.
2. **Implement batching strategies** to improve efficiency.
3. **Use quantization** where applicable to reduce computational load.
4. **Monitor and adjust scaling parameters** to meet demand.
5. **Optimize prompt lengths** to reduce processing time.
6. **Implement caching** to minimize repeated calculations.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Costs & management
Understanding costs and model availability for serverless deployments.
## Deployment costs
**Q: Are there costs associated with deploying fine-tuned models to serverless infrastructure?**
No, deploying fine-tuned models to serverless infrastructure is free.
**What’s free**:
* Deploying fine-tuned models to serverless
* Hosting models on serverless infrastructure
* Deploying up to 100 fine-tuned models
**What you pay for**:
* **Usage costs** on a per-token basis when the model is actually used
* The **fine-tuning process** itself, if applicable
*Note*: This differs from on-demand deployments, which include hourly hosting costs.
***
## Model availability
**Q: Do you provide notice before removing model availability?**
Yes, we provide advance notice before removing models from the serverless infrastructure:
* **Minimum 2 weeks’ notice** before model removal
* Longer notice periods may be provided for **popular models**, depending on usage
* Higher-usage models may have extended deprecation timelines
**Best Practices**:
1. Monitor announcements regularly.
2. Prepare a migration plan in advance.
3. Test alternative models to ensure continuity.
4. Keep your contact information updated for timely notifications.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Performance issues
Troubleshooting timeout errors and performance issues with serverless LLM models.
## Timeout and response times
**Q: Why am I experiencing request timeout errors and slow response times with serverless LLM models?**
Timeout errors and increased response times can occur due to **server load during high-traffic periods**.
With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned.
The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**.
Deployment barriers should be **minimal** and **pricing is based on usage**.
However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**.
With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes.
Which is why our two recommended ways to address timeout and response time issues is:
### Current solution (recommended for production)
* **Use on-demand deployments** for more stable performance
* **Guaranteed response times**
* **Dedicated resources** to ensure availability
We are always investing in ways to improve speed and performance.
### Upcoming improvements
* Enhanced SLAs for uptime
* More consistent generation speeds during peak load times
If you experience persistent issues, please include the following details in your support request:
1. Exact **model name**
2. **Timestamp** of errors (in UTC)
3. **Frequency** of timeouts
4. **Average wait times**
### Performance optimization tips
* Consider **batch processing** for handling bulk requests
* Implement **retry logic with exponential backoff**
* Monitor **usage patterns** to identify peak traffic times
* Set **appropriate timeout settings** based on model complexity
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Service levels
Understanding SLAs and service guarantees for Fireworks.ai serverless deployments.
## Latency guarantees
**Q: Is latency guaranteed for serverless models?**
Currently there are **no latency or availability guarantees** for serverless models, however they are coming soon and we recommend contacting [sales](https://fireworks.ai/company/contact-us) to discuss any specific needs or requirements you have.
***
## Service level agreements
**Q: Are there any SLAs for serverless models?**
Our **multi-tenant serverless offering** does not currently come with **Service Level Agreements (SLAs)**. However they are coming and we'd love to understand what your use case is in order to ensure you have the best experience possible on the Fireworks platform. Reach out to us via sales or our Discord community.
***
## Quota information
**Q: Are there any quotas for serverless?**
For **serverless deployments**, quotas are as follows:
* **Developer accounts**: 600 requests per minute (RPM)
* **Enterprise accounts**: 600 requests per minute (RPM)
* Quotas apply **across all models** and cannot be exceeded within the serverless infrastructure
**For higher quotas**:
* Consider switching to **on-demand deployments**
* **Contact enterprise sales** for custom solutions
* Evaluate **dedicated infrastructure options** for greater flexibility
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Certifications
Information about Fireworks.ai compliance certifications and HIPAA requirements.
## Security certifications
**Q: What type of certifications do you have?**
We are **SOC 2 Type II** and **HIPAA Certified**. These certifications demonstrate our commitment to:
* **Security**
* **Availability**
* **Processing integrity**
* **Confidentiality**
* **Privacy**
You can view more at [https://trust.fireworks.ai/](https://trust.fireworks.ai/).
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
# Enterprise quotas
Understanding quota allocations for Enterprise customers.
## Enterprise limits
**Q: Are there any quotas for Enterprise Tier?**
No, there are **no quotas** for Enterprise Tier. Enterprise customers benefit from:
1. **Resource Allocation**:
* **Unlimited request capacity**
* **Flexible scaling options**
* **Custom resource allocation**
2. **Performance Benefits**:
* **Dedicated infrastructure**
* **Priority processing**
* **Enhanced support**
3. **Custom Solutions**:
* **Tailored deployment options**
* **Specialized configurations**
* **Customized scaling policies**
For specific requirements or custom configurations, contact your **enterprise account representative**.
***
## Additional resources
* **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
# Platform support
Information about Fireworks.ai deployment regions, general support channels, and platform requests.
## General support
**Q: I have another question or issue.**
We have an active [Discord community](https://discord.gg/mMqQxvFD9A) where you can:
* Post questions
* Request features
* Report bugs
* Interact directly with the Fireworks team and community
***
## Feature requests
**Q: How can I request a new model to be added to the platform?**
Head over to our **Discord server** and let us know which models you would like to see deployed. We actively take feature requests for new, popular models.
***
## Product feedback
**Q: I have specific performance questions or want to know about further performance improvement options.**
If you need more tailored performance advice or want to discuss advanced optimization options, here are two ways to get support:
1. **General support**: Reach out via our [support channels](https://fireworks.ai/company/contact-us) or check out the performance optimization practices for tips on maximizing efficiency with on-demand deployments.
2. **Direct consultation**: For in-depth questions, feel free to schedule a consultation directly with our Product Manager, Ray Thai, using [this link to his calendar](https://calendly.com/raythai). Ray can assist with advanced optimization strategies and hardware recommendations based on your specific workload and deployment needs.
***
## Deployment regions
**Q: Do you host your deployments in the EU or Asia?**
We are currently deployed in multiple U.S.-based locations. However, we’re open to hearing more about your specific requirements. You can:
* Join our [Discord community](https://discord.gg/mMqQxvFD9A)
* Write to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
If you're an Enterprise customer, please contact your dedicated customer support representative to ensure a timely response.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Support structure & access
Information about Fireworks.ai support options, access methods, and communication channels.
## Support options
**Q: What support options exist?**
* Enterprise accounts receive **dedicated support**.
* Developer-tier customers can interact directly with the Fireworks team and community through our **Discord channel**.
***
## Support process
**Q: How does Support work?**
Fireworks provides support for its services with **target response times** based on the **priority level** of the issue. Customers can indicate priority when creating support issues through the **Fireworks support system**.
***
## Additional resources
* **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
* **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs)
# Enterprise support tiers & SLAs
Detailed information about Fireworks.ai support priority levels and response time commitments.
## Enterprise support contact
**Q: If you're an Enterprise customer, how do you contact support?**
Enterprise customers have access to **dedicated support channels**. Please contact your assigned **customer support representative** for timely assistance.
***
## Communication channels
**Q: Do you have a shared Slack channel?**
For customers who use Slack internally, we create a **shared Slack channel**. This channel is used for:
* **Answering questions** about Fireworks’ platform and features
* **Receiving bug reports** from customers
* **Communicating** around incidents and escalations
* **Announcing new features** and requesting feedback on current offerings
***
## Support priority levels
**Q: What are the support tiers and SLAs for enterprise?**
Support issues are categorized into four priority levels, with specific examples for each:
| Priority Level | Response Time | Description | Examples |
| --------------- | ----------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| **Urgent (P0)** | Within 1 hour | Reserved for critical cases that break live production workflows | • Production scheduled task/runbook unexpectedly failing
• Application inaccessible to end users |
| **High (P1)** | Within 4 business hours | Problems that prevent regular platform usage but not breaking live production | • Development/staging schedule failing
• Task deployment failing |
| **Normal (P2)** | Within 8 business hours | Requests for information, enhancements, or documentation clarification with no negative service impact | • Feature requests
• Documentation questions |
| **Low (P3)** | Within 2 business days | Any issues that don't fall into P0, P1, or P2 categories | • General inquiries
• Non-urgent requests |
*Note: Business hours refer to standard working hours.*
# Platform models
Information about custom and available models on Fireworks.ai.
## Custom models
**Q: Does Fireworks support custom base models?**
Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models).
***
## Model availability
**Q: There’s a model I would like to use that isn’t available on Fireworks. Can I request it?**
Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform.
**To request new models**:
1. **Join our [Discord server](https://discord.gg/fireworks-ai)**
2. Let us know which models you’d like to see
3. Provide **use case details**, if possible, to help us prioritize
We regularly evaluate and add new models based on:
* **Community requests**
* **Popular demand**
* **Technical feasibility**
* **Licensing requirements**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning service
Overview of Fireworks.ai fine-tuning capabilities and supported models.
## Service availability
**Q: Does Fireworks offer a fine-tuning service?**
Yes, Fireworks offers a fine-tuning service. Take a look at our [fine-tuning guide](https://docs.fireworks.ai/fine-tuning/fine-tuning-models), which is also available [via REST API](https://docs.fireworks.ai/fine-tuning/fine-tuning-via-api) for detailed information about our services and capabilities.
***
## Model support
**Q: What models are supported for fine-tuning? Is Llama 3 supported for fine-tuning?**
Yes, **Llama 3** (8B and 70B) is supported for fine-tuning with **LoRA adapters**, which can be deployed via our **serverless** and **on-demand** options for inference.
**Capabilities include**:
* **LoRA adapter training** for flexible model adjustments
* **Serverless deployment support** for scalable, cost-effective usage
* **On-demand deployment options** for high-performance inference
* A variety of **base model options** to suit different use cases
For a complete list of models available for fine-tuning, refer to our [documentation](https://docs.fireworks.ai/fine-tuning/fine-tuning-models).
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning troubleshooting
Solutions for common fine-tuning deployment and access issues.
## Access issues
**Q: Why am I getting "Model not found" errors when trying to access my fine-tuned model?**
If you’re unable to access your fine-tuned model, try these troubleshooting steps:
**First steps**:
* Attempt to access the model through both the **playground** and the **API**.
* Check if the error occurs for **all users** on the account.
* Ensure your **API key** is valid.
**Common causes**:
* User email previously associated with a **deleted account**
* **API key permissions** issues
* **Access conflicts** due to multiple accounts
**Debug process**:
1. Verify the API key’s validity using:
```bash
curl -v -H "Authorization: Bearer $FIREWORKS_API_KEY" https://api.fireworks.ai/verifyApiKey
```
2. Check if the issue persists across different **API keys**.
3. Identify which specific **users/emails** are affected.
**Getting help**:
* Contact support with:
* Your **account ID**
* **API key verification** results
* A list of **affected users/emails**
* Results from both **playground** and **API** tests
*Note*: If you have multiple accounts, ensure that access permissions are checked across all of them.
***
## Troubleshooting firectl deployment
**Q: Why am I getting "invalid id" errors when using firectl commands like create deployment or list deployments?**
This error typically occurs when your **account ID** is not properly configured.
### Common symptoms
* Error message: `invalid id: id must be at least 1 character long`
* Affects multiple commands, including:
* `firectl create deployment`
* `firectl list deployments`
To resolve:
### Steps to resolve
1. Run `firectl whoami` to check which **account id** is being used.
2. Ensure the correct **account ID** is being used. If not, run `firectl signin` to sign-in to the right account.
***
## LoRA deployment issues
**Q: Why can’t I deploy my fine-tuned Llama 3.1 LoRA adapter?**
If you encounter the following error:
```bash
Invalid LoRA weight model.layers.0.self_attn.q_proj.lora_A.weight shape: torch.Size([16, 4096]), expected (16, 8192)
```
This issue is due to the `fireworks.json` file being set to **Llama 3.1 70b instruct** by default.
**Workaround**:
1. Download the **model weights**.
2. Modify the base model to be `accounts/fireworks/models/llama-v3p1-8b-instruct`.
3. Follow the instructions in the [documentation](https://fireworks.ai/fine-tuning/model-upload) to upload and deploy the model.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# FLUX capabilities
Understanding FLUX image generation features and limitations.
## Multiple images
**Q: Can I generate multiple images in a single API call using FLUX serverless?**
No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance.
***
## Image-to-image generation
**Q: Does FLUX support image-to-image generation?**
No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development.
***
## LoRA models
**Q: Can I create custom LoRA models with FLUX?**
Inference on FLUX-LoRA adapters is currently supported. However managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available.
***
## Size control
**Q: How do I control output image sizes when using SDXL ControlNet?**
When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request:
The input control signal image will be automatically:
* **Resized** to fit your specified dimensions
* **Cropped** to preserve aspect ratio
**Example**: To generate a 768x1344 image, explicitly include these parameters in your request:
```json
{
"width": 768,
"height": 1344
}
```
*Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Limitations & controls
Understanding model limitations, safety features, and token limits.
## Safety Features
**Q: Can safety filters or content restrictions be disabled on text generation models?**
No, safety features and content restrictions for text generation models (such as Llama, Mistral, etc.) are embedded by the original model creators during training:
* **Safety measures** are integrated directly into the models by the teams that trained and released them.
* These are **core behaviors** of the model, not external filters.
* Different models may have varying levels of built-in safety.
* **Fireworks.ai does not add additional censorship layers** beyond what is inherent in the models.
* Original model behaviors **cannot be modified** via API parameters or configuration.
*Note*: For specific content handling needs, review the documentation of each model to understand its inherent safety features.
## Token Limits
**Q: What are the maximum completion token limits for models, and can they be increased?**
Token limits are model-specific and have technical constraints:
**Current Limitations**:
* Many models, such as **Llama 3.1 405B**, have a **4096 token completion limit**.
* Setting a higher `max_tokens` in API calls **will not override** this limit.
* You will see `"finish_reason": "length"` in responses when hitting this limit.
**Why Limits Exist**:
* **Resource management** for shared infrastructure
* Prevents single requests from monopolizing resources
* Helps maintain **service availability** for all users
**Working with Token Limits**:
* Break longer generations into **multiple requests**.
* *Note*: This may require repeating context or prompts.
* Be mindful that repeated context can **increase total token usage**.
**Example API Response at Limit**:
```json
{
"finish_reason": "length",
"usage": {
"completion_tokens": 4096,
"prompt_tokens": 4206,
"total_tokens": 8302
}
}
```
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Inference performance
Understanding model performance, quantization, and batching capabilities.
## Model quantization
**Q: What quantization format is used for the Llama 3.1 405B model?**
The **Llama 3.1 405B model** uses the **FP8 quantization format**, which:
* Closely matches **Meta's reference implementation**
* Provides further details in the model description at [fireworks.ai/models/fireworks/llama-v3p1-405b-instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct)
* Has a general quantization methodology documented in our [Quantization blog](https://fireworks.ai/blog/fireworks-quantization)
*Note*: **BF16 precision** will be available soon for on-demand deployments.
***
## API capabilities
**Q: Does the API support batching and load balancing?**
Current capabilities include:
* **Load balancing**: Yes, supported out of the box
* **Continuous batching**: Yes, supported
* **Batch inference**: Not currently supported (on the roadmap)
* Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency.
* **Streaming**: Yes, supported
***
## Request handling
**Q: What factors affect the number of simultaneous requests that can be handled?**
Request handling capacity depends on several factors:
* **Model size and type**
* **Number of GPUs allocated** to the deployment
* **GPU type** (e.g., A100, H100)
* **Prompt size**
* **Generation token length**
* **Deployment type** (serverless vs. on-demand)
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Data security
Information about Fireworks.ai data encryption and security measures.
## Data at rest
**Q: How is data encrypted at rest?**
All resources stored within Fireworks are **encrypted at rest**, including:
* **Models**
* **Datasets**
* **LoRA Adapters**
* Other stored resources
***
## Data in transit
**Q: How is data encrypted in transit?**
All data passed through Fireworks is encrypted using **industry-standard protocols and methods**.
***
## Encryption options
**Q: Does Fireworks provide client-side encryption or allow customers to bring their own encryption keys?**
Currently, Fireworks does not provide:
* **Client-side encryption**
* **Customer-managed keys** for encrypting data at rest
*Note*: We continuously evaluate additional encryption options based on customer needs and security requirements.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Security documentation
Access to Fireworks.ai security policies and documentation.
## Security policies
**Q: Where can I find more information about your security policies?**
Comprehensive security documentation is available at [trust.fireworks.ai](https://trust.fireworks.ai), including:
* **Security measures**
* **Compliance information**
* **Best practices**
* **Policy updates**
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Model security
Understanding model security and guardrail implementations.
## Model guardrails
**Q: Do you put any guardrails before any LLM models?**
By default, we don’t apply any guardrails to LLM models. Our customers can implement guardrails through various methods:
1. **Using built-in options**:
* Models such as **Llama Guard** provide built-in guardrails.
* Integration with existing **security frameworks**.
2. **Third-party solutions**:
* AI gateways like **Portkey** offer guardrails as a feature.
* Documentation available at: [Portkey Guardrails](https://docs.portkey.ai/docs/product/guardrails)
**Best practices**:
* Implement guardrails appropriate to your **use case**.
* Conduct regular **security audits**.
* Monitor **model outputs** consistently.
* Keep **security policies** updated.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Private access
Understanding private connection options for Fireworks.ai services.
## Private connections
**Q: Do you provide private connections?**
Fireworks provides various forms of **private connections**:
**Cloud provider options**:
* **AWS PrivateLink**
* **GCP Private Service Connect**
**Additional options**:
* **Direct Routing**, which allows you to connect your dedicated API Gateway
**Benefits**:
* **Enhanced security**
* **Reduced latency**
* **Private network communication**
* **Improved reliability**
**Implementation process**:
1. **Contact support** to initiate setup.
2. **Choose connection type** based on your requirements.
3. **Configure network settings** as per the guidelines.
4. **Verify connectivity** to ensure successful integration.
***
## Additional information
If you experience any issues during these processes, you can:
* Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai)
* Reach out to your account representative (Enterprise customers)
* Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
# Fine-tuning models
Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, Llama 3.1 8B Instruct and Llama
3.1 70B Instruct are now supported!
We utilize [LoRA (Low-Rank Adaptation)](https://huggingface.co/docs/diffusers/training/lora)
for efficient and effective fine-tuning of large language models. LoRA is used for
fine-tuning all models besides our 70B models, which uses qLoRA (quantized) to improve
training speeds. Take advantage of this opportunity to enhance your models with our
cutting-edge technology!
## Introduction
Fine-tuning a model with a dataset can be useful for several reasons:
1. **Enhanced Precision**: It allows the model to adapt to the unique attributes and trends within the dataset, leading to significantly improved precision and effectiveness.
2. **Domain Adaptation**: While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field.
3. **Bias Reduction**: General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes.
4. **Contemporary Relevance**: Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant.
5. **Customization for Specific Applications**: This process allows for the tailoring of the model to meet unique objectives and needs, an aspect not achievable with standard models.
In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let's hop on a journey of fine-tuning a model!
Fine-tuned model inference on Serverless is slower than base model inference on Serverless. For use cases that need low latency, we recommend using [on-demand deployments](https://docs.fireworks.ai/guides/ondemand-deployments). For on-demand deployements, fine-tuned model inference speeds are significant closer to base model speeds (but still slightly slower). If you are only using 1 LoRA on-demand, [merging fine-tuned weights](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model) into the base model when using on-demand deployments will provide identical speed to base model inference. If you have an enterprise use case that needs fast fine-tuned models, please [contact us!](https://fireworks.ai/company/contact-us)
## Installing firectl
[`firectl`](/tools-sdks/firectl/firectl) is the command-line (CLI) utiliy to manage, and deploy various resources on the [Fireworks AI Platform](https://fireworks.ai). Use `firectl` to manage fine-tuning jobs and their resulting models.
Please visit the Firectl [Getting Started](/tools-sdks/firectl/firectl) Guide on installing and using `firectl`.
## Preparing your dataset
To fine-tune a model, we need to first upload a dataset. Once uploaded, this dataset can be used to create one or more fine-tuning jobs. A dataset consists of a single JSONL file, where each line is a separate training example.
Limits:
* Minimum number of examples is 1.
* Maximum number of examples is 3,000,000.
Format:
* Each line of the file must be a valid JSON object.
For the rest of this tutorial, we will use the [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset as an example. Each record in this dataset consists of a `category`, `instruction`, an optional `context`, and the expected `response`. Here are a few sample records:
```json
{"instruction": "When did Virgin Australia start operating?", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
{"instruction": "Which is a species of fish? Tope or Rope", "context": "", "response": "Tope", "category": "classification"}
{"instruction": "Why can camels survive for long without water?", "context": "", "response": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.", "category": "open_qa"}
{"instruction": "Alice's parents have three daughters: Amy, Jessy, and what\u2019s the name of the third daughter?", "context": "", "response": "The name of the third daughter is Alice", "category": "open_qa"}
{"instruction": "When was Tomoaki Komorida born?", "context": "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegated to the J2 League at the end of the 2001 season. In 2002, he moved to the J2 club Oita Trinita. He became a regular player as a defensive midfielder and the club won the championship in 2002 and was promoted in 2003. He played many matches until 2005. In September 2005, he moved to the J2 club Montedio Yamagata. In 2006, he moved to the J2 club Vissel Kobe. Although he became a regular player as a defensive midfielder, his gradually was played less during the summer. In 2007, he moved to the Japan Football League club Rosso Kumamoto (later Roasso Kumamoto) based in his local region. He played as a regular player and the club was promoted to J2 in 2008. Although he did not play as much, he still played in many matches. In 2010, he moved to Indonesia and joined Persela Lamongan. In July 2010, he returned to Japan and joined the J2 club Giravanz Kitakyushu. He played often as a defensive midfielder and center back until 2012 when he retired.", "response": "Tomoaki Komorida was born on July 10,1981.", "category": "closed_qa"}
```
To create a dataset, run:
```shell
firectl create dataset path/to/dataset.jsonl
```
and you can check the dataset with:
```shell
firectl get dataset
```
To use an existing Hugging Face dataset, please refer to the [script below](#hugging-face-dataset-to-jsonl) for conversion. Datasets are private and cannot be viewed by other accounts.
## Starting your tuning job
Fireworks supports three types of fine-tuning depending on the modeling objective:
* **Text completion** - used to train a text generation model
* **Text classification** - used to train a text classification model
* **Conversation** - used to train a chat/conversation model
There are two ways to specify settings for your tuning job. You can create a settings YAML file and/or specify them using command-line flags. If a setting is present in both, the command-line flag takes precedence.
To start a job, run:
```shell
firectl create fine-tuning-job --settings-file path/to/settings.yaml --display-name "My Job"
```
firectl will return the fine-tuning job ID.
### Starting from a base model or a PEFT addon model
When creating a fine-tuning job, you can start tuning from a base model, or from a model you tuned earlier (PEFT addon):
1. **Base model**: Use the `base_model` parameter to start from a pre-trained base model.
2. **PEFT addon model**: Use the `warm_start_from` parameter to start from an existing PEFT addon model.
You must specify either `base_model` or `warm_start_from` in your settings file or command-line flags.
The following sections provide examples of a settings file for the given tasks.
### Text completion
To train a text completion model, you need to define an input template and output template from your JSON fields. To directly use a field as inputs or outputs, simply set the input and output templates as the field names.
You can also add additional text to the input and output templates. For example, this example demonstrates training on `context`, `instruction` and the `response` fields with added text around the fields. We won't use the `category` field at all.
```yaml YAML(Base Model)
# The ID of the dataset you created above.
dataset: my-dataset
text_completion:
# How the fields of the JSON dataset should be formatted into the input text.
input_template: "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: "
# How the fields of the JSON dataset should be formatted into the output text.
output_template: "ANSWER: {response}"
# The Fireworks model name of the base model.
base_model: accounts/fireworks/models/llama-v3p1-8b-instruct
```
```yaml YAML(Warm Start)
# The ID of the dataset you created above.
dataset: my-dataset
text_completion:
# How the fields of the JSON dataset should be formatted into the input text.
input_template: "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: "
# How the fields of the JSON dataset should be formatted into the output text.
output_template: "ANSWER: {response}"
# The Fireworks model name of the peft addon model.
warm_start_from: accounts//models/
```
### Conversation
To train a conversation model, the dataset must conform to the schema expected by the [Chat Completions API](querying-text-models#chat-completions-api). Each JSON object of the dataset must contain a single array field called `messages`. Each message is an object containing two fields:
* `role` - one of "system", "user", or "assistant".
* `content` - the content of the message.
A message with the "system" role is optional, but if specified, must be the first message of the conversation. Subsequent messages start with "user" and alternate between "user" and "assistant". For example:
```json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "blue"}]}
{"messages": [{"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]}
```
The settings file for tuning a conversation model looks like:
```yaml YAML(Base Model)
# The ID of the dataset you created above.
dataset: my-dataset
conversation: {}
# The Fireworks model name of the base model.
base_model: accounts/fireworks/models/llama-v3p1-8b-instruct
```
```yaml YAML(Warm Start)
# The ID of the dataset you created above.
dataset: my-dataset
conversation: {}
# The Fireworks model name of the peft addon model.
warm_start_from: accounts//models/
```
Or, you can optionally pass in a [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) template to digest the messages, settings file look like:
```yaml YAML(Base Model)
# The ID of the dataset you created above.
dataset: my-dataset
conversation:
jinja_template:
# The Fireworks model name of the base model.
base_model: accounts/fireworks/models/llama-v3p1-8b-instruct
```
```yaml YAML(Warm Start)
# The ID of the dataset you created above.
dataset: my-dataset
conversation:
jinja_template:
# The Fireworks model name of the peft addon model.
warm_start_from: accounts//models/
```
an example of template string will look like:
```
{%- set _mode = mode | default('generate', true) -%}
{%- set stop_token = '<|eot_id|>' -%}
{%- set message_roles = ['USER', 'ASSISTANT'] -%}
{%- set ns = namespace(initial_system_message_handled=false, last_assistant_index_for_eos=-1, messages=messages) -%}
{%- for message in ns.messages -%}
{%- if loop.last and message['role'] | upper == 'ASSISTANT' -%}
{%- set ns.last_assistant_index_for_eos = loop.index0 -%}
{%- endif -%}
{%- endfor -%}
{%- if _mode == 'generate' -%}
{{ bos_token }}
{%- endif -%}
{%- for message in ns.messages -%}
{%- if message['role'] | upper == 'SYSTEM' and not ns.initial_system_message_handled -%}
{%- set ns.initial_system_message_handled = true -%}
{{ '<|start_header_id|>system<|end_header_id|>\n\n' + message['content'] + stop_token }}
{%- elif message['role'] | upper != 'SYSTEM' -%}
{%- if (message['role'] | upper == 'USER') != ((loop.index0 - (1 if ns.initial_system_message_handled else 0)) % 2 == 0) -%}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{%- endif -%}
{%- if message['role'] | upper == 'USER' -%}
{{ '<|start_header_id|>user<|end_header_id|>\n\n' + message['content'] + stop_token }}
{%- elif message['role'] | upper == 'ASSISTANT' -%}
{%- if _mode == 'train' -%}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' + unk_token + message['content'] + stop_token + unk_token }}
{%- else -%}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' + message['content'] + (stop_token if loop.index0 != ns.last_assistant_index_for_eos else '') }}
{%- endif -%}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if _mode == 'generate' and ns.last_assistant_index_for_eos == -1 -%}
{{ '<|start_header_id|>assistant<|end_header_id|>' }}
{%- endif -%}
```
Notice: To use conversation settings, default polished Jinja templates will be provided for models that are recommended for chat tuning to guarantee the quality, see the specs in the [conversation recommended](/fine-tuning/fine-tuning-models#supported-base-models) column at the model spec section. Otherwise, we will still provide a default generic template if no is template provided to overwrite, but the tuned model quality might not be optimal.
### Text classification
In this example, we'll only be training on the `instruction` and the `category` field. We won't use the `context` and `response` field at all
```yaml
# The ID of the dataset you created above.
dataset: my-dataset
text_classification:
# The JSON field containing the input text to be classified.
text: instruction
# The JSON field containing the classification label.
label: category
# The Hugging Face model name of the base model.
base_model: accounts/fireworks/models/llama-v3p1-8b-instruct
```
## Checking the job status
You can monitor the progress of the tuning job by running
```shell
firectl get fine-tuning-job
```
Once the job successfully completes, a model will be created in your account. You can see a list of models by running:
```shell
firectl list models
```
Or if you specified a model ID when creating the fine-tuning job, you can get the model directly:
```shell
firectl get model
```
## Deploying and using a model
Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on [Deploying a model](/models/deploying#peft-addons) and [Querying text models](/guides/querying-text-models) for detailed instructions.
Some base models may not support serverless addons. To check:
1. Run `firectl -a fireworks get `
2. Look under `Deployed Model Refs` to see if a `fireworks`-owned deployment exists, e.g. `accounts/fireworks/deployments/3c7a68b0`
3. If so, then it is supported
If the base model doesn't support serverless addons, you will need use an [on-demand deployment](/models/deploying#deploying-to-on-demand) to deploy it.
## Additional tuning options
### Evaluation
By default, the fine-tuning job will not run any post-training evaluation. You can enable model evaluation and also configure the amount of training data used for eval. The default is 15%.
For classification tasks, we measure the number of examples that match the expected label.
For these conversation and text completion tasks, we use [perplexity](https://en.wikipedia.org/wiki/Perplexity) to measure how well the model generates responses.
Sample usage:
```yaml
# ...
evaluation: True
evaluation_split: 0.2
```
```shell
firectl create fine-tuning-job \
--evaluation-split 0.2 \
...
```
### Epochs
Epochs is the number of epochs (i.e. passes over the training data) the job should train for. Non-integer values are supported. If not specified, a reasonable default number will be chosen for you.
**notice: we have the max value of 3 millions of dataset examples \* epochs**
```yaml
# ...
epochs: 2.0
```
```shell
firectl create fine-tuning-job \
--epochs 2.0 \
...
```
### Learning rate
The learning rate used in training can be configured. If not specified, a reasonable default value will be chosen.
```yaml
# ...
learning_rate: 0.0001
```
```shell
firectl create fine-tuning-job \
--learning-rate 0.0001 \
...
```
### Batch size
The batch size of dataset used in training can be configured with a positive integer less than 1024 and in power of 2. If not specified, a reasonable default value will be chosen.
```yaml
# ...
batch_size: 32
```
```shell
firectl create fine-tuning-job \
--batch-size 32 \
...
```
### Lora Rank
LoRA rank refers to the dimensionality of trainable matrices in Low-Rank Adaptation fine-tuning, balancing model adaptability and computational efficiency in fine-tuning large language models. The LoRA rank used in training can be configured with a positive integer with a max value of 32. If not specified, a reasonable default value will be chosen.
```yaml
# ...
lora_rank: 16
```
```shell
firectl create fine-tuning-job \
--lora-rank 16 \
...
```
### Training progress and monitoring
The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.
```yaml
wandb_entity: my-org
wandb_api_key: xxx
wandb_project: My Project
```
```shell
firectl create fine-tuning-job \
--wandb-entity my-org \
--wandb-api-key xxx \
--wandb-project "My Project" \
...
```
### Model ID
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within (ID constraints)\[[https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids)].
```yaml
model_id: my-model
```
```shell
firectl create fine-tuning-job \
--model-id my-model \
...
```
### Job ID
By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.
```yaml
job_id: my-fine-tuning-job
```
```shell
firectl create fine-tuning-job \
--job-id my-fine-tuning-job \
...
```
## Downloading model weights
We are opening model weights download to everyone now! simply following the command below
```shell
firectl download model
```
## Appendix
### Supported base models
The following base models are supported for parameter-efficient fine-tuning (PEFT) and can be deployed as PEFT add-ons on Fireworks [serverless](/models/deploying#deploying-to-serverless) and [on-demand](/models/deploying#deploying-to-on-demand) deployments, using the default parameters below. Serverless deployment is only available for a subset of fine-tuned models - run "get (\)\[[https://docs.fireworks.ai/models/overview#introduction](https://docs.fireworks.ai/models/overview#introduction)]" or check the models (page)\[[https://fireworks.ai/models](https://fireworks.ai/models)] to see if there's an active serverless deployment.
The cut-off length is the maximum limit on the sum of input tokens and
generated output tokens.
| Model | Batch Size | LoRA Rank | Epochs | Learning Rate | Cut-off Length | Conversation Recommended |
| ----------------------------------------------------------------------------------------------------------------------- | ---------- | --------- | ------ | ------------- | -------------- | ------------------------ |
| [accounts/fireworks/models/llama-v3p2-1b-instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | 32 | 4 | 1 | 1.00E-04 | 16384 | Yes |
| [accounts/fireworks/models/llama-v3p2-3b-instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 32 | 4 | 1 | 1.00E-04 | 16384 | Yes |
| [accounts/fireworks/models/llama-v3p1-70b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 8 | 4 | 1 | 2.00E-05 | 8192 | Yes |
| [accounts/fireworks/models/llama-v3p1-8b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 16 | 8 | 1 | 1.00E-04 | 8192 | Yes |
| [accounts/fireworks/models/llama-v3-70b-instruct-hf](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | 4 | 1 | 2.00E-05 | 8192 | Yes |
| [accounts/fireworks/models/llama-v3-8b-instruct-hf](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 16 | 8 | 1 | 1.00E-04 | 8192 | Yes |
| [accounts/fireworks/models/mixtral-8x7b-instruct-hf](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | 16 | 8 | 1 | 1.00E-04 | 32768 | Yes |
| [accounts/fireworks/models/mixtral-8x22b-instruct-hf](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | 16 | 8 | 1 | 1.00E-04 | 8192 | Yes |
| [accounts/fireworks/models/mixtral-8x22b-hf](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) | 8 | 8 | 1 | 1.00E-04 | 8192 | No |
| [accounts/fireworks/models/mixtral-8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | 16 | 8 | 1 | 1.00E-04 | 8192 | No |
| [accounts/fireworks/models/mistral-7b-instruct-v0p2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 16 | 8 | 1 | 1.00E-04 | 4096 | Yes |
| [accounts/fireworks/models/mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 16 | 8 | 1 | 1.00E-04 | 4096 | Yes |
| [accounts/fireworks/models/code-qwen-1p5-7b](https://huggingface.co/Qwen/CodeQwen1.5-7B) | 16 | 8 | 1 | 3.00E-04 | 65536 | No |
| [accounts/fireworks/models/deepseek-coder-v2-lite-base](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Base) | 16 | 8 | 1 | 3.00E-04 | 16384 | No |
| [accounts/fireworks/models/deepseek-coder-7b-base](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base) | 16 | 8 | 1 | 3.00E-04 | 16384 | No |
| [accounts/fireworks/models/deepseek-coder-1b-base](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) | 16 | 8 | 1 | 3.00E-04 | 16384 | No |
| [accounts/fireworks/models/codegemma-7b](https://huggingface.co/google/codegemma-7b) | 16 | 8 | 1 | 3.00E-04 | 8192 | No |
| [accounts/fireworks/models/codegemma-2b](https://huggingface.co/google/codegemma-2b) | 16 | 8 | 1 | 3.00E-04 | 8192 | No |
| [accounts/fireworks/models/starcoder2-15b](https://huggingface.co/bigcode/starcoder2-15b) | 16 | 8 | 1 | 1.00E-04 | 16384 | No |
| [accounts/fireworks/models/starcoder2-7b](https://huggingface.co/bigcode/starcoder2-7b) | 16 | 8 | 1 | 1.00E-04 | 16384 | No |
| [accounts/fireworks/models/starcoder2-3b](https://huggingface.co/bigcode/starcoder2-3b) | 16 | 8 | 1 | 1.00E-04 | 16384 | No |
| [accounts/fireworks/models/stablecode-3b](https://huggingface.co/stabilityai/stable-code-3b) | 16 | 8 | 1 | 3.00E-04 | 16384 | No |
| [accounts/fireworks/models/qwen2-72b-instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | 8 | 8 | 1 | 3.00E-04 | 16384 | Yes |
### Hugging Face dataset to JSONL
To convert a Hugging Face dataset to the JSONL format supported by our fine-tuning service, you can use the following Python script:
```python
import json
from datasets import load_dataset
dataset = load_dataset("")
# Replace 'dataset_split' with the appropriate split you want to export, e.g., 'train', 'test', etc.
split_data = dataset[""]
counter = 0
with open(".jsonl", "w") as f:
for item in split_data:
json.dump(item, f)
counter += f.write("\n")
print(f"{counter} lines converted")
```
## Support
We'd love to hear what you think! Please connect with the team, ask questions, and share your feedback in the [#fine-tuning](https://discord.gg/zYDmm4zqmq) Discord channel.
## Pricing
We charge based on the total number of tokens processed (dataset tokens \* number of epochs). Please see our [Pricing](https://fireworks.ai/pricing#fine-tuning) page for details.
# Fine-tuning models via API
## Introduction
This guide walks you through the process of fine-tuning a model using the Fireworks REST API.
For an overview of fine-tuning see [https://docs.fireworks.ai/fine-tuning/fine-tuning-models](https://docs.fireworks.ai/fine-tuning/fine-tuning-models)
## Prepare the dataset
Create your dataset file in JSONL format. Each line should be a valid JSON object containing your training examples.
## Create a dataset record
* **Endpoint:** POST /v1/accounts/\{account\_id}/datasets
* **Request Body:**
```json
{
"datasetId": "your-dataset-id",
"dataset": {
"userUploaded": {}
}
}
```
* **Curl Example:**
```bash
curl -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/datasets" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"datasetId": "your-dataset-id",
"dataset": {
"userUploaded": {}
}
}'
```
This will create a record with a dataset id you can use to upload the file.
## Upload your dataset file
### Option 1 - Upload your dataset file directly (recommended for files \< 150MB)
A streamlined file upload API is available for file sizes less than 150 MB.
* **Endpoint:** POST /v1/accounts/\{account\_id}/datasets/\{dataset\_id}:upload
* **Curl Example:**
```bash
curl -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/datasets/{dataset_id}:upload" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@/path/to/your-dataset-file.jsonl"
```
Post the file using the multipart/form-data format.
After this step you can [skip to Check the dataset state](#check-the-dataset-state)
### Option 2 - Upload your file using the signed upload URL (recommended for files > 150MB)
Alternatively, for larger files, get the signed URL for uploading your file directly to cloud storage.
* **Endpoint:** POST /v1/accounts/\{account\_id}/datasets/\{dataset\_id}:getUploadEndpoint
* **Request Body:**
```json
{
"filenameToSize": {
"your-dataset-file.jsonl": file_size_in_bytes
}
}
```
* **Curl Example:**
You can use jq to directly extract the signed URL:
```bash
SIGNED_UPLOAD_URL=$(
curl -s -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"filenameToSize": {
"your-dataset-file.jsonl": file_size_in_bytes
}
}' | jq -r '.filenameToSignedUrls | to_entries | .[0].value'
)
```
The response will contain a signed URL that looks similar to this:
```
https://storage.googleapis.com/fireworks-models-...
```
### Upload your dataset ( Option 2 continued )
Use the following curl command to upload your file:
```bash
curl -X PUT \
-H "Content-Type: application/octet-stream" \
-H "x-goog-content-length-range: FILE_SIZE_IN_BYTES,FILE_SIZE_IN_BYTES" \
--data-binary "@/path/to/your-dataset-file.jsonl" \
"$SIGNED_UPLOAD_URL"
```
### Validate the dataset upload ( Option 2 continued )
* **Endpoint:** POST /v1/accounts/\{account\_id}/datasets/\{dataset\_id}:validateUpload
* **Curl Example:**
```bash
curl -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload" \
-H "Authorization: Bearer YOUR_API_KEY"
```
Calling validate will finalize the dataset upload.
## Check the dataset state
* **Endpoint:** GET /v1/accounts/\{account\_id}/datasets/\{dataset\_id}
* **Curl Example:**
```bash
curl -X GET "https://api.fireworks.ai/v1/accounts/{account_id}/datasets/{dataset_id}" \
-H "Authorization: Bearer YOUR_API_KEY"
```
After successful creation and upload, the API will set the dataset state to READY
## Create a fine-tuning job
After uploading your dataset, create a fine-tuning job using the following command:
* **Endpoint:** POST /v1/accounts/\{account\_id}/fineTuningJobs
* **Request Body:**
```json
{
"model_id": "optional-model-id",
"dataset": "accounts/{account_id}/datasets/{dataset_id}",
"base_model": "accounts/fireworks/models/{base_model_id}",
"text_completion": {
"input_template": "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: ",
"output_template": "ANSWER: {response}"
}
}
```
* **Curl Example:**
```bash
curl -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/fineTuningJobs" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model_id": "optional-model-id",
"dataset": "accounts/{account_id}/datasets/{dataset_id}",
"base_model": "accounts/fireworks/models/{base_model_id}",
"text_completion": {
"input_template": "### GIVEN THE CONTEXT: {context} ### INSTRUCTION: {instruction} ### RESPONSE IS: ",
"output_template": "ANSWER: {response}"
}
}'
```
Adjust the `input_template` and `output_template` fields as needed to match your dataset format.
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally choose a custom `model_id`.
## Get the job status
* **Endpoint:** GET /v1/accounts/\{account\_id}/fineTuningJobs/\{fine\_tuning\_job\_id}
* **Curl Example:**
```bash
curl -X POST "https://api.fireworks.ai/v1/accounts/{account_id}/fineTuningJobs/{fine_tuning_job_id}" \
-H "Authorization: Bearer YOUR_API_KEY" \
```
The job should now have state PENDING or RUNNING.
## Deleting a job
* **Endpoint:** DELETE /v1/accounts/\{account\_id}/fineTuningJobs/{fine_tuning_job_id}
* **Curl Example:**
```bash
curl -X DELETE "https://api.fireworks.ai/v1/accounts/{account_id}/fineTuningJobs/{fine_tuning_job_id}" \
-H "Authorization: Bearer YOUR_API_KEY"
```
## Downloading model weights
After your fine-tuning job is complete, a model will be created in your account. You can download the model weights using the following steps.
If you specified a model\_id when creating the fine-tuning job, you can get the model weights directly.
You can see a list of models by running.
* **Endpoint:** GET /v1/accounts/\{account\_id}/models/
Use the model\_id to get the signed URLs for downloading the model files.
* **Endpoint:** GET /v1/accounts/\{account\_id}/models/\{model\_id}:getDownloadEndpoint
* **Curl Example:**
```bash
curl -X GET "https://api.fireworks.ai/v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint" \
-H "Authorization: Bearer YOUR_API_KEY"
```
The response will contain a map of filenames to signed URLs for downloading each file.
For each file in the response, use the provided signed URL to download the file.
* **Curl Example:**
```bash
curl -o "/path/to/save/model_file.json" "$SIGNED_DOWNLOAD_URL"
```
# Concepts
This document outlines basic Fireworks AI concepts.
## Resources
### Account
Your account is the top-level resource under which other resources are located. Quotas are also
enforced at the account level.
For developer accounts, the account ID is auto-generated from the email address used to sign up.
Enterprise accounts can optionally choose a custom, unique account ID.
### User
A user is an email address associated with an account. Only users added to an account have access
to private resources within the account.
### Model
A model is a set of model weights and metadata associated with the model. A model cannot be used
for inference until it is deployed to one or more deployments, creating a "deployed model". There
are two types of models:
* Base models
* Parameter-efficient fine-tuned (PEFT) addons
See our [Models overview](/models/overview) page for details.
### Deployment
A deployment is a collection (one or more) model servers that host one base model and optionally
one or more PEFT addons (also known as LoRA adapters).
Fireworks provides a set of "serverless" deployments that host common base models. These deployments
may be used for [serverless inference](/models/overview#serverless-inference) as well as hosting [serverless addons](/models/overview#serverless-addons).
### Deployed model
A deployed model is an instance of a base model or PEFT addon that is loaded into a deployment.
### Dataset
A dataset is an immutable set of training examples that can be used to fine-tune a model.
### Fine-tuning job
A fine-tuning job is an offline training job that uses a dataset to train a PEFT addon model.
## Resource names and IDs
A full resource name looks like
```
accounts/my-account/models/my-model
```
The individual segments `my-account` and `my-model` are account and [model IDs](https://docs.fireworks.ai/models/overview), respectively.
Resource IDs must satisfy the following constraints:
* between 1 and 63 characters (inclusive)
* consist of a-z, 0-9, and hyphen (-)
* does not begin or end with a hyphen (-)
Some APIs take the full resource name, while others may take a resource ID if the context is clear.
## Control plane and data plane
The Fireworks API can be split into a control plane and a data plane.
* The **control plane** consists of APIs used for managing the lifecycle of resources. This
includes your account, models, and deployments.
* The **data plane** consists of the APIs used for inference and the backend services that power
them.
## Interfaces
Users can interact with Fireworks through one of many interfaces:
* The **web console** at [https://fireworks.ai](https://fireworks.ai)
* The command-line interface `firectl`
* [Python SDK](/tools-sdks/python-client/installation)
# Introduction
Fireworks AI is a generative AI inference platform to run and customize models with industry-leading speed and production-readiness.
## Welcome to Fireworks AI
{/*
Make an API call to an open-source LLM
Watch to learn more about the Fireworks AI platform
*/}
## What we offer
The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly [pricing](https://fireworks.ai/pricing). See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links.
* **Inference:** Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case.
* **Modalities and Models:** Use 100s models (or bring your own) across modalities of:
* [Text](https://docs.fireworks.ai/guides/querying-text-models)
* [Audio](https://docs.fireworks.ai/api-reference/audio-transcriptions)
* [Image](https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt)
* [Embedding](https://docs.fireworks.ai/guides/querying-embeddings-models)
* [Vision-understanding](https://docs.fireworks.ai/guides/querying-vision-language-models)
* **Adaptation:** [Tune](https://docs.fireworks.ai/fine-tuning/fine-tuning-models) and optimize your model and deployment for the best . [Serve](https://docs.fireworks.ai/models/deploying) and experiment with hundreds of fine-tuned models with our multi-LoRA [capabilities](https://fireworks.ai/blog/multi-lora).
* **Compound AI Development Framework:** Use [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting), [grammar mode](https://docs.fireworks.ai/structured-responses/structured-output-grammar-based), [function calling](https://docs.fireworks.ai/guides/function-calling) or our Flumina framework to build a collaborative system with reliable and performant outputs.
## Inference
Fireworks has 3 options for running generative AI models with unparalleled speed and costs.
* **Serverless**: The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots.
* **[On-demand](https://fireworks.ai/blog/why-gpus-on-demand)** -The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both \~250% improved throughput and 50% improved latency compared to vLLM. Excels for:
* **Production volume** - Per-token costs decrease with more volume and there are no set rate limits
* **Custom needs and reliability** - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models
* **Enterprise Reserved GPUs** - Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations.
| Property | **Serverless** | **On-demand** | **Enterprise reserved** |
| -------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Performance** | Industry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage. | Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM. | Tailor-made set-up by Fireworks AI experts for best possible latency |
| **Getting Started** | Self-serve - immediately use serverless with 1 line of code | Self-serve - configure GPUs, then use them with 1 line of code. | Chat with Fireworks |
| **Scaling and management** | Scale up and down freely within rate limits | Option for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups. | Chat with Fireworks |
| **Pricing** | Pay fixed price per token | Pay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM. | Customized price based on reserved GPU capacity |
| **Commitment** | None | None | Arrange plan length with Fireworks |
| **Rate limits** | Yes, see [quotas](https://docs.fireworks.ai/accounts/quotas) | No rate limits. [Quotas](https://docs.fireworks.ai/accounts/quotas) on number of GPUs | None |
| **Model Selection** | Collection of popular models, curated by Fireworks | Use 100s of pre-uploaded models or upload your own custom model within supported [architecture](https://docs.fireworks.ai/models/uploading-custom-models) | Use 100s of pre-uploaded models or upload any model |
## FireOptimizer
**FireOptimizer** - Fireworks optimizes inference for your workload and your use case though FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are:
* **[Fine-tuning](https://fireworks.ai/blog/fine-tune-launch)** - Quickly fine-tune models with LoRA for the best quality on your use case
* Upload data and choose your model to start tuning
* Pay per token of training data.
* Serve and evaluate models immediately on Fireworks
* Download models weights to use anywhere
* **[Multi-LoRA serving](https://fireworks.ai/blog/multi-lora)** - Deploy 100s of fine-tuned models at no extra cost.
* Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model.
* Use models fine-tuned on Fireworks or upload your own fine-tuned adapter
* Host hundreds of models on the same deployment on either serverless or dedicated deployments
## Compound AI
Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include:
* **[JSON mode and grammar mode](https://fireworks.ai/blog/why-do-all-LLMs-need-structured-output-modes)** - Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components.
* **[Function calling](https://fireworks.ai/blog/firefunction-v2-launch-post)** - Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B
* **Flumina** - Fireworks enables multimedia apps to be easily packaged together and deployed and scaled with low-latency through the Flumina server apps framework. Contact us to get Flumina access
{/*
## Support
Join our community of Generative AI builders
Have more questions? Drop us a note!
*/}
# Quickstart
Get started in 5 minutes
Fireworks.ai is a lightning-fast inference platform that serves generative AI models. All the models are exposed over `completions` and a `chat completions` API.
Using the API, you can build on popular open-source models and custom fine-tuned models like FireFunction, Hermes 2 Pro, etc.
Experience all our models in the [model playground!](https://fireworks.ai/models/fireworks/mixtral-8x7b-instruct)
Quickstart helps you to get started in minutes. However, if you want to explore more, please refer to the [guides](/guides/querying-text-models) section or the [API reference](/api-reference/introduction).
In this guide, you will:
* Set up your development environment
* Choose an SDK
* Call the Fireworks API with an API Key
## Account Creation
Create a [Fireworks AI](https://fireworks.ai/login) account. Under Account Settings, click on [API Keys](https://fireworks.ai/api-keys) to generate one.
Please keep the API Key in a secure location.
### Set up developer environment
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade fireworks-ai
```
Fireworks Python Client is OpenAI API Compatible.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API.
```python
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
Before installing, ensure that you have the right version of Python installed. Optionally you might want to setup a virtual environment too.
```bash
pip install --upgrade openai
```
Fireworks AI platform offers drop-in replacement with OpenAI Python Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $OPENAI_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx OPENAI_API_BASE "https://api.fireworks.ai/inference/v1"
setx OPENAI_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %OPENAI_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI Python SDK.
```python
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
# notice the change in the model name
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
)
print(response.choices[0].message.content)
```
Before installing, ensure that you have the right version of Node. Please make sure you have the `npm` installed or a package manager of your choice.
```bash
npm install openai
```
Fireworks AI platform offers drop-in replacement with OpenAI JavaScript Client.
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export OPENAI_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running
`echo $OPENAI_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx OPENAI_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %OPENAI_API_KEY%
```
You can quickly instantiate with the generated API Key and call the Fireworks API through OpenAI JavaScript SDK.
```javascript
import OpenAI from 'openai';
const openai = new OpenAI({
baseUrl: 'https://api.fireworks.ai/inference/v1',
apiKey: process.env['OPENAI_API_KEY']
});
const completion = await openai.chat.completions.create({
messages: [{ role: "user", content: "Say this is a test" }],
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
});
console.log(completion.choices[0].message.content);
```
cURL is a popular open-source command line tool to send HTTP requests. Most Operating systems ship cURL by default.
However, if you are not sure, you can follow the first two steps of this guide to setup cURL. If not, we recommend skipping to **Step Three**.
Check if your operating system has cURL installed by running `curl https://api.fireworks.ai`
macOS comes with the cURL tool bundled with the operating system.
If you want to upgrade to the latest version shipped by the cURL project, we recommend installing homebrew:
```bash Homebrew
brew install curl
```
Most Linux distributions offer curl and libcurl to be installed if they are not installed by default.
```bash apt
apt install curl
```
```bash yum
yum install curl
```
Windows 10 comes with the cURL tool bundled with the operating system since version 1804.
If you have an older Windows version or just want to upgrade to the latest version shipped by the cURL project, download the latest official cURL release for Windows from [curl.se/windows](https://curl.se/windows).
Step-by-step instructions for setting an environment variable for respective OS platforms:
Depending on your shell, you'll need to edit either `~/.bash_profile` for Bash or `~/.zshrc` for `Zsh`.
You can do this by running the command:
```bash bash
vim ~/.bash_profile
```
```zsh zsh
vim ~/.zshrc
```
Add a new line to the file with the following:
```bash
export FIREWORKS_API_KEY=""
```
After saving the file, you'll need to apply the changes by either restarting your terminal session or running depending on the file you edited.
```bash bash
source ~/.bash_profile
```
```zsh zsh
source ~/.zshrc
```
You can verify that the variable has been set correctly by running `echo $FIREWORKS_API_KEY`
You can open Command Prompt by searching for it in the Windows search bar or by pressing Win + R, typing cmd, and pressing Enter.
```
setx FIREWORKS_API_KEY ""
```
To verify that the variable has been set correctly, you can close and reopen Command Prompt and type:
```
echo %FIREWORKS_API_KEY%
```
Making your first API request with cURL. Notice the use of `$FIREWORKS_API_KEY`.
```
curl \
--header 'Authorization: Bearer '$FIREWORKS_API_KEY \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
More details on calling various APIs can be found at our [API Reference](/api-reference)
## Dive in further
Integrating Fireworks AI using LangChain
Learn Stable Diffusion 3 API
Create a unique model
Deploy on our blazing-fast inference stack
Have fun!
If you have any questions, please reach out to us on [Discord](https://discord.gg/mMqQxvFD9A) or [Twitter](https://twitter.com/thefireworksai).
# Using function-calling
The function calling API allows a user to describe the set of tools/functions available to the model and have the model intelligently choose the right set of function calls to invoke given the context. This functionality allows users to build dynamic agents that can get access to real-time information & produce structured outputs. The function calling API doesn't invoke any function calls. Instead, it generates the tool calls to make in [OpenAI](https://platform.openai.com/docs/guides/function-calling)-compatible format.
At a high level, function calling works by
1. The user specifies a **query** along with the **list of available tools** for the model. The tools are specified using [JSON Schema](https://json-schema.org/learn/getting-started-step-by-step).
2. The model intelligently detects intent and based on intent the model outputs either a normal conversation reply or a list of tools/functions to invoke for the user. Based on the specified schema, the model populates the correct set of arguments to invoke a function call.
3. The user receives a reply from the model. If the reply contains a function call specification - the user can execute that method and return its output to the model along with further queries/conversation.
## Resources
1. [Fireworks Blog Post on FireFunction-v2](https://fireworks.ai/blog/firefunction-v2-launch-post)
2. [Open AI Docs on Function Calling](https://platform.openai.com/docs/guides/function-calling)
3. [Open AI Cookbook on Function Calling](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models)
4. [Function Calling Best Practices](#best-practices)
## Supported models
* [Firefunction-v2](https://fireworks.ai/models/fireworks/firefunction-v2) - Latest and most performant model
* [Firefunction-v1](https://fireworks.ai/models/fireworks/firefunction-v1) - Previous generation, Mixtral-based function calling model. Fast and excels at routing and structured output
## Example usage
**TL;DR** **This example tutorial is available as a Python notebook** \[[code](https://github.com/fw-ai/cookbook/blob/main/learn/function-calling/notebooks_firefunction_openai/fireworks_function_calling_demo.ipynb) | [Colab](https://colab.research.google.com/drive/1m7Bk1360CFI50y24KBVxRAKYuEU3pbPU?usp=sharing)].
For this example, let's consider a user looking for Nike's financial data. We will provide the model with a tool that the model is allowed to invoke & get access to the financial information of any company.
1. To achieve our goal, we will provide the model with information about the `get_financial_data` function. We detail its purpose, arguments, etc in [JSON Schema](https://json-schema.org/). We send this information in through the `tools` argument. We sent the user query as usual through the `messages` argument.
```python Request
import openai
import json
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = ""
)
messages = [
{"role": "system", "content": f"You are a helpful assistant with access to functions."
"Use them if required."},
{"role": "user", "content": "What are Nike's net income in 2022?"}
]
tools = [
{
"type": "function",
"function": {
# name of the function
"name": "get_financial_data",
# a good, detailed description for what the function is supposed to do
"description": "Get financial data for a company given the metric and year.",
# a well defined json schema: https://json-schema.org/learn/getting-started-step-by-step#define
"parameters": {
# for OpenAI compatibility, we always declare a top level object for the parameters of the function
"type": "object",
# the properties for the object would be any arguments you want to provide to the function
"properties": {
"metric": {
# JSON Schema supports string, number, integer, object, array, boolean and null
# for more information, please check out https://json-schema.org/understanding-json-schema/reference/type
"type": "string",
# You can restrict the space of possible values in an JSON Schema
# you can check out https://json-schema.org/understanding-json-schema/reference/enum for more examples on how enum works
"enum": ["net_income", "revenue", "ebdita"],
},
"financial_year": {
"type": "integer",
# If the model does not understand how it is supposed to fill the field, a good description goes a long way
"description": "Year for which we want to get financial data."
},
"company": {
"type": "string",
"description": "Name of the company for which we want to get financial data."
}
},
# You can specify which of the properties from above are required
# for more info on `required` field, please check https://json-schema.org/understanding-json-schema/reference/object#required
"required": ["metric", "financial_year", "company"],
},
},
}
]
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/firefunction-v2",
messages=messages,
tools=tools,
temperature=0.1
)
print(chat_completion.choices[0].message.model_dump_json(indent=4))
```
```json Response
{
"content": "",
"role": "assistant",
"function_call": null,
"tool_calls": [
{
"id": "call_XstygHYlzKrI8hbERr0ybeOQ",
"function": {
"arguments": "{\"metric\": \"net_income\", \"financial_year\": 2022, \"company\": \"Nike\"}",
"name": "get_financial_data"
},
"type": "function",
"index": 0
}
]
}
```
2. In our case, the model decides to invoke the tool `get_financial_data` with some specific set of arguments. **Note** The model itself won't invoke the tool. It just specifies the argument. When the model issues a function call - the completion reason would be set to `tool_calls`. The API caller is responsible for parsing the function name and arguments supplied by the model & invoking the appropriate tool.
```python Call External API
def get_financial_data(metric: str, financial_year: int, company: str):
print(f"{metric=} {financial_year=} {company=}")
if metric == "net_income" and financial_year == 2022 and company == "Nike":
return {"net_income": 6_046_000_000}
else:
raise NotImplementedError()
function_call = chat_completion.choices[0].message.tool_calls[0].function
tool_response = locals()[function_call.name](**json.loads(function_call.arguments))
print(tool_response)
```
```json Response
metric='net_income' financial_year=2022 company='Nike'
{'net_income': 6046000000}
```
3. The API caller obtains the response from the tool invocation & passes its response back to the model for generating a response.
```python Request
agent_response = chat_completion.choices[0].message
# Append the response from the agent
messages.append(
{
"role": agent_response.role,
"content": "",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat_completion.choices[0].message.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response)
}
)
next_chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/firefunction-v2",
messages=messages,
tools=tools,
temperature=0.1
)
print(next_chat_completion.choices[0].message.content)
```
```json Response
{
"content": "Nike's net income for the year 2022 was $6,046,000,000.",
"role": "assistant",
"function_call": null,
"tool_calls": null
}
```
This results in the following response
```
Nike's net income for the year 2022 was $6,046,000,000.
```
## Tools specification
`tools` field is an array, and each individual component contains the following two fields
* `type`: `string` The type of the tool. Currently, only function is supported.
* `function`: `object`
* `description`: `string` A description of what the function does, used by the model to choose when and how to call the function.
* `name`: `string` The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
* `parameters`: `object` The parameters the functions accepts, described as a JSON Schema object. See the [JSON Schema reference](https://json-schema.org/understanding-json-schema/reference) for documentation about the format.
## Tool choice
The `tool_choice` parameter controls whether the model is allowed to call functions or not. Currently, we support values `auto`, `none` , `any` or a specific function name.
* `auto` mode implies the model can pick between generating a message or calling a function. This is the **default** tool choice when the field is not specified.
* `none` mode is akin to no tool specification being passed to the model.
* To force a specific function call, you can set the`tool_choice = {"type": "function", "function": {"name":"get_financial_data"}}`. In this scenario, the model is forced to use this specific function.
* To specify that any function call is a good function call, and a function call should always be made, you can use the `any` mode, or you can set the function name to be empty when you specify tool choice to be "function", like `tool_choice = {"type": "function"}`
## OpenAI compatibility
Our function calling API is fully compatible with OpenAI including in streaming scenarios. Compared to the OpenAI API - we don't support parallel and nested function calling.
## Best practices
1. **Number of Functions** - The length of the list of functions specified to the model, directly impacts its performance. For best performance, keep the list of functions below 7. It's possible to see some degradation in the quality of the model as the tool list length exceeds 10.
2. **Function Description** - The function specification follows [JSON Schema](https://json-schema.org/). For best performance, describe in great detail what the function does under the "description" section. An example of a good function description is "Get financial data for a company given the metric and year". A bad example would be "Get financial data for a company".
3. **System Prompt** - In order to ensure optimal performance, we recommend **not** adding any additional system prompt. User-specified system prompts can interfere with the function detection & calling ability of the model. The auto-injected prompt for our function calling model is designed to ensure optimal performance.
4. **Temperature** Setting temperature to 0.0 or some low value. This helps the model to only generate confident predictions and avoid hallucinating parameter values.
5. **Function descriptions** Providing verbose descriptions for functions & its parameters. This is similar to prompt engineering: the more elaborate & accurate the function definition/documentation - the better the model is at deciphering the accurate intent of the function and its parameters.
## Function calling vs JSON mode
When to use function calling vs [JSON mode](/structured-responses/structured-response-formatting)? Use function calling if the use case involves decision-making or interactivity - which function to call, what parameters to fill, or how to follow up with the user to fill in missing function arguments in a chat form. JSON/Grammar mode is a preferred choice for non-interactive structured data extraction and allows you to explore non-JSON formats too.
## Example apps
* Fireworks-created demos:
* UI [demo](https://functional-chat.vercel.app/) for image generation and stock price retrieval
* [Notebook](https://colab.research.google.com/drive/1SI6jz66k122vv641e8wDDI0Ujh4cwlUy?usp=sharing) for information extraction
* Langchain integration notebooks:
* [Function Calling with LangChain JS](https://github.com/langchain-ai/langchainjs/blob/main/cookbook/function_calling_fireworks.ipynb)
* [AgentExecutor notebook ](https://colab.research.google.com/drive/1huPsNm9l4OcJvIcu63u0FFWF8X2J7zW3?usp=sharing)
* [RAG + Langchain ](https://colab.research.google.com/drive/1Vy4tYxP_rlbkAKi4pGpaDRV7hnSQeG2d?usp=sharing)
## Data policy
Data from Firefunction is logged and automatically deleted after 30 days to ensure product quality and to prevent abuse ( bulk data on average # functions used, etc). This data will never be used to train models. Please contact [raythai@fireworks.ai](mailto:raythai@fireworks.ai) if you have questions, comments, or use cases where data cannot be logged.
# On-demand deployments
Deploying on your own GPUs
Fireworks allows you to create on-demand, dedicated deployments that are reserved for your own use. This has several advantages over the shared deployment Fireworks used for its serverless models:
* Predictable performance unaffected by load caused by other users
* No hard rate limits - but subject to the maximum load capacity of the deployment
* Cheaper under high utilization
* Access to larger selection of models not available via our serverless models
* [Custom base models](/models/uploading-custom-models#custom-base-models) from Hugging Face files
If you plan on using a significant amount of on-demand deployments, consider purchasing [reserved capacity](/deployments/reservations)
for cheaper pricing and higher GPU quotas.
## Quickstart
See the "All models" list on our [Models](https://fireworks.ai/models) page for a list of pre-uploaded models on the
Fireworks AI platform. You can also use a [custom base model](#custom-base-models).
To create a new deployment of a [model provided by Fireworks](https://fireworks.ai/models), run:
```bash
firectl create deployment accounts/fireworks/models/ --wait
```
This command will complete when the deployment is `READY`. To let it run asynchronously, remove the `--wait` flag.
See the model [overview](https://docs.fireworks.ai/models/overview#introduction) for info on model IDs. The deployment ID is the last part of the deployment name.
To create a new deployment using a custom base model, follow the [Uploading custom models](/models/uploading-custom-models#custom-base-models) guide to first upload your custom base model to the Fireworks platform. Then run:
```bash
firectl create deployment
```
The deployment ID is the last part of the deployment name: `accounts//deployments/`.
You can verify the deployment is complete by running:
```bash
firectl get deployment
```
The state field should show `READY`.
To query a specific deployment, use the model identifier in the format: `#`
In most cases, the model identifier follows this pattern:
`accounts//models/` + `#` + `accounts//deployments/`
**Example:**
The model identifier for querying Llama3.2-3B Instruct (listed as `accounts/fireworks/models/llama-v3p2-3b-instruct`) for Acme Inq.'s deployment (deployment-id being `12ab34cd56ef`) would be:
`accounts/fireworks/models/llama-v3p2-3b-instruct#accounts/acmeInc/deployments/12ab34cd56ef`
**Sample Request:**
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/#accounts//deployments/",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
By default, deployments will automatically [scale down to zero](#customizing-autoscaling-behavior) replicas if unused (i.e. no
inference requests) for 1 hour, and automatically delete itself if unused for one week.
To completely delete the deployment, run:
```bash
firectl delete deployment
```
**Notes:**
* Make sure you include the `#` in the model identifier when querying a specific deployment.
* If you are unsure about the model identifier format, refer to the [Model Identifiers](https://docs.fireworks.ai/models/deploying#model-identifier) section for more details and alternatives.
## Deployment options
### Replica count (horizontal scaling)
The number of replicas (horizontal scaling) is specified by passing the `--min-replica-count` and `--max-replica-count`
flags. Increasing the number of replicas will increase the maximum QPS the deployment can support. The deployment will
automatically scale based on server load.
Auto-scaling up may fail if there is a GPU stockout. Use [reserved capacity](/deployments/reservations) to
guarantee capacity for your deployments.
The default value for `--min-replica-count` is 0. Setting `--min-replica-count` to 0 enables the deployment to auto-scale to 0 if a deployment is unused (i.e. no inference requests) for a specified "scale-to-zero" time window. While the deployment is scaled to 0, you will not pay for any GPU utilization.
The default value for `--max-replica-count` is 1 if
`--min-replica-count=0`, or the value of `--min-replica-count` otherwise.
```bash create
firectl create deployment \
--min-replica-count 2 \
--max-replica-count 3
```
```bash update
firectl update deployment \
--min-replica-count 2 \
--max-replica-count 3
```
### Customizing autoscaling behavior
You can customize certain aspects of the deployment's autoscaling behavior by setting the following flags:
* `--scale-up-window` The duration the autoscaler will wait before scaling up a deployment after observing increased
load. Default is `30s`.
* `--scale-down-window` The duration the autoscaler will wait before scaling down a deployment after observing
decreased load. Default is `10m`.
* `--scale-to-zero-window` The duration after which there are no requests that the deployment will be scaled down to
zero replicas. This is ignored if `--min-replica-count` is greater than 0. Default is `1h`. The minimum is `5m`.
There will be a cold-start latency (up to a few minutes) for requests made while the deployment is scaling
from 0 to 1 replicas.
A deployment with `--min-replica-count` set to 0 will be automatically deleted if it receives no traffic for 7
days.
Refer to [time.ParseDuration](https://pkg.go.dev/time#ParseDuration) for valid syntax for the duration string.
### Multiple GPUs (vertical scaling)
The number of GPUs used per replica is specified by passing the `--accelerator-count` flag. Increasing the accelerator count will increase the generation speed, time-to-first-token, and maximum QPS for your deployment, however the scaling is sub-linear. The default value for most models is 1 but may be higher for larger models that require sharding.
```bash create
firectl create deployment --accelerator-count 2
```
```bash update
firectl update deployment --accelerator-count 2
```
### Choosing hardware type
By default, a deployment will use NVIDIA A100 80 GB GPUs. You can also deploy using NVIDIA H100 80 GB GPUs by passing the `--accelerator-type` flag.
For advice on choosing a hardware type, see this [FAQ](https://docs.fireworks.ai/faq/deployment/ondemand/hardware-options#hardware-selection)
```bash create
firectl create deployment --accelerator-type="NVIDIA_H100_80GB"
```
```bash update
firectl update deployment --accelerator-type="NVIDIA_H100_80GB"
```
### Optimizing your deployments
By default, a balanced deployment will be created using the hardware resources you specify. Higher performance can be
achieved for long-prompt length (>\~3000 tokens) workloads by passing the `--long-prompt` flag.
This option roughly doubles the amount of GPU memory required to serve the model and requires a minimum of two
GPUs to be effective. If `--accelerator-count` is not specified, then a deployment using twice the minimum number of
GPUs (to serve without `--long-prompt`) will be created.
```bash create
firectl create deployment --accelerator-count=2 --long-prompt
```
```bash update
firectl update deployment --long-prompt
```
To update a deployment to disable this option, pass `--long-prompt=false`.
Additional optimization options are available through our enterprise plan.
## Deploying PEFT addons
By default, PEFT addons are disabled for deployments. To enable addons, pass the `--enable-addons` flag:
```bash create
firectl create deployment --enable-addons
```
```bash update
firectl update deployment --enable-addons
```
See [Uploading a custom model](/models/uploading-custom-models#peft-addons) for instructions on how to upload custom
PEFT addons. To deploy a PEFT addon to a on-demand deployment, pass the `--deployment-id` flag to `firectl deploy`. For
example:
```bash
firectl deploy --deployment-id
```
The base model of the deployment must match the base model of the addon.
# Pricing
On-demand deployments are billed by GPU-second. Consult our [pricing page](https://fireworks.ai/pricing) for details.
# Prompt caching
Prompt caching is a performance optimization feature that allows Fireworks to
respond faster to requests with prompts that share common prefixes. In many
situations, it can reduce time to first token (TTFT) by as much as 80%.
Prompt caching is **enabled by default** for all Fireworks models and deployments.
For dedicated deployments, prompt caching frees up resources, leading to higher
throughput on the same hardware. Dedicated deployments on the Enterprise plan allow
additional configuration options to further optimize cache performance.
## Using prompt caching
### Common use cases
Requests to LLMs often share a large portion of their prompt. For example:
* Long system prompts with detailed instructions
* Descriptions of available tools for function calling
* Growing previous conversation history for chat use cases
* Shared per-user context, like a current file for a coding assistant
Prompt caching avoids re-processing the cached prefix of the prompt and
starts output generation much sooner.
### Structuring prompts for caching
Prompt caching works only for exact prefix matches within a prompt. To
realize caching benefits, place static content like instructions and examples at
the beginning of your prompt, and put variable content, such as user-specific
information, at the end.
For function calling models, tools are considered part of the prompt.
For vision-language models, images currently aren't cached (but this might be improved in the future).
### How it works
Fireworks will automatically find the longest prefix of the request that is
present in the cache and reuse it. The remaining portion of the prompt will be
processed as usual.
The entire prompt is stored in the cache for future reuse. Cached prompts
usually stay in the cache for at least several minutes. Depending on the model,
load level, and deployment configuration, it can be up to several hours. The
oldest prompts are evicted from the cache first.
Prompt caching doesn't alter the result generated by the model. The response you
receive will be identical to what you would get if prompt caching was not used.
Each generation is sampled from the model independently on each request and is not
cached for future usage.
## Monitoring
For dedicated deployments, information about prompt caching is returned in the
response headers. The header `fireworks-prompt-tokens` contains the number of tokens
in the prompt, out of which `fireworks-cached-prompt-tokens` are cached.
Aggregated metrics are also available in the [usage dashboard](https://fireworks.ai/account/usage?type=deployments).
## Data privacy
Serverless deployments maintain separate caches for each Fireworks account to prevent data leakage and timing attacks.
Dedicated deployments by default share a single cache across all requests.
Because prompt caching doesn't change the outputs, privacy is preserved even
if the deployment powers a multi-tenant application. It does open a minor risk
of a timing attack: potentially, an adversary can learn that a particular prompt
is cached by observing the response time. To ensure full isolation, you can pass
the `x-prompt-cache-isolation-key` header or the `prompt_cache_isolation_key`
field in the body of the request. It can contain an arbitrary string that acts
as an additional cache key, i.e., no sharing will occur between requests with
different IDs.
## Limiting or turning off caching
Additionally, you can pass the `prompt_cache_max_len` field in the request body to
limit the maximum prefix of the prompt (in tokens) that is considered for
caching. It's rarely needed in real applications but can come in handy for
benchmarking the performance of dedicated deployments by passing
`"prompt_cache_max_len": 0`.
## Advanced: cache locality for Enterprise deployments
Dedicated deployments on an Enterprise plan allow you to pass an additional hint in the request to improve cache hit rates.
First, the deployment needs to be created or updated with an additional flag:
```bash
firectl create deployment ... --enable-session-affinity
```
Then the client can pass an opaque identifier representing a single user or
session in the `user` field of the body or in the `x-session-affinity` header. Fireworks
will try to route requests with the identifier to the same server, further reducing response times.
It's best to choose an identifier that groups requests with long shared prompt
prefixes. For example, it can be a chat session with the same user or an
assistant working with the same shared context.
# Querying embedding models
Fireworks hosts many embedding models, and we will walk through an example of using `nomic-ai/nomic-embed-text-v1.5` today to see how to query Fireworks with embeddings API.
# Embedding documents
Our embedding service is OpenAI compatible. Use OpenAI's embeddings [guide](https://platform.openai.com/docs/guides/embeddings) and OpenAI's [embeddings documentation](https://platform.openai.com/docs/api-reference/embeddings) for more detailed information on our embedding model usage.
The embedding model inputs text and outputs a vector (list) of floating point numbers to use for tasks like similarity comparisons and search.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: Spiderman was a particularly entertaining movie with...",
)
print(response)
```
This code embeds the text "search\_document: Spiderman was a particularly entertaining movie with..." and returns the following
```json Response
CreateEmbeddingResponse(data=[Embedding(embedding=[0.006380197126418352, 0.011841800063848495,...], index=0, object='embedding')], model='intfloat/e5-mistral-7b-instruct', object='list', usage=Usage(prompt_tokens=12, total_tokens=12))
```
However, you might have noticed the interesting prefix with `search_document: `, what is that supposed to mean?
# Embedding queries and document
Nomic models have been fine-tuned to take prefixes. For user query, you will need to prefix it with `search_query: `, and for documents, you need to prefix with `search_document: `. What does that mean exactly?
* Let's say I previously used the embedding model to embed many movie reviews that I stored in a vector database. All the documents should come with a prefix of `search_document: `
* I now want to create a movie recommendation that takes in a user query and outputs recommendations based on this data. The code below demonstrates how to embed the user query and system prompt.
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
query = "I love superhero movies, any recommendations?"
task_description="Given a user query for movies, retrieve the relevant movie that can fulfill the query. "
query_emb = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input=f"search_query: {query}"
)
print(query_emb)
```
To view this example end-to-end and see how to use a MongoDB vector store and Fireworks-hosted generation model for RAG, see our full [guide](https://github.com/fw-ai/cookbook/blob/main/examples/rag/mongo_basic.ipynb). For more information on what kind of prefixes are possible with nomic, please check out [this guide from nomic](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#usage).
# Variable dimensions
The model also supports variable embedding dimension sizes. In this case, we can provide dimension as a query to the embeddings.create request
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.embeddings.create(
model="nomic-ai/nomic-embed-text-v1.5",
input="search_document: I like Christmas movies, can you make any recommendations?",
dimensions=128,
)
print(len(response.data[0].embedding))
```
You will see that the returned results are embeddings with dimension 128.
# List of available models
| Model name | model size |
| :--------------------------------------------- | :--------- |
| `nomic-ai/nomic-embed-text-v1.5` (recommended) | 137M |
| `nomic-ai/nomic-embed-text-v1` | 137M |
| `WhereIsAI/UAE-Large-V1` | 335M |
| `thenlper/gte-large` | 335M |
| `thenlper/gte-base` | 109M |
# Querying text models
Fireworks.ai offers an OpenAI-compatible REST API for querying text models. There are several ways to interact with it:
* The [Fireworks Python client library](/tools-sdks/python-client/installation)
* The [web console](https://fireworks.ai)
* [LangChain](https://python.langchain.com/docs/integrations/providers/fireworks)
* Directly invoking the [REST API](/api-reference/post-completions) using your favorite tools or language
* The [OpenAI Python client](https://github.com/openai/openai-python)
## Using the web console
All Fireworks models can be accessed through the web console at [fireworks.ai](https://fireworks.ai). Clicking on a model will take you to the playground where you can enter a prompt along with additional request parameters.
Non-chat models will use the [completions API](/api-reference/post-completions) which passes your input directly into the model.
Models with a conversation config are considered chat models (also known as instruct models). By default, chat models will use the [chat completions API](/api-reference/post-chatcompletions) which will automatically format your input with the conversation style of the model. Advanced users can revert back to the completions API by unchecking the "Use chat template" option.
## Using the API
### Chat completions API
Models with a conversation config have the [chat completions API](/api-reference/post-completions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Llama chat models use the following [template](https://gpus.llm-utils.org/llama-2-prompt-template/):
> \\[INST] \<\>
>
> {system_prompt}
>
> \<\>
>
> \{user\_message\_1} \[/INST]
Some templates like `llama-chat` can support multiple chat messages as well. In general, we recommend users use the chat completions API whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in poor model performance.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
)
print(response.choices[0].message.content)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"messages": [{
"role": "user",
"content": "Say this is a test"
}]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
#### Overriding the system prompt
A conversation style may include a default system prompt. For example, the `llama-chat` style uses the default Llama prompt:
> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
For styles that support a system prompt, you may override this prompt by setting the first message with the role `system`. For example:
```json JSON
[
{
"role": "system",
"content": "You are a pirate."
},
{
"role": "user",
"content": "Hello, what is your name?"
}
]
```
To completely omit the system prompt, you can set `content` to the empty string.
The process of generating a conversation-formatted prompt will depend on the conversation style used. To verify the exact prompt used, turn on [`echo`](#echo).
### Completions API
Text models generate text based on the provided input prompt. All text models support this basic [completions API](/api-reference/post-completions). Using this API, the model will successively generate new tokens until either the maximum number of output tokens has been reached or if the model's special end-of-sequence (EOS) token has been generated.
> NOTE: Llama-family models will automatically prepend the beginning-of-sequence (BOS) token (\) to your prompt input. This is to be consistent with the [original implementation](https://github.com/facebookresearch/llama/blob/ef351e9cd9496c579bf9f2bb036ef11bdc5ca3d2/llama/generation.py#L264).
Here are some examples of calling the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response = client.completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.Completion.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
)
print(response.choices[0].text)
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3-8b-instruct",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
## Getting usage info
The returned object will contain a `usage` field containing
* The number of prompt tokens ingested
* The number of completion tokens (i.e. the number of tokens generated)
## Advanced options
See the API reference for the [completions](/api-reference/post-completions) and [chat completions](/api-reference/post-completions) APIs for a detailed description of these options.
### Streaming
By default, results are returned to the client once the generation is finished. Another option is to stream the results back, which is useful for chat use cases where the client can incrementally see results as each token is generated.
Here is an example with the completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.Completion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
prompt="Say this is a test",
stream=True,
)
for chunk in response_generator:
print(chunk.choices[0].text, end="")
```
```shell cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"prompt": "Say this is a test",
"stream": true
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
and one with the chat completions API:
```python Python (Fireworks)
from fireworks.client import Fireworks
client = Fireworks(api_key="")
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
response_generator = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response_generator = openai.ChatCompletion.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
messages=[{
"role": "user",
"content": "Say this is a test",
}],
stream=True,
)
for chunk in response_generator:
if "content" in chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end="")
```
### Async mode
The Python client library also supports asynchronous mode for both completion and chat completion.
```python Python (Fireworks)
import asyncio
from fireworks.client import AsyncFireworks
client = AsyncFireworks(api_key="")
async def main():
stream = client.completion.acreate(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
```python Python (OpenAI 1.x)
import asyncio
import openai
client = openai.AsyncOpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key="",
)
async def main():
stream = await client.completions.create(
model="accounts/fireworks/models/llama-v3-8b-instruct",
prompt="Say this is a test",
stream=True,
)
async for chunk in stream:
print(chunk.choices[0].text, end="")
asyncio.run(main())
```
### Predicted Outputs
In cases where large parts of the LLM output are known in advance, e.g. code rewriting, specific edits within a longer document, you can improve output generation speeds with predicted outputs. Predicted outputs allows you to provide strong "guesses" of what output may look like.
To use predicted outputs, set the `prediction` field in the Fireworks API with the predicted output. For example, you may want to edit a survey and add an option to contact users by text message:
```
{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
```
In this case, we expect most of the code will remain the same. We set the ‘prediction’ field to be the original survey code. The output generation speed increases using predicted outputs. See additional information about predicted outputs below the code block.
```python Python (Fireworks)
from fireworks.client import Fireworks
code = """{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
"""
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{
"role": "user",
"content": "Edit the How to Contact question to add an option called Text Message. Output the full edited code, with no markdown or explanations.",
},
{
"role": "user",
"content": code
}
],
temperature=0,
prediction={"type": "content", "content": code}
)
print(response.choices[0].message.content)
```
Additional information on predicted outputs:
* Using predicted outputs is free at this time
* Using predicted outputs does not impact the quality of outputs generated
* We recommend setting temperature=0 for best results for most intended use cases of predicted outputs
* If the prediction is substantially different from the generated output, output generation speed may decrease
* The max length of the `prediction` field is set by `max_tokens` and is 2048 by default. You need to update it if you have a longer input and prediction.
* **Important Gotcha:** Ensure that the model output follows exactly the same format as specified in the prediction field. Common pitfalls include:
* Additional verbosity (e.g., "Sure, I updated this and that aspect ...")
* Markdown formatting (e.g., wrapping code in "\`\`\`")
* Extra new lines at the beginning of the output
You may need to adjust the prompt to avoid these issues.
* We are actively developing the feature and [welcome feedback on Discord!](https://discord.com/invite/mMqQxvFD9A)
* Read our [blog post](https://fireworks.ai/blog/cursor) on how Cursor used predicted outputs (which leverages speculative decoding under the hood) in production with Fireworks
### Sampling options
The API auto-regressively generates text based on choosing the next token using the probability distribution over the space of tokens. For detailed information on how to implement these options, please refer to the [Chat Completions](/api-reference/post-chatcompletions) or [Completions](/api-reference/post-completions) API documentation.
#### Multiple choices
By default, the API will return a single generation choice per request. You can create multiple generations by setting the `n` parameter to the number of desired choices. The returned `choices` array will contain the result of each generation.
#### Max tokens
`max_tokens` or `max_completion_tokens` defines the maximum number of tokens the model can generate, with a default of 2000. If the combined token count (prompt + output) exceeds the model’s limit, it automatically reduces the number of generated tokens to fit within the allowed context.
#### Temperature
Temperature allows you to configure how much randomness you want in the generated text. A higher temperature leads to more "creative" results. On the other hand, setting a temperature of 0 will allow you to generate deterministic results which is useful for testing and debugging.
#### Top-p
Top-p (also called nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top\_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
#### Top-k
Top-k is another sampling method where the k most probable tokens are filtered and the probability mass is redistributed among tokens.
#### Min P
[`min_p`](https://arxiv.org/abs/2407.01082) specifies a probability threshold to control which tokens can be selected during generation. Tokens with probabilities lower than this threshold are excluded, making the model more focused on higher-probability tokens. The default value varies, and setting a lower value ensures more variety, while a higher value produces more predictable, focused outputs.
#### Repetition penalty
LLMs are sometimes prone to repeat a single character or a sentence. Using a frequency and presence penalty can reduce the likelihood of sampling repetitive sequences of tokens. They work by directly modifying the model's logits (un-normalized log-probabilities) with an additive contribution.
> logits\[j] -= c\[j] \* frequency\_penalty + (c\[j] > 0 ? 1 : 0) \* presence\_penalty
where
* `logits[j]` is the logits of the j-th token
* `c[j]` is how often that token was sampled before the current position
The [`repetition_penalty`](https://arxiv.org/pdf/1909.05858.pdf) modifies the logit (raw model output) for repeated tokens. If a token has already appeared in the prompt or output, the penalty is applied to its probability of being selected again.
**Key differences to keep in mind:**
* `frequency_penalty`: Works on how often a word has been used, increasing the penalty for more frequent words. OAI compatible.
* `presence_penalty`: Penalizes words once they appear, regardless of frequency. OAI compatible.
* `repetition_penalty`: Adjusts the likelihood of repeated tokens based on previous appearances, providing an exponential scaling effect to control repetition more precisely, including from the prompt.
#### Mirostat (learning rate and target)
The [Mirostat algorithm](https://arxiv.org/abs/2007.14966) is a sampling method that helps keep the output’s unpredictability, or perplexity, at a set target. It adjusts token probabilities as the text is generated to balance between more diverse or more predictable results. This is useful when you need steady control over how random or focused the text output should be.
There are two parameters that can be adjusted:
* `mirostat_target`: Sets the desired level of unpredictability (perplexity) for the Mirostat algorithm. A higher target results in more diverse output, while a lower target keeps the text more predictable.
* `mirostat_lr`: Controls how quickly the Mirostat algorithm adjusts token probabilities to reach the target perplexity. A lower learning rate makes the adjustments slower and more gradual, while a higher rate speeds up the corrections.
#### Logit bias
Parameter that modifies the likelihood of specified tokens appearing. Pass in a Dict\[int, float] that maps a token\_id to a logits bias value between -200.0 and 200.0. For example
```Text python
client.completions.create(
model="...",
prompt="...",
logit_bias={0: 10.0, 2: -50.0}
)
```
## Debugging options
#### Ignore EOS
This option allows you to control whether the model stops when it generated the End of Sequence (EOS) token. This is helpful primarily for performance benchmarking to reliably generate exactly `max_tokens`. Note the quality of the output may degrade as we override model's decision to generate EOS token.
### Logprobs
The `logprobs` parameter determines how many token probabilities are returned. If set to N, it will return log (base e)
probabilities for N+1 tokens: the chosen token plus the N most likely alternative tokens.
The log probabilities will be returned in a LogProbs object for each choice.
* `tokens` contains each token of the chosen result.
* `token_ids` contains the integer IDs of each token of the chosen result.
* `token_logprobs` contains the logprobs of each chosen token.
* `top_logprobs` will be a list whose length is the number of tokens of the output. Each element is a dictionary of size `logprobs`, from the most likely tokens at the given position to their respective log probabilities.
When used in conjunction with echo, this option can be set to see how the model tokenized your input.
### top\_logprobs
Setting the `top_logprobs` parameter to an integer value in conjunction with `logprobs=True` will also return the above information but in an OpenAI client-compatible format.
### Echo
Setting the `echo` option to true will cause the API to return the prompt along with the generated text. This can be used in conjunction with the chat completions API to verify the prompt template used. It can also be used in conjunction with logprobs to see how the model tokenized your input.
## Appendix
### Tokenization
Language models read and write text in chunks called tokens. In English, a **token** can be as short as one character or as long as one word (e.g., a or apple), and in some languages, tokens can be even shorter than one character or even longer than one word.
Different model families use different **tokenizers**. The same text might be translated to different numbers of tokens depending on the model. It means that generation cost may vary per model even if the model size is the same. For the Llama model family, you can use [this tool](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) to estimate token counts. The actual number of tokens used in prompt and generation is returned in the `usage` field of the API response.
# Querying vision-language models
See [Querying text models](/guides/querying-text-models) for a general guide on the API and its options.
## Using the API
Both completions API and chat completions API are supported. However, we recommend users use the chat completions API
whenever possible to avoid common prompt formatting errors. Even small errors like misplaced whitespace may result in
poor model performance.
For Llama 3.2 Vision models, you should pass images before text in the content field, to avoid the model refusing to answer
You can pass images via a URL link or base64 encoded format. Code examples for both methods are below.
### Chat completions API
All vision-language models should have a conversation config and have [chat completions API](https://docs.fireworks.ai/api-reference/post-chatcompletions) enabled. These models are typically tuned with specific conversation styles for which they perform best. For example, Phi-3 models use the following template:
```
SYSTEM: {system message}
USER:
{user message}
ASSISTANT:
```
The `` substring is a special token that we insert into the prompt to allow the model to figure out where to put the image.
Here are some examples of calling the chat completions API:
```python Python (Fireworks)
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```bash cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
In the above example, we are providing images by providing the URL to the images. Alternatively, you can also provide the string representation of the base64 encoding of the images, prefixed with MIME types. For example:
```python Python (Fireworks)
import fireworks.client
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
fireworks.client.api_key = ""
response = fireworks.client.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 1.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
client = openai.OpenAI(
base_url = "https://api.fireworks.ai/inference/v1",
api_key = "",
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```python Python (OpenAI 0.x)
import openai
import base64
# Helper function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# The path to your image
image_path = "your_image.jpg"
# The base64 string of the image
image_base64 = encode_image(image_path)
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
response = openai.ChatCompletion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "Can you describe this image?",
}, {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}, ],
}],
)
print(response.choices[0].message.content)
```
```Text cURL
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "accounts/fireworks/models/phi-3-vision-128k-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,"
}
}
]
}
]
}' \
--url https://api.fireworks.ai/inference/v1/chat/completions
```
### Completions API
Advanced users can also query the completions API directly. Users will need to manually insert the image token `` where appropriate and supply the list of images as an ordered list (this is true for the Phi-3 model, but may be subject to change for future vision-language models). For example:
```python
import fireworks.client
fireworks.client.api_key = ""
response = fireworks.client.Completion.create(
model = "accounts/fireworks/models/phi-3-vision-128k-instruct",
prompt = "SYSTEM: Hello\n\nUSER:\ntell me about the image\n\nASSISTANT:",
images = ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"],
)
print(response.choices[0].text)
```
## API limitations
Right now, we impose certain limits on the completions API and chat completions API as follows:
1. The total number of images included in a single API request cannot exceed 30, regardless of whether they are provided as base64 strings or URLs.
2. All the images should be smaller than 5MB in size, and if the time taken to download the images is longer than 1.5 seconds, the request will be dropped and you will receive an error.
## Model limitations
At the moment, we primarily offer Phi-3 vision models for serverless deployment.
## Managing images
The Chat Completions API is not stateful. That means you have to manage the messages (including images) you pass to the model yourself. However, we try to cache the image download as much as we can to save latency on model download.
For long-running conversations, we suggest passing images via URLs instead of base64 encoded images. The latency of the model can also be improved by downsizing your images ahead of time to be less than the maximum size they are expected to be.
## Calculating cost
For the Phi-3 Vision model, an image is treated as a dynamic number of tokens based on image resolution. For one image the number of tokens typically ranges from 1K to 2.5K. The pricing is otherwise identical to text models. For more information, please refer to [our pricing page here.](https://fireworks.ai/pricing)
## FAQ
### Can I fine-tune the image capabilities with VLM?
Not right now, but we will be working on Phi-3 vision model fine-tuning since it is now a more popular choice. If you are interested, please reach out to us via Discord.
### Can I use a vision-language model to generate images?
No. But we support image generation models for this purpose:
* [Stable Diffusion](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0)
* [SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B)
* [Japanese Stable Diffusion](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl)
* [Playground v2](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic)
Please give these models a try and let us know how it goes!
### What type of files can I upload?
We currently support `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff` and `.ppm` format images.
### Is there a limit to the size of the image I can upload?
Currently, our API is restricted to 10MB for the whole request, so the image sent through request in base64 encoding will need to be smaller than 10MB (when converted to base64 encoding). If you are using URLs, then each image needs to be smaller than 5MB.
### What is the retention policy for the images I upload?
We do not persist the images longer than the server lifetime, and will be deleted automatically.
### How do rate limits work with VLMs?
VLMs are rate-limited like all of our other LLM models, which depends on which tier of rate-limiting you are at. For more information, please check out [Pricing](https://fireworks.ai/pricing).
### Can VLMs understand image metadata?
No. If you have image metadata that you want the model to understand, please provide them through the prompt.
# Deploying models
A model must be deployed before it can be used for inference. Fireworks deploys the most popular base models to
serverless deployments that can be used out of the box (including PEFT addons). See [Querying text models](/guides/querying-text-models).
Less popular base models or custom base
models must be used with an [on-demand deployment](/guides/ondemand-deployments).
## Deploying a model
### PEFT addons
#### Deploying to serverless
Fireworks also supports deploying serverless addons for [supported base models](/fine-tuning/fine-tuning-models#appendix).
To deploy a PEFT addon to serverless, run
`firectl deploy` without passing a deployment ID:
```bash
firectl deploy
```
Serverless addons are charged by input and output tokens for inference. There is no additional charge for deploying
serverless addons.
PEFT addons on serverless have higher latency compared with base model inference. This includes LoRA fine-tunes, which
are one type of PEFT addon. For faster inference speeds with PEFT addons, we recommend deploying to on-demand.
Unused addons may be automatically undeployed after a week.
#### Deploying to on-demand
Addons may also be deployed in an [on-demand deployment](/guides/ondemand-deployments) of [supported base models](/fine-tuning/fine-tuning-models#appendix).
To create an on-demand deployment, run:
```bash
firectl create deployment "accounts/fireworks/models/" --enable-addons
```
On-demand deployments are charged by GPU-hour. See [Pricing](https://fireworks.ai/pricing#ondemand) for
details.
Once the deployment is ready, deploy the addon to the deployment:
```bash
firectl deploy --deployment
```
### Base models
Custom base models may only be used with [on-demand deployments](/guides/ondemand-deployments). To create one, run:
```bash
firectl create deployment
```
On-demand deployments are charged by GPU-hour. See [Pricing](https://fireworks.ai/pricing#ondemand) for
details.
Use the `` specified during [model upload](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model-2). Creating the deployment will automatically deploy the base model to the deployment.
## Checking whether a model is deployed
You can check the status of a model deployment by looking at the "Deployed Model Refs" section from:
```
firectl get model
```
If successful, there will be an entry with `State: DEPLOYED`.
Alternatively, you can list all deployed models within your account by running:
```
firectl list deployed-models
```
## Inference
### Model identifier
After your model is successfully deployed, it will be ready for inference. A model can be queried using one of the
following model identifiers:
* The model and deployment names - `accounts//models/#accounts//deployments/`,
e.g.
* `accounts/fireworks/models/mixtral-8x7b#accounts/alice/deployments/12345678`
* `accounts/alice/models/custom-model#accounts/alice/deployments/12345678`
* The model and deployment short-names - `/#/`,
e.g.
* `fireworks/mixtral-8x7b#alice/12345678`
* `alice/custom-model#alice/12345678`
* Deployed model name - Instead of needing to use both the model and deployment name to refer to a deployed model, you can optionally just use a unique deployed model name. This name utilizes a unique deployed model ID that is created upon deployment. The deployed model ID takes the form \-\/`
* `/#/`
### Multiple deployments
Since a model may be deployed to multiple deployments, querying by model name will route to the "default" deployed
model. You can see which deployed model entry is marked with `Default: true` by describing the model:
```
firectl get model
...
Deployed Model Refs:
[{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
Default: true
},
{
Name: accounts//deployedModels/
Deployment: accounts//deployments/
State: DEPLOYED
},
]
```
To update the default deployed model, note the "Name" of the deployed model reference above. Then run:
```
firectl update deployed-model --default
```
Deleting a default deployment:
To delete a default deployment you must delete all other deployments for the same model first,
or designate a different deployed model as the default as described above. This is to ensure that querying by model name
will always route to an unambiguous default deployment as long as deployments for the model exist.
### Querying the model
To test the model using the completions API, run:
```bash
curl \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"prompt": "Say this is a test"
}' \
--url https://api.fireworks.ai/inference/v1/completions
```
See [Querying text models](/guides/querying-text-models) for a more comprehensive guide.
## Publishing a model
By default, models can only be queried by the account that owns them. To make a model public, pass the `--public` flag
when creating or updating it.
```bash
firectl update model --public
```
To unpublish it, run:
```bash
firectl update model --public=false
```
# Overview
## Introduction
A *model* is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be
deployed on hardware (i.e. a *deployment*) for inference. Each model has a [globally unique name](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) of the
form `accounts//models/`. The model IDs are:
* Pre-populated for models that Fireworks has uploaded. For example, "llama-v3p1-70b-instruct" is the model ID for the Llama 3.1 70B model that Fireworks provides. It can be found on each model's page ([example](https://fireworks.ai/models/fireworks/llama-v3p1-70b-instruct))
* Either auto-generated or user-specified for fine-tuned models [uploaded](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model) or [created](https://docs.fireworks.ai/fine-tuning/fine-tuning-models#model-id) by users
* User-specified for [custom models](https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model) uploaded by users
There are two types of models:
* Base models
* Parameter-efficient fine-tuned (PEFT) addons
### Base models
A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as
full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for
[serverless inference](#serverless-inference) as well as [dedicated deployments](#dedicated-deployments). Fireworks
also allows you to upload your own custom base models.
### Parameter-efficient fine-tuned (PEFT) addons
A PEFT addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to
a fully fine-tuned model. A common technique for training PEFT addons is low-rank adaptation (LoRA). Fireworks
supports both [training](/fine-tuning/fine-tuning-models), [uploading](/models/uploading-custom-models#peft-addons),
and [serving](/models/deploying) PEFT addons.
PEFT addons must be deployed on a serverless or dedicated deployment for its corresponding base model.
## Using models for inference
A model must be deployed before it can be used for inference. Take a look at the [Querying text models](/guides/querying-text-models)
guide for a comprehensive overview of making LLM inference.
### Serverless inference
Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the
Fireworks team for the community to use. Take a look at the [Models](https://fireworks.ai/models) page for the latest
list of serverless models.
Serverless inference is billed on a per-token basis depending on the model size. See our [Pricing](https://fireworks.ai/pricing#text)
page for details.
Since serverless deployments are shared across users, there are no SLA guarantees for up-time or latency. It is
best-effort. The Fireworks team may also deprecate models from serverless with at least 2 weeks notice.
Custom base models are not supported for serverless inference.
### Serverless addons
The most popular base models for fine-tuning will also support serverless PEFT addons. This feature allows users to
quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the
[Deploying to serverless](/models/deploying#deploying-to-serverless) guide for details.
Similar to serverless inference, there are no SLA guarantees for serverless addons.
### Dedicated deployments
Dedicated deployments give users the most flexibility and control over what models can be deployed and performance
guarantees. These deployments are private to you and give you access to a wide array of hardware. Both PEFT addons and
base models can be deployed to dedicated deployments.
Dedicated deployments are billed by a GPU-second basis. See our [Pricing](https://fireworks.ai/pricing#ondemand) page
for details.
Take a look at our [On-demand deployments](/guides/ondemand-deployments) guide for a comprehensive overview.
## Data privacy & security
Your data is your data. No prompt or generated data is logged or stored on Fireworks; only meta-data like the number of tokens in a request is logged, as required to deliver the service. There are two exceptions:
* For our proprietary FireFunction model, input/output data is logged for 30 days only to enable bulk analytics to improve the model, such as tracking the number of functions provided to the model.
* For certain advanced features (e.g. FireOptimizer), users can explicitly opt-in to log data.
# null
By default, models are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits
required to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics
which may introduce small changes to the output.
Take a look at our [blog post](https://fireworks.ai/blog/fireworks-quantization) for a detailed treatment of how
quantization affects model quality.
## Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision using using `firectl prepare-model`:
```bash
firectl prepare-model --precision FP8
```
This is an additive process that adds a new FP8 checkpoint for your model. The original FP16 checkpoint is still
available for use.
You can check on the status of preparation by running
```bash
firectl get model
```
and checking if the state is still in `PREPARING`. A successfully prepared model will have the desired precision added
to the `Precisions` list.
## Creating an FP8 deployment
By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a
model, run:
```bash
firectl get model
```
The `Precisions` field will indicate what precisions the model has been prepared for.
To use the quantized FP8 checkpoint, pass the `--precision` flag:
```bash
firectl create deployment --precision FP8
```
# Uploading a custom model
In addition to the predefined set of models already available on Fireworks and models you fine-tune on the Fireworks
platform, you can also upload your own custom models. Both custom base models and PEFT addons are supported.
## PEFT addons
### Requirements
Your PEFT addon must contain the following files:
* `adapter_config.json` - The Hugging Face adapter configuration file.
* `adapter_model.bin` or `adapter_model.safetensors` - The saved addon file.
The `adapter_config.json` must contain the following fields:
* `r` - The number of LoRA ranks. Must be between an integer between 4 and 64, inclusive.
* `target_modules` - A list of target modules. Currently the following target modules are supported:
* `q_proj`
* `k_proj`
* `v_proj`
* `o_proj`
* `up_proj` or `w1`
* `down_proj` or `w2`
* `gate_proj` or `w3`
* `block_sparse_moe.gate`
Additional fields may be specified but are ignored.
### Enabling chat completions
To enable the chat completions API for your PEFT addon, add a `fireworks.json` file directory containing:
```json
{
"conversation_config": {
"style": "jinja",
"args": {
"template": ""
}
}
}
```
### Uploading the model
To upload a PEFT addon, run the following command. The MODEL\_ID is an arbitrary [resource ID](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids) to refer to the model within Fireworks.
> NOTE: Only some base models support PEFT addons.
```bash
firectl create model /path/to/files/ --base-model "accounts/fireworks/models/"
```
## Custom base models
### Requirements
Fireworks currently supports the following model architectures:
* [Gemma](https://huggingface.co/docs/transformers/en/model_doc/gemma)
* [Phi, Phi-3](https://huggingface.co/docs/transformers/en/model_doc/phi)
* [Llama 1,2,3,3.1](https://huggingface.co/docs/transformers/en/model_doc/llama2)
* [LLaVa](https://huggingface.co/docs/transformers/main/en/model_doc/llava)
* [Mistral](https://huggingface.co/docs/transformers/en/model_doc/mistral) & [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral)
* [Qwen2](https://huggingface.co/docs/transformers/en/model_doc/qwen2)
* [StableLM](https://huggingface.co/docs/transformers/main/en/model_doc/stablelm)
* [Starcoder(GPTBigCode)](https://huggingface.co/docs/transformers/en/model_doc/gpt_bigcode) & [Starcoder2](https://huggingface.co/docs/transformers/main/en/model_doc/starcoder2)
* [DeepSeek V1 & V2](https://huggingface.co/deepseek-ai)
* [GPT NeoX](https://huggingface.co/docs/transformers/en/model_doc/gpt_neox)
The model files you will need to provide depend on the model architecture. In general, you will need the following files:
* Model configuration: `config.json`.
Fireworks does not support the `quantization_config` option in `config.json`.
* Model weights, in one of the following formats:
* `*.safetensors`
* `*.bin`
* Weights index:`*.index.json`
* Tokenizer file(s), e.g.
* `tokenizer.model`
* `tokenizer.json`
* `tokenizer_config.json`
If the requisite files are not present, model deployment may fail.
### Enabling chat completions
To enable the chat completions API for your custom base model, ensure your `tokenizer_config.json` contains a
`chat_template` field. See the Hugging Face guide on [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating)
for details.
### Uploading the model
To upload a custom base model, run the following command.
```bash
firectl create model /path/to/files/
```
## Deploying
A model cannot be used for inference until it is deployed. See the [Deploying models](/models/deploying) guide to deploy
the model.
# Using grammar mode
## What is grammar-based structured output?
Grammar mode is the ability to specify a forced output schema for any Fireworks model via an extended BNF formal grammar ([GBNF format](https://github.com/ggerganov/llama.cpp/tree/master/grammars)). This method is popularly used to constrain model outputs in [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). What is a formal grammar? It's a way to define rules to declare strings to be valid or invalid. See the "Syntax for Describing Grammars" below for more info. Similar to our [JSON mode](/structured-responses/structured-response-formatting) format, you provide `response_format` field in the request like `{"type": "grammar", "grammar": }`.
For best results, we still recommend that you do some prompt engineering and describe the desired output to the model to guide decision-making.
## Why grammar-based structured output?
* Relying solely on system prompt engineering is finicky and time-consuming. It can be difficult to coerce the model to do certain things, for example
* Behave like a classifier, only output from a predefined list
* Output only Japanese, Chinese, a specified programming language, or otherwise prevent the model from generating a large set of of tokens
* Sometimes JSON is not what you need (e.g. it may be finicky with string escaping) and you need some other structured output
* Small models may have difficulty following instructions
## End-to-end examples
This guide provides a step-by-step example of creating a structured output response with grammar using the Fireworks.ai API. The example uses Python and the OpenAI library to define the schema for the output.
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` libraries installed. You can install them using pip:
```bash
pip install openai
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all fireworks models support this feature. You can find your favorite model and get structured responses out of it!
### Step 1: Configure the Fireworks.ai client
You can use either Fireworks.ai or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
Replace `"Your_API_Key"` with your actual API key.
### Step 2: Define the output grammar
Define a grammar to restrict the specified output. Let's say you have a model that is a classifier and classifies patient requests into a few predefined classes:
```
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
```
Then you can ask the model to only respond within these classes.
### Step 3: Specify your output grammar in your chat completions request
```python Python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
diagnosis_grammar = """
root ::= diagnosis
diagnosis ::= "arthritis" | "dengue" | "urinary tract infection" | "impetigo" | "cervical spondylosis"
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": diagnosis_grammar},
messages=[
{
"role": "system",
"content": "Given the symptoms try to guess the possible diagnosis. Possible choices: arthritis, dengue, urinary tract infection, impetigo, cervical spondylosis. Answer with a single word",
},
{
"role": "user",
"content": "I have been having trouble with my muscles and joints. My neck is really tight and my muscles feel weak. I have swollen joints and it is hard to move around without becoming stiff. It is also really uncomfortable to walk.",
},
],
)
print(chat_completion.choices[0].message.content)
```
and for the response, we will only get one of the 5 classes we specified, in this case, the model output is
```
'arthritis'
```
Note, that we still have done some prompt engineering to instruct the model about possible diagnoses in free form. Alternatively, we may have used one of the fine-tuned models for the medical domain.
## Advanced examples
### Japanese and Chinese
Make a request to the Fireworks.ai API to get a structured response. In your request, specify the output schema you used in step 3. For example, we are pretending
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="Your_API_Key",
)
cjk_grammar = """
root ::= jp-char+ ([ \t\n] jp-char+)*
jp-char ::= hiragana | katakana | punctuation | cjk
hiragana ::= [ぁ-ゟ]
katakana ::= [ァ-ヿ]
punctuation ::= [、-〾]
cjk ::= [一-鿿]
"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": cjk_grammar},
messages=[
{
"role": "user",
"content": "You are a Japanese tour guide who speaks fluent Japanese. Please tell me what are some good places for me to visit in Kyoto",
},
],
)
print(chat_completion.choices[0].message.content)
```
The model will reply in Japanese
```
こんにちは、私は日本語を母国語として話せるキョトの私が案内する旅行案内者です。京都を旅行にお付き合いいただきありがとうごさいます。京都にはたくさんの楽しいところがありますが、私はあなたの需要に基いて、いくつかのおすすめていきます。\n最初に、古都の一面を体感できる場所として、清水寺をおすすめします。清水寺は世界的に有名な寺院で、美しい金面山だまのホームページと、きれいな庭で知られています。\n次に、京
```
And since the grammar is actually more lenient than Japanese and covers Chinese as well, we can also just prompt the model to be a fluent Chinese speaker.
```
You are a Japanese tour guide who speaks fluent Chinese. Please tell me what are some good places for me to visit in Shanghai?",
```
And you can see here that we are trying something a little difficult, asking a Japanese tour guide to speak Chinese. But with the help from the grammar, the model replied in Chinese, with the same grammar specified
```
当您访问上海时、我建议您参观以下几个地方。\\n上海外国语大学。这是一所著名的大学、校园美景优秀、还有各种餐馆可供您选择。\\n上海中山公园。这是一座位于城市中心的公园、风景优美、有许多古老的建设和展览馆。\\n南京路。这是一条繁华的商业大街、有许多品牌商店和餐馆可供您选择。\\n上海南京东路步行街。这是全球最大的步行街之一、有许多商店和餐馆可供您选择。\\n上海世博中心。这是一座展览馆复合体、经常举办各种展览和会议
```
Without the help from the grammar, here is the model reply in a mix of Chinese and English
```
你好!uming as a Japanese tour guide who speaks fluent Chinese, I would be happy to recommend some places for you to visit in Shanghai!\\n\\n1. The Bund: This is a famous waterfront area in Shanghai that offers stunning views of the city's skyline, including the iconic Oriental Pearl Tower. You can take a leisurely stroll along the promenade and enjoy the beautiful scenery.\\n\\n2. Yuyuan Garden: This is a beautiful classical Chinese garden that dates back to the Ming Dynasty. It features pavilions, halls, rockeries, ponds, and cloisters, and is a great place to experience traditional Chinese architecture and garden design.\\n\\n3. Shanghai Tower: This is the tallest building in China and the second-tallest building in the world. You can take the elevator up to the observation deck on the 128th floor for breatht
```
### C code generation
This is one of the community contribution on llama.cpp. You can hook that with our Mixtral model and try to come up with a good solution for a coding problem you have.
```python
from fireworks.client import Fireworks
client = Fireworks(
api_key="your_API_Key",
)
c_grammar = """
root ::= (declaration)*
declaration ::= dataType identifier "(" parameter? ")" "{" statement* "}"
dataType ::= "int" ws | "float" ws | "char" ws
identifier ::= [a-zA-Z_] [a-zA-Z_0-9]*
parameter ::= dataType identifier
statement ::=
( dataType identifier ws "=" ws expression ";" ) |
( identifier ws "=" ws expression ";" ) |
( identifier ws "(" argList? ")" ";" ) |
( "return" ws expression ";" ) |
( "while" "(" condition ")" "{" statement* "}" ) |
( "for" "(" forInit ";" ws condition ";" ws forUpdate ")" "{" statement* "}" ) |
( "if" "(" condition ")" "{" statement* "}" ("else" "{" statement* "}")? ) |
( singleLineComment ) |
( multiLineComment )
forInit ::= dataType identifier ws "=" ws expression | identifier ws "=" ws expression
forUpdate ::= identifier ws "=" ws expression
condition ::= expression relationOperator expression
relationOperator ::= ("<=" | "<" | "==" | "!=" | ">=" | ">")
expression ::= term (("+" | "-") term)*
term ::= factor(("*" | "/") factor)*
factor ::= identifier | number | unaryTerm | funcCall | parenExpression
unaryTerm ::= "-" factor
funcCall ::= identifier "(" argList? ")"
parenExpression ::= "(" ws expression ws ")"
argList ::= expression ("," ws expression)*
number ::= [0-9]+
singleLineComment ::= "//" [^\n]* "\n"
multiLineComment ::= "/*" ( [^*] | ("*" [^/]) )* "*/"
ws ::= ([ \t\n]+)"""
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
response_format={"type": "grammar", "grammar": c_grammar},
messages=[
{
"role": "user",
"content": "You are an expert in writing C code. Can you write a program that prints hello world?",
},
],
)
print(chat_completion.choices[0].message.content)
```
In this case, we get a cute little valid C program as the output:
```
char\nc(int a){return 2*a;}
```
## Syntax
### Background
[Bakus-Naur Form (BNF)](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) is a notation for describing the syntax of formal languages like programming languages, file formats, and protocols. Fireworks API uses an extension of BNF with a few modern regex-like features, inspired by [Llama.cpp's implementation](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
### Basics
In BNF, we define *production rules* that specify how a *non-terminal* (rule name) can be replaced with sequences of *terminals* (characters, specifically Unicode [code points](https://en.wikipedia.org/wiki/Code_point)) and other non-terminals. The basic format of a production rule is `nonterminal ::= sequence...`.
Consider an example of a small chess notation grammar:
```
# `root` specifies the pattern for the overall output
root ::= (
# it must start with the characters "1. " followed by a sequence
# of characters that match the `move` rule, followed by a space, followed
# by another move, and then a newline
"1. " move " " move "\n"
# it's followed by one or more subsequent moves, numbered with one or two digits
([1-9] [0-9]? ". " move " " move "\n")+
)
# `move` is an abstract representation, which can be a pawn, nonpawn, or castle.
# The `[+#]?` denotes the possibility of checking or mate signs after moves
move ::= (pawn | nonpawn | castle) [+#]?
pawn ::= ...
nonpawn ::= ...
castle ::= ...
```
### Non-terminals and terminals
Non-terminal symbols (rule names) stand for a pattern of terminals and other non-terminals. They are required to be a dashed lowercase word, like `move`, `castle`, or `check-mate`.
Terminals are actual characters ([code points](https://en.wikipedia.org/wiki/Code_point)). They can be specified as a sequence like `"1"` or `"O-O"` or as ranges like `[1-9]` or `[NBKQR]`.
### Characters and character ranges
Terminals support the full range of Unicode. Unicode characters can be specified directly in the grammar, for example `hiragana ::= [ぁ-ゟ]`, or with escapes: 8-bit (`\xXX`), 16-bit (`\uXXXX`) or 32-bit (`\UXXXXXXXX`).
Character ranges can be negated with `^`:
```
single-line ::= [^\n]+ "\n"`
```
Dot `.` symbol matches any character:
```
any-three-symbol-sequence ::= ...
```
### Sequences and alternatives
The order of symbols in a sequence matter. For example, in `"1. " move " " move "\n"`, the `"1. "` must come before the first `move`, etc.
Alternatives, denoted by `|`, give different sequences that are acceptable. For example, in `move ::= pawn | nonpawn | castle`, `move` can be a `pawn` move, a `nonpawn` move, or a `castle`.
Parentheses `()` can be used to group sequences, which allows for embedding alternatives in a larger rule or applying repetition and optional symbols (below) to a sequence.
### Repetition and optional symbols
* `*` after a symbol or sequence means that it can be repeated zero or more times.
* `+` denotes that the symbol or sequence should appear one or more times.
* `?` makes the preceding symbol or sequence optional.
### Comments and newlines
Comments can be specified with `#`:
```
# defines optional whitespace
ws ::= [ \t\n]+
```
Newlines are allowed between rules and between symbols or sequences nested inside parentheses. Additionally, a newline after an alternate marker `|` will continue the current rule, even outside of parentheses.
### The root rule
In a full grammar, the `root` rule always defines the starting point of the grammar. In other words, it specifies what the entire output must match.
```
# a grammar for lists
root ::= ("- " item)+
item ::= [^\n]+ "\n"
```
# Using JSON mode
## What is JSON mode?
JSON mode enables you to provide a JSON schema to force any Fireworks language model to respond in
## Why JSON responses?
1. Clarity and Precision: Responding in JSON ensures that the output from the LLM is clear, precise, and easy to parse. This is particularly beneficial in scenarios where the response needs to be further processed or analyzed by other systems.
2. Ease of Integration: JSON, being a widely-used format, allows for easy integration with various platforms and applications. This interoperability is essential for developers looking to incorporate AI capabilities into their existing systems without extensive modifications.
## End-to-end example
This guide provides a step-by-step example of how to create a structured output response using the Fireworks.ai API. The example uses Python and the `pydantic` library to define the schema for the output.
### Prerequisites
Before you begin, ensure you have the following:
* Python installed on your system.
* `openai` and `pydantic` libraries installed. You can install them using pip:
```bash
pip install openai pydantic
```
Next, select the model you want to use. In this example, we use `mixtral-8x7b-instruct`, but all fireworks models support this feature. You can find your favorite model and get a JSON response out of it!
### Step 1: Import libraries
Start by importing the required libraries:
```python
import openai
from pydantic import BaseModel, Field
```
### Step 2: Configure the Fireworks.ai client
You can use either Fireworks.ai or OpenAI SDK with this feature. Using OpenAI SDK with your API key and the base URL:
```python
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="Your_API_Key",
)
```
Replace `"Your_API_Key"` with your actual API key.
### Step 3: Define the output schema
Define a Pydantic model to specify the schema of the output. For example:
```python
class Result(BaseModel):
winner: str
```
This model defines a simple schema with a single field `winner`. If you are not familiar with pydantic, please [check the documentation here](https://docs.pydantic.dev/latest/) . Pydantic emits JSON Schema, and you can find more informations [about it here](https://json-schema.org/).
### Step 4: Specify your output schema in your chat completions request
Make a request to the Fireworks.ai API to get a JSON response. In your request, specify the output schema you used in step 3. For example, to ask who won the US presidential election in 2012:
```python
chat_completion = client.chat.completions.create(
model="accounts/fireworks/models/mixtral-8x7b-instruct",
response_format={"type": "json_object", "schema": Result.model_json_schema()},
messages=[
{
"role": "user",
"content": "Who won the US presidential election in 2012? Reply just in one JSON.",
},
],
)
```
### Step 5: Display the result
Finally, print the result:
```python
print(repr(chat_completion.choices[0].message.content))
```
This will display the response in the format defined by the `Result` schema. We get just one nice json response:
```
'{\n "winner": "Barack Obama"\n}'
```
And you can parse that as a plain JSON, and hook it up with the rest of your system. Current we enforce a structure with a grammar based state machine, to make sure that the LLMs would always generate all the fields in the schema. If your provided output schema is not a valid json schema, we will fail the response.
## Structured response modes
Fireworks support the following variants:
* **Arbitrary JSON**. Similar to [OpenAI](https://platform.openai.com/docs/guides/text-generation/json-mode), you can force the model to produce any valid json by providing `{"type": "json_object"}` as `response_format` in the request. This forces the model to output JSON but does not specify what specific JSON schema to use.
* **JSON with the given schema**. To specify a given JSON schema, you can provide the schema according to [JSON schema spec](https://json-schema.org/specification) to be imposed on the model generation. See supported constructs in the next section.
**Important:** when using JSON mode, it's also crucial to instruct the model to produce JSON and describe the desired schema via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly "stuck" request.
To get the best outcome, you need to include the schema in **both the prompt and the schema.**
Technically, it means that when using "JSON with the given schema" mode, the model doesn't automatically "see" the schema passed in the `response_format` field. Adherence to the schema is forced upon the model during sampling. So for best results, you need to include the desired schema in the prompt in addition to specifying it as `response_format`. You may need to experiment with the best way to describe the schema in the prompt depending on the model: besides JSON schema, describing it in plain English might work well too, e.g. "extract name and address of the person in JSON format".
**Note:** that the message content may be partially cut off if `finish_reason="length"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length. In this case, the return value might not be a valid JSON.
Structured response modes work for both Completions and Chat Completions APIs.
If you use [function calling](/docs/function-calling), JSON mode is enabled automatically and function schema is added to the prompt. So none of the comments above apply.
### JSON schema constructs
Fireworks supports a subset of [JSON schema specification](https://json-schema.org/specification).
Supported:
* Nested schemas composition, including `anyOf` and `$ref`
* `type`: `string`, `number`, `integer` `boolean`, `object`, `array`, `null`
* `properties` and `required` for objects
* `items` for arrays
Fireworks API doesn't error out on unsupported constructs. They just won't be enforced. Not yet supported constraints include:
* Sophisticated composition with `oneOf`
* Length/size constraints for objects and arrays
* Regular expressions via `pattern`
**Note**: JSON specification [allows for arbitrary field names](https://json-schema.org/understanding-json-schema/reference/object#additionalproperties) to appear in an object with the `properties` constraint unless `"additionalProperties": false` or `"unevaluatedProperties": false` is provided. It's a poor default for LLM constrained generation since any hallucination would be accepted. Thus Fireworks treats any schema with `properties` constraint as if it had `"unevaluatedProperties": false`.
An example of `response_format` field with the schema accepting an object with two fields - a required string and an optional integer:
```
{
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"foo": {"type": "string"},
"bar": {"type": "integer"}
},
"required": ["foo"]
}
}
```
## Similar features
Check out our [function calling model](/guides/function-calling) if you're interested in use cases like:
* Multi-turn capabilities: For example, the ability for the model to ask for clarifying information about parameters
* Routing: The ability for the model to route across multiple different options or models. Instead of just having one possible JSON Schema, you have many different JSON schemas to work across.
Check out [grammar mode](/structured-responses/structured-output-grammar-based) if you want structured output specified not through JSON, but rather through an arbitrary grammar (limit output to specific words, character limits, character types, etc).
# Authentication
Authentication for access to your account
### Signing in
Users using Google SSO can run:
```
firectl signin
```
If you are using [custom SSO](/accounts/sso), also specify the account ID:
```
firectl signin my-enterprise-account
```
### Authenticate with API Key
To authenticate without a web browser, append `--api-key` to any firectl command.
```
firectl --api-key API_KEY
```
To persist the API key for all subsequent commands, run:
```
firectl set-api-key API_KEY
```
# Create a Dataset
Create a Dataset on Fireworks AI platform
```
firectl create dataset [flags]
```
### Example
```
firectl create dataset my-dataset /path/to/dataset.jsonl
```
### Flags
```
--display-name string The display name of the dataset.
-h, --help help for dataset
--quiet If true, does not print the upload progress bar.
```
# Create a deployment
Create a Deployment on Fireworks AI platform
Creates a new deployment.
```
firectl create deployment [flags]
```
### Example
```
firectl create deployment falcon-7b
```
### Flags
```
--description string Description of the deployment.
--disable-speculative-decoding If true, speculative decoding is disabled.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
--max-peft-batch-size int32 Max batching of concurrent peft requests of the server.
--max-replica-count int32 Maximum number of replicas for the deployment. If min-replica-count > 0 defaults to 0, otherwise defaults to 1.
--min-replica-count int32 Minimum number of replicas for the deployment. If min-replica-count < max-replica-count the deployment will automatically scale between the two replica counts based on load.
--model-id string The ID of a model that should be deployed when the deployment is created.
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled. (default 168h0m0s)
--wait Wait until the deployment is ready.
--world-size int32 The number of GPUs the base model is served with.
-h, --help help for deployment
```
### Flags inherited from parent commands
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Create a fine-tuning job
Create a fine-tuning job with a base model
Creates a fine-tuning job on Fireworks AI platform with the provided configuration yaml.
```
firectl create fine-tuning-job [flags]
```
### Example
```
firectl create fine-tuning-job --settings-file settings.yaml
```
### Flags
```
--base-model string (required) The base model used for fine-tuning. e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
--batch-size int32 (optional) The batch size of dataset used for training.
--conversation-template string (optional) The conversation jinja template field.
--dataset string (required) The ID of the dataset for the fine tuning.
--display-name string (optional) The display name of the fine-tuning job.
--draft-base-model string (optional) The draft model hf base model field.
--epochs float (optional) The number of epochs to train for.
--input-template string The input template. Required if kind is text_completion.
--job-id string (optional) The ID of the fine-tuning job.
--kind string (required) The kind of fine-tuning job to run. Must be "text_completion", "text_classification", or "conversation".
--label string The label field. Required is text_classification.
--learning-rate float (optional) The learning rate used for training.
--lora-rank int32 (optional) The LoRA rank used for training.
--model-id string (optional) The ID of the uploaded model.
--output-template string The output template. Required if kind is text_completion.
--settings-file string If specified, the YAML file from which settings should be read.
--text string The text field. Required if kind is text_classification.
-w, --wait Block until the job is complete
-h, --help help for deployment
--wandb-api-key string (optional) A Weights & Biases API key associated with the entity.
--wandb-entity string (optional) The Weights & Biases entity where training progress should be reported.
--wandb-project string (optional) The Weights & Biases project where training progress should be reported.
```
# Create Model
Create a model on Fireworks AI platform
```
firectl create model [flags]
```
### Example
```
firectl create model my-model /path/to/checkpoint/
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--quiet If true, does not print the upload progress bar.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Delete Resources
Deletes resource(s) in a Fireworks AI account
### Delete a model
```
firectl delete model [flags]
```
##### Example
```
firectl delete model my-model
```
### Delete a fine-tuning job.
```
firectl delete fine-tuning-job [flags]
```
#### Example
```
firectl delete fine-tuning-job my-fine-tuning-job
```
### Delete a deployment
Deletes an model deployment.
```
firectl delete deployment [flags]
```
#### Example
```
firectl delete deployment my-deployment
```
### Delete a dataset.
```
firectl delete dataset [flags]
```
#### Example
```
firectl delete dataset my-dataset
```
### Flags
```
-h, --help help for deleting resources
```
# Deploy Model
Deploy a model on Fireworks AI platform
```
firectl deploy [flags]
```
#### Example
```
firectl deploy my-model
```
### Flags
```
--deployment-id string The ID of the deployment where the model is to be deployed.
-h, --help help for deploy
--wait Wait until the model is deployed.
```
# Download a model
Download a model from third-party locations
```
firectl download model [flags]
```
#### Example
```
firectl download model my-model /path/to/checkpoint/
```
### Flags
```
-h, --help help for download
```
# Get Resources
Retrieves model information from Fireworks AI platform
```
firectl get [flags]
```
#### Example
```
firectl get model [flags]
```
### Retrieve user information
Prints information about a user.
```
firectl get user [flags]
```
#### Example
```
firectl get user john-08bb29
```
### Retrieve fine-tuning job information
Prints information about a fine-tuning job.
```
firectl get fine-tuning-job [flags]
```
#### Example
```
firectl get fine-tuning-job my-fine-tuning-job
```
### Get information about a deployment.
```
firectl get deployment [flags]
```
#### Example
```
firectl get deployment my-deployment
```
### Get information about a dataset.
```
firectl get dataset [flags]
```
#### Example
```
firectl get dataset instr-fine-tuning
```
### Flags
```
--dry-run Print the request proto without running it.
-o, --output Output Set the output format to "text" or "json". (default text)
```
### Flags inherited from parent commands
```
-o, --output Output Set the output format to "text" or "json". (default text)
```
# Import Model
Imports specified model from Fireworks AI Platform
Imports a model from the fireworks account.
```
firectl import model [flags]
```
#### Example
```
firectl import model llama-v3p1-8b-instruct
```
### Flags
```
-h, --help help for model
--model-id string The ID of the model to be created.
```
# List Resources
List various resources in an Fireworks AI account
```
firectl list [flags]
```
### List models
```
firectl list models
```
### List fine-tuning jobs
Prints all fine-tuning jobs in an account.
```
firectl list fine-tuning-jobs [flags]
```
### List deployments
Prints all deployments in the account.
```
firectl list deployments [flags]
```
### List deployed models
Prints all deployed models in an account.
```
firectl list deployed-models [flags]
```
### List datasets
Prints all datasets uploaded by a user in an account.
```
firectl list datasets [flags]
```
### Flags inherited from parent commands
```
--filter string Only resources satisfying the provided filter will be listed. See https://google.aip.dev/160 for the filter grammar.
-h, --help help for list
--no-paginate List all resources without pagination.
--order-by string A list of fields to order by. To specify a descending order for a field, append a " desc" suffix
--page-size int32 The maximum number of resources to list.
--page-token string The page to list.
```
# Undeploy Model
Undeploy a model on Fireworks AI platform
```
firectl undeploy [flags]
```
#### Example
```
firectl undeploy my-model
```
### Flags
```
-h, --help help for undeploy
--wait Wait until the model is deployed.
```
# Update Resources
Updates Resources on Fireworks AI platform
```
firectl update model [flags]
```
#### Example
```
firectl update model my-model --display-name="New Name"
```
### Flags
```
--context-length int32 The maximum context length of the model.
--default-draft-model string The default speculative draft model to use when creating a deployment.
--default-draft-token-count int32 The default speculative draft token count when creating a deployment.
--description string The description of the model.
--display-name string The display name of the model.
--github-url string The GitHub URL of the model.
-h, --help help for model
--hugging-face-url string The Hugging Face URL of the model.
--public Whether the model is publicly accessible.
--supports-image-input Whether the model supports image inputs.
--supports-tools Whether the model supports function calling.
```
## Update a user
```
firectl update user [flags]
```
#### Example
```
firectl update user my-user --display-name="Alice Cullen"
```
### Flags
```
--display-name string The display name of the user.
-h, --help help for user
--user string The role of the user. Must be one of {user, admin}.
```
## Update a deployment
```
firectl update deployment [flags]
```
#### Example
```
firectl update deployment my-deployment
```
### Flags
```
--description string Description of the deployment. Must be fewer than 1000 characters long.
--display-name string Human-readable name of the deployment. Must be fewer than 64 characters long.
-h, --help help for deployment
--max-peft-batch-size int32 Max batching of concurrent PEFT requests to the server.
--max-replica-count int32 The maximum number of replicas.
--min-replica-count int32 The minimum number of replicas. (default 1)
--scale-down-window duration The duration the autoscaler will wait before scaling down a deployment after observing decreased load. Default is 10m.
--scale-to-zero-window duration The duration after which there are no requests that the deployment will be scaled down to zero replicas, if min-replica-count is 0. Default 1h.
--scale-up-window duration The duration the autoscaler will wait before scaling up a deployment after observing increased load. Default is 30s.
--unused-auto-delete-duration duration The duration for which if no requests are received, the deployment will automatically be deleted. If 0, the auto-deletion is disabled.
--world-size int32 The number of GPUs the base model is served with.
```
## Update a dataset
```
firectl update dataset [flags]
```
#### Example
```
firectl update dataset my-dataset
```
### Flags
```
--display-name string The display name of the model.
-h, --help help for dataset
```
# Getting Started
Learn to create, deploy, and manage resources using Firectl
Firectl can be installed several ways based on your choice and platform.
```bash homebrew
brew tap fw-ai/firectl
brew install firectl
# If you encounter a failed SHA256 check, try first running
brew update
```
```bash macOS (Apple Silicon)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-arm64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash macOS (x86_64)
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-amd64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash Linux (x86_64)
wget -O firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip firectl.gz
sudo install -o root -g root -m 0755 firectl /usr/local/bin/firectl
```
```Text Windows (64 bit)
wget -L https://storage.googleapis.com/fireworks-public/firectl/stable/firectl.exe
```
### Sign into Fireworks account
To sign into your Fireworks account:
```bash
firectl signin
```
If you have set up [Custom SSO](/accounts/sso) then also pass your account ID:
```bash
firectl signin
```
### Check you have signed in
To show which account you have signed into:
```bash
firectl whoami
```
### Check your installed version
```bash
firectl version
```
### Upgrade to the latest version
```bash
sudo firectl upgrade
```
# OpenAI compatibility
You can use [OpenAI Python client library](https://github.com/openai/openai-python) to interact with Fireworks.
This makes migration of existing applications already using OpenAI particularly easy.
## Specify endpoint and API key
You can override parameters for the entire application using environment variables
```shell Shell
export OPENAI_API_BASE="https://api.fireworks.ai/inference/v1"
export OPENAI_API_KEY=""
```
or by setting these values in Python
```python
import openai
# warning: it has a process-wide effect
openai.api_base = "https://api.fireworks.ai/inference/v1"
openai.api_key = ""
```
Alternatively, you may specify these parameters for a single request (useful if you mix calls to OpenAI and Fireworks in the same process):
```python
# api_base and api_key can be passed to any of the supported APIs
chat_completion = openai.ChatCompletion.create(
api_base="https://api.fireworks.ai/inference/v1",
api_key = "
Note, that if you're using OpenAI SDK, they `usage` field won't be listed in the SDK's structure definition. But it can be accessed directly. For example:
* In Python SDK, you can access the attribute directly, e.g. `for chunk in openai.ChatCompletion.create(...): print(chunk["usage"])`.
* In TypeScript SDK, you need to cast away the typing, e.g. `for await (const chunk of await openai.chat.completions.create(...)) { console.log((chunk as any).usage); }`.
### Not supported options
The following options are not yet supported:
* `presence_penalty`
* `frequency_penalty`
* `best_of`: you can use `n` instead
* `logit_bias`
* `functions`: you can use our [LangChain integration](https://python.langchain.com/docs/integrations/providers/fireworks) to achieve similar functionality client-side
Please reach out to us on [Discord](https://discord.gg/fireworks-ai) if you have a use case requiring one of these.
# API Reference
## BaseCompletion Objects
```python
class BaseCompletion()
```
Base class for handling completions. This class provides shared logic for creating completions,\
both synchronously and asynchronously, and both streaming and non-streaming.
**Attributes**:
* `endpoint` *str* - API endpoint for the completion request.
* `response_class` *Type* - Class used for parsing the non-streaming response.
* `stream_response_class` *Type* - Class used for parsing the streaming response.
#### create
```python
@classmethod
def create(cls,
model,
prompt_or_messages=None,
request_timeout=600,
stream=False,
**kwargs)
```
Create a completion or chat completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `prompt_or_messages` *Union\[str, List\[ChatMessage]]* - The prompt for Completion or a list of chat messages for ChatCompletion. If not specified, must specify either `prompt` or `messages` in kwargs.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, Generator[CompletionStreamResponse, None, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse\
or a generator yielding CompletionStreamResponse.
#### acreate
```python
@classmethod
def acreate(cls, model, *args, request_timeout=600, stream=False, **kwargs)
```
Asynchronously create a completion.
**Arguments**:
* `model` *str* - Model name to use for the completion.
* `request_timeout` *int, optional* - Request timeout in seconds. Defaults to 600.
* `stream` *bool, optional* - Whether to use streaming or not. Defaults to False.
* `**kwargs` - Additional keyword arguments.
**Returns**:
`Union[CompletionResponse, AsyncGenerator[CompletionStreamResponse, None]]`:\
Depending on the `stream` argument, either returns a CompletionResponse or an async generator yielding CompletionStreamResponse.
# completion
## Completion Objects
```python
class Completion(BaseCompletion)
```
Class for handling text completions.
# chat\_completion
## ChatCompletion Objects
```python
class ChatCompletion(BaseCompletion)
```
Class for handling chat completions.
# api
## Choice Objects
```python
class Choice(BaseModel)
```
A completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionResponse Objects
```python
class CompletionResponse(BaseModel)
```
The response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `choices` *List\[Choice]* - The list of generated completion choices.
## CompletionResponseStreamChoice Objects
```python
class CompletionResponseStreamChoice(BaseModel)
```
A streamed completion choice.
**Attributes**:
* `index` *int* - The index of the completion choice.
* `text` *str* - The completion response.
* `logprobs` *float, optional* - The log probabilities of the most likely tokens.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## CompletionStreamResponse Objects
```python
class CompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "text\_completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[CompletionResponseStreamChoice]):\
The list of streamed completion choices.
## Model Objects
```python
class Model(BaseModel)
```
A model deployed to the Fireworks platform.
**Attributes**:
* `id` *str* - The model name.
* `object` *str* - The object type, which is always "model".
* `created` *int* - The Unix time in seconds when the model was generated.
## ListModelsResponse Objects
```python
class ListModelsResponse(BaseModel)
```
The response message from a /v1/models call.
**Attributes**:
* `object` *str* - The object type, which is always "list".
* `data` *List\[Model]* - The list of models.
## ChatMessage Objects
```python
class ChatMessage(BaseModel)
```
A chat completion message.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the message.
## ChatCompletionResponseChoice Objects
```python
class ChatCompletionResponseChoice(BaseModel)
```
A chat completion choice generated by a chat model.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `message` *ChatMessage* - The chat completion message.
* `finish_reason` *Optional\[str]* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## UsageInfo Objects
```python
class UsageInfo(BaseModel)
```
Usage statistics.
**Attributes**:
* `prompt_tokens` *int* - The number of tokens in the prompt.
* `total_tokens` *int* - The total number of tokens used in the request (prompt + completion).
* `completion_tokens` *Optional\[int]* - The number of tokens in the generated completion.
## ChatCompletionResponse Objects
```python
class ChatCompletionResponse(BaseModel)
```
The response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.
* `choices` *List\[ChatCompletionResponseChoice]* - The list of chat completion choices.
* `usage` *UsageInfo* - Usage statistics for the chat completion.
## DeltaMessage Objects
```python
class DeltaMessage(BaseModel)
```
A message delta.
**Attributes**:
* `role` *str* - The role of the author of this message.
* `content` *str* - The contents of the chunk message.
## ChatCompletionResponseStreamChoice Objects
```python
class ChatCompletionResponseStreamChoice(BaseModel)
```
A streamed chat completion choice.
**Attributes**:
* `index` *int* - The index of the chat completion choice.
* `delta` *DeltaMessage* - The message delta.
* `finish_reason` *str* - The reason the model stopped generating tokens. This will be "stop" if the model hit a natural stop point or a provided stop sequence, or "length" if the maximum number of tokens specified in the request was reached.
## ChatCompletionStreamResponse Objects
```python
class ChatCompletionStreamResponse(BaseModel)
```
The streamed response message from a /v1/chat/completions call.
**Attributes**:
* `id` *str* - A unique identifier of the response.
* `object` *str* - The object type, which is always "chat.completion".
* `created` *int* - The Unix time in seconds when the response was generated.
* `model` *str* - The model used for the chat completion.\
choices (List\[ChatCompletionResponseStreamChoice]):\
The list of streamed chat completion choices.
# model
## Model Objects
```python
class Model()
```
#### list
```python
@classmethod
def list(cls, request_timeout=60)
```
Returns a list of available models.
**Arguments**:
* `request_timeout` *int, optional* - The request timeout in seconds. Default is 60.
**Returns**:
* `ListModelsResponse` - A list of available models.
# log
#### set\_console\_log\_level
```python
def set_console_log_level(level: str) -> None
```
Controls console logging.
**Arguments**:
* `level` - the minimum level that prints out to console.\
Supported values: \[CRITICAL, FATAL, ERROR, WARN,\
WARNING, INFO, DEBUG]
# error
## PermissionError Objects
```python
class PermissionError(FireworksError)
```
A permission denied error.
## InvalidRequestError Objects
```python
class InvalidRequestError(FireworksError)
```
A invalid request error.
## AuthenticationError Objects
```python
class AuthenticationError(FireworksError)
```
A authentication error.
## RateLimitError Objects
```python
class RateLimitError(FireworksError)
```
A rate limit error.
## InternalServerError Objects
```python
class InternalServerError(FireworksError)
```
An internal server error.
## ServiceUnavailableError Objects
```python
class ServiceUnavailableError(FireworksError)
```
A service unavailable error.
# Getting Started
You can install the client library with pip:
```bash pip
pip install --upgrade fireworks-ai
```
### Authentication
You can authenticate with Fireworks by setting the `fireworks.client.api_key` variable:
```python
fireworks.client.api_key = ""
```
Or by setting the `FIREWORKS_API_KEY` environment variable:
```
export FIREWORKS_API_KEY=
```
# Inference errors
This page lists common error codes encountered during inference requests using the Fireworks API, their meanings, and potential resolutions.
## Error codes
Below is a table of common status codes and their associated messages for inference-related API requests.
| **Error Code** | **Error Name** | **Possible Issue(s)** | **How to Resolve** |
| -------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `400` | `Bad Request` | Invalid input or malformed request. | Review the request parameters and ensure they match the expected format. |
| `401` | `Unauthorized` | Invalid API key or insufficient permissions. | Verify your API key and ensure it has the correct permissions. |
| `402` | `Payment Required` | User's account is not on a paid plan or has exceeded usage limits. | Check your billing status and ensure your payment method is up to date. Upgrade your plan if necessary. |
| `403` | `Forbidden` | The model name may be incorrect, or the model does not exist. This error is also returned to avoid leaking information about model availability. | Verify the model name on the Fireworks site and ensure it exists. Double-check the spelling of the model name in your request. |
| `404` | `Not Found` | The API endpoint is incorrect, or the resource path is invalid (e.g., a user tried accessing `/v1/foobar` instead of a valid endpoint). | Verify the URL path in your request and ensure you are using the correct API endpoint as per the documentation. |
| `405` | `Method Not Allowed` | Using an unsupported HTTP method (e.g., using GET instead of POST). | Check the API documentation for the correct HTTP method to use for the request. |
| `408` | `Request Timeout` | The request took too long to complete, possibly due to server overload or network issues. | Retry the request after a brief wait. Consider increasing the timeout value if applicable. |
| `412` | `Precondition Failed` | This error occurs when attempting to invoke a LoRA model that failed to load. The final validation of the model happens during inference, not at upload time. | Check the body of the request for a detailed error message. Ensure the LoRA model was uploaded correctly and is compatible. Contact support if the issue persists. |
| `413` | `Payload Too Large` | Input data exceeds the allowed size limit. | Reduce the size of the input payload (e.g., by trimming large text or image data). |
| `429` | `Over Quota` | The user has reached the API rate limit. | Wait for the quota to reset or upgrade your plan for a higher rate limit. |
| `500` | `Internal Server Error` | This indicates a server-side code bug and is unlikely to resolve on its own. | Contact Fireworks support immediately, as this error typically requires intervention from the engineering team. |
| `502` | `Bad Gateway` | The server received an invalid response from an upstream server. | Wait and retry the request. If the error persists, it may indicate a server outage. |
| `503` | `Service Unavailable` | The service is down for maintenance or experiencing issues. | Retry the request after some time. Check for any maintenance announcements. |
| `504` | `Gateway Timeout` | The server did not receive a response in time from an upstream server. | Wait briefly and retry the request. Consider using a shorter input prompt if applicable. |
| `520` | `Unknown Error` | An unexpected error occurred with no clear explanation. | Retry the request. If the issue persists, contact support for further assistance. |
## Troubleshooting tips
If you encounter an error not listed here, try the following:
* Review the API documentation for the correct usage of endpoints and parameters.
* Check the [Fireworks status page](https://status.fireworks.ai) for any ongoing service disruptions.
* Contact support at [support@fireworks.ai](mailto:support@fireworks.ai) for further assistance.
This will provide additional insights into any issues encountered.
## Need more help?
If you continue to experience issues, please reach out on our [Discord channel](https://discord.gg/fireworks-ai).