# Custom SSO Set up custom Single Sign-On (SSO) authentication for Fireworks AI Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform. By default, Fireworks supports Google SSO. If you have an enterprise account, Fireworks supports bringing your own identity provider using: * OpenID Connect (OIDC) provider * SAML 2.0 provider Coordinate with your Fireworks AI representative to enable the integration. ## OpenID Connect (OIDC) provider Create an OIDC client application in your identity provider, e.g. Okta. Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret). Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse ``` Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up. ## SAML 2.0 provider Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm). Set the SSO URL to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse ``` Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like: ``` urn:amazon:cognito:sp: ``` Create an Attribute Statement with the name: ``` http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress ``` and the value `user.email` Leave the rest of the settings as defaults. Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up. ## Troubleshooting ### Invalid samlResponse or relayState from identity provider This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports service provider (SP) initiated login. See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an in-depth explanation. ### Required String parameter 'RelayState' is not present See above. # Managing users Add and delete additional users in your Fireworks account See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users. Only admin users can manage other users within the account. ## Adding users To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://fireworks.ai/account/users](https://fireworks.ai/account/users). ```bash firectl create user --email="alice@example.com" ``` To create another admin user, pass the `--role=admin` flag: ```bash firectl create user --email="alice@example.com" --role=admin ``` ## Updating a user's role To update a user's role, run ```bash firectl update user --role="{admin,user}" ``` ## Deleting users You can remove a user from your account by running: ```bash firectl delete user ``` # Batch Delete Batch Jobs post /v1/accounts/{account_id}/batchJobs:batchDelete # Batch Delete Environments post /v1/accounts/{account_id}/environments:batchDelete # Batch Delete Node Pools post /v1/accounts/{account_id}/nodePools:batchDelete # Cancel Batch Job post /v1/accounts/{account_id}/batchJobs/{batch_job_id}:cancel Cancels an existing batch job if it is queued, pending, or running. # Connect Environment post /v1/accounts/{account_id}/environments/{environment_id}:connect Connects the environment to a node pool. Returns an error if there is an existing pending connection. # Create Aws Iam Role Binding post /v1/accounts/{account_id}/awsIamRoleBindings # Create Batch Job post /v1/accounts/{account_id}/batchJobs # Create Cluster post /v1/accounts/{account_id}/clusters # Create Environment post /v1/accounts/{account_id}/environments # Create Node Pool post /v1/accounts/{account_id}/nodePools # Create Node Pool Binding post /v1/accounts/{account_id}/nodePoolBindings # Create Snapshot post /v1/accounts/{account_id}/snapshots # Delete Aws Iam Role Binding post /v1/accounts/{account_id}/awsIamRoleBindings:delete # Delete Batch Job delete /v1/accounts/{account_id}/batchJobs/{batch_job_id} # Delete Cluster delete /v1/accounts/{account_id}/clusters/{cluster_id} # Delete Environment delete /v1/accounts/{account_id}/environments/{environment_id} # Delete Node Pool delete /v1/accounts/{account_id}/nodePools/{node_pool_id} # Delete Node Pool Binding post /v1/accounts/{account_id}/nodePoolBindings:delete # Delete Snapshot delete /v1/accounts/{account_id}/snapshots/{snapshot_id} # Disconnect Environment post /v1/accounts/{account_id}/environments/{environment_id}:disconnect Disconnects the environment from the node pool. Returns an error if the environment is not connected to a node pool. # Get Batch Job get /v1/accounts/{account_id}/batchJobs/{batch_job_id} # Get Batch Job Logs get /v1/accounts/{account_id}/batchJobs/{batch_job_id}:getLogs # Get Cluster get /v1/accounts/{account_id}/clusters/{cluster_id} # Get Cluster Connection Info get /v1/accounts/{account_id}/clusters/{cluster_id}:getConnectionInfo Retrieve connection settings for the cluster to be put in kubeconfig # Get Environment get /v1/accounts/{account_id}/environments/{environment_id} # Get Node Pool get /v1/accounts/{account_id}/nodePools/{node_pool_id} # Get Node Pool Stats get /v1/accounts/{account_id}/nodePools/{node_pool_id}:getStats # Get Snapshot get /v1/accounts/{account_id}/snapshots/{snapshot_id} # List Aws Iam Role Bindings get /v1/accounts/{account_id}/awsIamRoleBindings # List Batch Jobs get /v1/accounts/{account_id}/batchJobs # List Clusters get /v1/accounts/{account_id}/clusters # List Environments get /v1/accounts/{account_id}/environments # List Node Pool Bindings get /v1/accounts/{account_id}/nodePoolBindings # List Node Pools get /v1/accounts/{account_id}/nodePools # List Snapshots get /v1/accounts/{account_id}/snapshots # Update Batch Job patch /v1/accounts/{account_id}/batchJobs/{batch_job_id} # Update Cluster patch /v1/accounts/{account_id}/clusters/{cluster_id} # Update Environment patch /v1/accounts/{account_id}/environments/{environment_id} # Update Node Pool patch /v1/accounts/{account_id}/nodePools/{node_pool_id} # Align transcription post /audio/alignments ### Request ##### (multi-part form) The input audio file to align with text. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance The text to align with the audio. String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`. String name of the alignment model to use. Currently supported: * `mms_fa` optimal accuracy for multilingual speech. * `tdnn_ffn` optimal accuracy for English-only speech. * `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)). The format in which to return the response. Can be one of `srt`, `verbose_json`, or `vtt`. Audio preprocessing mode. Currently supported: * `none` to skip audio preprocessing. * `dynamic` for arbitrary audio content with variable loudness. * `soft_dynamic` for speech intense recording such as podcasts and voice-overs. * `bass_dynamic` for boosting lower frequencies; ### Response The task which was performed. Either `transcribe` or `translate`. The language of the transcribed/translated text. The duration of the transcribed/translated audio, in seconds. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. Start time of the word in seconds. End time of the word in seconds. Segments of the transcribed/translated text and their corresponding details. ```python python !pip install fireworks-ai from fireworks.client.audio import AudioInference # Prepare client audio = requests.get("https://tinyurl.com/3pddjjdc").content text = "At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality" client = AudioInference( model="whisper-v3-turbo", base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai", api_key="<...>", ) # Make request start = time.time() r = await client.align_async(audio=audio, text=text) print(f"Took: {(time.time() - start):.3f}s. Response: '{r}'") ``` ```curl curl # Download audio file curl -sL -o "30s.flac" "https://tinyurl.com/3pddjjdc" # Make request curl -X POST "http://api.fireworks.ai/inference/v1/audio/alignments" \ -H "Authorization: Bearer <...>" \ -F "file=@30s.flac" -F "text=At this turning point of history there manifest themselves, side by side and often mixed and entangled together, a magnificent, manifold, virgin forest-like upgrowth and upstriving, a kind of tropical tempo in the rivalry of growth, and an extraordinary decay and self-destruction owing to the savagely opposing and seemingly exploding egoisms which strive with one another for sun and light, and can no longer assign any limit, restraint, or forbearance for themselves by means of the hitherto existing morality" ``` # Streaming Transcription websocket /audio/transcriptions/streaming Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint. Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket. Stream audio to get transcription continuously in real-time. Stream audio to get transcription continuously in real-time. Stream audio to get transcription continuously in real-time. ### URL Please use the following serverless endpoint: ``` wss://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming ``` ### Headers Your Fireworks API key, e.g. `Authorization=API_KEY`. ### Query Parameters The format in which to return the response. Currently only `verbose_json` is recommended for streaming. The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128). The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. ### Streaming Audio Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will: 1. Resample your audio to 16 kHz if it is not already. 2. Convert it to mono. 3. Send 50ms chunks (16,000 Hz \* 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format. ### Handling Responses The client maintains a state dictionary, starting with an empty dictionary `{}`. When the server sends the first transcription message, it contains a list of segments. Each segment has an `id` and `text`: ```python # Server initial message { "segments": [ {"id": "0", "text": "This is the first sentence"}, {"id": "1", "text": "This is the second sentence"} ] } # Client initial state { "0": "This is the first sentence", "1": "This is the second sentence", } ``` When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment `id`: ```python # Server continuous message { "segments": [ {"id": "1", "text": "This is the second sentence modified"}, {"id": "2", "text": "This is the third sentence"} ] } # Client continuous update { "0": "This is the first sentence", "1": "This is the second sentence modified", # overwritten "2": "This is the third sentence", # new } ``` ### Example Usage Check out a brief Python example below or example sources: * [Python notebook](https://colab.research.google.com/github/fw-ai/cookbook/blob/main/learn/audio/audio_streaming_speech_to_text/audio_streaming_speech_to_text.ipynb) * [Python sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/python) * [Node.js sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/nodejs) ```python !pip3 install requests torch torchaudio websocket-client import io import time import json import torch import requests import torchaudio import threading import websocket import urllib.parse lock = threading.Lock() segments = {} def on_open(ws): # Send audio chunks def send_audio_chunks(): for chunk in audio_chunk_bytes: ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY) time.sleep(chunk_size_ms / 1000.0) time.sleep(2) ws.close() threading.Thread(target=send_audio_chunks).start() def on_message(ws, message): # Merge new segments with existing segments msg = json.loads(message) new_segments = {seg["id"]: seg["text"] for seg in msg.get("segments", [])} with lock: segments.update(new_segments) print(json.dumps(segments, indent=2)) def on_error(ws, error): print(f"WebSocket error: {error}") # Open a connection URL with query params url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming" params = urllib.parse.urlencode({ "language": "en", }) ws = websocket.WebSocketApp( f"{url}?{params}", header={"Authorization": ""}, on_open=on_open, on_message=on_message, on_error=on_error, ) ws.run_forever() ``` ### Dedicated endpoint For fixed throughput and predictable SLAs, you may request a dedicated endpoints for streaming transcription at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) or [discord](https://www.google.com/url?q=https%3A%2F%2Fdiscord.gg%2Ffireworks-ai). # Transcribe audio post /audio/transcriptions Send a sample audio to get a transcription. ### Request ##### (multi-part form) The input audio file to transcribe. Max audio file size is 1 GB, there is not limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance. String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints: * [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`); * [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`); String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`. String name of the alignment model to use. Currently supported: * `mms_fa` optimal accuracy for multilingual speech. * `tdnn_ffn` optimal accuracy for English-only speech. * `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)). The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128). The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`. The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`. Audio preprocessing mode. Currently supported: * `none` to skip audio preprocessing. * `dynamic` for arbitrary audio content with variable loudness. * `soft_dynamic` for speech intense recording such as podcasts and voice-overs. * `bass_dynamic` for boosting lower frequencies; ### Response The task which was performed. Either `transcribe` or `translate`. The language of the transcribed/translated text. The duration of the transcribed/translated audio, in seconds. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. Start time of the word in seconds. End time of the word in seconds. Segments of the transcribed/translated text and their corresponding details. ```python python !pip install fireworks-ai requests from fireworks.client.audio import AudioInference # Prepare client audio = requests.get("https://tinyurl.com/4cb74vas").content client = AudioInference( model="whisper-v3", base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai", # # Or for the turbo version # model="whisper-v3-turbo", # base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai", api_key="<...>", ) # Make request start = time.time() r = await client.transcribe_async(audio=audio) print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'") ``` ```curl curl # Download audio file curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas" # Make request curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions" \ -H "Authorization: Bearer <...>" \ -F "file=@1hr.flac" ``` # Translate audio post /audio/translations ### Request ##### (multi-part form) The input audio file to translate. Max audio file size is 1 GB, there is not limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints: * [https://audio-prod.us-virginia-1.direct.fireworks.ai](https://audio-prod.us-virginia-1.direct.fireworks.ai) (for `whisper-v3`); * [https://audio-turbo.us-virginia-1.direct.fireworks.ai](https://audio-turbo.us-virginia-1.direct.fireworks.ai) (for `whisper-v3-turbo`); String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`. String name of the alignment model to use. Currently supported: * `mms_fa` optimal accuracy for multilingual speech. * `tdnn_ffn` optimal accuracy for English-only speech. * `gentle` best accuracy for English-only speech (requires a dedicated endpoint, contact us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)). The target language for transcription. The set of supported target languages can be found [here](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/tokenizer.py#L10-L128). The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`. The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, or `segment`. If not present, defaults to `segment`. Audio preprocessing mode. Currently supported: * `none` to skip audio preprocessing. * `dynamic` for arbitrary audio content with variable loudness. * `soft_dynamic` for speech intense recording such as podcasts and voice-overs. * `bass_dynamic` for boosting lower frequencies; ### Response The task which was performed. Either `transcribe` or `translate`. The language of the transcribed/translated text. The duration of the transcribed/translated audio, in seconds. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. Start time of the word in seconds. End time of the word in seconds. Segments of the transcribed/translated text and their corresponding details. ```python python !pip install fireworks-ai requests from fireworks.client.audio import AudioInference # Prepare client audio = requests.get("https://tinyurl.com/4cb74vas").content client = AudioInference( model="whisper-v3", base_url="https://audio-prod.us-virginia-1.direct.fireworks.ai", # # Or for the turbo version # model="whisper-v3-turbo", # base_url="https://audio-turbo.us-virginia-1.direct.fireworks.ai", api_key="<...>", ) # Make request start = time.time() r = await client.translate_async(audio=audio) print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'") ``` ```curl curl # Download audio file curl -sL -o "1hr.flac" "https://tinyurl.com/4cb74vas" # Make request curl -X POST "https://audio-prod.us-virginia-1.direct.fireworks.ai/v1/audio/translations" \ -H "Authorization: Bearer <...>" \ -F "file=@1hr.flac" ``` # Create Dataset post /v1/accounts/{account_id}/datasets # CRUD APIs for deployed models. post /v1/accounts/{account_id}/deployedModels # Create Deployment post /v1/accounts/{account_id}/deployments # Create Model post /v1/accounts/{account_id}/models # Create User post /v1/accounts/{account_id}/users # Create embeddings post /embeddings # Delete Dataset delete /v1/accounts/{account_id}/datasets/{dataset_id} # null delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Delete Deployment delete /v1/accounts/{account_id}/deployments/{deployment_id} # Delete Model delete /v1/accounts/{account_id}/models/{model_id} # Generate an image Official API reference for image generation workloads can be found on the corresponding models pages, upon clicking "view code". We support generating images from text prompts, other images, and/or ControlNet [https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0](https://fireworks.ai/models/fireworks/stable-diffusion-xl-1024-v1-0) [https://fireworks.ai/models/fireworks/SSD-1B](https://fireworks.ai/models/fireworks/SSD-1B) [https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic](https://fireworks.ai/models/fireworks/playground-v2-1024px-aesthetic) [https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl](https://fireworks.ai/models/fireworks/japanese-stable-diffusion-xl) # Get Account get /v1/accounts/{account_id} # Get Dataset get /v1/accounts/{account_id}/datasets/{dataset_id} # Get Dataset Upload Endpoint post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint # Get Deployment get /v1/accounts/{account_id}/deployments/{deployment_id} # Get Model get /v1/accounts/{account_id}/models/{model_id} # Get Model Download Endpoint get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint # Get Model Upload Endpoint post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint # Get User get /v1/accounts/{account_id}/users/{user_id} # Introduction Fireworks AI REST API enables you to interact with various Language, Image and Embedding Models using the API Key. ## Authentication All requests made to the Fireworks AI via REST API must include an `Authorization` header. Header should specify a valid `Bearer` Token with API key and must be encoded as JSON with the "Content-Type: application/json" header. This ensures that your requests are properly authenticated and formatted for interaction with the Fireworks AI. A Sample header to be included in the REST API request should look like below: ```json authorization: Bearer ``` # List Datasets get /v1/accounts/{account_id}/datasets # List Deployments get /v1/accounts/{account_id}/deployments # List Models get /v1/accounts/{account_id}/models # List Users get /v1/accounts/{account_id}/users # Create Chat Completion post /chat/completions Creates a model response for the given chat conversation. # Create Completion post /completions Creates a completion for the provided prompt and parameters. # Update Dataset patch /v1/accounts/{account_id}/datasets/{dataset_id} # Update Deployment patch /v1/accounts/{account_id}/deployments/{deployment_id} # Update Model patch /v1/accounts/{account_id}/models/{model_id} # Update User patch /v1/accounts/{account_id}/users/{user_id} # Upload Dataset Files post /v1/accounts/{account_id}/datasets/{dataset_id}:upload Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint). # Validate Dataset Upload post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload # Validate Model Upload get /v1/accounts/{account_id}/models/{model_id}:validateUpload # Start here The **Fireworks Cookbook** is your hands-on guide to building, deploying, and fine-tuning generative AI and agentic workflows. It offers curated examples, Jupyter Notebooks, apps, and resources tailored to various use cases and skill levels, making it a go-to resource for practical Fireworks implementations. In this cookbook, you’ll find: * **Production-ready projects**: Scalable, proven solutions with ongoing support from the Fireworks engineering team. * **Learning-focused tutorials**: Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques. * **Community-driven showcases**: Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts. *** ## Repository structure To help you easily navigate and find the right resources, the Cookbook organizes examples by purpose:
**Hands-on projects for learning AI** techniques, maintained by the DevRel team.

**Explore user-contributed projects** that push creative boundaries with Fireworks.
*** ### Feedback & support We value your feedback! If you encounter issues, need clarification, or have questions, please contact us at * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) *** **Additional resources:** * [Fireworks AI Blog](https://fireworks.ai/blog) * [Fireworks AI YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA) * [Fireworks AI Twitter](https://x.com/fireworksai_hq) # Build with Fireworks Step-by-step guides for hands-on exploration, ideal for interactive learning of AI techniques. ## Inference Explore notebooks and projects showcasing how to run generative AI models on Fireworks, demonstrating both third-party integrations and innovative applications with industry-leading speed and flexibility. ### LLMs Dive into examples that utilize Fireworks for deploying and fine-tuning large language models (LLMs), featuring integrations with popular libraries and cutting-edge use cases. **Notebooks** (Python) An interactive Streamlit app for comparing LLMs on Fireworks with parameter tuning and LLM-as-a-Judge functionality. (Python) Demonstrates structured responses using Llama 3.1, covering Grammar Mode and JSON Mode for consistent output formats. (Python) Explores generating synthetic data with Llama 3.1 models on Fireworks, including structured outputs for quizzes. **Apps** A Next.js app for real-time transcription chat using Fireworks and Vercel integration. ### Visual-language Discover projects combining vision and language capabilities using Fireworks, integrating external frameworks for seamless multimodal understanding. ### Audio Explore real-time audio transcription, processing, and generation examples using Fireworks’ advanced audio models and integrations. **Notebooks** A notebook demonstrating real-time audio transcription using Fireworks' `whisper-v3-large` compatible model. The project includes streaming audio input and getting transcription messages, making it ideal for tasks requiring accurate and responsive audio processing. Stream audio to get transcription continuously in real-time. Stream audio to get transcription continuously in real-time. ### Image Experiment with image-based projects using Fireworks’ models, enhanced with third-party libraries for innovative applications in image creation, manipulation, and recognition. ### Multimodal Learn from complex multimodal examples that blend text, audio, and image inputs, demonstrating the full potential of Fireworks combined with external tools for interactive AI experiences. *** ## Fine-tuning Access notebooks that demonstrate efficient model fine-tuning on Fireworks, utilizing both internal capabilities and third-party tools like Axolotl for custom optimization. ### Multi-LoRA Explore notebooks showcasing the integration and utilization of multiple LoRA adapters in Fireworks. These resources demonstrate advanced techniques for merging, fine-tuning, and deploying multi-LoRA configurations to optimize model performance across diverse tasks. **Notebooks** (Python) An interactive guide showcasing the integration of Multi-LoRA adapters on Fireworks, enabling fine-tuned responses for diverse product domains such as beauty, fashion, outdoor gear, and baby products. *** ## Function calling Explore examples of function-calling workflows using Fireworks, showcasing how to integrate with external APIs and tools for sophisticated, multi-step AI operations. **Notebooks** Demonstrates Function-Calling with LangChain integration, including custom tool routing and query handling. (Python) Explore the integration of Fireworks' function-calling model with LangChain tools. This notebook demonstrates building basic agents using `firefunction-v1` for tasks like answering questions, retrieving stock prices, and generating images with the Fireworks SDXL API (Javascript). Showcases Function-Calling with LangGraph integration for graph-based agent systems and tool queries. (Python) Uses Fireworks' Function-Calling for structured QA with OpenAI, featuring multi-turn conversation handling. (Python) Demonstrates querying financial data using Fireworks' Function-Calling API with integrated tool setup. (Python) Extracts structured information from web content using Fireworks' Function-Calling API. (Python) Generates stock charts using Fireworks' Function-Calling API with AutoGen integration. (Python) **Apps** A demo app showcasing chat with function-calling capabilities for dynamic service invocation. *** ## RAG Build retrieval-augmented generation (RAG) systems with Fireworks, featuring projects that connect with vector databases and search tools for enhanced, context-aware AI responses. **Notebooks** A basic RAG implementation using ChromaDB with League of Legends data, comparing responses across multiple models. (Python) An agentic system using RAG for generating catchy research paper titles with embeddings and LLM completions. (Python) A movie recommendation system using Fireworks' function-calling models and MongoDB Atlas for personalized, real-time suggestions. (Python) **Apps** A RAG chatbot using SurrealDB for vector storage and Fireworks for real-time, context-aware responses. *** ### Integration partners We welcome contributions from integration partners! Follow these steps: 1. **Clone the Repo**: [Fireworks Cookbook repo](https://github.com/fw-ai/cookbook) 2. **Create Folder**: Add your company/tool under `integrations` 3. **Add Examples**: Include code, notebooks, or demos 4. **Use Template**: Fill out the [integration guide](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md) 5. **Submit PR**: Create a pull request 6. **Review**: Fireworks will review and merge Need help? Contact us or open an issue. *** ### Support For help or feedback: * **Discord**: [Join us](https://discord.gg/fireworks-ai) * **Email**: [Contact us](mailto:inquiries@fireworks.ai) **Resources**: * [Blog](https://fireworks.ai/blog) * [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA) * [Twitter](https://x.com/fireworksai_hq) # Community showcase Creative user-contributed projects that showcase innovative applications of Fireworks in diverse contexts. Convert any PDF into a personalized podcast using open-source LLMs and TTS models. Powered by Fireworks-hosted Llama 3.1, MeloTTS, and Bark, this app generates engaging dialogue and outputs it as an MP3 file via a user-friendly Gradio interface. High-throughput code generation with Qwen2.5 Coder models, optimized for fast inference on Fireworks. Includes a robust pipeline for data creation, fine-tuning with Unsloth, and real-time application in AI-powered code editors. Ensure accurate and reliable technical documentation with ProoferX, built using Fireworks’ fast Llama models and Firefunc for structured output. This project addresses a key challenge in developer tools by validating and streamlining documentation with real-time checks. *** ## Community project submissions We welcome your contributions to the **Fireworks Cookbook**! Share your projects and help expand our collaborative resource. Here’s how: 1. **Clone the Repo**: [Fireworks Cookbook](https://github.com/fw-ai/cookbook) and go to `showcase`. 2. **Create Folder**: Add a folder named after your project. 3. **Include Code**: Add notebooks, apps, or other resources demonstrating your project. 4. **Complete Template**: Fill out the [Showcase Template](https://github.com/fw-ai/cookbook/blob/main/showcase/template_projectMDX.md) for key project details. 5. **Submit PR**: Submit your project as a pull request. 6. **Review & Feature**: Our team will review your submission; selected projects may be highlighted in docs or social media. *** ### Support For help or feedback: * **Discord**: [Join us](https://discord.gg/fireworks-ai) * **Email**: [Contact us](mailto:inquiries@fireworks.ai) **Resources**: * [Blog](https://fireworks.ai/blog) * [YouTube](https://www.youtube.com/channel/UCHCffBTGYa1Ut72h03ldtGA) * [Twitter](https://x.com/fireworksai_hq) # Direct routing Direct routing enables enterprise users reduce latency to their deployments. ## Internet direct routing Internet direct routing bypasses our global API load balancer and directly routes your request to the machines where your deployment is running. This can save several tens or even hundreds of milliseconds of time-to-first-token (TTFT) latency. To create a deployment using Internet direct routing: ```bash $ firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \ --direct-routing-type INTERNET \ --direct-route-api-keys Name: accounts/my-account/deployments/abcd1234 ... Direct Route Handle: my-account-abcd1234.us-arizona-1.direct.fireworks.ai Region: US_ARIZONA_1 ``` You will need to specify a comma-separated list of API keys that can access the direct route deployment. These keys can be any alpha-numeric string and are a distinct concept from the API keys provisioned via the Fireworks console. A key provisioned in the console but not specified the list here will not be allowed when querying the model via direct routing. Take note of the `Direct Route Handle` to get the inference endpoint. This is what you will use access the deployment instead of the global `https://api.fireworks.ai/inference/` endpoint. For example: ```bash curl \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "model": "accounts/fireworks/models/llama-v3-8b-instruct", "prompt": "The sky is" }' \ --url https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai/v1/completions ``` ## Private Service Connect (PSC) Contact your Fireworks representative to set up [GCP Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) to your deployment. ## AWS PrivateLink Contact your Fireworks representative to set up [AWS PrivateLink](https://aws.amazon.com/privatelink/) to your deployment. # Regions Fireworks runs a global fleet of hardware on which you can deploy your models. ## Availability Current region availability: | **Region** | **Launch status** | **Hardware availability** | | ---------------- | ------------------- | ------------------------------------- | | `US_ILLINOIS_2` | Generally Available | `NVIDIA_A100_80GB` | | `US_VIRGINIA_2` | Generally Available | `NVIDIA_H100_80GB` `AMD_MI300X_192GB` | | `EU_PARIS_1` | Generally Available | `NVIDIA_H200_141GB` | | `AP_TOKYO_1` | Enterprise only | `NVIDIA_H100_80GB` | | `EU_FRANKFURT_1` | Enterprise only | `NVIDIA_H100_80GB` | | `US_ILLINOIS_1` | Enterprise only | `NVIDIA_H100_80GB` | | `US_IOWA_1` | Enterprise only | `NVIDIA_H100_80GB` | | `US_VIRGINIA_1` | Enterprise only | `NVIDIA_H100_80GB` | | `US_ARIZONA_1` | Enterprise only | `NVIDIA_H100_80GB` | If you need deployments in a non-GA region, please contact our team at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai). ## Using a region When creating a deployment, you can pass the `--region` flag: ``` firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \ --region US_IOWA_1 ``` ## Changing regions Updating a region for a deployment in-place is currently not supported. To move a deployment between regions, please create a new deployment in the new region, then delete the old deployment. ## Quotas Each region has it's own separate quota for each hardware type. To view your current quotas, run ``` firectl list quotas ``` # Reserved capacity Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the following advantages over [on-demand deployments](/guides/ondemand-deployments): * Guaranteed capacity * Higher quotas * Lower GPU-hour prices * Pre-GA access to newer regions * Pre-GA access to newest hardware ## Purchasing or renewing a reservation To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us). ## Viewing your reservations To view your existing reservations, run: ``` firectl list reservations ``` ## Usage and billing Reservations are automatically "consumed" when you create deployments that the meet the reservation parameters. For example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both deployments are running, 12 H100s will count towards using your reservation, while the excess 4 H100s will be metered and billed at the on-demand rate. When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding number of deployments, otherwise you may be billed at for your usage at on-demand rates. Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract (e.g. monthly, quarterly, or yearly). Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is actively used. # About Fireworks developer partners Learn about the Fireworks Developer Partners Program, including goals, application process, and benefits for tools and platforms in the LLMOps/Gen-Ops ecosystem. The **Fireworks developer integrations program** supports tools, platforms, and projects in the LLMOps/Gen-Ops ecosystem, enabling seamless collaboration with Fireworks. 🌐 Whether through **native integrations** or **compatible workflows**, developer integrations represent tools and platforms that: * Offer **native integration** with Fireworks APIs, enabling deep functionality and seamless operation. * Provide **compatible workflows**, demonstrating interoperability with Fireworks through shared use cases and adaptable processes. * Add value to the Fireworks ecosystem by enhancing developer workflows, improving scalability, or solving key challenges in LLMOps/Gen-Ops. 🔧 *** # Goals of the developer partners program 1. **Expand the ecosystem**: Build a rich network of tools that extend Fireworks’ capabilities. 🌱 2. **Showcase interoperability**: Demonstrate how Fireworks works with diverse tools to solve real-world challenges. 🌍 3. **Support innovation**: Encourage the creation of impactful generative AI solutions. 💡 4. **Promote collaboration**: Highlight shared contributions through joint marketing, workshops, and developer resources. 🤝 *** ## Types of developer partners 1. **Native integrations** 🛠️ * Tools with direct integration into Fireworks APIs or SDKs, offering seamless plug-and-play functionality. * Examples include official connectors, plugins, and platform integrations. 2. **Compatible workflows** * Tools or platforms that interoperate with Fireworks through shared APIs, workflows, or third-party bridges. * Examples include vector stores, fine-tuning tools, and monitoring solutions that work alongside Fireworks. *** # What does a developer integration look like? A developer integration can include: * **Native integrations**: Fully integrated tools or connectors offering seamless user experiences. * **Workflow compatibility**: Examples and documentation showing how a tool works with Fireworks APIs. * **Developer resources**: Contributed guides, notebooks, and sample repositories to enable other users. **Examples**: * **Native integration**: A plugin for a vector database that directly connects with Fireworks’ RAG workflows. * **Compatible workflow**: A step-by-step guide for using Fireworks APIs alongside an MLOps monitoring tool. *** # How to apply ### Step 1: Demonstrate compatibility or build integration 🔍 * **Native integrations**: Develop a connector or integration directly into Fireworks APIs or SDKs. * **Compatible workflows**: Validate how your tool works with Fireworks workflows and APIs. * Prepare resources such as GitHub repos, notebooks, or workflow guides. ### Step 2: Submit your application 📤 1. **Create documentation** * Use the [Fireworks cookbook template](https://github.com/fw-ai/cookbook/blob/main/integrations/template_integration_guide.md) to document your integration or workflow. 2. **Submit your contribution** * Fork the [Fireworks cookbook](https://github.com/fw-ai/cookbook) and submit a pull request with your materials. * Include links to your GitHub repo or supporting documentation. 3. **Contact developer relations**\ For guidance, reach out to [DevRel](mailto:devrel@fireworks.ai). ### Step 3: Review and feedback ✅ * Fireworks developer relations will review your submission to ensure technical accuracy and alignment with program goals. * Once approved, your integration or workflow will be published in Fireworks documentation and promoted through official channels. *** # Benefits of becoming a Fireworks developer partner 🌟 1. **Ecosystem visibility** * Be featured in Fireworks documentation and resources as a trusted integration. * Gain recognition within the growing LLMOps/Gen-Ops developer community. 2. **Technical and marketing support** * Access Fireworks resources and technical support for building integrations. * Collaborate on co-marketing campaigns, webinars, and tutorials. 3. **Community collaboration** * Join a network of ecosystem partners working to push generative AI innovation forward. * Share insights and learn from other projects in the LLMOps/Gen-Ops space. *** # Program FAQ ❓ **Q: Who can apply to the Developer Partners program?**\ A: Tools, platforms, and projects that either integrate natively with Fireworks or demonstrate compatibility through workflows are welcome to apply. **Q: What types of contributions are required?**\ A: Contributions can include technical documentation, integration guides, sample workflows, GitHub repos, and co-marketing materials. **Q: Is there a cost to participate?**\ A: No, the Developer Partners program is free. **Q: Can compatible workflows evolve into native integrations?**\ A: Yes! Tools demonstrating strong adoption and compatibility may transition to deeper integrations and partnerships. *** For more information or to get started, contact us at: * **Discord**: [Join here](https://discord.gg/fireworks-ai) * **Email**: [devrel@fireworks.ai](mailto:devrel@fireworks.ai) # Account setup & management Solutions for common account access issues and management procedures for Fireworks.ai accounts ## Multiple account access **Q: What should I do if I can't access my company account after being invited when I already have a personal account?** This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to). To resolve this: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts 2. Include in your email: * The account ID you created personally (e.g., username-44ace8) * The company account ID you need access to (e.g., company-a57b2a) * Mention that you're having trouble accessing your company account Note: This is a known scenario that support can resolve once they verify your email ownership. *** ## Account closure **Q: How do I close my Fireworks.ai account?** To close your account: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) 2. Include in your request: * Your account ID * A clear request for account deletion Before closing your account, please ensure: * All outstanding invoices are paid * Any active deployments are terminated * Important data is backed up if needed *** ## Signing in from different Fireworks accounts **Q: I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this?** If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Billing management Information about Fireworks.ai invoicing and API billing. ## Invoice questions **Q: Why did I receive an invoice when I only deposited credits?** Fireworks.ai billing works as follows: * **Deposited credits** are used first. * Once credits are exhausted, you **continue to accrue charges** for additional usage. * **Usage charges** are billed at the end of each month. * You’ll receive an invoice for any usage that **exceeded your pre-purchased credits**. This process happens automatically, regardless of subscription status. To prevent additional charges, please monitor your usage or contact support to set up spending restrictions. **Q: Where's my receipt for purchased credits?** Receipts for purchased credits are sent via Stripe upon initial credit purchase. Check your email for receipts from Stripe (not Fireworks). Contact [billing@fireworks.ai](mailto:billing@fireworks.ai) if you still are encountering problems. *** ## API billing **Q: Are calls to the Models API billable?** No, calls to the **Models API** endpoint are free. This applies to all **management API calls** for: * Accounts * Users * Models * Datasets *Note*: While the API calls themselves are free, charges apply for: * **Model deployments** * **Fine-tuning jobs** *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Credit system Understanding how Fireworks.ai billing, credits, and account suspension work. ## Billing and credit usage **Q: How does billing and credit usage work?** Usage and billing operate through a **tiered system**: * Each **tier** has a monthly usage limit, regardless of available credits. * Once you reach your tier's limit, **service will be suspended** even if you have remaining credits. * **Usage limits** reset at the beginning of each month. * Pre-purchased credits do not prevent additional charges once the limit is exceeded. *** ## Account suspension **Q: Why might my account be suspended even with remaining credits?** Your account may be suspended due to several factors: 1. **Monthly usage limits**: * Each tier includes a monthly usage limit, independent of any credits. * Once you reach this limit, your service will be suspended, even if you have credits remaining. * Usage limits automatically reset at the beginning of each month. 2. **Billing structure**: * Pre-purchased credits do not prevent additional charges. * You can exceed your pre-purchased credits and will be billed for any usage beyond that limit. * **Example**: If you have `$20` in pre-purchased credits but incur `$83` in usage, you will be billed for the `$63` difference. *** ## Missing credits **Q: I bought credits but don’t see them reflected in my account. Did they disappear?** Fireworks operates with a **postpaid billing** system where: * **Prepaid credits** are instantly applied to any outstanding balance. * **Example**: If you had a `$750` outstanding bill and added `$500` in credits, your bill would reduce to `$250`, with \$0 remaining credits available for new usage. To check your credit balance: 1. Visit your **billing dashboard**. 2. Review the **"Credits"** section. 3. Check your **current outstanding balance**. *Note*: Credits are always applied to any existing balance before being available for new usage. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Cost structure Understanding Fireworks.ai pricing and fees for various services. ## Platform costs **Q: How much does Fireworks cost?** Fireworks AI operates on a **pay-as-you-go** model for all non-Enterprise usage, and new users automatically receive free credits. You pay based on: * **Per token** for serverless inference * **Per GPU usage time** for on-demand deployments * **Per token of training data** for fine-tuning For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options. Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing). *** ## Fine-tuning fees **Q: Are there extra fees for serving fine-tuned models?** No, deploying fine-tuned models to serverless infrastructure is free. Here’s what you need to know: **What’s free**: * Deploying fine-tuned models to serverless infrastructure * Hosting the models on serverless infrastructure * Deploying up to 100 fine-tuned models **What you pay for**: * **Usage costs** on a per-token basis when the model is actually used * The **fine-tuning process** itself, if applicable *Note*: This differs from on-demand deployments, which include hourly hosting costs. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Discounts Information about bulk usage discounts and special pricing options. ## Bulk usage **Q: Are there discounts for bulk usage?** Yes, we offer discounts for **bulk or pre-paid purchases** exclusively for on-demand deployments—not for serverless GPUs. Please contact [inquiries@firework.ai](mailto:inquiries@fireworks.ai) if you're interested. *** ## Serverless discounts **Q: Are there discounts for bulk spend on serverless deployments?** Our publicly accessible services have **standard rates** for all customers. Currently, we do not offer bulk discounts for serverless deployments. *** ## Additional information For **enterprise customers** or **high-volume users**: * Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options** * Discuss **annual commitment discounts** * Explore **enterprise-specific features and benefits** # Billing & scaling Understanding billing and scaling mechanisms for on-demand deployments. ## Autoscaling and costs **Q: How does autoscaling affect my costs?** * **Scaling from 0**: No minimum cost when scaled to zero * **Scaling up**: Each new replica adds to your total cost proportionally. For example: * Scaling from 1 to 2 replicas doubles your GPU costs * If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total) For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing). *** ## Rate-limits for on-demand deployment **Q: What are the rate limits for on-demand deployments?** Request throughput scales with your GPU allocation. Base allocations include: * Up to 8 A100 GPUs * Up to 8 H100 GPUs On-demand deployments offer several advantages: * **Predictable pricing** based on time units, not token I/O * **Protected latency and performance**, independent of traffic on the serverless platform * **Choice of GPUs**, including A100s and H100s Need more GPUs? Contact us to discuss higher allocations for your specific use case. *** ## On-demand billing **Q: How does billing work for on-demand deployments?** On-demand deployments come with automatic cost optimization features: * **Default autoscaling**: Automatically scales to 0 replicas when not in use * **Pay for what you use**: Charged only for GPU time when replicas are active * **Flexible configuration**: Customize autoscaling behavior to match your needs **Best practices for cost management**: 1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use 2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges 3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling). *** ## Scaling structure **Q: How does billing and scaling work for on-demand GPU deployments?** On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: **Billing**: * Charges start when the server begins accepting requests * **Billed by GPU-second** for each active instance * Costs accumulate even if there are no active API calls **Scaling options**: * Supports **autoscaling** from 0 to multiple GPUs * Each additional GPU **adds to the billing rate** * Can handle unlimited requests within the GPU’s capacity **Management requirements**: * Not fully serverless; requires some manual management * **Manually delete deployments** when no longer needed * Or configure autoscaling to **scale down to 0** during inactive periods **Cost control tips**: * Regularly **monitor active deployments** * **Delete unused deployments** to avoid unnecessary costs * Consider **serverless options** for intermittent usage * Use **autoscaling to 0** to optimize costs during low-demand times *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options** # Deployment issues Troubleshooting and resolving common issues with on-demand deployments. ## Custom model issues **Q: What are the common issues when deploying custom models?** Here are key areas to troubleshoot for custom model deployments: ### 1. Deployment hanging or crashing **Common causes**: * **Missing model files**, especially when using Hugging Face models * **Symlinked files** not uploaded correctly * **Outdated firectl version** **Solutions**: * Download models without symlinks using: ```bash huggingface-cli download model_name --local-dir=/path --local-dir-use-symlinks=False ``` * Update **firectl** to the latest version ### 2. LoRA adapters vs full models * **Compatibility**: LoRA adapters work with specific base models. * **Performance**: May experience slightly lower speed with LoRA, but **quality should remain similar** to the original model. * **Troubleshooting quality drops**: * Check **model configuration** * Review **conversation template** * Add `echo: true` to debug requests ### 3. Performance optimization factors Consider adjusting the following for improved performance: * **Accelerator count** and **accelerator type** * **Long prompt** settings to handle complex inputs *** ## Autoscaling **Q: What should I expect for deployment and scaling performance?** * **Initial deployment**: Should complete within minutes * **Scaling from zero**: You may experience brief availability delays while the system scales up * **Troubleshooting**: If deployment takes over 1 hour, this typically indicates a crash and should be investigated * **Best practice**: Monitor deployment status and contact support if deployment times are unusually long *** ## Performance questions **Q: I have more specific performance questions about improvements** For detailed discussions on performance and optimization options: * **Schedule a consultation** directly with our PM, Ray Thai ([calendly](https://calendly.com/raythai)) * Discuss your **specific use cases** * Get **personalized recommendations** * Review **advanced configuration options** *Note*: Monitor costs carefully during the deployment and testing phase, as repeated deployments and tests can quickly consume credits. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options** # Hardware options Understanding hardware choices for Fireworks.ai on-demand deployments. ## Hardware selection **Q: Which accelerator/GPU should I use?** It depends on your specific needs. Fireworks has two grouping of accelerators: smaller (A100) and larger (H100, H200, and MI300X) accelerators. Small accelerators are less expensive (see [pricing page](https://fireworks.ai/pricing)), so they’re more cost-effective for low-volume use cases. However, if you have enough volume to fully utilize a larger accelerator, we find that they tend to be both faster and more cost-effective per token. Choosing between larger accelerators depends on the use case. * MI300X has the highest memory capacity and sometimes enables large models to be deployed with comparatively few GPUs. For example, unquantized Llama 3.1 70B fits on one MI300X and FP8 Llama 405B fits on 4 MI300X’s. Higher memory also may enable better throughput for longer prompts and less sharded deployments. It’s also more affordably priced than the H100. * H100 offers blazing fast inference and often provides the highest throughput, especially for high-volume use cases * H200 is recommended for large models like DeepSeek V3 and DeepSeek R1 e.g. the minimum config for DeepSeek V3, DeepSeek R1 is 8 H200s. ### Best Practices for Selection 1. **Analyze your workload requirements** to determine which GPU fits your processing needs. 2. Consider your **throughput needs** and the scale of your deployment. 3. Calculate the **cost-performance ratio** for each hardware option. 4. Factor in **future scaling needs** to ensure the selected GPU can support growth. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for **custom pricing options** # On-demand deployment scaling Understanding Fireworks.ai system scaling and request handling capabilities. ## System scaling **Q: How does the system scale?** Our system is **horizontally scalable**, meaning it: * Scales linearly with additional **replicas** of the deployment * **Automatically allocates resources** based on demand * Manages **distributed load handling** efficiently *** ## Auto scaling **Q: Do you support Auto Scaling?** Yes, our system supports **auto scaling** with the following features: * **Scaling down to zero** capability for resource efficiency * Controllable **scale-up and scale-down velocity** * **Custom scaling rules and thresholds** to match your specific needs *** ## Throughput capacity **Q: What’s the supported throughput?** Throughput capacity typically depends on several factors: * **Deployment type** (serverless or on-demand) * **Traffic patterns** and **request patterns** * **Hardware configuration** * **Model size and complexity** *** ## Request handling **Q: What factors affect the number of simultaneous requests that can be handled?** The request handling capacity is influenced by multiple factors: * **Model size and type** * **Number of GPUs** allocated to the deployment * **GPU type** (e.g., A100 vs. H100) * **Prompt size** and **generation token length** * **Deployment type** (serverless vs. on-demand) *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Performance optimization Guidelines for optimizing performance and benchmarking Fireworks.ai deployments. ## Performance improvement **Q: What are the techniques to improve performance?** To optimize model performance, consider the following techniques: 1. **Quantization** 2. **Check model type**: Determine whether the model is **GQA** (Grouped Query Attention) or **MQA** (Multi-Query Attention). 3. **Increase batch size** to improve throughput. *** ## Benchmarking **Q: How can we benchmark?** There are multiple ways to benchmark your deployment’s performance: * Use our [open-source load-testing tool](https://github.com/fw-ai/benchmark) * Develop custom performance testing scripts * Integrate with monitoring tools to track metrics *** ## Model latency **Q: What’s the latency for small, medium, and large LLM models?** Model latency and performance depend on various factors: * **Input/output prompt lengths** * **Model quantization** * **Model sharding** * **Disaggregated prefill processes** * **Hardware configuration** * **Multiple layers of caching** * **Fire optimizations** * **LoRA adapters** (Low-Rank Adaptation) Our team specializes in personalizing model performance. We work with you to understand your traffic patterns and create customized deployment templates that maximize performance for your use case. *** ## Performance factors **Q: What factors affect model latency and performance?** Key factors that impact latency and performance include: * **Model architecture and size** * **Hardware configuration** * **Network conditions** * **Request patterns** * **Batch size settings** * **Caching implementation** *** ## Best practices **Q: What are the best practices for optimizing performance?** For optimal performance, follow these recommendations: 1. **Choose an appropriate model size** for your specific use case. 2. **Implement batching strategies** to improve efficiency. 3. **Use quantization** where applicable to reduce computational load. 4. **Monitor and adjust scaling parameters** to meet demand. 5. **Optimize prompt lengths** to reduce processing time. 6. **Implement caching** to minimize repeated calculations. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Costs & management Understanding costs and model availability for serverless deployments. ## Deployment costs **Q: Are there costs associated with deploying fine-tuned models to serverless infrastructure?** No, deploying fine-tuned models to serverless infrastructure is free. **What’s free**: * Deploying fine-tuned models to serverless * Hosting models on serverless infrastructure * Deploying up to 100 fine-tuned models **What you pay for**: * **Usage costs** on a per-token basis when the model is actually used * The **fine-tuning process** itself, if applicable *Note*: This differs from on-demand deployments, which include hourly hosting costs. *** ## Model availability **Q: Do you provide notice before removing model availability?** Yes, we provide advance notice before removing models from the serverless infrastructure: * **Minimum 2 weeks’ notice** before model removal * Longer notice periods may be provided for **popular models**, depending on usage * Higher-usage models may have extended deprecation timelines **Best Practices**: 1. Monitor announcements regularly. 2. Prepare a migration plan in advance. 3. Test alternative models to ensure continuity. 4. Keep your contact information updated for timely notifications. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Performance issues Troubleshooting timeout errors and performance issues with serverless LLM models. ## Timeout and response times **Q: Why am I experiencing request timeout errors and slow response times with serverless LLM models?** Timeout errors and increased response times can occur due to **server load during high-traffic periods**. With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned. The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**. Deployment barriers should be **minimal** and **pricing is based on usage**. However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**. With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes. Which is why our two recommended ways to address timeout and response time issues is: ### Current solution (recommended for production) * **Use on-demand deployments** for more stable performance * **Guaranteed response times** * **Dedicated resources** to ensure availability We are always investing in ways to improve speed and performance. ### Upcoming improvements * Enhanced SLAs for uptime * More consistent generation speeds during peak load times If you experience persistent issues, please include the following details in your support request: 1. Exact **model name** 2. **Timestamp** of errors (in UTC) 3. **Frequency** of timeouts 4. **Average wait times** ### Performance optimization tips * Consider **batch processing** for handling bulk requests * Implement **retry logic with exponential backoff** * Monitor **usage patterns** to identify peak traffic times * Set **appropriate timeout settings** based on model complexity *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Service levels Understanding SLAs and service guarantees for Fireworks.ai serverless deployments. ## Latency guarantees **Q: Is latency guaranteed for serverless models?** Currently there are **no latency or availability guarantees** for serverless models, however they are coming soon and we recommend contacting [sales](https://fireworks.ai/company/contact-us) to discuss any specific needs or requirements you have. *** ## Service level agreements **Q: Are there any SLAs for serverless models?** Our **multi-tenant serverless offering** does not currently come with **Service Level Agreements (SLAs)**. However they are coming and we'd love to understand what your use case is in order to ensure you have the best experience possible on the Fireworks platform. Reach out to us via sales or our Discord community. *** ## Quota information **Q: Are there any quotas for serverless?** For **serverless deployments**, quotas are as follows: * **Developer accounts**: 600 requests per minute (RPM) * **Enterprise accounts**: 600 requests per minute (RPM) * Quotas apply **across all models** and cannot be exceeded within the serverless infrastructure **For higher quotas**: * Consider switching to **on-demand deployments** * **Contact enterprise sales** for custom solutions * Evaluate **dedicated infrastructure options** for greater flexibility *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Certifications Information about Fireworks.ai compliance certifications and HIPAA requirements. ## Security certifications **Q: What type of certifications do you have?** We are **SOC 2 Type II** and **HIPAA Certified**. These certifications demonstrate our commitment to: * **Security** * **Availability** * **Processing integrity** * **Confidentiality** * **Privacy** You can view more at [https://trust.fireworks.ai/](https://trust.fireworks.ai/). *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information # Enterprise quotas Understanding quota allocations for Enterprise customers. ## Enterprise limits **Q: Are there any quotas for Enterprise Tier?** No, there are **no quotas** for Enterprise Tier. Enterprise customers benefit from: 1. **Resource Allocation**: * **Unlimited request capacity** * **Flexible scaling options** * **Custom resource allocation** 2. **Performance Benefits**: * **Dedicated infrastructure** * **Priority processing** * **Enhanced support** 3. **Custom Solutions**: * **Tailored deployment options** * **Specialized configurations** * **Customized scaling policies** For specific requirements or custom configurations, contact your **enterprise account representative**. *** ## Additional resources * **Enterprise sales**: Contact our [sales team](https://fireworks.ai/company/contact-us?tab=business) for more information * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) # Platform support Information about Fireworks.ai deployment regions, general support channels, and platform requests. ## General support **Q: I have another question or issue.** We have an active [Discord community](https://discord.gg/mMqQxvFD9A) where you can: * Post questions * Request features * Report bugs * Interact directly with the Fireworks team and community *** ## Feature requests **Q: How can I request a new model to be added to the platform?** Head over to our **Discord server** and let us know which models you would like to see deployed. We actively take feature requests for new, popular models. *** ## Product feedback **Q: I have specific performance questions or want to know about further performance improvement options.** If you need more tailored performance advice or want to discuss advanced optimization options, here are two ways to get support: 1. **General support**: Reach out via our [support channels](https://fireworks.ai/company/contact-us) or check out the performance optimization practices for tips on maximizing efficiency with on-demand deployments. 2. **Direct consultation**: For in-depth questions, feel free to schedule a consultation directly with our Product Manager, Ray Thai, using [this link to his calendar](https://calendly.com/raythai). Ray can assist with advanced optimization strategies and hardware recommendations based on your specific workload and deployment needs. *** ## Deployment regions **Q: Do you host your deployments in the EU or Asia?** We are currently deployed in multiple U.S.-based locations. However, we’re open to hearing more about your specific requirements. You can: * Join our [Discord community](https://discord.gg/mMqQxvFD9A) * Write to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) If you're an Enterprise customer, please contact your dedicated customer support representative to ensure a timely response. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Support structure & access Information about Fireworks.ai support options, access methods, and communication channels. ## Support options **Q: What support options exist?** * Enterprise accounts receive **dedicated support**. * Developer-tier customers can interact directly with the Fireworks team and community through our **Discord channel**. *** ## Support process **Q: How does Support work?** Fireworks provides support for its services with **target response times** based on the **priority level** of the issue. Customers can indicate priority when creating support issues through the **Fireworks support system**. *** ## Additional resources * **Discord Community**: [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * **Email Support**: [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) * **Documentation**: [Fireworks.ai docs](https://fireworks.ai/docs) # Enterprise support tiers & SLAs Detailed information about Fireworks.ai support priority levels and response time commitments. ## Enterprise support contact **Q: If you're an Enterprise customer, how do you contact support?** Enterprise customers have access to **dedicated support channels**. Please contact your assigned **customer support representative** for timely assistance. *** ## Communication channels **Q: Do you have a shared Slack channel?** For customers who use Slack internally, we create a **shared Slack channel**. This channel is used for: * **Answering questions** about Fireworks’ platform and features * **Receiving bug reports** from customers * **Communicating** around incidents and escalations * **Announcing new features** and requesting feedback on current offerings *** ## Support priority levels **Q: What are the support tiers and SLAs for enterprise?** Support issues are categorized into four priority levels, with specific examples for each: | Priority Level | Response Time | Description | Examples | | --------------- | ----------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------- | | **Urgent (P0)** | Within 1 hour | Reserved for critical cases that break live production workflows | • Production scheduled task/runbook unexpectedly failing
• Application inaccessible to end users | | **High (P1)** | Within 4 business hours | Problems that prevent regular platform usage but not breaking live production | • Development/staging schedule failing
• Task deployment failing | | **Normal (P2)** | Within 8 business hours | Requests for information, enhancements, or documentation clarification with no negative service impact | • Feature requests
• Documentation questions | | **Low (P3)** | Within 2 business days | Any issues that don't fall into P0, P1, or P2 categories | • General inquiries
• Non-urgent requests | *Note: Business hours refer to standard working hours.* # Platform models Information about custom and available models on Fireworks.ai. ## Custom models **Q: Does Fireworks support custom base models?** Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models). *** ## Model availability **Q: There’s a model I would like to use that isn’t available on Fireworks. Can I request it?** Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform. **To request new models**: 1. **Join our [Discord server](https://discord.gg/fireworks-ai)** 2. Let us know which models you’d like to see 3. Provide **use case details**, if possible, to help us prioritize We regularly evaluate and add new models based on: * **Community requests** * **Popular demand** * **Technical feasibility** * **Licensing requirements** *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Fine-tuning service Overview of Fireworks.ai fine-tuning capabilities and supported models. ## Service availability **Q: Does Fireworks offer a fine-tuning service?** Yes, Fireworks offers a fine-tuning service. Take a look at our [fine-tuning guide](https://docs.fireworks.ai/fine-tuning/fine-tuning-models), which is also available [via REST API](https://docs.fireworks.ai/fine-tuning/fine-tuning-via-api) for detailed information about our services and capabilities. *** ## Model support **Q: What models are supported for fine-tuning? Is Llama 3 supported for fine-tuning?** Yes, **Llama 3** (8B and 70B) is supported for fine-tuning with **LoRA adapters**, which can be deployed via our **serverless** and **on-demand** options for inference. **Capabilities include**: * **LoRA adapter training** for flexible model adjustments * **Serverless deployment support** for scalable, cost-effective usage * **On-demand deployment options** for high-performance inference * A variety of **base model options** to suit different use cases For a complete list of models available for fine-tuning, refer to our [documentation](https://docs.fireworks.ai/fine-tuning/fine-tuning-models). *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Fine-tuning troubleshooting Solutions for common fine-tuning deployment and access issues. ## Access issues **Q: Why am I getting "Model not found" errors when trying to access my fine-tuned model?** If you’re unable to access your fine-tuned model, try these troubleshooting steps: **First steps**: * Attempt to access the model through both the **playground** and the **API**. * Check if the error occurs for **all users** on the account. * Ensure your **API key** is valid. **Common causes**: * User email previously associated with a **deleted account** * **API key permissions** issues * **Access conflicts** due to multiple accounts **Debug process**: 1. Verify the API key’s validity using: ```bash curl -v -H "Authorization: Bearer $FIREWORKS_API_KEY" https://api.fireworks.ai/verifyApiKey ``` 2. Check if the issue persists across different **API keys**. 3. Identify which specific **users/emails** are affected. **Getting help**: * Contact support with: * Your **account ID** * **API key verification** results * A list of **affected users/emails** * Results from both **playground** and **API** tests *Note*: If you have multiple accounts, ensure that access permissions are checked across all of them. *** ## Troubleshooting firectl deployment **Q: Why am I getting "invalid id" errors when using firectl commands like create deployment or list deployments?** This error typically occurs when your **account ID** is not properly configured. ### Common symptoms * Error message: `invalid id: id must be at least 1 character long` * Affects multiple commands, including: * `firectl create deployment` * `firectl list deployments` To resolve: ### Steps to resolve 1. Run `firectl whoami` to check which **account id** is being used. 2. Ensure the correct **account ID** is being used. If not, run `firectl signin` to sign-in to the right account. *** ## LoRA deployment issues **Q: Why can’t I deploy my fine-tuned Llama 3.1 LoRA adapter?** If you encounter the following error: ```bash Invalid LoRA weight model.layers.0.self_attn.q_proj.lora_A.weight shape: torch.Size([16, 4096]), expected (16, 8192) ``` This issue is due to the `fireworks.json` file being set to **Llama 3.1 70b instruct** by default. **Workaround**: 1. Download the **model weights**. 2. Modify the base model to be `accounts/fireworks/models/llama-v3p1-8b-instruct`. 3. Follow the instructions in the [documentation](https://fireworks.ai/fine-tuning/model-upload) to upload and deploy the model. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # FLUX capabilities Understanding FLUX image generation features and limitations. ## Multiple images **Q: Can I generate multiple images in a single API call using FLUX serverless?** No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance. *** ## Image-to-image generation **Q: Does FLUX support image-to-image generation?** No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development. *** ## LoRA models **Q: Can I create custom LoRA models with FLUX?** Inference on FLUX-LoRA adapters is currently supported. However managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available. *** ## Size control **Q: How do I control output image sizes when using SDXL ControlNet?** When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request: The input control signal image will be automatically: * **Resized** to fit your specified dimensions * **Cropped** to preserve aspect ratio **Example**: To generate a 768x1344 image, explicitly include these parameters in your request: ```json { "width": 768, "height": 1344 } ``` *Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Limitations & controls Understanding model limitations, safety features, and token limits. ## Safety Features **Q: Can safety filters or content restrictions be disabled on text generation models?** No, safety features and content restrictions for text generation models (such as Llama, Mistral, etc.) are embedded by the original model creators during training: * **Safety measures** are integrated directly into the models by the teams that trained and released them. * These are **core behaviors** of the model, not external filters. * Different models may have varying levels of built-in safety. * **Fireworks.ai does not add additional censorship layers** beyond what is inherent in the models. * Original model behaviors **cannot be modified** via API parameters or configuration. *Note*: For specific content handling needs, review the documentation of each model to understand its inherent safety features. ## Token Limits **Q: What are the maximum completion token limits for models, and can they be increased?** Token limits are model-specific and have technical constraints: **Current Limitations**: * Many models, such as **Llama 3.1 405B**, have a **4096 token completion limit**. * Setting a higher `max_tokens` in API calls **will not override** this limit. * You will see `"finish_reason": "length"` in responses when hitting this limit. **Why Limits Exist**: * **Resource management** for shared infrastructure * Prevents single requests from monopolizing resources * Helps maintain **service availability** for all users **Working with Token Limits**: * Break longer generations into **multiple requests**. * *Note*: This may require repeating context or prompts. * Be mindful that repeated context can **increase total token usage**. **Example API Response at Limit**: ```json { "finish_reason": "length", "usage": { "completion_tokens": 4096, "prompt_tokens": 4206, "total_tokens": 8302 } } ``` *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Inference performance Understanding model performance, quantization, and batching capabilities. ## Model quantization **Q: What quantization format is used for the Llama 3.1 405B model?** The **Llama 3.1 405B model** uses the **FP8 quantization format**, which: * Closely matches **Meta's reference implementation** * Provides further details in the model description at [fireworks.ai/models/fireworks/llama-v3p1-405b-instruct](https://fireworks.ai/models/fireworks/llama-v3p1-405b-instruct) * Has a general quantization methodology documented in our [Quantization blog](https://fireworks.ai/blog/fireworks-quantization) *Note*: **BF16 precision** will be available soon for on-demand deployments. *** ## API capabilities **Q: Does the API support batching and load balancing?** Current capabilities include: * **Load balancing**: Yes, supported out of the box * **Continuous batching**: Yes, supported * **Batch inference**: Not currently supported (on the roadmap) * Note: For batch use cases, we recommend sending multiple parallel HTTP requests to the deployment while maintaining some fixed level of concurrency. * **Streaming**: Yes, supported *** ## Request handling **Q: What factors affect the number of simultaneous requests that can be handled?** Request handling capacity depends on several factors: * **Model size and type** * **Number of GPUs allocated** to the deployment * **GPU type** (e.g., A100, H100) * **Prompt size** * **Generation token length** * **Deployment type** (serverless vs. on-demand) *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Data security Information about Fireworks.ai data encryption and security measures. ## Data at rest **Q: How is data encrypted at rest?** All resources stored within Fireworks are **encrypted at rest**, including: * **Models** * **Datasets** * **LoRA Adapters** * Other stored resources *** ## Data in transit **Q: How is data encrypted in transit?** All data passed through Fireworks is encrypted using **industry-standard protocols and methods**. *** ## Encryption options **Q: Does Fireworks provide client-side encryption or allow customers to bring their own encryption keys?** Currently, Fireworks does not provide: * **Client-side encryption** * **Customer-managed keys** for encrypting data at rest *Note*: We continuously evaluate additional encryption options based on customer needs and security requirements. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Security documentation Access to Fireworks.ai security policies and documentation. ## Security policies **Q: Where can I find more information about your security policies?** Comprehensive security documentation is available at [trust.fireworks.ai](https://trust.fireworks.ai), including: * **Security measures** * **Compliance information** * **Best practices** * **Policy updates** *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Model security Understanding model security and guardrail implementations. ## Model guardrails **Q: Do you put any guardrails before any LLM models?** By default, we don’t apply any guardrails to LLM models. Our customers can implement guardrails through various methods: 1. **Using built-in options**: * Models such as **Llama Guard** provide built-in guardrails. * Integration with existing **security frameworks**. 2. **Third-party solutions**: * AI gateways like **Portkey** offer guardrails as a feature. * Documentation available at: [Portkey Guardrails](https://docs.portkey.ai/docs/product/guardrails) **Best practices**: * Implement guardrails appropriate to your **use case**. * Conduct regular **security audits**. * Monitor **model outputs** consistently. * Keep **security policies** updated. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Private access Understanding private connection options for Fireworks.ai services. ## Private connections **Q: Do you provide private connections?** Fireworks provides various forms of **private connections**: **Cloud provider options**: * **AWS PrivateLink** * **GCP Private Service Connect** **Additional options**: * **Direct Routing**, which allows you to connect your dedicated API Gateway **Benefits**: * **Enhanced security** * **Reduced latency** * **Private network communication** * **Improved reliability** **Implementation process**: 1. **Contact support** to initiate setup. 2. **Choose connection type** based on your requirements. 3. **Configure network settings** as per the guidelines. 4. **Verify connectivity** to ensure successful integration. *** ## Additional information If you experience any issues during these processes, you can: * Contact support through Discord at [discord.gg/fireworks-ai](https://discord.gg/fireworks-ai) * Reach out to your account representative (Enterprise customers) * Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) # Fine-tuning models We're introducing an upgraded tuning service with improved speed, usability and reliability! The new service utilizes different commands and model coverage. The new service is offered for free as we're in public preview. See these [docs](https://docs.fireworks.ai/fine-tuning/fine-tuning-legacy) to use our legacy service instead. ## Introduction Fireworks' offers a [LoRA](https://huggingface.co/docs/diffusers/training/lora)-based fine-tuning method designed for usability, reliability and efficiency. LoRA is used for fine-tuning all models besides our 70B models, which uses qLoRA (quantized) to improve training speeds. The fine-tuning service is provide hassle-free quality improvements through intelligent defaults and little configuration. Models fine-tuned with our service can be seamlessly deployed for inference on Fireworks or downloaded for local usage. Fine-tuning a model with a dataset can be useful for several reasons: 1. **Enhanced Precision**: It allows the model to adapt to the unique attributes and trends within the dataset, leading to significantly improved precision and effectiveness. 2. **Domain Adaptation**: While many models are developed with general data, fine-tuning them with specialized, domain-specific datasets ensures they are finely attuned to the specific requirements of that field. 3. **Bias Reduction**: General models may carry inherent biases. Fine-tuning with a well-curated, diverse dataset aids in reducing these biases, fostering fairer and more balanced outcomes. 4. **Contemporary Relevance**: Information evolves rapidly, and fine-tuning with the latest data keeps the model current and relevant. 5. **Customization for Specific Applications**: This process allows for the tailoring of the model to meet unique objectives and needs, an aspect not achievable with standard models. In essence, fine-tuning a model with a specific dataset is a pivotal step in ensuring its enhanced accuracy, relevance, and suitability for specific applications. Let's hop on a journey of fine-tuning a model! Fine-tuned model inference on Serverless is slower than base model inference on Serverless. For use cases that need low latency, we recommend using [on-demand deployments](https://docs.fireworks.ai/guides/ondemand-deployments). For on-demand deployements, fine-tuned model inference speeds are significant closer to base model speeds (but still slightly slower). If you are only using 1 LoRA on-demand, [merging fine-tuned weights](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model) into the base model when using on-demand deployments will provide identical speed to base model inference. If you have an enterprise use case that needs fast fine-tuned models, please [contact us!](https://fireworks.ai/company/contact-us) ## Pricing Our new tuning service is currently free but will be charged based on the total number of tokens processed (dataset tokens \* number of epochs). Running inference on fine-tuned models incurs no extra costs outside of base inference fees. See our [Pricing](https://fireworks.ai/pricing#fine-tuning) page for pricing details on our legacy tuning service. ## Installing firectl [`firectl`](/tools-sdks/firectl/firectl) is the command-line (CLI) utility to manage, and deploy various resources on the [Fireworks AI Platform](https://fireworks.ai). Use `firectl` to manage fine-tuning jobs and their resulting models. Please visit the Firectl [Getting Started](/tools-sdks/firectl/firectl) Guide on installing and using `firectl`. ## Preparing your dataset To fine-tune a model, we need to first upload a dataset. Once uploaded, this dataset can be used to create one or more fine-tuning jobs. A dataset consists of a single JSONL file, where each line is a separate training example. Limits: * Minimum number of examples is 3. * Maximum number of examples is 3,000,000. Format: * Each line of the file must be a valid JSON object. Each dataset must conform to the schema expected by our OpenAI-compatible [Chat Completions API](https://docs.fireworks.ai/guides/querying-text-models#chat-completions-api). Each JSON object of the dataset must contain a single array field called `messages`. Each message is an object containing two fields: * `role` - one of "system", "user", or "assistant". * `content` - the content of the message. A message with the "system" role is optional, but if specified, must be the first message of the conversation. Subsequent messages start with "user" and alternate between "user" and "assistant". See below for example training examples: ```json {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "blue"}]} {"messages": [{"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]} ``` ### Creating your dataset To create a dataset, run: ```shell firectl create dataset path/to/dataset.jsonl ``` and you can check the dataset with: ```shell firectl get dataset ``` ## Starting your tuning job To start a structured fine-tuning job (sftj), run: ```shell firectl create sftj --base-model --dataset --output-model ``` For example: ```shell firectl create sftj --base-model llama-v3p1-8b-instruct --dataset my_dataset --output-model my_model ``` firectl will return the fine-tuning job ID. When creating a fine-tuning job, you can start tuning from a base model, or from a model you tuned earlier (LoRA add-on): 1. **Base model**: Use the `base-model` parameter to start from a pre-trained base model. 2. **Existing LoRA add-on**: Use the `warm-start-from` parameter to start from an existing LoRA addon model, where the LoRA is specified with the format "accounts/\/models/\" You must specify either `base-model` or `warm-start-from` in your command-line flags. ### Checking the job status You can monitor the progress of the tuning job by running ```shell firectl get fine-tuning-job ``` Once the job successfully completes, a model will be created in your account. You can see a list of models by running: ```shell firectl list models ``` Or if you specified a model ID when creating the fine-tuning job, you can get the model directly: ```shell firectl get model ``` ## Deploying and using a model Before using your fine-tuned model for inference, you must deploy it. Please refer to our guides on [Deploying a model](/models/deploying#lora-addons) and [Querying text models](/guides/querying-text-models) for detailed instructions. Some base models may not support serverless addons. To check: 1. Run `firectl -a fireworks get ` 2. Look under `Deployed Model Refs` to see if a `fireworks`-owned deployment exists, e.g. `accounts/fireworks/deployments/3c7a68b0` 3. If so, then it is supported If the base model doesn't support serverless addons, you will need use an [on-demand deployment](/models/deploying#deploying-to-on-demand) to deploy it. ## Additional tuning options Tuning settings are specified when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like epochs learning rate, we recommend using default settings and only changing hyperparameters if results are not as desired. All tuning options must be specified via command line flags as shown in the below example command with multiple flags. ```shell firectl create sftj \ --base-model llama-v3p1-8b-instruct \ --dataset cancerset \ --output-model my-tuned-model \ --job-id my-fine-tuning-job \ --learning-rate 0.0001 \ --epochs 2 \ --early-stop \ --evaluation-dataset my-eval-set ``` ### Evaluation By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data. 1. `evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl ```shell firectl create sftj \ ... --evaluation-dataset my-eval-set \ ... ``` ### Early stopping Early stopping stops training early in the validation loss does not improve. It is off by default ```shell firectl create sftj \ ... --early-stop \ ... ``` ### Max Context Length By default, fine-tuned models support a max context length of 8k. Increase max context length if your use case requires context above 8k. Maximum context length can be increased up to the default context length of your selected model. For models with over 70B parameters, we only support up to 32k max context length. ```shell firectl create sftj \ ... --max-context-length 16000 ... ``` ### Epochs Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported. **Note: we set a max value of 3 million dataset examples \* epochs** ```shell firectl create sftj \ ... --epochs 2.0 \ ... ``` ### Learning rate Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value set is automatically based on your selected model. ```shell firectl create sftj \ ... --learning-rate 0.0001 \ ... ``` ### Lora Rank LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8. ```shell firectl create sftj \ ... --lora-rank 16 \ ... ``` ### Training progress and monitoring The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key. ```shell firectl create sftj \ ... --wandb-entity my-org \ --wandb-api-key xxx \ --wandb-project "My Project" \ ... ``` ### Model ID By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within (ID constraints)\[[https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids](https://docs.fireworks.ai/getting-started/concepts#resource-names-and-ids)]. ```shell firectl create sftj \ ... --output-model-id my-model \ ... ``` ### Job ID By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID. ```shell firectl create sftj \ ... --job-id my-fine-tuning-job \ ... ``` ## Downloading model weights To download model weights run ```shell firectl download model ``` ## Appendix ### Supported base models - tuning Fireworks tuning service is limited to select models where we're confident in providing intelligent defaults for a hassle-free experience. Currently, we only support tuning models with the following architectures: * [Llama 1,2,3.x](https://huggingface.co/docs/transformers/en/model_doc/llama2) architectures are supported. Llama vision models and Llama 405B currently not supported * [Qwen2](https://huggingface.co/docs/transformers/en/model_doc/qwen2) architectures are supported. ### Supported base models - LoRAs on dedicated deployment LoRAs can be deployed for inference on dedicated deployments (on-demand or enterprise reserved) for the following models: * All models supported for tuning * accounts/fireworks/models/mixtral-8x7b-instruct-hf * accounts/fireworks/models/mixtral-8x22b-instruct-hf * accounts/fireworks/models/mixtral-8x22b-hf * accounts/fireworks/models/mixtral-8x7b * accounts/fireworks/models/mistral-7b-instruct-v0p2 * accounts/fireworks/models/mistral-7b * accounts/fireworks/models/code-qwen-1p5-7b * accounts/fireworks/models/deepseek-coder-v2-lite-base * accounts/fireworks/models/deepseek-coder-7b-base * accounts/fireworks/models/deepseek-coder-1b-base * accounts/fireworks/models/codegemma-7b * accounts/fireworks/models/codegemma-2b * accounts/fireworks/models/starcoder2-15b * accounts/fireworks/models/starcoder2-7b * accounts/fireworks/models/starcoder2-3b * accounts/fireworks/models/stablecode-3b This means that [up to 100](https://docs.fireworks.ai/guides/quotas_usage/rate-limits#other-quotas) LoRAs can be deployed to a dedicated instance for no extra fees compared to the base deployment costs. ### Supported base models - LoRAs on serverless The following base models are supported for low-rank adaptation (LoRA) and can be deployed as LoRA add-ons on Fireworks [serverless](/models/deploying#deploying-to-serverless) and [on-demand](/models/deploying#deploying-to-on-demand) deployments, using the default parameters below. Serverless deployment is only available for a subset of fine-tuned models - run "get (\)\[[https://docs.fireworks.ai/models/overview#introduction](https://docs.fireworks.ai/models/overview#introduction)]" or check the models (page)\[[https://fireworks.ai/models](https://fireworks.ai/models)] to see if there's an active serverless deployment. A limited number of models are available for serverless LoRA deployment, meaning that up to 100 LoRAs can be deployed to serverless and are constantly available on a pay-per-token basis. * accounts/fireworks/models/llama-v3p1-8b-instruct * accounts/fireworks/models/llama-v3p1-70b-instruct * accounts/fireworks/models/llama-v3p2-3b-instruct ### Support We'd love to hear what you think! Please connect with the team, ask questions, and share your feedback in the [#fine-tuning](https://discord.gg/zYDmm4zqmq) Discord channel. ## # Using Document Inlining ## Overview Document Inlining allows any LLM to process images and PDFs through our chat completions API. Simply append `#transform=inline` to your document URL to enable this feature. Document Inlining connects our proprietary Fireworks Parsing Service to any LLM to provide advantages including: * Improved reasoning (compared to VLMs): LLMs reason better over text than over image and document inlining allows you to use specialized and more recently updated text models * Improved input flexibility: Document Inlining enables PDFs and multiple images to be ingested * Ultra-simple usage: Use Document Inlining through our openAI-compability, chat completions API. Simply add 1-line to specify to add your file and turn on Document Inlining Read our [announcement blog](https://fireworks.ai/blog/document-inlining-launch) for more details. ## Usage ### Basic Example Note the "#transform=inline" addition to the image URL. ```python Python import openai client = openai.OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key="", ) response = client.chat.completions.create( model="accounts/fireworks/models/llama-v3p3-70b-instruct", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://pdfobject.com/pdf/sample.pdf#transform=inline" } }, { "type": "text", "text": "What information can you extract from this document?" } ] } ] ) ``` ```typescript TypeScript import OpenAI from "openai"; const client = new OpenAI({ apiKey: "", baseURL: "https://api.fireworks.ai/inference/v1" }); const response = await client.chat.completions.create({ model: "accounts/fireworks/models/llama-v3p3-70b-instruct", messages: [ { role: "user", content: [ { type: "image_url", image_url: { url: "https://example.com/document.pdf#transform=inline" } }, { type: "text", text: "What information can you extract from this document?" } ] } ] }); ``` ```javascript JavaScript const OpenAI = require("openai"); const client = new OpenAI({ apiKey: "", baseURL: "https://api.fireworks.ai/inference/v1" }); const response = await client.chat.completions.create({ model: "accounts/fireworks/models/llama-v3p3-70b-instruct", messages: [ { role: "user", content: [ { type: "image_url", image_url: { url: "https://example.com/document.pdf#transform=inline" } }, { type: "text", text: "What information can you extract from this document?" } ] } ] }); ``` The `image_url.url` field supports both direct URLs and base64-encoded data URLs, compatible with VLM API: ```text # For PDF files data:application/pdf;base64,{base64_str_for_pdf} # For images (png/jpg/gif/tiff supported) data:image/png;base64,{base64_str_for_image} data:image/jpeg;base64,{base64_str_for_image} data:image/gif;base64,{base64_str_for_image} data:image/tiff;base64,{base64_str_for_image} ``` Similarly, simply append `#transform=inline` to the base64 string to enable document inlining. ### Combining with Structured Output Document Inlining works seamlessly with structured output formats. Here's how to extract specific fields using [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting): ```python from pydantic import BaseModel class DocumentInfo(BaseModel): title: str key_points: list[str] response = client.chat.completions.create( model="accounts/fireworks/models/llama-v3p3-70b-instruct", messages=[...], # Same as above response_format={"type": "json_object", "schema": DocumentInfo.model_json_schema()} ) ``` ## Limitations Document Inlining is only intended to handle images and documents that contain text. Document Inlining may provide subpar results for highly visual, spatially dependent, or layout-heavy content that does not translate well into structured text. * Maximum document size: 50 pages or the model's context size (whichever is smaller) * Maximum document size: \~32 MB if sent as base64 encoded string, \~100 MB if sent as URL * Supported formats: PDFs and images ## Model Compatibility Document Inlining works with any LLM on Fireworks, including: * Serverless models * On-demand models * Fine-tuned and custom models * Vision models Simply append `#transform=inline` to your document URL to enable the feature with any supported model. Multiple documents are supported. Vision models also support document inlining with images for use cases that require both document processing and non-document vision. Users can control whether to inline a document by selectively appending `#transform=inline` to image\_url.url of each attachment. ## Pricing During public preview, Document Inlining incurs no added costs compared to our typical text models. For example, let’s say you’re conducting a structured extraction task where you provide: Input: 10 token Prompt + document with 1,000 tokens worth of text Output: 100 tokens You would simply pay for the 1110 tokens worth of input and output token costs but will NOT incur additional costs for document parsing. Please note that Document Inlining is in Public Preview mode and subject to changes. Please contact us on Discord if you have feedback or questions or at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) for enterprise inquries. # Concepts This document outlines basic Fireworks AI concepts. ## Resources ### Account Your account is the top-level resource under which other resources are located. Quotas and billing are enforced at the account level, so usage for all users in an account contribute to the same quotas and bill. For developer accounts, the account ID is auto-generated from the email address used to sign up. Enterprise accounts can optionally choose a custom, unique account ID. ### User A user is an email address associated with an account. Users added to an account have full access to delete, edit, and create resources within the account, such as deployments and models. ### Model A model is a set of model weights and metadata associated with the model. A model cannot be used for inference until it is deployed to one or more deployments, creating a "deployed model". There are two types of models: * Base models * Low-rank adaptation (LoRA) addons See our [Models overview](/models/overview) page for details. ### Deployment A deployment is a collection (one or more) model servers that host one base model and optionally one or more LoRA addons. Fireworks provides a set of "serverless" deployments that host common base models. These deployments may be used for [serverless inference](/models/overview#serverless-inference) as well as hosting [serverless addons](/models/overview#serverless-addons). ### Deployed model A deployed model is an instance of a base model or LoRA addon that is loaded into a deployment. ### Dataset A dataset is an immutable set of training examples that can be used to fine-tune a model. ### Fine-tuning job A fine-tuning job is an offline training job that uses a dataset to train a LoRA addon model. ## Resource names and IDs A full resource name looks like ``` accounts/my-account/models/my-model ``` The individual segments `my-account` and `my-model` are account and [model IDs](https://docs.fireworks.ai/models/overview), respectively. Resource IDs must satisfy the following constraints: * between 1 and 63 characters (inclusive) * consist of a-z, 0-9, and hyphen (-) * does not begin or end with a hyphen (-) Some APIs take the full resource name, while others may take a resource ID if the context is clear. ## Control plane and data plane The Fireworks API can be split into a control plane and a data plane. * The **control plane** consists of APIs used for managing the lifecycle of resources. This includes your account, models, and deployments. * The **data plane** consists of the APIs used for inference and the backend services that power them. ## Interfaces Users can interact with Fireworks through one of many interfaces: * The **web console** at [https://fireworks.ai](https://fireworks.ai) * The command-line interface `firectl` * [Python SDK](/tools-sdks/python-client/installation) # Introduction Fireworks AI is a generative AI inference platform to run and customize models with industry-leading speed and production-readiness. ## Welcome to Fireworks AI Hero Light Hero Dark {/* Make an API call to an open-source LLM Watch to learn more about the Fireworks AI platform */} ## What we offer The Fireworks platform empowers developers to create generative AI systems with the best quality, cost and speed. All publicly available services are pay-as-you-go with developer friendly [pricing](https://fireworks.ai/pricing). See the below list for offerings and docs links. Scroll further for more detailed descriptions and blog links. * **Inference:** Run generative AI models on Fireworks-hosted infrastructure with our optimized FireAttention inference engine. Multiple inference options ensure there’s always a fit for your use case. * **Modalities and Models:** Use 100s models (or bring your own) across modalities of: * [Text](https://docs.fireworks.ai/guides/querying-text-models) * [Audio](https://docs.fireworks.ai/api-reference/audio-transcriptions) * [Image](https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt) * [Embedding](https://docs.fireworks.ai/guides/querying-embeddings-models) * [Vision-understanding](https://docs.fireworks.ai/guides/querying-vision-language-models) * **Adaptation:** [Tune](https://docs.fireworks.ai/fine-tuning/fine-tuning-models) and optimize your model and deployment for the best . [Serve](https://docs.fireworks.ai/models/deploying) and experiment with hundreds of fine-tuned models with our multi-LoRA [capabilities](https://fireworks.ai/blog/multi-lora). * **Compound AI Development:** Use [JSON mode](https://docs.fireworks.ai/structured-responses/structured-response-formatting), [grammar mode](https://docs.fireworks.ai/structured-responses/structured-output-grammar-based) or [function calling](https://docs.fireworks.ai/guides/function-calling) to build a collaborative system with reliable and performant outputs ## Inference Fireworks has 3 options for running generative AI models with unparalleled speed and costs. * **Serverless**: The easiest way to get started. Use the most popular models on pre-configured GPUs. Pay per token and avoid cold boots. * **[On-demand](https://fireworks.ai/blog/why-gpus-on-demand)** -The most flexible option for scaling. Use private GPUs to support your specific needs and only pay when you’re using it. GPUs running Fireworks software offer both \~250% improved throughput and 50% improved latency compared to vLLM. Excels for: * **Production volume** - Per-token costs decrease with more volume and there are no set rate limits * **Custom needs and reliability** - On-demand GPUs are private to you. This enables complete control to tailor deployments for speed/throughput/reliability or to run more specialized models * **Enterprise Reserved GPUs** - Use private GPUs with hardware and software set-up personally tailored by the Fireworks team for your use case. Enjoy SLAs, dedicated support, bring-your-own-cloud (BYOC) deployment options, and enterprise-only optimizations. | Property | **Serverless** | **On-demand** | **Enterprise reserved** | | -------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | | **Performance** | Industry-leading speed on Fireworks-curated set-up. Performance may vary with others’ usage. | Speed dependent on user-specified GPU configuration and private usage. Per GPU latency should be significantly faster than vLLM. | Tailor-made set-up by Fireworks AI experts for best possible latency | | **Getting Started** | Self-serve - immediately use serverless with 1 line of code | Self-serve - configure GPUs, then use them with 1 line of code. | Chat with Fireworks | | **Scaling and management** | Scale up and down freely within rate limits | Option for auto-scaling GPUs with traffic. GPUs scale to zero automatically, so no charge for unused GPUs and for boot-ups. | Chat with Fireworks | | **Pricing** | Pay fixed price per token | Pay per GPU second with no commitments. Per GPU throughput should be significantly greater than options like vLLM. | Customized price based on reserved GPU capacity | | **Commitment** | None | None | Arrange plan length with Fireworks | | **Rate limits** | Yes, see [quotas](https://docs.fireworks.ai/accounts/quotas) | No rate limits. [Quotas](https://docs.fireworks.ai/accounts/quotas) on number of GPUs | None | | **Model Selection** | Collection of popular models, curated by Fireworks | Use 100s of pre-uploaded models or upload your own custom model within supported [architecture](https://docs.fireworks.ai/models/uploading-custom-models) | Use 100s of pre-uploaded models or upload any model | ## FireOptimizer **FireOptimizer** - Fireworks optimizes inference for your workload and your use case though FireOptimizer. FireOptimizer includes several optimization techniques. Publicly available features are: * **[Fine-tuning](https://fireworks.ai/blog/fine-tune-launch)** - Quickly fine-tune models with LoRA for the best quality on your use case * Upload data and choose your model to start tuning * Pay per token of training data. * Serve and evaluate models immediately on Fireworks * Download models weights to use anywhere * **[Multi-LoRA serving](https://fireworks.ai/blog/multi-lora)** - Deploy 100s of fine-tuned models at no extra cost. * Zero extra cost to serving LoRAs. 1 million requests with 50 models is the same price as 1 million requests with 1 model. * Use models fine-tuned on Fireworks or upload your own fine-tuned adapter * Host hundreds of models on the same deployment on either serverless or dedicated deployments ## Compound AI Fireworks makes it easy to use multiple models and modalities together in one compound AI system. Features include: * **[JSON mode and grammar mode](https://fireworks.ai/blog/why-do-all-LLMs-need-structured-output-modes)** - Provide structure to any LLM on Fireworks with either (a) JSON schema (b) Context-free grammar to guarantee that LLM output follows your desired format. These structured output modes are particularly useful to ensure LLMs can reliably call and pipe outputs to other models, APIs and components. * **[Function calling](https://fireworks.ai/blog/firefunction-v2-launch-post)** - Fireworks offers function calling support via our proprietary Firefunction models or Llama 3.1 70B {/* ## Support Join our community of Generative AI builders Have more questions? Drop us a note! */} # Onboarding A quick guide to navigating and building with the Fireworks platform. # Introduction Welcome to the **Fireworks onboarding guide**! This guide is designed to help you quickly and effectively get started with the Fireworks platform, whether you're a developer, researcher, or AI enthusiast. By following this step-by-step resource, you'll learn how to explore and experiment with state-of-the-art AI models, prototype your ideas using Fireworks’ serverless infrastructure, and scale your projects with advanced on-demand deployments. ### Who this guide is for This guide is designed for new Fireworks users who are exploring the platform for the first time. It provides a hands-on introduction to the core features of Fireworks, including the model library, playgrounds, and on-demand deployments, all accessible through the web app. For experienced users, this guide serves as a starting point, with future resources planned to dive deeper into advanced tools like `firectl` and other intermediate features to enhance your workflow. ### Objectives of the guide * **Explore the Fireworks model library**: Navigate and select generative AI models for text, image, and audio tasks. * **Experiment with the playground**: Test prompts, tweak parameters, and generate outputs in real time. * **Prototype effortlessly**: Use Fireworks’ serverless infrastructure to deploy and iterate without managing servers. * **Scale your AI**: Learn how on-demand deployments offer predictable performance and advanced customization. * **Develop complex systems**: Unlock advanced capabilities like Compound AI, function calling, and retrieval-augmented generation to create production-ready applications. By the end of this guide, you’ll be equipped with the knowledge and tools to confidently use Fireworks to build, scale, and optimize AI-powered solutions. Let’s get started! *** # Step 1. Explore our model library Fireworks provides a range of leading open-source models for tasks like text generation, code generation, and image understanding. With the Fireworks [model library](https://fireworks.ai/models), you can choose from our wide range of popular LLMs, VLMs, LVMs, and audio models, such as: * [**LLMs**: Llama 3.3 70B](https://fireworks.ai/models/fireworks/llama-v3p3-70b-instruct), [Deepseek V3](https://fireworks.ai/models/fireworks/deepseek-v3), and [Qwen2.5 Coder 32B Instruct](https://fireworks.ai/models/fireworks/qwen2p5-coder-32b-instruct). * [**VLMs**: Llama 3.2 90B Vision Instruct](https://fireworks.ai/models/fireworks/llama-v3p2-90b-vision-instruct). * [**Vision models**: BFL’s FLUX.1 \[dev\] FP8](https://fireworks.ai/models/fireworks/flux-1-dev-fp8) and [Stability.ai’s Stable Diffusion 3.5 Large Turbo](https://fireworks.ai/models/fireworks/stable-diffusion-3p5-large-turbo). * [**Audio models**: Whisper V3](https://fireworks.ai/models/fireworks/whisper-v3) and [(blazing fast)](https://fireworks.ai/blog/audio-transcription-launch)[Whisper V3 Turbo](https://fireworks.ai/models/fireworks/whisper-v3-turbo). as well as [**embedding models**](https://docs.fireworks.ai/guides/querying-embeddings-models#list-of-available-models) from Nomic AI. In this video, we introduce the **Fireworks Model Library**, your gateway to a diverse range of open-source and proprietary models designed for tasks like text generation, image understanding, and audio processing. Whether you’re a developer or a creative, Fireworks makes it easy to find and integrate the right tools for your generative AI needs. ### What you’ll learn: 1️⃣ **Navigating the model library**: Browse popular models, filter by deployment type, and search for specific tools like Llama, Whisper, and Flux.\ 2️⃣ **Customizing your experience**: Use filters like "Serverless Models" to find models that fit your specific needs.\ 3️⃣ **Seamless integration**: Discover how Fireworks simplifies the process of discovering and managing AI models.