# Exporting Billing Metrics Source: https://docs.fireworks.ai/accounts/exporting-billing-metrics Export billing and usage metrics for all Fireworks services ## Overview Fireworks provides a CLI tool to export comprehensive billing metrics for all usage types including serverless inference, on-demand deployments, and fine-tuning jobs. The exported data can be used for cost analysis, internal billing, and usage tracking. ## Exporting billing metrics Use the Fireworks CLI to export a billing CSV that includes all usage: ```bash theme={null} # Authenticate (once) firectl auth login # Export billing metrics to CSV firectl export billing-metrics ``` ## Examples Export all billing metrics for an account: ```bash theme={null} firectl export billing-metrics ``` Export metrics for a specific date range and filename: ```bash theme={null} firectl export billing-metrics \ --start-time "2025-01-01" \ --end-time "2025-01-31" \ --filename january_metrics.csv ``` ## Output format The exported CSV includes the following columns: * **email**: Account email * **start\_time**: Request start timestamp * **end\_time**: Request end timestamp * **usage\_type**: Type of usage (e.g., TEXT\_COMPLETION\_INFERENCE\_USAGE) * **accelerator\_type**: GPU/hardware type used * **accelerator\_seconds**: Compute time in seconds * **base\_model\_name**: The model used * **model\_bucket**: Model category * **parameter\_count**: Model size * **prompt\_tokens**: Input tokens * **completion\_tokens**: Output tokens ### Sample row ```csv theme={null} email,start_time,end_time,usage_type,accelerator_type,accelerator_seconds,base_model_name,model_bucket,parameter_count,prompt_tokens,completion_tokens user@example.com,2025-10-20 17:16:48 UTC,2025-10-20 17:16:48 UTC,TEXT_COMPLETION_INFERENCE_USAGE,,,accounts/fireworks/models/llama4-maverick-instruct-basic,Llama 4 Maverick Basic,401583781376,803,109 ``` ## Automation You can automate exports in cron jobs and load the CSV into your internal systems: ```bash theme={null} # Example: Daily export with dated filename firectl export billing-metrics \ --start-time "$(date -v-1d '+%Y-%m-%d')" \ --end-time "$(date '+%Y-%m-%d')" \ --filename "billing_$(date '+%Y%m%d').csv" ``` Run `firectl export billing-metrics --help` to see all available flags and options. ## Coverage This export includes: * **Serverless inference**: All serverless API usage * **On-demand deployments**: Deployment usage (see also [Exporting deployment metrics](/deployments/exporting-metrics) for real-time Prometheus metrics) * **Fine-tuning jobs**: Fine-tuning compute usage * **Other services**: All billable Fireworks services For real-time monitoring of on-demand deployment performance metrics (latency, throughput, etc.), use the [Prometheus metrics endpoint](/deployments/exporting-metrics) instead. ## See also * [firectl CLI overview](/tools-sdks/firectl/firectl) * [Exporting deployment metrics](/deployments/exporting-metrics) - Real-time Prometheus metrics for on-demand deployments * [Rate Limits & Quotas](/guides/quotas_usage/rate-limits) - Understanding spend limits and quotas # Service Accounts Source: https://docs.fireworks.ai/accounts/service-accounts How to manage and use service accounts in Fireworks Service accounts in Fireworks allow applications, scripts, and automated systems to authenticate and perform actions securely—without relying on human credentials. They are ideal for CI/CD pipelines, backend services, and automated workflows. Service Accounts let you avoid shared credentials and easily distinguish between what automated systems did vs humans in audit logs. Service accounts can take actions using an API key, like creating deployments, running models or creating datasets (see [API reference](https://fireworks.ai/docs/api-reference/introduction)). Service accounts cannot login through the web interface or use OIDC tokens. ## Creating a Service Account Using our firectl you can create service accounts ```bash theme={null} firectl create user --user-id "my-service-account" --service-account ``` ## Creating an API Key for Service Account Using our firectl you can create an api key on behalf of a service account ```bash theme={null} firectl create api-key --service-account "my-service-account" ``` ## Billing * Service accounts count toward the same account quotas and limits assigned to the account * Usage is tracked by the account, not individual user vs service account ## Auditing In audit logs users are referenced by their email id's. Service accounts are referenced by `my-service-account@my-account.sa.fireworks.ai`. # Custom SSO Source: https://docs.fireworks.ai/accounts/sso Set up custom Single Sign-On (SSO) authentication for Fireworks AI Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform. By default, Fireworks supports Google SSO. If you have an enterprise account, Fireworks supports bringing your own identity provider using: * OpenID Connect (OIDC) provider * SAML 2.0 provider Coordinate with your Fireworks AI representative to enable the integration. ## OpenID Connect (OIDC) provider Create an OIDC client application in your identity provider, e.g. Okta. Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret). Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse ``` Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up. ## SAML 2.0 provider Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm). Set the SSO URL to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse ``` Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like: ``` urn:amazon:cognito:sp: ``` Create an Attribute Statement with the name: ``` http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress ``` and the value `user.email` Leave the rest of the settings as defaults. Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up. ## Troubleshooting ### Invalid samlResponse or relayState from identity provider This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports service provider (SP) initiated login. See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an in-depth explanation. ### Required String parameter 'RelayState' is not present See above. # Managing users Source: https://docs.fireworks.ai/accounts/users Add and delete additional users in your Fireworks account See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users. Only admin users can manage other users within the account. ## Adding users To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://app.fireworks.ai/account/users](https://app.fireworks.ai/account/users). ```bash theme={null} firectl create user --email="alice@example.com" ``` To create another admin user, pass the `--role=admin` flag: ```bash theme={null} firectl create user --email="alice@example.com" --role=admin ``` ## Updating a user's role To update a user's role, run ```bash theme={null} firectl update user --role="{admin,user}" ``` ## Deleting users You can remove a user from your account by running: ```bash theme={null} firectl delete user ``` # Streaming Transcription Source: https://docs.fireworks.ai/api-reference/audio-streaming-transcriptions websocket /audio/transcriptions/streaming Streaming transcription is performed over a WebSocket. Provide the transcription parameters and establish a WebSocket connection to the endpoint. Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). In parallel, receive transcription from the WebSocket. Stream audio to get transcription continuously in real-time. Stream audio to get transcription continuously in real-time. Stream audio to get transcription continuously in real-time. ### URLs Fireworks provides serverless, real-time ASR via WebSocket endpoints. Please select the appropriate version: #### Streaming ASR v1 (default) Production-ready and generally recommended for all use cases. ``` wss://audio-streaming.api.fireworks.ai/v1/audio/transcriptions/streaming ``` #### Streaming ASR v2 (preview) An early-access version of our next-generation streaming transcription service. V2 is good for use cases that require lower latency and higher accuracy in noisy situations. ``` wss://audio-streaming-v2.api.fireworks.ai/v1/audio/transcriptions/streaming ``` ### Headers Your Fireworks API key, e.g. `Authorization=API_KEY`. Alternatively, can be provided as a query param. ### Query Parameters Your Fireworks API key. Required when headers cannot be set (e.g., browser WebSocket connections). Can alternatively be provided via the Authorization header. The format in which to return the response. Currently only `verbose_json` is recommended for streaming. The target language for transcription. See the [Supported Languages](#supported-languages) section below for a complete list of available languages. The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. The timestamp granularities to populate for this streaming transcription. Defaults to null. Set to `word,segment` to enable timestamp granularities. Use a list for timestamp\_granularities in all client libraries. A comma-separated string like `word,segment` only works when manually included in the URL (e.g. in curl). ### Client messages This field is for client to send audio chunks over to server. Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). This field is for client event initiating the context clean up. A unique identifier for the event. A constant string that identifies the type of event as "stt.state.clear". The ID of the context or session to be cleared. This field is for client event initiating tracing. A unique identifier for the event. A constant string indicating the event type is "stt.input.trace". The ID used to correlate this trace event across systems. ### Server messages The task that was performed — either `transcribe` or `translate`. The language of the transcribed/translated text. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. The language of the word. The probability of the word. The hallucination score of the word. Start time of the word in seconds. Appears only when timestamp\_granularities is set to `word,segment`. End time of the word in seconds. Appears only when timestamp\_granularities is set to `word,segment`. Indicates whether this word has been finalized. Segments of the transcribed/translated text and their corresponding details. The ID of the segment. The text content of the segment. Extracted words in the segment. Start time of the segment in seconds. Appears only when timestamp\_granularities is set to `word,segment`. End time of the segment in seconds. Appears only when timestamp\_granularities is set to `word,segment`. This field is for server to communicate it successfully cleared the context. A unique identifier for the event. A constant string indicating the event type is "stt.state.cleared" The ID of the context or session that has been successfully cleared. This field is for server to complete tracing. A unique identifier for the event. A constant string indicating the event type is "stt.output.trace". The ID used to correlate this output trace with the corresponding input trace. ### Streaming Audio Stream short audio chunks (50-400ms) in binary frames of PCM 16-bit little-endian at 16kHz sample rate and single channel (mono). Typically, you will: 1. Resample your audio to 16 kHz if it is not already. 2. Convert it to mono. 3. Send 50ms chunks (16,000 Hz \* 0.05s = 800 samples) of audio in 16-bit PCM (signed, little-endian) format. ### Handling Responses The client maintains a state dictionary, starting with an empty dictionary `{}`. When the server sends the first transcription message, it contains a list of segments. Each segment has an `id` and `text`: ```python theme={null} # Server initial message: { "segments": [ {"id": "0", "text": "This is the first sentence"}, {"id": "1", "text": "This is the second sentence"} ] } # Client initial state: { "0": "This is the first sentence", "1": "This is the second sentence", } ``` When the server sends the next updates to the transcription, the client updates the state dictionary based on the segment `id`: ```python theme={null} # Server continuous message: { "segments": [ {"id": "1", "text": "This is the second sentence modified"}, {"id": "2", "text": "This is the third sentence"} ] } # Client updated state: { "0": "This is the first sentence", "1": "This is the second sentence modified", # overwritten "2": "This is the third sentence", # new } ``` ### Handling Connection Interruptions & Timeouts Real-time streaming transcription over WebSockets can run for a long time. The longer a WebSocket session runs, the more likely it is to experience interruptions from network glitches to service hiccups. It is important to be aware of this and build your client to recover gracefully so the stream keeps going without user impact. In the following section, we’ll outline recommended practices for handling connection interruptions and timeouts effectively. #### When a connection drops Although Fireworks is designed to keep streams running smoothly, occasional interruptions can still occur. If the WebSocket is disrupted (e.g., bandwidth limitation or network failures), your application must initialize a new WebSocket connection, start a fresh streaming session and begin sending audio as soon as the server confirms the connection is open. #### Avoid losing audio during reconnects While you’re reconnecting, audio could be still being produced and you could lose that audio segment if it is not transferred to our API during this period. To minimize the risk of dropping audio during a reconnect, one effective approach is to store the audio data in a buffer until it can re-establish the connection to our API and then sends the data for transcription. ### Keep timestamps continuous across sessions When timestamps are enabled, the result will include the start and end time of the segment in seconds. And each new WebSocket session will reset the timestamps to start from 00:00:00. To keep a continuous timeline, we recommend to maintain a running “stream start offset” in your app and add that offset to timestamps from each new session so they align with the overall audio timeline. ### Example Usage Check out a brief Python example below or example sources: * [Python notebook](https://colab.research.google.com/github/fw-ai/cookbook/blob/main/learn/audio/audio_streaming_speech_to_text/audio_streaming_speech_to_text.ipynb) * [Python sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/python) * [Node.js sources](https://github.com/fw-ai/cookbook/tree/main/learn/audio/audio_streaming_speech_to_text/nodejs) ```python theme={null} !pip3 install requests torch torchaudio websocket-client import io import time import json import torch import requests import torchaudio import threading import websocket import urllib.parse lock = threading.Lock() state = {} def on_open(ws): def send_audio_chunks(): for chunk in audio_chunk_bytes: ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY) time.sleep(chunk_size_ms / 1000) final_checkpoint = json.dumps({"checkpoint_id": "final"}) ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT) threading.Thread(target=send_audio_chunks).start() def on_message(ws, message): message = json.loads(message) if message.get("checkpoint_id") == "final": ws.close() return update = {s["id"]: s["text"] for s in message["segments"]} with lock: state.update(update) print("\n".join(f" - {k}: {v}" for k, v in state.items())) def on_error(ws, error): print(f"WebSocket error: {error}") # Open a connection URL with query params url = "wss://audio-streaming.api.fireworks.ai/v1/audio/transcriptions/streaming" params = urllib.parse.urlencode({ "language": "en", }) ws = websocket.WebSocketApp( f"{url}?{params}", header={"Authorization": ""}, on_open=on_open, on_message=on_message, on_error=on_error, ) ws.run_forever() ``` ### Dedicated endpoint For fixed throughput and predictable SLAs, you may request a dedicated endpoint for streaming transcription at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) or [discord](https://www.google.com/url?q=https%3A%2F%2Fdiscord.gg%2Ffireworks-ai). ### Supported Languages The following languages are supported for transcription: | Language Code | Language Name | | ------------- | ------------------- | | en | English | | zh | Chinese | | de | German | | es | Spanish | | ru | Russian | | ko | Korean | | fr | French | | ja | Japanese | | pt | Portuguese | | tr | Turkish | | pl | Polish | | ca | Catalan | | nl | Dutch | | ar | Arabic | | sv | Swedish | | it | Italian | | id | Indonesian | | hi | Hindi | | fi | Finnish | | vi | Vietnamese | | he | Hebrew | | uk | Ukrainian | | el | Greek | | ms | Malay | | cs | Czech | | ro | Romanian | | da | Danish | | hu | Hungarian | | ta | Tamil | | no | Norwegian | | th | Thai | | ur | Urdu | | hr | Croatian | | bg | Bulgarian | | lt | Lithuanian | | la | Latin | | mi | Maori | | ml | Malayalam | | cy | Welsh | | sk | Slovak | | te | Telugu | | fa | Persian | | lv | Latvian | | bn | Bengali | | sr | Serbian | | az | Azerbaijani | | sl | Slovenian | | kn | Kannada | | et | Estonian | | mk | Macedonian | | br | Breton | | eu | Basque | | is | Icelandic | | hy | Armenian | | ne | Nepali | | mn | Mongolian | | bs | Bosnian | | kk | Kazakh | | sq | Albanian | | sw | Swahili | | gl | Galician | | mr | Marathi | | pa | Punjabi | | si | Sinhala | | km | Khmer | | sn | Shona | | yo | Yoruba | | so | Somali | | af | Afrikaans | | oc | Occitan | | ka | Georgian | | be | Belarusian | | tg | Tajik | | sd | Sindhi | | gu | Gujarati | | am | Amharic | | yi | Yiddish | | lo | Lao | | uz | Uzbek | | fo | Faroese | | ht | Haitian Creole | | ps | Pashto | | tk | Turkmen | | nn | Nynorsk | | mt | Maltese | | sa | Sanskrit | | lb | Luxembourgish | | my | Myanmar | | bo | Tibetan | | tl | Tagalog | | mg | Malagasy | | as | Assamese | | tt | Tatar | | haw | Hawaiian | | ln | Lingala | | ha | Hausa | | ba | Bashkir | | jw | Javanese | | su | Sundanese | | yue | Cantonese | | zh-hant | Traditional Chinese | | zh-hans | Simplified Chinese | # Transcribe audio Source: https://docs.fireworks.ai/api-reference/audio-transcriptions post /audio/transcriptions Send a sample audio to get a transcription. ### Headers Your Fireworks API key, e.g. `Authorization=API_KEY`. ### Request ##### (multi-part form) The input audio file to transcribe or an URL to the public audio file. Max audio file size is 1 GB, there is no limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance. String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints: * [https://audio-prod.api.fireworks.ai](https://audio-prod.api.fireworks.ai) (for `whisper-v3`); * [https://audio-turbo.api.fireworks.ai](https://audio-turbo.api.fireworks.ai) (for `whisper-v3-turbo`); String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`. String name of the alignment model to use. Currently supported: * `mms_fa` optimal accuracy for multilingual speech. * `tdnn_ffn` optimal accuracy for English-only speech. The target language for transcription. See the [Supported Languages](#supported-languages) section below for a complete list of available languages. The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. Alternatively, fallback decoding can be enabled by passing a list of temperatures like `0.0,0.2,0.4,0.6,0.8,1.0`. This can help to improve performance. The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`. The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, `segment`, or `word,segment`. If not present, defaults to `segment`. Whether to get speaker diarization for the transcription. Can be one of `true`, or `false`. If not present, defaults to `false`. Enabling diarization also requires other fields to hold specific values: 1. `response_format` must be set `verbose_json`. 2. `timestamp_granularities` must include `word` to use diarization. The minimum number of speakers to detect for diarization. `diarize` must be set `true` to use `min_speakers`. If not present, defaults to `1`. The maximum number of speakers to detect for diarization. `diarize` must be set `true` to use `max_speakers`. If not present, defaults to `inf`. Audio preprocessing mode. Currently supported: * `none` to skip audio preprocessing. * `dynamic` for arbitrary audio content with variable loudness. * `soft_dynamic` for speech intense recording such as podcasts and voice-overs. * `bass_dynamic` for boosting lower frequencies; ### Response The task which was performed. Either `transcribe` or `translate`. The language of the transcribed/translated text. The duration of the transcribed/translated audio, in seconds. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. The language of the word. The probability of the word. The hallucination score of the word. Start time of the word in seconds. End time of the word in seconds. Speaker label for the word. Segments of the transcribed/translated text and their corresponding details. The id of the segment. The text content of the segment. Start time of the segment in seconds. End time of the segment in seconds. Speaker label for the segment. Extracted words in the segment. ```curl curl theme={null} # Download audio file curl -L -o "audio.flac" "https://tinyurl.com/4997djsh" # Make request curl -X POST "https://audio-prod.api.fireworks.ai/v1/audio/transcriptions" \ -H "Authorization: " \ -F "file=@audio.flac" ``` ```python fireworks sdk theme={null} !pip install fireworks-ai requests python-dotenv from fireworks.client.audio import AudioInference import requests import os from dotenv import load_dotenv import time # Create a .env file with your API key load_dotenv() # Download audio sample audio = requests.get("https://tinyurl.com/4cb74vas").content # Prepare client client = AudioInference( model="whisper-v3", base_url="https://audio-prod.api.fireworks.ai", # Or for the turbo version # model="whisper-v3-turbo", # base_url="https://audio-turbo.api.fireworks.ai", api_key=os.getenv("FIREWORKS_API_KEY"), ) # Make request start = time.time() r = await client.transcribe_async(audio=audio) print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'") ``` ```python Python (openai sdk) theme={null} !pip install openai requests python-dotenv from openai import OpenAI import os import requests from dotenv import load_dotenv load_dotenv() client = OpenAI( base_url="https://audio-prod.api.fireworks.ai/v1", api_key=os.getenv("FIREWORKS_API_KEY") ) audio_file= requests.get("https://tinyurl.com/4cb74vas").content transcription = client.audio.transcriptions.create( model="whisper-v3", file=audio_file ) print(transcription.text) ``` ### Supported Languages The following languages are supported for transcription: | Language Code | Language Name | | ------------- | ------------------- | | en | English | | zh | Chinese | | de | German | | es | Spanish | | ru | Russian | | ko | Korean | | fr | French | | ja | Japanese | | pt | Portuguese | | tr | Turkish | | pl | Polish | | ca | Catalan | | nl | Dutch | | ar | Arabic | | sv | Swedish | | it | Italian | | id | Indonesian | | hi | Hindi | | fi | Finnish | | vi | Vietnamese | | he | Hebrew | | uk | Ukrainian | | el | Greek | | ms | Malay | | cs | Czech | | ro | Romanian | | da | Danish | | hu | Hungarian | | ta | Tamil | | no | Norwegian | | th | Thai | | ur | Urdu | | hr | Croatian | | bg | Bulgarian | | lt | Lithuanian | | la | Latin | | mi | Maori | | ml | Malayalam | | cy | Welsh | | sk | Slovak | | te | Telugu | | fa | Persian | | lv | Latvian | | bn | Bengali | | sr | Serbian | | az | Azerbaijani | | sl | Slovenian | | kn | Kannada | | et | Estonian | | mk | Macedonian | | br | Breton | | eu | Basque | | is | Icelandic | | hy | Armenian | | ne | Nepali | | mn | Mongolian | | bs | Bosnian | | kk | Kazakh | | sq | Albanian | | sw | Swahili | | gl | Galician | | mr | Marathi | | pa | Punjabi | | si | Sinhala | | km | Khmer | | sn | Shona | | yo | Yoruba | | so | Somali | | af | Afrikaans | | oc | Occitan | | ka | Georgian | | be | Belarusian | | tg | Tajik | | sd | Sindhi | | gu | Gujarati | | am | Amharic | | yi | Yiddish | | lo | Lao | | uz | Uzbek | | fo | Faroese | | ht | Haitian Creole | | ps | Pashto | | tk | Turkmen | | nn | Nynorsk | | mt | Maltese | | sa | Sanskrit | | lb | Luxembourgish | | my | Myanmar | | bo | Tibetan | | tl | Tagalog | | mg | Malagasy | | as | Assamese | | tt | Tatar | | haw | Hawaiian | | ln | Lingala | | ha | Hausa | | ba | Bashkir | | jw | Javanese | | su | Sundanese | | yue | Cantonese | | zh-hant | Traditional Chinese | | zh-hans | Simplified Chinese | # Translate audio Source: https://docs.fireworks.ai/api-reference/audio-translations post /audio/translations ### Headers Your Fireworks API key, e.g. `Authorization=API_KEY`. ### Request ##### (multi-part form) The input audio file to translate or an URL to the public audio file. Max audio file size is 1 GB, there is no limit for audio duration. Common file formats such as mp3, flac, and wav are supported. Note that the audio will be resampled to 16kHz, downmixed to mono, and reformatted to 16-bit signed little-endian format before transcription. Pre-converting the file before sending it to the API can improve runtime performance. String name of the ASR model to use. Can be one of `whisper-v3` or `whisper-v3-turbo`. Please use the following serverless endpoints: * [https://audio-prod.api.fireworks.ai](https://audio-prod.api.fireworks.ai) (for `whisper-v3`); * [https://audio-turbo.api.fireworks.ai](https://audio-turbo.api.fireworks.ai) (for `whisper-v3-turbo`); String name of the voice activity detection (VAD) model to use. Can be one of `silero`, or `whisperx-pyannet`. String name of the alignment model to use. Currently supported: * `mms_fa` optimal accuracy for multilingual speech. * `tdnn_ffn` optimal accuracy for English-only speech. The source language for transcription. See the [Supported Languages](#supported-languages) section below for a complete list of available languages. The input prompt that the model will use when generating the transcription. Can be used to specify custom words or specify the style of the transcription. E.g. `Um, here's, uh, what was recorded.` will make the model to include the filler words into the transcription. Sampling temperature to use when decoding text tokens during transcription. Alternatively, fallback decoding can be enabled by passing a list of temperatures like `0.0,0.2,0.4,0.6,0.8,1.0`. This can help to improve performance. The format in which to return the response. Can be one of `json`, `text`, `srt`, `verbose_json`, or `vtt`. The timestamp granularities to populate for this transcription. response\_format must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported. Can be one of `word`, `segment`, or `word,segment`. If not present, defaults to `segment`. Audio preprocessing mode. Currently supported: * `none` to skip audio preprocessing. * `dynamic` for arbitrary audio content with variable loudness. * `soft_dynamic` for speech intense recording such as podcasts and voice-overs. * `bass_dynamic` for boosting lower frequencies; ### Response The task which was performed. Either `transcribe` or `translate`. The language of the transcribed/translated text. The duration of the transcribed/translated audio, in seconds. The transcribed/translated text. Extracted words and their corresponding timestamps. The text content of the word. Start time of the word in seconds. End time of the word in seconds. Segments of the transcribed/translated text and their corresponding details. ```curl curl theme={null} # Download audio file curl -L -o "audio.flac" "https://tinyurl.com/4997djsh" # Make request curl -X POST "https://audio-prod.api.fireworks.ai/v1/audio/translations" \ -H "Authorization: " \ -F "file=@audio.flac" ``` ```python Python (fireworks sdk) theme={null} !pip install fireworks-ai requests from fireworks.client.audio import AudioInference import requests import time from dotenv import load_dotenv import os load_dotenv() # Prepare client audio = requests.get("https://tinyurl.com/3cy7x44v").content client = AudioInference( model="whisper-v3", base_url="https://audio-prod.api.fireworks.ai", # # Or for the turbo version # model="whisper-v3-turbo", # base_url="https://audio-turbo.api.fireworks.ai", api_key=os.getenv("FIREWORKS_API_KEY") ) # Make request start = time.time() r = await client.translate_async(audio=audio) print(f"Took: {(time.time() - start):.3f}s. Text: '{r.text}'") ``` ```python Python (openai sdk) theme={null} !pip install openai requests from openai import OpenAI import requests from dotenv import load_dotenv import os load_dotenv() client = OpenAI( base_url="https://audio-prod.api.fireworks.ai/v1", api_key=os.getenv("FIREWORKS_API_KEY"), ) audio_file= requests.get("https://tinyurl.com/3cy7x44v").content translation = client.audio.translations.create( model="whisper-v3", file=audio_file, ) print(translation.text) ``` ### Supported Languages Translation is from one of the supported languages to English, the following languages are supported for translation: | Language Code | Language Name | | ------------- | -------------- | | en | English | | zh | Chinese | | de | German | | es | Spanish | | ru | Russian | | ko | Korean | | fr | French | | ja | Japanese | | pt | Portuguese | | tr | Turkish | | pl | Polish | | ca | Catalan | | nl | Dutch | | ar | Arabic | | sv | Swedish | | it | Italian | | id | Indonesian | | hi | Hindi | | fi | Finnish | | vi | Vietnamese | | he | Hebrew | | uk | Ukrainian | | el | Greek | | ms | Malay | | cs | Czech | | ro | Romanian | | da | Danish | | hu | Hungarian | | ta | Tamil | | no | Norwegian | | th | Thai | | ur | Urdu | | hr | Croatian | | bg | Bulgarian | | lt | Lithuanian | | la | Latin | | mi | Maori | | ml | Malayalam | | cy | Welsh | | sk | Slovak | | te | Telugu | | fa | Persian | | lv | Latvian | | bn | Bengali | | sr | Serbian | | az | Azerbaijani | | sl | Slovenian | | kn | Kannada | | et | Estonian | | mk | Macedonian | | br | Breton | | eu | Basque | | is | Icelandic | | hy | Armenian | | ne | Nepali | | mn | Mongolian | | bs | Bosnian | | kk | Kazakh | | sq | Albanian | | sw | Swahili | | gl | Galician | | mr | Marathi | | pa | Punjabi | | si | Sinhala | | km | Khmer | | sn | Shona | | yo | Yoruba | | so | Somali | | af | Afrikaans | | oc | Occitan | | ka | Georgian | | be | Belarusian | | tg | Tajik | | sd | Sindhi | | gu | Gujarati | | am | Amharic | | yi | Yiddish | | lo | Lao | | uz | Uzbek | | fo | Faroese | | ht | Haitian Creole | | ps | Pashto | | tk | Turkmen | | nn | Nynorsk | | mt | Maltese | | sa | Sanskrit | | lb | Luxembourgish | | my | Myanmar | | bo | Tibetan | | tl | Tagalog | | mg | Malagasy | | as | Assamese | | tt | Tatar | | haw | Hawaiian | | ln | Lingala | | ha | Hausa | | ba | Bashkir | | jw | Javanese | | su | Sundanese | | yue | Cantonese | # Cancel Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/cancel-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:cancel # Create API Key Source: https://docs.fireworks.ai/api-reference/create-api-key post /v1/accounts/{account_id}/users/{user_id}/apiKeys # Create Batch Inference Job Source: https://docs.fireworks.ai/api-reference/create-batch-inference-job post /v1/accounts/{account_id}/batchInferenceJobs # Create Batch Request Source: https://docs.fireworks.ai/api-reference/create-batch-request post /{path}?endpoint_id={endpoint_id} Create a batch request for our audio transcription service ### Headers Your Fireworks API key, e.g. `Authorization=FIREWORKS_API_KEY`. Alternatively, can be provided as a query param. ### Path Parameters The relative route of the target API operation (e.g. `"v1/audio/transcriptions"`, `"v1/audio/translations"`). This should correspond to a valid route supported by the backend service. ### Query Parameters Identifies the target backend service or model to handle the request. Currently supported: * `audio-prod`: [https://audio-prod.api.fireworks.ai](https://audio-prod.api.fireworks.ai) * `audio-turbo`: [https://audio-turbo.api.fireworks.ai](https://audio-turbo.api.fireworks.ai) ### Body Request body fields vary depending on the selected `endpoint_id` and `path`. The request body must conform to the schema defined by the corresponding synchronous API.\ For example, transcription requests typically accept fields such as `model`, `diarize`, and `response_format`.\ Refer to the relevant synchronous API for required fields: * [Transcribe audio](https://docs.fireworks.ai/api-reference/audio-transcriptions) * [Translate audio](https://docs.fireworks.ai/api-reference/audio-translations) ### Response The status of the batch request submission.\ A value of `"submitted"` indicates the batch request was accepted and queued for processing. A unique identifier assigned to the batch job. This ID can be used to check job status or retrieve results later. The unique identifier of the account associated with the batch job. The backend service selected to process the request.\ This typically matches the `endpoint_id` used during submission. A human-readable message describing the result of the submission.\ Typically `"Request submitted successfully"` if accepted. ```curl curl theme={null} # Download audio file curl -L -o "audio.flac" "https://tinyurl.com/4997djsh" # Make request curl -X POST "https://audio-batch.api.fireworks.ai/v1/audio/transcriptions?endpoint_id=audio-prod" \ -H "Authorization: " \ -F "file=@audio.flac" ``` ```python python theme={null} !pip install requests import os import requests # input API key and download audio api_key = "" audio = requests.get("https://tinyurl.com/4cb74vas").content # Prepare request data url = "https://audio-batch.api.fireworks.ai/v1/audio/transcriptions?endpoint_id=audio-prod" headers = {"Authorization": api_key} payload = { "model": "whisper-v3", "response_format": "json" } files = {"file": ("audio.flac", audio, "audio/flac")} # Send request response = requests.post(url, headers=headers, data=payload, files=files) print(response.text) ``` To check the status of your batch request, use the [Check Batch Status](https://docs.fireworks.ai/api-reference/get-batch-status) endpoint with the returned `batch_id`. # Create Dataset Source: https://docs.fireworks.ai/api-reference/create-dataset post /v1/accounts/{account_id}/datasets # Load LoRA Source: https://docs.fireworks.ai/api-reference/create-deployed-model post /v1/accounts/{account_id}/deployedModels # Create Deployment Source: https://docs.fireworks.ai/api-reference/create-deployment post /v1/accounts/{account_id}/deployments # null Source: https://docs.fireworks.ai/api-reference/create-dpo-job post /v1/accounts/{account_id}/dpoJobs # Create Evaluation Job Source: https://docs.fireworks.ai/api-reference/create-evaluation-job post /v1/accounts/{account_id}/evaluationJobs # Create Evaluator Source: https://docs.fireworks.ai/api-reference/create-evaluator post /v1/accounts/{account_id}/evaluatorsV2 Creates a custom evaluator for scoring model outputs. Evaluators use the [Eval Protocol](https://evalprotocol.io) to define test cases, run model inference, and score responses. They are used with evaluation jobs and Reinforcement Fine-Tuning (RFT). ## Source Code Requirements Your project should contain: - `requirements.txt` - Python dependencies for your evaluator - `test_*.py` - Pytest test file(s) with [`@evaluation_test`](https://evalprotocol.io/reference/evaluation-test) decorated functions - Any additional code/modules your evaluator needs ## Workflow **Recommended:** Use the [`ep upload`](https://evalprotocol.io/reference/cli#ep-upload) CLI command to handle all these steps automatically. If using the API directly: 1. Call this endpoint to create the evaluator resource 2. Package your source directory as a `.tar.gz` (respecting `.gitignore`) 3. Call [Get Evaluator Upload Endpoint](/api-reference/get-evaluator-upload-endpoint) to get a signed upload URL 4. `PUT` the tar.gz file to the signed URL 5. Call [Validate Evaluator Upload](/api-reference/validate-evaluator-upload) to trigger server-side validation 6. Poll [Get Evaluator](/api-reference/get-evaluator) until ready Once active, reference the evaluator in [Create Evaluation Job](/api-reference/create-evaluation-job) or [Create Reinforcement Fine-tuning Job](/api-reference/create-reinforcement-fine-tuning-job). # Create Model Source: https://docs.fireworks.ai/api-reference/create-model post /v1/accounts/{account_id}/models # Create Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs # Create Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs # null Source: https://docs.fireworks.ai/api-reference/create-secret post /v1/accounts/{account_id}/secrets # Create Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/create-supervised-fine-tuning-job post /v1/accounts/{account_id}/supervisedFineTuningJobs # Create User Source: https://docs.fireworks.ai/api-reference/create-user post /v1/accounts/{account_id}/users # Create embeddings Source: https://docs.fireworks.ai/api-reference/creates-an-embedding-vector-representing-the-input-text post /embeddings # Delete API Key Source: https://docs.fireworks.ai/api-reference/delete-api-key post /v1/accounts/{account_id}/users/{user_id}/apiKeys:delete # Delete Batch Inference Job Source: https://docs.fireworks.ai/api-reference/delete-batch-inference-job delete /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id} # Delete Dataset Source: https://docs.fireworks.ai/api-reference/delete-dataset delete /v1/accounts/{account_id}/datasets/{dataset_id} # Unload LoRA Source: https://docs.fireworks.ai/api-reference/delete-deployed-model delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Delete Deployment Source: https://docs.fireworks.ai/api-reference/delete-deployment delete /v1/accounts/{account_id}/deployments/{deployment_id} # null Source: https://docs.fireworks.ai/api-reference/delete-dpo-job delete /v1/accounts/{account_id}/dpoJobs/{dpo_job_id} # Delete Evaluation Job Source: https://docs.fireworks.ai/api-reference/delete-evaluation-job delete /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id} # Delete Evaluator Source: https://docs.fireworks.ai/api-reference/delete-evaluator delete /v1/accounts/{account_id}/evaluators/{evaluator_id} Deletes an evaluator and its associated versions and build artifacts. # Delete Model Source: https://docs.fireworks.ai/api-reference/delete-model delete /v1/accounts/{account_id}/models/{model_id} # Delete Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-job delete /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id} # Delete Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-step delete /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id} # Delete Response Source: https://docs.fireworks.ai/api-reference/delete-response delete /v1/responses/{response_id} Deletes a model response by its ID. Once deleted, the response data will be gone immediately and permanently. The response cannot be recovered and any conversations that reference this response ID will no longer be able to access it. # null Source: https://docs.fireworks.ai/api-reference/delete-secret delete /v1/accounts/{account_id}/secrets/{secret_id} # Delete Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/delete-supervised-fine-tuning-job delete /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id} # Execute one training step for keep-alive Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/execute-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:executeTrainStep # Generate an image with FLUX.1 [schnell] FP8 Source: https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image [FLUX.1 \[schnell\]](https://huggingface.co/fireworks-ai/FLUX.1-schnell-fp8-flumina) is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. The FP8 version uses reduced precision numerics for 2x faster inference. See our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-1-schnell-fp8) to quickly try it out in your browser. ## Headers Specifies which format to return the response in. With `image/png` and `image/jpeg`, the server will populate the response body with a binary image of the specified format. The media type of the request body. The Bearer with Fireworks API Key. ## Request Body Prompt to use for the image generation process. Aspect ratio of the generated image. **Options:** `1:1`, `21:9`, `16:9`, `3:2`, `5:4`, `4:5`, `2:3`, `9:16`, `9:21`, `4:3`, `3:4` Classifier-free guidance scale for the image diffusion process. Default value is 3.5. Number of denoising steps for the image generation process. Default value is 4. Random seed to use for the image generation process. If 0, we will use a totally random seed. ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image" headers = { "Content-Type": "application/json", "Accept": "image/jpeg", "Authorization": "Bearer $API_KEY", } data = { "prompt": "A beautiful sunset over the ocean" } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: with open("a.jpg", "wb") as f: f.write(response.content) print("Image saved as a.jpg") else: print("Error:", response.status_code, response.text) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image", { method: "POST", headers: { "Content-Type": "application/json", "Accept": "image/jpeg", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ prompt: "A beautiful sunset over the ocean" }), }); // To process the response and get the image: const buffer = await response.arrayBuffer(); fs.writeFile('a.jpg', Buffer.from(buffer), () => console.log('Finished downloading!')); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image \ -H 'Content-Type: application/json' \ -H 'Accept: image/jpeg' \ -H "Authorization: Bearer $API_KEY" \ --data ' { "prompt": "A beautiful sunset over the ocean" }' -o a.jpg ``` ```json Accept: application/json theme={null} { "id": "1234567890", "base64": ["data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...", "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."], "finishReason": "SUCCESS", "seed": 1234567890 } ``` ```txt Accept: image/jpeg theme={null} /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wAARCAABAAEDASIAAhEBAxEB/8QAFQABAQAAAAAAAAAAAAAAAAAAAAv/xAAUEAEAAAAAAAAAAAAAAAAAAAAA/8QAFQEBAQAAAAAAAAAAAAAAAAAAAAX/xAAUEQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIRAxEAPwCdABmX/9k= ``` ```txt Accept: image/png theme={null} iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNkYPhfDwAChwGA60e6kgAAAABJRU5ErkJggg== ``` ## Response The unique identifier for the image generation request. Includes a base64-encoded string containing an image in PNG format. To retrieve the image, base64-decode the string into binary data, then load that binary data as a PNG file. Can be `SUCCESS` or `CONTENT_FILTERED`. Specifies the outcome of the image generation process. It could be `SUCCESS` indicating that the image was successfully generated, or `CONTENT_FILTERED` if the image was filtered due to the safety\_check=true parameter being set. The seed used for the image generation process. When the Accept type is `image/jpeg`, the response body will contain a binary image. Additionally, the response will include headers such as: **Content-Length:** Represents the length of the binary image content. **Seed:** The random seed used to generate the image. **Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`. When the Accept type is `image/png`, the response body will contain a binary image. Additionally, the response will include headers such as: **Content-Length:** Represents the length of the binary image content. **Seed:** The random seed used to generate the image. **Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`. # Generate or edit an image with FLUX.1 Kontext Source: https://docs.fireworks.ai/api-reference/generate-or-edit-image-using-flux-kontext POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model} 💡 Note that this API is async and will return the **request\_id** instead of the image. Call the [get\_result](/api-reference/get-generated-image-from-flux-kontex) API to obtain the generated image. FLUX Kontext Pro is a specialized model for generating contextually-aware images from text descriptions. Designed for professional use cases requiring high-quality, consistent image generation. Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-pro) to quickly try it out in your browser. FLUX Kontext Max is the most advanced model in the Kontext series, offering maximum quality and context understanding. Ideal for enterprise applications requiring the highest level of image generation performance. Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-max) to quickly try it out in your browser. ## Path The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API. ## Headers The media type of the request body. Your Fireworks API key. ## Request Body Prompt to use for the image generation process. Base64 encoded image or URL to use with Kontext. Optional seed for reproducibility. Aspect ratio of the image between 21:9 and 9:21. Output format for the generated image. Can be 'jpeg' or 'png'. **Options:** `jpeg`, `png` URL to receive webhook notifications. **Length:** 1-2083 characters Optional secret for webhook signature verification. Whether to perform upsampling on the prompt. If active, automatically modifies the prompt for more creative generation. Tolerance level for input and output moderation. Between 0 and 6, 0 being most strict, 6 being least strict. Limit of 2 for Image to Image. **Range:** 0-6 ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}" headers = { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY", } data = { "prompt": "A beautiful sunset over the ocean", "input_image": "", "seed": 42, "aspect_ratio": "", "output_format": "jpeg", "webhook_url": "", "webhook_secret": "", "prompt_upsampling": False, "safety_tolerance": 2 } response = requests.post(url, headers=headers, json=data) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ prompt: "A beautiful sunset over the ocean" }), }); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model} \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ --data ' { "prompt": "A beautiful sunset over the ocean" }' ``` ## Response Successful Response request id Unsuccessful Response error message # Get Account Source: https://docs.fireworks.ai/api-reference/get-account get /v1/accounts/{account_id} # Get Batch Inference Job Source: https://docs.fireworks.ai/api-reference/get-batch-inference-job get /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id} # Check Batch Status Source: https://docs.fireworks.ai/api-reference/get-batch-status get /v1/accounts/{account_id}/batch_job/{batch_id} This endpoint allows you to check the current status of a previously submitted batch request, and retrieve the final result if available. Check status of your batch request ### Headers Your Fireworks API key. e.g. `Authorization=FIREWORKS_API_KEY`. Alternatively, can be provided as a query param. ### Path Parameters The identifier of your Fireworks account. Must match the account used when the batch request was submitted. The unique identifier of the batch job to check.\ This should match the `batch_id` returned when the batch request was originally submitted. ### Response The response includes the status of the batch job and, if completed, the final result. The status of the batch job at the time of the request.\ Possible values include `"completed"` and `"processing"`. The unique identifier of the batch job whose status is being retrieved.\ This ID matches the one provided in the original request. A human-readable message describing the current state of the batch job.\ This field is typically `null` when the job has completed successfully. The original content type of the response body.\ This value can be used to determine how to parse the string in the `body` field. The serialized result of the batch job, this field is only present when `status` is `"completed"`.\ The format of this string depends on the `content_type` field and may vary across endpoints.\ Clients should use `content_type` to determine how to parse or interpret the value. ```curl curl theme={null} # Make request curl -X GET "https://audio-batch.api.fireworks.ai/v1/accounts/{account_id}/batch_job/{batch_id}" \ -H "Authorization: " ``` ```python python theme={null} !pip install requests import os import requests # Input api key and path parameters api_key = "" account_id = "" batch_id = "" # Send request url = f"https://audio-batch.api.fireworks.ai/v1/accounts/{account_id}/batch_job/{batch_id}" headers = {"Authorization": api_key} response = requests.get(url, headers=headers) print(response.text) ``` # Get Dataset Source: https://docs.fireworks.ai/api-reference/get-dataset get /v1/accounts/{account_id}/datasets/{dataset_id} # Get Dataset Download Endpoint Source: https://docs.fireworks.ai/api-reference/get-dataset-download-endpoint get /v1/accounts/{account_id}/datasets/{dataset_id}:getDownloadEndpoint # Get Dataset Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-dataset-upload-endpoint post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint # Get LoRA Source: https://docs.fireworks.ai/api-reference/get-deployed-model get /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Get Deployment Source: https://docs.fireworks.ai/api-reference/get-deployment get /v1/accounts/{account_id}/deployments/{deployment_id} # Get Deployment Shape Source: https://docs.fireworks.ai/api-reference/get-deployment-shape get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id} # Get Deployment Shape Version Source: https://docs.fireworks.ai/api-reference/get-deployment-shape-version get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions/{version_id} # null Source: https://docs.fireworks.ai/api-reference/get-dpo-job get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id} # null Source: https://docs.fireworks.ai/api-reference/get-dpo-job-metrics-file-endpoint get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:getMetricsFileEndpoint # Get Evaluation Job Source: https://docs.fireworks.ai/api-reference/get-evaluation-job get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id} # Get Evaluation Job execution logs (stream log endpoint + tracing IDs). Source: https://docs.fireworks.ai/api-reference/get-evaluation-job-log-endpoint get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id}:getExecutionLogEndpoint # Get Evaluator Source: https://docs.fireworks.ai/api-reference/get-evaluator get /v1/accounts/{account_id}/evaluators/{evaluator_id} Retrieves an evaluator by name. Use this to monitor build progress after creation (**step 6** in the [Create Evaluator](/api-reference/create-evaluator) workflow). Possible states: - `BUILDING` - Environment is being prepared - `ACTIVE` - Evaluator is ready to use - `BUILD_FAILED` - Check build logs via [Get Evaluator Build Log Endpoint](/api-reference/get-evaluator-build-log-endpoint) # Get Evaluator Build Log Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-build-log-endpoint get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getBuildLogEndpoint Returns a signed URL to download the evaluator's build logs. Useful for debugging `BUILD_FAILED` state. # Get Evaluator Source Code Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-source-code-endpoint get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getSourceCodeSignedUrl Returns a signed URL to download the evaluator's source code archive. Useful for debugging or reviewing the uploaded code. # Get Evaluator Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-upload-endpoint post /v1/accounts/{account_id}/evaluators/{evaluator_id}:getUploadEndpoint Returns signed URLs for uploading evaluator source code (**step 3** in the [Create Evaluator](/api-reference/create-evaluator) workflow). After receiving the signed URL, upload your `.tar.gz` archive using HTTP `PUT` with `Content-Type: application/octet-stream` header. # Get generated image from FLUX.1 Kontext Source: https://docs.fireworks.ai/api-reference/get-generated-image-from-flux-kontex GET https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result Replace **model** with **flux-kontext-pro** in the API to get the result. Replace **model** with **flux-kontext-max** in the API to get the result. ## Path The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API. ## Headers The media type of the request body. Your Fireworks API key. ## Request Body Request id generated from create/edit image request. ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result" headers = { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY", } data = { id: "request_id" } response = requests.post(url, headers=headers, json=data) print(response.text) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ id: "request_id" }), }); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ --data ' { id: "request_id" }' ``` ## Response Task id for retrieving result Available options: Task not found, Pending, Request Moderated, Content Moderated, Ready, Error # Get Model Source: https://docs.fireworks.ai/api-reference/get-model get /v1/accounts/{account_id}/models/{model_id} # Get Model Download Endpoint Source: https://docs.fireworks.ai/api-reference/get-model-download-endpoint get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint # Get Model Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-model-upload-endpoint post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint # Get Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-job get /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id} # Get Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-step get /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id} # Get Response Source: https://docs.fireworks.ai/api-reference/get-response get /v1/responses/{response_id} # Get Secret Source: https://docs.fireworks.ai/api-reference/get-secret get /v1/accounts/{account_id}/secrets/{secret_id} Retrieves a secret by name. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included. # Get Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/get-supervised-fine-tuning-job get /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id} # Get User Source: https://docs.fireworks.ai/api-reference/get-user get /v1/accounts/{account_id}/users/{user_id} # Introduction Source: https://docs.fireworks.ai/api-reference/introduction Fireworks AI REST API enables you to interact with various language, image and embedding models using an API Key. It also lets you automate management of models, deployments, datasets, and more. ## Authentication All requests made to the Fireworks AI REST API must include an `Authorization` header with a valid `Bearer` token using your API key, along with the `Content-Type: application/json` header. ### Getting your API key You can obtain an API key by: * Using the [`firectl create api-key`](/tools-sdks/firectl/commands/create-api-key) command * Generating one through the [Fireworks AI dashboard](https://app.fireworks.ai/settings/users/api-keys) ### Request headers Include the following headers in your REST API requests: ```json theme={null} authorization: Bearer content-type: application/json ``` # List Accounts Source: https://docs.fireworks.ai/api-reference/list-accounts get /v1/accounts # List API Keys Source: https://docs.fireworks.ai/api-reference/list-api-keys get /v1/accounts/{account_id}/users/{user_id}/apiKeys # List Batch Inference Jobs Source: https://docs.fireworks.ai/api-reference/list-batch-inference-jobs get /v1/accounts/{account_id}/batchInferenceJobs # List Datasets Source: https://docs.fireworks.ai/api-reference/list-datasets get /v1/accounts/{account_id}/datasets # List LoRAs Source: https://docs.fireworks.ai/api-reference/list-deployed-models get /v1/accounts/{account_id}/deployedModels # List Deployment Shapes Versions Source: https://docs.fireworks.ai/api-reference/list-deployment-shape-versions get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions # List Deployment Shapes Source: https://docs.fireworks.ai/api-reference/list-deployment-shapes get /v1/accounts/{account_id}/deploymentShapes # List Deployments Source: https://docs.fireworks.ai/api-reference/list-deployments get /v1/accounts/{account_id}/deployments # null Source: https://docs.fireworks.ai/api-reference/list-dpo-jobs get /v1/accounts/{account_id}/dpoJobs # List Evaluation Jobs Source: https://docs.fireworks.ai/api-reference/list-evaluation-jobs get /v1/accounts/{account_id}/evaluationJobs # List Evaluators Source: https://docs.fireworks.ai/api-reference/list-evaluators get /v1/accounts/{account_id}/evaluators Lists all evaluators for an account with pagination support. # List Models Source: https://docs.fireworks.ai/api-reference/list-models get /v1/accounts/{account_id}/models # List Reinforcement Fine-tuning Jobs Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-jobs get /v1/accounts/{account_id}/reinforcementFineTuningJobs # List Reinforcement Fine-tuning Steps Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-steps get /v1/accounts/{account_id}/rlorTrainerJobs # List Responses Source: https://docs.fireworks.ai/api-reference/list-responses get /v1/responses Get a list of all responses for the authenticated account. Args: limit: Maximum number of responses to return (default: 20, max: 100) after: Cursor for pagination - return responses after this ID before: Cursor for pagination - return responses before this ID # List Secrets Source: https://docs.fireworks.ai/api-reference/list-secrets get /v1/accounts/{account_id}/secrets Lists all secrets for an account. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included for each secret. # List Supervised Fine-tuning Jobs Source: https://docs.fireworks.ai/api-reference/list-supervised-fine-tuning-jobs get /v1/accounts/{account_id}/supervisedFineTuningJobs # List Users Source: https://docs.fireworks.ai/api-reference/list-users get /v1/accounts/{account_id}/users # Create Chat Completion Source: https://docs.fireworks.ai/api-reference/post-chatcompletions post /v1/chat/completions Create a completion for the provided prompt and parameters. # Create Completion Source: https://docs.fireworks.ai/api-reference/post-completions post /v1/completions Create a completion for the provided prompt and parameters. # Create Response Source: https://docs.fireworks.ai/api-reference/post-responses post /v1/responses Creates a model response, optionally interacting with custom tools via the Model Context Protocol (MCP). This endpoint supports conversational continuation and streaming. Explore our cookbooks for detailed examples: - [Basic MCP Usage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_examples.ipynb) - [Streaming with MCP](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_with_streaming.ipynb) - [Conversational History with `previous_response_id`](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_previous_response_cookbook.ipynb) - [Basic Streaming](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_streaming_example.ipynb) - [Controlling Response Storage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/mcp_server_with_store_false_argument.ipynb) # Prepare Model for different precisions Source: https://docs.fireworks.ai/api-reference/prepare-model post /v1/accounts/{account_id}/models/{model_id}:prepare # Rerank documents Source: https://docs.fireworks.ai/api-reference/rerank-documents post /rerank Rerank documents for a query using relevance scoring # Resume Dpo Job Source: https://docs.fireworks.ai/api-reference/resume-dpo-job post /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:resume # Resume Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:resume # Resume Rlor Trainer Job Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:resume # Resume Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/resume-supervised-fine-tuning-job post /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}:resume # Scale Deployment to a specific number of replicas or to zero Source: https://docs.fireworks.ai/api-reference/scale-deployment patch /v1/accounts/{account_id}/deployments/{deployment_id}:scale # Undelete Deployment Source: https://docs.fireworks.ai/api-reference/undelete-deployment post /v1/accounts/{account_id}/deployments/{deployment_id}:undelete # Update Dataset Source: https://docs.fireworks.ai/api-reference/update-dataset patch /v1/accounts/{account_id}/datasets/{dataset_id} # Update LoRA Source: https://docs.fireworks.ai/api-reference/update-deployed-model patch /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Update Deployment Source: https://docs.fireworks.ai/api-reference/update-deployment patch /v1/accounts/{account_id}/deployments/{deployment_id} # Update Evaluator Source: https://docs.fireworks.ai/api-reference/update-evaluator patch /v1/accounts/{account_id}/evaluators/{evaluator_id} Updates evaluator metadata (display_name, description, default_dataset). Changing `requirements` or `entry_point` triggers a rebuild. To upload new source code, set `prepare_code_upload: true` then follow the upload flow. # Update Model Source: https://docs.fireworks.ai/api-reference/update-model patch /v1/accounts/{account_id}/models/{model_id} # null Source: https://docs.fireworks.ai/api-reference/update-secret patch /v1/accounts/{account_id}/secrets/{secret_id} # Update User Source: https://docs.fireworks.ai/api-reference/update-user patch /v1/accounts/{account_id}/users/{user_id} # Upload Dataset Files Source: https://docs.fireworks.ai/api-reference/upload-dataset-files post /v1/accounts/{account_id}/datasets/{dataset_id}:upload Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint). # Validate Dataset Upload Source: https://docs.fireworks.ai/api-reference/validate-dataset-upload post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload # Validate Evaluator Upload Source: https://docs.fireworks.ai/api-reference/validate-evaluator-upload post /v1/accounts/{account_id}/evaluators/{evaluator_id}:validateUpload Triggers server-side validation of the uploaded source code (**step 5** in the [Create Evaluator](/api-reference/create-evaluator) workflow). The server extracts and processes the archive, then builds the evaluator environment. Poll [Get Evaluator](/api-reference/get-evaluator) to monitor progress. # Validate Model Upload Source: https://docs.fireworks.ai/api-reference/validate-model-upload get /v1/accounts/{account_id}/models/{model_id}:validateUpload # Autoscaling Source: https://docs.fireworks.ai/deployments/autoscaling Configure how your deployment scales based on traffic Control how your deployment scales based on traffic and load. ## Configuration options | Flag | Type | Default | Description | | ------------------------ | --------- | ------------- | ------------------------------------------------------ | | `--min-replica-count` | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero | | `--max-replica-count` | Integer | 1 | Maximum number of replicas | | `--scale-up-window` | Duration | 30s | Wait time before scaling up | | `--scale-down-window` | Duration | 10m | Wait time before scaling down | | `--scale-to-zero-window` | Duration | 1h | Idle time before scaling to zero (min: 5m) | | `--load-targets` | Key-value | `default=0.8` | Scaling thresholds. See options below | **Load target options** (use as `--load-targets =[,=...]`): * `default=` - General load target from 0 to 1 * `tokens_generated_per_second=` - Desired tokens per second per replica * `requests_per_second=` - Desired requests per second per replica * `concurrent_requests=` - Desired concurrent requests per replica When multiple targets are specified, the maximum replica count across all is used. ## Common patterns Scale to zero when idle to minimize costs: ```bash theme={null} firectl create deployment \ --min-replica-count 0 \ --max-replica-count 3 \ --scale-to-zero-window 1h ``` Best for: Development, testing, or intermittent production workloads. Keep replicas running for instant response: ```bash theme={null} firectl create deployment \ --min-replica-count 2 \ --max-replica-count 10 \ --scale-up-window 15s \ --load-targets concurrent_requests=5 ``` Best for: Low-latency requirements, avoiding cold starts, high-traffic applications. Match known traffic patterns: ```bash theme={null} firectl create deployment \ --min-replica-count 3 \ --max-replica-count 5 \ --scale-down-window 30m \ --load-targets tokens_generated_per_second=150 ``` Best for: Steady workloads where you know typical load ranges. Cold starts take up to a few minutes when scaling from 0→1. Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up. # Performance benchmarking Source: https://docs.fireworks.ai/deployments/benchmarking Measure and optimize your deployment's performance with load testing Understanding your deployment's performance under various load conditions is essential for production readiness. Fireworks provides tools and best practices for benchmarking throughput, latency, and identifying bottlenecks. ## Fireworks Benchmark Tool Use our open-source benchmarking tool to measure and optimize your deployment's performance: **[Fireworks Benchmark Tool](https://github.com/fw-ai/benchmark)** This tool allows you to: * Test throughput and latency under various load conditions * Simulate production traffic patterns * Identify performance bottlenecks * Compare different deployment configurations ### Installation ```bash theme={null} git clone https://github.com/fw-ai/benchmark.git cd benchmark pip install -r requirements.txt ``` ### Basic usage Run a basic benchmark test: ```bash theme={null} python benchmark.py \ --model "accounts/fireworks/models/llama-v3p1-8b-instruct" \ --deployment "your-deployment-id" \ --num-requests 1000 \ --concurrency 10 ``` ### Key metrics to monitor When benchmarking your deployment, focus on these key metrics: * **Throughput**: Requests per second (RPS) your deployment can handle * **Latency**: Time to first token (TTFT) and end-to-end response time * **Token generation rate**: Tokens per second during generation * **Error rate**: Failed requests under load ## Custom benchmarking You can also develop custom performance testing scripts or integrate with monitoring tools to track metrics over time. Consider: * Using production-like request patterns and payloads * Testing with various concurrency levels * Monitoring resource utilization (GPU, memory, network) * Testing autoscaling behavior under load ## Best practices 1. **Warm up your deployment**: Run a few requests before benchmarking to ensure models are loaded 2. **Test realistic scenarios**: Use request patterns and payloads similar to your production workload 3. **Gradually increase load**: Start with low concurrency and gradually increase to find your deployment's limits 4. **Monitor for errors**: Track error rates and response codes to identify issues under load 5. **Compare configurations**: Test different deployment shapes, quantization levels, and hardware to optimize cost and performance ## Next steps Configure autoscaling to handle variable load Optimize your client code for maximum throughput # Client-side performance optimization Source: https://docs.fireworks.ai/deployments/client-side-performance-optimization Optimize your client code for maximum performance with dedicated deployments When using a dedicated deployment, it is important to optimize the client-side HTTP connection pooling for maximum performance. We recommend using our [Python SDK](/tools-sdks/python-sdk) as it has good defaults for connection pooling and utilizes [httpx](https://www.python-httpx.org/) for optimal performance with Python's `asyncio` library. It also includes retry logic for handling `429` errors that Fireworks returns when the server is overloaded. ## General optimization recommendations Based on our benchmarks, we recommend the following: 1. Use a client library optimized for high concurrency, such as [httpx](https://www.python-httpx.org/) in Python or [http.Agent](https://nodejs.org/api/http.html#class-httpagent) in Node.js. 2. Use the `AsyncFireworks` client for high-concurrency workloads. 3. Increase concurrency until performance stops improving or you observe too many `429` errors. 4. Use [direct routing](/deployments/direct-routing) to avoid the global API load balancer and route requests directly to your deployment. ## Code example: Optimal concurrent requests (Python) Install the [Fireworks Python SDK](/tools-sdks/python-sdk): The SDK is currently in alpha. Use the `--pre` flag when installing to get the latest version. ```bash pip theme={null} pip install --pre fireworks-ai ``` ```bash poetry theme={null} poetry add --pre fireworks-ai ``` ```bash uv theme={null} uv add --pre fireworks-ai ``` Here's how to implement optimal concurrent requests using `asyncio` and the `AsyncFireworks` client: ```python main.py theme={null} import asyncio import time import statistics from fireworks import AsyncFireworks async def make_concurrent_requests( messages: list[str], model: str, max_workers: int = 1000, ): """Make concurrent requests with optimized connection pooling""" client = AsyncFireworks( base_url="https://my-account-abcd1234.eu-iceland-2.direct.fireworks.ai", api_key="MY_DIRECT_ROUTE_API_KEY", max_retries=5, ) # Semaphore to limit concurrent requests semaphore = asyncio.Semaphore(max_workers) latencies = [] async def single_request(message: str): """Make a single request with semaphore control""" async with semaphore: start_time = time.perf_counter() response = await client.chat.completions.create( model=model, messages=[{"role": "user", "content": message}], max_tokens=100, ) latency = time.perf_counter() - start_time latencies.append(latency) return response.choices[0].message.content # Create all request tasks tasks = [single_request(message) for message in messages] # Execute all requests concurrently results = await asyncio.gather(*tasks) return results, latencies # Usage example async def main(): messages = ["Hello!"] * 1000 # 1000 requests model = "accounts/fireworks/models/qwen3-0p6b" start_time = time.perf_counter() results, latencies = await make_concurrent_requests( messages=messages, model=model, ) total_time = time.perf_counter() - start_time # Calculate performance metrics num_requests = len(results) requests_per_second = num_requests / total_time # Latency statistics (in milliseconds) latencies_ms = [lat * 1000 for lat in latencies] avg_latency = statistics.mean(latencies_ms) min_latency = min(latencies_ms) max_latency = max(latencies_ms) p50_latency = statistics.median(latencies_ms) p95_latency = statistics.quantiles(latencies_ms, n=20)[18] # 95th percentile p99_latency = statistics.quantiles(latencies_ms, n=100)[98] # 99th percentile print("\n" + "=" * 50) print("Performance Results") print("=" * 50) print(f"Total requests: {num_requests}") print(f"Total time: {total_time:.2f} seconds") print(f"Throughput: {requests_per_second:.2f} requests/second") print("\nLatency Statistics (ms):") print(f" Min: {min_latency:.2f}") print(f" Max: {max_latency:.2f}") print(f" Avg: {avg_latency:.2f}") print(f" P50 (median): {p50_latency:.2f}") print(f" P95: {p95_latency:.2f}") print(f" P99: {p99_latency:.2f}") print("=" * 50) if __name__ == "__main__": asyncio.run(main()) ``` This implementation: * Uses `AsyncFireworks` for non-blocking async requests with optimized connection pooling * Uses `asyncio.Semaphore` to control concurrency to avoid overwhelming the server * Targets a dedicated deployment with [direct routing](/deployments/direct-routing) # Direct routing Source: https://docs.fireworks.ai/deployments/direct-routing Direct routing enables enterprise users reduce latency to their deployments. ## Internet direct routing Internet direct routing bypasses our global API load balancer and directly routes your request to the machines where your deployment is running. This can save several tens or even hundreds of milliseconds of time-to-first-token (TTFT) latency. To create a deployment using Internet direct routing: When creating a deployment with direct routing, the `--region` parameter is required to specify the deployment region. ```bash theme={null} $ firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \ --direct-route-type INTERNET \ --direct-route-api-keys \ --region Name: accounts/my-account/deployments/abcd1234 ... Direct Route Handle: my-account-abcd1234.us-arizona-1.direct.fireworks.ai Region: US_ARIZONA_1 ``` If you have multiple API keys, use repeated fields, such as: `--direct-route-api-keys= --direct-route-api-keys=`. These keys can be any alpha-numeric string and are a distinct concept from the API keys provisioned via the Fireworks console. A key provisioned in the console but not specified the list here will not be allowed when querying the model via direct routing. Take note of the `Direct Route Handle` to get the inference endpoint. This is what you will use access the deployment instead of the global `https://api.fireworks.ai/inference/` endpoint. For example: ```bash theme={null} curl \ --header 'Authorization: Bearer ' \ --header 'Content-Type: application/json' \ --data '{ "model": "accounts/fireworks/models/llama-v3-8b-instruct", "prompt": "The sky is" }' \ --url https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai/v1/completions ``` ### Use Python SDKs with direct routing Set the direct route handle as the `base_url` when you initialize the SDK so your calls go straight to the regional deployment endpoint. **Important:** The `base_url` format differs between SDKs: * **OpenAI SDK:** Include the `/v1` suffix (e.g., `https://...direct.fireworks.ai/v1`) * **Fireworks SDK:** Omit the `/v1` suffix (e.g., `https://...direct.fireworks.ai`) ```python OpenAI SDK theme={null} from openai import OpenAI client = OpenAI( # Note: Include /v1 suffix for OpenAI SDK base_url="https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai/v1", api_key="" ) response = client.chat.completions.create( model="accounts/fireworks/models/llama-v3-8b-instruct", messages=[{"role": "user", "content": "Hello!"}] ) ``` ```python Fireworks SDK theme={null} from fireworks import Fireworks client = Fireworks( # Note: No /v1 suffix for Fireworks SDK base_url="https://my-account-abcd1234.us-arizona-1.direct.fireworks.ai", api_key="" ) response = client.chat.completions.create( model="accounts/fireworks/models/llama-v3-8b-instruct", messages=[{"role": "user", "content": "Hello!"}] ) ``` The direct route handle replaces the standard `https://api.fireworks.ai/inference/v1` endpoint, bypassing the global load balancer to reduce latency. For a complete code-only example that demonstrates creating a direct route deployment and querying it, see the [Python SDK direct route deployment example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/direct_route_deployment.py). ## Supported Regions for Direct Routing Direct routing is currently supported in the following regions: * `US_IOWA_1` * `US_VIRGINIA_1` * `US_ARIZONA_1` * `US_ILLINOIS_1` * `US_TEXAS_1` * `US_ILLINOIS_2` * `EU_FRANKFURT_1` * `US_WASHINGTON_3` * `US_WASHINGTON_1` * `AP_TOKYO_1` ## Private Service Connect (PSC) Contact your Fireworks representative to set up [GCP Private Service Connect](https://cloud.google.com/vpc/docs/private-service-connect) to your deployment. ## AWS PrivateLink Contact your Fireworks representative to set up [AWS PrivateLink](https://aws.amazon.com/privatelink/) to your deployment. # Exporting Metrics Source: https://docs.fireworks.ai/deployments/exporting-metrics Export metrics from your dedicated deployments to your observability stack ## Overview Fireworks provides a metrics endpoint in Prometheus format, enabling integration with popular observability tools like Prometheus, OpenTelemetry (OTel) Collector, Datadog Agent, and Vector. This page covers real-time performance metrics (latency, throughput, etc.) for on-demand deployments. For billing and usage data across all Fireworks services, see [Exporting Billing Metrics](/accounts/exporting-billing-metrics). ## Setting Up Metrics Collection ### Endpoint The metrics endpoint is as follows. This URL and authorization header can be directly used by services like Grafana Cloud to ingest Fireworks metrics. ``` https://api.fireworks.ai/v1/accounts//metrics ``` ### Authentication Use the Authorization header with your Fireworks API key: ```json theme={null} { "Authorization": "Bearer YOUR_API_KEY" } ``` ### Scrape Interval We recommend using a 1-minute scrape interval as metrics are updated every 30s. ### Rate Limits To ensure service stability and fair usage: * Maximum of 6 requests per minute per account * Exceeding this limit results in HTTP 429 (Too Many Requests) responses * Use a 1-minute scrape interval to stay within limits ## Integration Options Fireworks metrics can be integrated with various observability platforms through multiple approaches: ### OpenTelemetry Collector Integration The Fireworks metrics endpoint can be integrated with OpenTelemetry Collector by configuring a Prometheus receiver that scrapes the endpoint. This allows Fireworks metrics to be pushed to a variety of popular exporters—see the [OpenTelemetry registry](https://opentelemetry.io/ecosystem/registry/) for a full list. ### Direct Prometheus Integration To integrate directly with Prometheus, specify the Fireworks metrics endpoint in your scrape config: ```yaml theme={null} global: scrape_interval: 60s scrape_configs: - job_name: 'fireworks' metrics_path: 'v1/accounts//metrics' authorization: type: "Bearer" credentials: "YOUR_API_KEY" static_configs: - targets: ['api.fireworks.ai'] scheme: https ``` For more details on Prometheus configuration, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/). ### Supported Platforms Fireworks metrics can be exported to various observability platforms including: * Prometheus * Datadog * Grafana * New Relic ## Available Metrics ### Common Labels All metrics include the following common labels: * `base_model`: The base model identifier (e.g., "accounts/fireworks/models/deepseek-v3") * `deployment`: Full deployment path (e.g., "accounts/account-name/deployments/deployment-id") * `deployment_account`: The account name * `deployment_id`: The deployment identifier ### Rate Metrics (per second) These metrics show activity rates calculated using 1-minute windows: #### Request Rate * `request_counter_total:sum_by_deployment`: Request rate per deployment #### Error Rate * `requests_error_total:sum_by_deployment`: Error rate per deployment, broken down by HTTP status code (includes additional `http_code` label) #### Token Processing Rates * `tokens_cached_prompt_total:sum_by_deployment`: Rate of cached prompt tokens per deployment * `tokens_prompt_total:sum_by_deployment`: Rate of total prompt tokens processed per deployment ### Latency Histogram Metrics These metrics provide latency distribution data with histogram buckets, calculated using 1-minute windows: #### Generation Latency * `latency_generation_per_token_ms_bucket:sum_by_deployment`: Per-token generation time distribution * `latency_generation_queue_ms_bucket:sum_by_deployment`: Time spent waiting in generation queue #### Request Latency * `latency_overall_ms_bucket:sum_by_deployment`: End-to-end request latency distribution * `latency_to_first_token_ms_bucket:sum_by_deployment`: Time to first token distribution #### Prefill Latency * `latency_prefill_ms_bucket:sum_by_deployment`: Prefill processing time distribution * `latency_prefill_queue_ms_bucket:sum_by_deployment`: Time spent waiting in prefill queue ### Token Distribution Metrics These histogram metrics show token count distributions per request, calculated using 1-minute windows: * `tokens_generated_per_request_bucket:sum_by_deployment`: Distribution of generated tokens per request * `tokens_prompt_per_request_bucket:sum_by_deployment`: Distribution of prompt tokens per request ### Resource Utilization Metrics These gauge metrics show average resource usage: * `generator_kv_blocks_fraction:avg_by_deployment`: Average fraction of KV cache blocks in use * `generator_kv_slots_fraction:avg_by_deployment`: Average fraction of KV cache slots in use * `generator_model_forward_time:avg_by_deployment`: Average time spent in model forward pass * `requests_coordinator_concurrent_count:avg_by_deployment`: Average number of concurrent requests * `prefiller_prompt_cache_ttl:avg_by_deployment`: Average prompt cache time-to-live # Regions Source: https://docs.fireworks.ai/deployments/regions Fireworks runs a global fleet of hardware on which you can deploy your models. ## Availability Current region availability: | **Region** | **Quota availability** | **Hardware availability** | | -------------------- | ---------------------- | -------------------------------------- | | `US_IOWA_1` | Available by default | `NVIDIA_H100_80GB` | | `US_TEXAS_2` | Available by default | `NVIDIA_H100_80GB` | | `REGION_UNSPECIFIED` | Available by default | `ANY OF THE ABOVE/BELOW` | | `US_ARIZONA_1` | Must be requested | `NVIDIA_H100_80GB` | | `US_CALIFORNIA_1` | Must be requested | `NVIDIA_H200_141GB` | | `US_GEORGIA_2` | Must be requested | `NVIDIA_B200_180GB` | | `US_ILLINOIS_1` | Must be requested | `NVIDIA_H100_80GB` | | `US_ILLINOIS_2` | Must be requested | `NVIDIA_A100_80GB` | | `US_UTAH_1` | Must be requested | `NVIDIA_B200_180GB` | | `US_VIRGINIA_1` | Must be requested | `NVIDIA_H100_80GB` `NVIDIA_H200_141GB` | | `US_WASHINGTON_1` | Must be requested | `NVIDIA_H100_80GB` | | `US_WASHINGTON_2` | Must be requested | `NVIDIA_H100_80GB` | | `US_WASHINGTON_3` | Must be requested | `NVIDIA_B200_180GB` | | `EU_FRANKFURT_1` | Must be requested | `NVIDIA_H100_80GB` | | `EU_ICELAND_1` | Must be requested | `NVIDIA_H200_141GB` | | `EU_ICELAND_2` | Must be requested | `NVIDIA_H200_141GB` | | `AP_TOKYO_1` | Must be requested | `NVIDIA_H100_80GB` | | `AP_TOKYO_2` | Must be requested | `NVIDIA_H200_141GB` | If you hit a quota limit when requesting a specific region, try launching the deployment without specifying a region. This taps into your global account quota, which is more flexible. Deployments may still be placed in `Must be requested` regions when region is not specified, but region-level quota must be enabled to explicitly specify that region when creating a deployment. ## Using a region When creating a deployment, you can pass the `--region` flag: ``` firectl create deployment accounts/fireworks/models/llama-v3p1-8b-instruct \ --region US_IOWA_1 ``` ## Changing regions Updating a region for a deployment in-place is currently not supported. To move a deployment between regions, please create a new deployment in the new region, then delete the old deployment. ## Quotas Each region has it's own separate quota for each hardware type. To view your current quotas, run ``` firectl list quotas ``` If you need deployments in a non-GA region, please contact our team at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai). # Reserved capacity Source: https://docs.fireworks.ai/deployments/reservations Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the following advantages over ordinary [on-demand deployments](/guides/ondemand-deployments): * Guaranteed capacity * Higher quotas * Lower GPU-hour prices * Pre-GA access to newer regions * Pre-GA access to newest hardware ## Usage and billing Consuming a reservation is done by creating a deployment that meets the reservation parameters. For example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both deployments are running, 12 of the H100s will count towards using your reservation, while the excess 4 H100s will be metered and billed at the on-demand rate. Follow [deploying models on-demand](/guides/ondemand-deployments) to create a deployment. When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding number of deployments, otherwise you may be billed at for your usage at on-demand rates. Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract (e.g. monthly, quarterly, or yearly). Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is actively used. ## Purchasing or renewing a reservation To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us). ## Viewing your reservations To view your existing reservations, run: ``` firectl list reservations ``` # Speculative Decoding Source: https://docs.fireworks.ai/deployments/speculative-decoding Speed up generation with draft models and n-gram speculation Speed up text generation by using a smaller "draft" model to assist the main model, or using n-gram based speculation. Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case. ## Configuration options | Flag | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------------- | | `--draft-model` | string | Draft model name. Can be a Fireworks model or custom model. See recommendations below. | | `--draft-token-count` | int32 | Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4. | | `--ngram-speculation-length` | int32 | Alternative to draft model: uses N-gram based speculation from previous input. | `--draft-model` and `--ngram-speculation-length` cannot be used together. ## Recommended draft models | Draft model | Use with | | -------------------------------------------------- | --------------------- | | `accounts/fireworks/models/llama-v3p2-1b-instruct` | All Llama models > 3B | | `accounts/fireworks/models/qwen2p5-0p5b-instruct` | All Qwen models > 3B | ## Examples Use a smaller model to speed up generation: ```bash theme={null} firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \ --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \ --draft-token-count=4 ``` Use input history for speculation (no draft model needed): ```bash theme={null} firectl create deployment accounts/fireworks/models/llama-v3p3-70b-instruct \ --ngram-speculation-length=3 \ --draft-token-count=4 ``` Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding. # Cloud Integrations Source: https://docs.fireworks.ai/ecosystem/integrations Cloud Integrations ## Cloud Deployments Deploy Fireworks models on AWS SageMaker Run Fireworks on Amazon Elastic Kubernetes Service Deploy using Amazon Elastic Container Service Build and deploy AI agents with AgentCore ## Need Help? For assistance with cloud deployments or custom integrations, [contact our team](https://fireworks.ai/contact). # Agent Frameworks Source: https://docs.fireworks.ai/ecosystem/integrations/agent-frameworks Build production-ready AI agents with Fireworks and leading open-source frameworks Fireworks AI seamlessly integrates with the best open-source agent frameworks, enabling you to build magical, production-ready applications powered by state-of-the-art language models. ## Supported Frameworks Build LLM applications with powerful orchestration and tool integration Efficient data retrieval and document indexing for LLM-based agents Orchestrate collaborative multi-agent systems for complex tasks Type-safe AI agent development with Pydantic validation Modern agent orchestration with seamless OpenAI-compatible integration ## Need Help? For assistance with agent framework integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai). # MLOps & Observability Source: https://docs.fireworks.ai/ecosystem/integrations/mlops-observability Track and monitor your Fireworks AI deployments with leading MLOps and observability platforms Fireworks AI integrates with industry-leading MLOps and observability platforms to help you monitor, track, and optimize your AI applications in production. ## Supported Platforms Track fine-tuning experiments and visualize training metrics with W\&B Mlflow Tracing to track prompts, outputs, latency etc as your build AI applications with FireworksAI ## Need Help? For assistance with MLOps and observability integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai). # Cookbooks Source: https://docs.fireworks.ai/examples/cookbooks Interactive Jupyter notebooks demonstrating advanced use cases and best practices with Fireworks AI Explore our collection of notebooks that showcase real-world applications, best practices, and advanced techniques for building with Fireworks AI. ## Fine-Tuning & Training Transfer large model capabilities to efficient models using a two-stage SFT + RFT approach. **Techniques:** Supervised Fine-Tuning (SFT) + Reinforcement Fine-Tuning (RFT) **Results:** 52% → 70% accuracy on GSM8K mathematical reasoning Beat frontier closed-source models for product catalog cleansing with vision-language model fine-tuning. **Techniques:** Supervised Fine-Tuning (SFT) **Results:** 48% increase in quality from base model ## Multimodal AI Extract structured data from invoices, forms, and financial documents using state-of-the-art OCR and document understanding. **Use Cases:** Forms, invoices, financial documents, product catalogs **Results:** 90.8% accuracy on invoice extraction (100% on invoice numbers and dates) Real-time audio transcription with streaming support and low latency. **Features:** Streaming support, low-latency transcription, production-ready ## API Features Leverage Model Context Protocol (MCP) for GitHub repository analysis, code search, and documentation Q\&A. **Features:** Repository analysis, code search, documentation Q\&A, GitMCP integration **Models:** Qwen 3 235B with external tool support # Courses Source: https://docs.fireworks.ai/examples/introduction Standalone end-to-end examples showing how to use Fireworks to solve real-world use cases Learn how to use Fireworks to fine-tune a model to convert natural language to SQL queries. Learn how to build reinforcement learning systems that avoid reward hacking. Learn to distill the knowledge of large AI models into efficient, deployable alternatives. # How do I close my Fireworks.ai account? Source: https://docs.fireworks.ai/faq-new/account-access/how-do-i-close-my-fireworksai-account To close your account: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) 2. Include in your request: * Your account ID * A clear request for account deletion Before closing your account, please ensure: * All outstanding invoices are paid * Any active deployments are terminated * Important data is backed up if needed # I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this? Source: https://docs.fireworks.ai/faq-new/account-access/i-have-multiple-fireworks-accounts-when-i-try-to-login-with-google-on-fireworks If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord. # What email does GitHub authentication use? Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-github-authentication-use When you authenticate with Fireworks using GitHub, we use the **primary email address** associated with your GitHub account for identification and account management. ## How it works Fireworks automatically retrieves your primary email address from your GitHub profile during the authentication process. This email address becomes your Fireworks account identifier. ## Managing your primary email To change your primary email address on GitHub: 1. Go to your [GitHub email settings](https://github.com/settings/emails) 2. Select the email address you want to set as primary in the "Primary email address" section You can also follow the [GitHub documentation](https://docs.github.com/en/enterprise-cloud@latest/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/changing-your-primary-email-address) for detailed instructions on managing email preferences. ## Switching between accounts You can easily switch which Fireworks account your GitHub authentication logs into by changing your primary email address on GitHub before logging in. This allows you to: * Log into different Fireworks accounts using the same GitHub account * Switch between personal and work accounts by updating your GitHub primary email * Maintain separate billing and usage tracking for different email addresses The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your GitHub primary email before authenticating. # What email does LinkedIn authentication use? Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-linkedin-authentication-use When you authenticate with Fireworks using LinkedIn, we use the **primary email address** associated with your LinkedIn account for identification and account management. ## How it works Fireworks automatically retrieves your primary email address from your LinkedIn profile during the authentication process. This email address becomes your Fireworks account identifier. ## Managing your primary email To change your primary email address on LinkedIn: 1. Go to your [LinkedIn email settings](https://www.linkedin.com/mypreferences/d/manage-email-addresses) 2. From there, you can add new email addresses or change your primary email 3. Click **Add email address** to add a new email or select an existing one to make primary You can also follow the [LinkedIn documentation](https://www.linkedin.com/help/linkedin/answer/a519904) for detailed instructions on managing email preferences. ## Switching between accounts You can easily switch which Fireworks account your LinkedIn authentication logs into by changing your primary email address on LinkedIn before logging in. This allows you to: * Log into different Fireworks accounts using the same LinkedIn account * Switch between personal and work accounts by updating your LinkedIn primary email * Maintain separate billing and usage tracking for different email addresses The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your LinkedIn primary email before authenticating. # What should I do if I can't access my company account after being invited when I already have a personal account? Source: https://docs.fireworks.ai/faq-new/account-access/what-should-i-do-if-i-cant-access-my-company-account-after-being-invited-when-i This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to). To resolve this: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts 2. Include in your email: * The account ID you created personally (e.g., username-44ace8) * The company account ID you need access to (e.g., company-a57b2a) * Mention that you're having trouble accessing your company account Note: This is a known scenario that support can resolve once they verify your email ownership. # Are there discounts for bulk usage? Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-discounts-for-bulk-usage We offer discounts for bulk or pre-paid purchases. Contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss volume pricing. # Are there extra fees for serving fine-tuned models? Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-extra-fees-for-serving-fine-tuned-models Fine-tuned (LoRA) models require a dedicated deployment to serve. Here's what you need to know: **What you pay for**: * **Deployment costs** on a per-GPU-second basis for hosting the model * **The fine-tuning process** itself, if applicable **Deployment options**: * **Live-merge deployment**: Deploy your LoRA model with weights merged into the base model for optimal performance * **Multi-LoRA deployment**: Deploy up to 100 LoRA models as addons on a single base model deployment For more details on deploying fine-tuned models, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras). # How does billing and credit usage work? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-does-billing-and-credit-usage-work Usage and billing operate through a **tiered system**: * Each **tier** has a monthly usage limit, regardless of available credits. * Once you reach your tier's limit, **service will be suspended** even if you have remaining credits. * **Usage limits** reset at the beginning of each month. * Pre-purchased credits do not prevent additional charges once the limit is exceeded. For detailed information about spend limits, tiers, and how to manage them, see our [Rate Limits & Quotas guide](/guides/quotas_usage/rate-limits#spend-limits). # How many tokens per image? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-many-tokens-per-image Learn how to calculate token usage for images in vision models and understand pricing implications Image token consumption varies by model and resolution, typically ranging from 1,000 to 2,500 tokens per image for most common resolutions. ## Common resolution token counts The following table shows the token counts for a single image for Qwen2.5 VL at different image resolutions: | Resolution | Token Count | | ---------- | ----------- | | 336×336 | 144 | | 672×672 | 576 | | 1024×1024 | 1,369 | | 1280×720 | 1,196 | | 1920×1080 | 2,769 | | 2560×1440 | 4,641 | | 3840×2160 | 10,549 | ## Calculating exact token count for your images You can determine exact token usage by processing your images through the model's tokenizer. For instance, for Qwen2.5 VL, you can use the following code: ```bash theme={null} pip install torch torchvision transformers pillow ``` ```python Tokenizing your image theme={null} import requests from PIL import Image from transformers import AutoProcessor import os # Your image source - can be URL or local path IMAGE_URL_OR_PATH = "https://images.unsplash.com/photo-1519125323398-675f0ddb6308" def load_image(source): """Load image from URL or local file path""" if source.startswith(('http://', 'https://')): print(f"Downloading image from URL: {source}") response = requests.get(source) response.raise_for_status() return Image.open(requests.get(source, stream=True).raw) else: print(f"Loading image from path: {source}") if not os.path.exists(source): raise FileNotFoundError(f"Image file not found: {source}") return Image.open(source) def count_image_tokens(image): """Count how many tokens an image takes using Qwen 2.5 VL processor""" processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "What's in this image?"}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=[image], return_tensors="pt") input_ids = inputs["input_ids"][0] # Count the image pad tokens (151655 is Qwen2.5 VL's image token ID) image_tokens = (input_ids == 151655).sum().item() return image_tokens, input_ids def main(): import sys image_source = sys.argv[1] if len(sys.argv) > 1 else IMAGE_URL_OR_PATH print(f"Processing image: {image_source}") image = load_image(image_source) print(f"Image size: {image.size}") print(f"Image mode: {image.mode}") print("\nCalculating tokens...") image_tokens, input_ids = count_image_tokens(image) print(f"Total tokens: {len(input_ids)}") print(f"Image tokens: {image_tokens}") print(f"Text tokens: {len(input_ids) - image_tokens}") if __name__ == "__main__": main() ``` ```bash Usage theme={null} # Calculate tokens for an image URL python token_calculator.py "https://example.com/image.jpg" # Calculate tokens for a local image python token_calculator.py "path/to/your/image.png" ``` # How much does Fireworks cost? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-much-does-fireworks-cost Fireworks AI operates on a **pay-as-you-go** model for all non-Enterprise usage, and new users automatically receive free credits. You pay based on: * **Per token** for serverless inference * **Per GPU usage time** for on-demand deployments * **Per token of training data** for fine-tuning For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options. Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing). # Is prompt caching billed differently for serverless models? Source: https://docs.fireworks.ai/faq-new/billing-pricing/is-prompt-caching-billed-differently No, **prompt caching does not affect billing for serverless models**. You are charged the same amount regardless of whether your request benefits from prompt caching or not. # How do credits work? Source: https://docs.fireworks.ai/faq-new/billing-pricing/what-happens-when-i-finish-my-1-dollar-credit ## How credits are applied Fireworks operates with a **postpaid billing** system: * **Prepaid credits are used first** for all usage * Once credits are exhausted, you **continue to accrue charges** for additional usage * **Usage charges** are billed at the end of each month * **Prepaid credits are instantly applied** to any outstanding balance **Example**: If you had a `$750` outstanding bill and added `$500` in credits, your bill would reduce to `$250`, with \$0 remaining credits available for new usage. ## Missing credits after purchase? If you don't see your credits reflected immediately: 1. Visit your **billing dashboard** 2. Review the **"Credits"** section 3. Check your **current outstanding balance** **Important**: Credits are always applied to any existing balance before being available for new usage. If you had an outstanding balance, your credits were automatically applied to reduce it. ## Why did I receive an invoice after depositing credits? You'll receive an invoice for any usage that **exceeded your pre-purchased credits**. This process happens automatically, regardless of subscription status. **Example**: If you deposited `$20` in credits but incurred `$83` in usage, you'll be billed for the `$63` difference at month-end. ## What happens when I finish my \$1 credit? When you finish your \$1 credit, the following occurs: ## Account Status * **Without payment method**: Your account will be **suspended** until you add a payment method. Additionally, accounts without a payment method are subject to a **provisional rate limit of 10 requests per minute (RPM)**. To access full rate limits (up to 6,000 RPM), add a payment method in your [billing settings](https://fireworks.ai/billing). * **With payment method**: You can continue using the service with full rate limits and usage-based billing **Payment Method Requirements:** * Adding a payment method is required to continue service after credit depletion * You're billed at the end of the month for actual usage * You can incur bills up to your configured spend limit (default: \$50/month for new accounts) * As you spend more with Fireworks, your allowed usage limits increase ## Where's my receipt for purchased credits? Receipts for purchased credits are sent via Stripe upon purchase. Check your email for receipts from Stripe (not Fireworks). If you can't find your receipt, contact [billing@fireworks.ai](mailto:billing@fireworks.ai). For detailed information about spend limits, tiers, and quotas, see our [Rate Limits & Quotas guide](/guides/quotas_usage/rate-limits). # Why might my account be suspended even with remaining credits? Source: https://docs.fireworks.ai/faq-new/billing-pricing/why-might-my-account-be-suspended-even-with-remaining-credits Your account may be suspended due to several factors: 1. **Monthly usage limits**: * Each tier includes a monthly usage limit, independent of any credits. * Once you reach this limit, your service will be suspended, even if you have credits remaining. * Usage limits automatically reset at the beginning of each month. 2. **Billing structure**: * Pre-purchased credits do not prevent additional charges. * You can exceed your pre-purchased credits and will be billed for any usage beyond that limit. * **Example**: If you have `$20` in pre-purchased credits but incur `$83` in usage, you will be billed for the `$63` difference. If you're experiencing account suspension issues or need assistance with your usage limits, please contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai). # Are there any quotas for serverless? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/are-there-any-quotas-for-serverless Yes, serverless deployments have rate limits and quotas. For detailed information about serverless quotas, rate limits, and daily token limits, see our [Rate Limits & Quotas guide](/guides/quotas_usage/rate-limits#rate-limits-on-serverless). # Do you provide notice before removing model availability? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-provide-notice-before-removing-model-availability Yes, we provide advance notice before removing models from the serverless infrastructure: * **Minimum 2 weeks’ notice** before model removal * Longer notice periods may be provided for **popular models**, depending on usage * Higher-usage models may have extended deprecation timelines **Best Practices**: 1. Monitor announcements regularly. 2. Prepare a migration plan in advance. 3. Test alternative models to ensure continuity. 4. Keep your contact information updated for timely notifications. # Do you support Auto Scaling? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-support-auto-scaling Yes, our system supports **auto scaling** with the following features: * **Scaling down to zero** capability for resource efficiency * Controllable **scale-up and scale-down velocity** * **Custom scaling rules and thresholds** to match your specific needs # How does autoscaling affect my costs? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-autoscaling-affect-my-costs * **Scaling from 0**: No minimum cost when scaled to zero * **Scaling up**: Each new replica adds to your total cost proportionally. For example: * Scaling from 1 to 2 replicas doubles your GPU costs * If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total) For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing). # How does billing and scaling work for on-demand GPU deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-and-scaling-work-for-on-demand-gpu-deployments On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: **Billing**: * Charges start when the server begins accepting requests * **Billed by GPU-second** for each active instance * Costs accumulate even if there are no active API calls **Scaling options**: * Supports **autoscaling** from 0 to multiple GPUs * Each additional GPU **adds to the billing rate** * Can handle unlimited requests within the GPU’s capacity **Management requirements**: * Not fully serverless; requires some manual management * **Manually delete deployments** when no longer needed * Or configure autoscaling to **scale down to 0** during inactive periods **Cost control tips**: * Regularly **monitor active deployments** * **Delete unused deployments** to avoid unnecessary costs * Consider **serverless options** for intermittent usage * Use **autoscaling to 0** to optimize costs during low-demand times # How does billing work for on-demand deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-work-for-on-demand-deployments On-demand deployments come with automatic cost optimization features: * **Default autoscaling**: Automatically scales to 0 replicas when not in use * **Pay for what you use**: Charged only for GPU time when replicas are active * **Flexible configuration**: Customize autoscaling behavior to match your needs **Best practices for cost management**: 1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use 2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges 3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling). # How does the system scale? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-the-system-scale Our system is **horizontally scalable**, meaning it: * Scales linearly with additional **replicas** of the deployment * **Automatically allocates resources** based on demand * Manages **distributed load handling** efficiently # Are there SLAs for serverless? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/is-latency-guaranteed-for-serverless-models Our multi-tenant serverless offering does not currently come with Service Level Agreements (SLAs) for latency or availability. If you have specific performance or availability requirements, we recommend: * **On-demand deployments**: Provides dedicated resources with predictable performance * **Contact sales**: [Reach out to discuss](https://fireworks.ai/company/contact-us) custom solutions and enterprise options # What are the rate limits for on-demand deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-are-the-rate-limits-for-on-demand-deployments On-demand deployments have GPU quotas that determine your maximum allocation. For detailed information about on-demand deployment quotas and GPU limits, see our [Rate Limits & Quotas guide](/guides/quotas_usage/rate-limits#gpu-limits-with-on-demand-deployments). Need higher GPU allocations? [Contact us](https://fireworks.ai/company/contact-us) to discuss custom solutions for your use case. # What factors affect the number of simultaneous requests that can be handled? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled The request handling capacity is influenced by multiple factors: * **Model size and type** * **Number of GPUs** allocated to the deployment * **GPU type** (e.g., A100 vs. H100) * **Prompt size** and **generation token length** * **Deployment type** (serverless vs. on-demand) # What’s the supported throughput? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/whats-the-supported-throughput Throughput capacity typically depends on several factors: * **Deployment type** (serverless or on-demand) * **Traffic patterns** and **request patterns** * **Hardware configuration** * **Model size and complexity** # Why am I experiencing request timeout errors and slow response times with serverless LLM models? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/why-am-i-experiencing-request-timeout-errors-and-slow-response-times-with-server Timeout errors and increased response times can occur due to **server load during high-traffic periods**. With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned. The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**. Deployment barriers should be **minimal** and **pricing is based on usage**. However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**. With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes. Which is why our two recommended ways to address timeout and response time issues is: ### Current solution (recommended for production) * **Use on-demand deployments** for more stable performance * **Guaranteed response times** * **Dedicated resources** to ensure availability We are always investing in ways to improve speed and performance. ### Upcoming improvements * Enhanced SLAs for uptime * More consistent generation speeds during peak load times If you experience persistent issues, please include the following details in your support request: 1. Exact **model name** 2. **Timestamp** of errors (in UTC) 3. **Frequency** of timeouts 4. **Average wait times** ### Performance optimization tips * Consider **batch processing** for handling bulk requests * Implement **retry logic with exponential backoff** * Monitor **usage patterns** to identify peak traffic times * Set **appropriate timeout settings** based on model complexity # Does Fireworks support custom base models? Source: https://docs.fireworks.ai/faq-new/models-inference/does-fireworks-support-custom-base-models Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models). # Does the API support batching and load balancing? Source: https://docs.fireworks.ai/faq-new/models-inference/does-the-api-support-batching-and-load-balancing Current capabilities include: * **Load balancing**: Yes, supported out of the box * **Continuous batching**: Yes, supported * **Batch inference**: Yes, supported via the [Batch API](/guides/batch-inference) * **Streaming**: Yes, supported For asynchronous batch processing of large volumes of requests, see our [Batch API documentation](/guides/batch-inference). # FLUX image generation Source: https://docs.fireworks.ai/faq-new/models-inference/flux-image-generation ## Can I generate multiple images in a single API call? No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance. ## Does FLUX support image-to-image generation? No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development. ## Can I create custom LoRA models with FLUX? Inference on FLUX-LoRA adapters is currently supported. However, managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available. # How do I control output image sizes when using SDXL ControlNet? Source: https://docs.fireworks.ai/faq-new/models-inference/how-do-i-control-output-image-sizes-when-using-sdxl-controlnet When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request: The input control signal image will be automatically: * **Resized** to fit your specified dimensions * **Cropped** to preserve aspect ratio **Example**: To generate a 768x1344 image, explicitly include these parameters in your request: ```json theme={null} { "width": 768, "height": 1344 } ``` *Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests. # How to check if a model is available on serverless? Source: https://docs.fireworks.ai/faq-new/models-inference/how-to-check-if-a-model-is-available-on-serverless ## Web UI Go to [https://app.fireworks.ai/models?filter=LLM\&serverless=true](https://app.fireworks.ai/models?filter=LLM\&serverless=true) # There’s a model I would like to use that isn’t available on Fireworks. Can I request it? Source: https://docs.fireworks.ai/faq-new/models-inference/theres-a-model-i-would-like-to-use-that-isnt-available-on-fireworks-can-i-reques Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform. **To request new models**: 1. **Join our [Discord server](https://discord.gg/fireworks-ai)** 2. Let us know which models you’d like to see 3. Provide **use case details**, if possible, to help us prioritize We regularly evaluate and add new models based on: * **Community requests** * **Popular demand** * **Technical feasibility** * **Licensing requirements** # What factors affect the number of simultaneous requests that can be handled? Source: https://docs.fireworks.ai/faq-new/models-inference/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled Request handling capacity depends on several factors: * **Model size and type** * **Number of GPUs allocated** to the deployment * **GPU type** (e.g., A100, H100) * **Prompt size** * **Generation token length** * **Deployment type** (serverless vs. on-demand) # Training Overview Source: https://docs.fireworks.ai/fine-tuning/cli-reference Launch RFT jobs using the eval-protocol CLI The Eval Protocol CLI provides the fastest, most reproducible way to launch RFT jobs. This page covers everything you need to know about using `eval-protocol create rft`. Before launching, review [Training Prerequisites & Validation](/fine-tuning/training-prerequisites) for requirements, validation checks, and common errors. Already familiar with [firectl](/fine-tuning/cli-reference#using-firectl-cli-alternative)? Use it as an alternative to eval-protocol. ## Installation and setup The following guide will help you: * Upload your evaluator to Fireworks. If you don't have one yet, see [Concepts > Evaluators](/fine-tuning/evaluators) * Upload your dataset to Fireworks * Create and launch the RFT job ```bash theme={null} pip install eval-protocol ``` Verify installation: ```bash theme={null} eval-protocol --version ``` Configure your Fireworks API key: ```bash theme={null} export FIREWORKS_API_KEY="fw_your_api_key_here" ``` Or create a `.env` file: ```bash theme={null} FIREWORKS_API_KEY=fw_your_api_key_here ``` Before training, verify your evaluator works. This command discovers and runs your `@evaluation_test` with pytest. If a Dockerfile is present, it builds an image and runs the test in Docker; otherwise it runs on your host. ```bash theme={null} cd evaluator_directory ep local-test ``` From the directory where your evaluator and dataset (dataset.jsonl) are located, ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --output-model my-model-name ``` The CLI will: * Upload evaluator code (if changed) * Upload dataset (if changed) * Create the RFT job * Display dashboard links for monitoring Expected output: ``` Created Reinforcement Fine-tuning Job name: accounts/your-account/reinforcementFineTuningJobs/abc123 Dashboard Links: Evaluator: https://app.fireworks.ai/dashboard/evaluators/your-evaluator Dataset: https://app.fireworks.ai/dashboard/datasets/your-dataset RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123 ``` Click the RFT Job link to watch training progress in real-time. See [Monitor Training](/fine-tuning/monitor-training) for details. ## Common CLI options Customize your RFT job with these flags: **Model and output**: ```bash theme={null} --base-model accounts/fireworks/models/llama-v3p1-8b-instruct # Base model to fine-tune --output-model my-custom-name # Name for fine-tuned model ``` **Training parameters**: ```bash theme={null} --epochs 2 # Number of training epochs (default: 1) --learning-rate 5e-5 # Learning rate (default: 1e-4) --lora-rank 16 # LoRA rank (default: 8) --batch-size 65536 # Batch size in tokens (default: 32768) ``` **Rollout (sampling) parameters**: ```bash theme={null} --temperature 0.8 # Sampling temperature (default: 0.7) --n 8 # Number of rollouts per prompt (default: 4) --max-tokens 4096 # Max tokens per response (default: 32768) --top-p 0.95 # Top-p sampling (default: 1.0) --top-k 50 # Top-k sampling (default: 40) ``` **Remote environments**: ```bash theme={null} --remote-server-url https://your-evaluator.example.com # For remote rollout processing ``` **Force re-upload**: ```bash theme={null} --force # Re-upload evaluator even if unchanged ``` See all options: ```bash theme={null} eval-protocol create rft --help ``` ## Advanced options Track training metrics in W\&B for deeper analysis: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --wandb-project my-rft-experiments \ --wandb-entity my-org ``` Set `WANDB_API_KEY` in your environment first. Save intermediate checkpoints during training: ```bash theme={null} firectl create rftj \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --checkpoint-frequency 500 # Save every 500 steps ... ``` Available in `firectl` only. Speed up training with multiple GPUs: ```bash theme={null} firectl create rftj \ --base-model accounts/fireworks/models/llama-v3p1-70b-instruct \ --accelerator-count 4 # Use 4 GPUs ... ``` Recommended for large models (70B+). For evaluators that need more time: ```bash theme={null} firectl create rftj \ --rollout-timeout 300 # 5 minutes per rollout ... ``` Default is 60 seconds. Increase for complex evaluations. ## Examples **Fast experimentation** (small model, 1 epoch): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/qwen3-0p6b \ --output-model quick-test ``` **High-quality training** (more rollouts, higher temperature): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --output-model high-quality-model \ --n 8 \ --temperature 1.0 ``` **Remote environment** (for multi-turn agents): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --remote-server-url https://your-agent.example.com \ --output-model remote-agent ``` **Multiple epochs with custom learning rate**: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --epochs 3 \ --learning-rate 5e-5 \ --output-model multi-epoch-model ``` ## Using `firectl` CLI (Alternative) For users already familiar with Fireworks `firectl`, you can create RFT jobs directly: ```bash theme={null} firectl create rftj \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset accounts/your-account/datasets/my-dataset \ --evaluator accounts/your-account/evaluators/my-evaluator \ --output-model my-finetuned-model ``` **Differences from `eval-protocol`**: * Requires fully qualified resource names (accounts/...) * Must manually upload evaluators and datasets first * More verbose but offers finer control * Same underlying API as `eval-protocol` See [firectl documentation](/tools-sdks/firectl/commands/create-reinforcement-fine-tuning-job) for all options. ## Next steps Review requirements, validation, and common errors Track job progress, inspect rollouts, and debug issues Learn how to adjust parameters for better results # Remote Environment Setup Source: https://docs.fireworks.ai/fine-tuning/connect-environments Implement the /init endpoint to run evaluations in your infrastructure If you already have an agent running in your product, or need to run rollouts on your own infrastructure, you can integrate it with RFT using the `RemoteRolloutProcessor`. This delegates rollout execution to an HTTP service you control. Remote agent are ideal for: * Multi-turn agentic workflows with tool use * Access to private databases, APIs, or internal services * Integration with existing agent codebases * Complex simulations that require your infrastructure New to RFT? Start with [local agent](/fine-tuning/quickstart-math) instead. They're simpler and cover most use cases. Only use remote agent environments when you need access to private infrastructure or have an existing agent to integrate. ## How remote rollouts work Remote rollout processor flow diagram showing the interaction between Eval Protocol, your remote server, and Fireworks Tracing During training, Fireworks calls your service's `POST /init` endpoint with the dataset row and correlation metadata. Your agent executes the task (e.g., multi-turn conversation, tool calls, simulation steps), logging progress via Fireworks tracing. Your service sends structured logs tagged with rollout metadata to Fireworks so the system can track completion. Once Fireworks detects completion, it pulls the full trace and evaluates it using your scoring logic. Everything except implementing your remote server is handled automatically by Eval Protocol. You only need to implement the `/init` endpoint and add Fireworks tracing. ## Implementing the /init endpoint Your remote service must implement a single `/init` endpoint that accepts rollout requests. ### Request schema Model configuration including model name and inference parameters like temperature, max\_tokens, etc. Array of conversation messages to send to the model Array of available tools for the model (for function calling) Base URL for making LLM calls through Fireworks tracing (includes correlation metadata) Rollout execution metadata for correlation (rollout\_id, run\_id, row\_id, etc.) Fireworks API key to use for model calls ### Example request ```json theme={null} { "completion_params": { "model": "accounts/fireworks/models/llama-v3p1-8b-instruct", "temperature": 0.7, "max_tokens": 2048 }, "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the weather for a city", "parameters": { "type": "object", "properties": { "city": { "type": "string" } } } } } ], "model_base_url": "https://tracing.fireworks.ai/rollout_id/brave-night-42/invocation_id/wise-ocean-15/experiment_id/calm-forest-28/run_id/quick-river-07/row_id/bright-star-91", "metadata": { "invocation_id": "wise-ocean-15", "experiment_id": "calm-forest-28", "rollout_id": "brave-night-42", "run_id": "quick-river-07", "row_id": "bright-star-91" }, "api_key": "fw_your_api_key" } ``` ## Metadata correlation The `metadata` object contains correlation IDs that you must include when logging to Fireworks tracing. This allows Eval Protocol to match logs and traces back to specific evaluation rows. Required metadata fields: * `invocation_id` - Identifies the evaluation invocation * `experiment_id` - Groups related experiments * `rollout_id` - Unique ID for this specific rollout (most important) * `run_id` - Identifies the evaluation run * `row_id` - Links to the dataset row `RemoteRolloutProcessor` automatically generates these IDs and sends them to your server. You don't need to create them yourself—just pass them through to your logging. ## Fireworks tracing integration Your remote server must use Fireworks tracing to report rollout status. Eval Protocol polls these logs to detect when rollouts complete. ### Basic setup ```python theme={null} import logging from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler, RolloutIdFilter # Configure Fireworks tracing handler globally fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) @app.post("/init") def init(request: InitRequest): # Create rollout-specific logger with filter rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}") rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) try: # Execute your agent logic here result = execute_agent(request) # Log successful completion with structured status rollout_logger.info( f"Rollout {request.metadata.rollout_id} completed", extra={"status": Status.rollout_finished()} ) return {"status": "success"} except Exception as e: # Log errors with structured status rollout_logger.error( f"Rollout {request.metadata.rollout_id} failed: {e}", extra={"status": Status.rollout_error(str(e))} ) raise ``` ### Key components 1. **FireworksTracingHttpHandler**: Sends logs to Fireworks tracing service 2. **RolloutIdFilter**: Tags logs with the rollout ID for correlation 3. **Status objects**: Structured status reporting that Eval Protocol can parse * `Status.rollout_finished()` - Signals successful completion * `Status.rollout_error(message)` - Signals failure with error details ### Alternative: Environment variable approach For simpler setups, you can use the `EP_ROLLOUT_ID` environment variable instead of manual filters. If your server processes one rollout at a time (e.g., serverless functions, container per request): ```python theme={null} import os import logging from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler # Set rollout ID in environment os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id # Configure handler (automatically picks up EP_ROLLOUT_ID) fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) logger = logging.getLogger(__name__) @app.post("/init") def init(request: InitRequest): # Logs are automatically tagged with rollout_id logger.info("Processing rollout...") # ... execute agent logic ... ``` If your `/init` handler spawns separate Python processes for each rollout: ```python theme={null} import os import logging import multiprocessing from eval_protocol import FireworksTracingHttpHandler, InitRequest def execute_rollout_step_sync(request): # Set EP_ROLLOUT_ID in the child process os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id logging.getLogger().addHandler(FireworksTracingHttpHandler()) # Execute your rollout logic here # Logs are automatically tagged @app.post("/init") async def init(request: InitRequest): # Do NOT set EP_ROLLOUT_ID in parent process p = multiprocessing.Process( target=execute_rollout_step_sync, args=(request,) ) p.start() return {"status": "started"} ``` ### How Eval Protocol uses tracing 1. **Your server logs completion**: Uses `Status.rollout_finished()` or `Status.rollout_error()` 2. **Eval Protocol polls**: Searches Fireworks logs by `rollout_id` tag until completion signal found 3. **Status extraction**: Reads structured status fields (`code`, `message`, `details`) to determine outcome 4. **Trace retrieval**: Fetches full trace of model calls and tool use for evaluation ## Complete example Here's a minimal but complete remote server implementation: ```python theme={null} from fastapi import FastAPI from fastapi.responses import JSONResponse from eval_protocol import InitRequest, FireworksTracingHttpHandler, RolloutIdFilter, Status import logging app = FastAPI() # Setup Fireworks tracing fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) @app.post("/init") async def init(request: InitRequest): # Create rollout-specific logger rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}") rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) rollout_logger.info(f"Starting rollout {request.metadata.rollout_id}") try: # Your agent logic here # 1. Make model calls using request.model_base_url # 2. Call tools, interact with environment # 3. Collect results result = run_your_agent( messages=request.messages, tools=request.tools, model_config=request.completion_params, api_key=request.api_key ) # Signal completion rollout_logger.info( f"Rollout {request.metadata.rollout_id} completed successfully", extra={"status": Status.rollout_finished()} ) return {"status": "success", "result": result} except Exception as e: # Signal error rollout_logger.error( f"Rollout {request.metadata.rollout_id} failed: {str(e)}", extra={"status": Status.rollout_error(str(e))} ) return JSONResponse( status_code=500, content={"status": "error", "message": str(e)} ) def run_your_agent(messages, tools, model_config, api_key): # Implement your agent logic here # Make model calls, use tools, etc. pass ``` ## Testing locally Before deploying, test your remote server locally: ```bash theme={null} uvicorn main:app --reload --port 8080 ``` In your evaluator test, point to your local server: ```python theme={null} from eval_protocol.pytest import RemoteRolloutProcessor rollout_processor = RemoteRolloutProcessor( remote_base_url="http://localhost:8080" ) ``` ```bash theme={null} pytest my-evaluator-name.py -vs ``` This sends test rollouts to your local server and verifies the integration works. ## Deploying your service Once tested locally, deploy to production: * ✅ Service is publicly accessible (or accessible via VPN/private network) * ✅ HTTPS endpoint with valid SSL certificate (recommended) * ✅ Authentication/authorization configured * ✅ Monitoring and logging set up * ✅ Auto-scaling configured for concurrent rollouts * ✅ Error handling and retry logic implemented * ✅ Service availability SLA meets training requirements **Vercel/Serverless**: * One rollout per function invocation * Use environment variable approach * Configure timeout for long-running evaluations **AWS ECS/Kubernetes**: * Handle concurrent requests with proper worker configuration * Use RolloutIdFilter approach * Set up load balancing **On-premise**: * Ensure network connectivity from Fireworks * Configure firewall rules * Set up VPN if needed for security ## Connecting to RFT Once your remote server is deployed, create an RFT job that uses it: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --remote-server-url https://your-evaluator.example.com \ --dataset my-dataset ``` The RFT job will send all rollouts to your remote server for evaluation during training. ## Troubleshooting **Symptoms**: Rollouts show as timed out or never complete **Solutions**: * Check that your service is logging `Status.rollout_finished()` correctly * Verify Fireworks tracing handler is configured * Ensure rollout\_id is included in log tags * Check for exceptions being swallowed without logging **Symptoms**: Eval Protocol can't match logs to rollouts **Solutions**: * Verify you're using the exact `rollout_id` from request metadata * Check that RolloutIdFilter or EP\_ROLLOUT\_ID is set correctly * Ensure logs are being sent to Fireworks (check tracing dashboard) **Symptoms**: Training is slow, high rollout latency **Solutions**: * Scale your service to handle concurrent requests * Optimize your agent logic (caching, async operations) * Add more workers or instances * Profile your code to find bottlenecks **Symptoms**: Model calls fail, API errors **Solutions**: * Verify API key is passed correctly from request * Check that your service has network access to Fireworks * Ensure model\_base\_url is used for traced calls ## Example implementations Learn by example: Complete walkthrough using a Vercel TypeScript server for SVG generation Minimal Python implementation showing the basics ## Next steps Launch your RFT job using the CLI Track rollout progress and debug issues Full Remote Rollout Processor tutorial Design effective reward functions # Deploying Fine Tuned Models Source: https://docs.fireworks.ai/fine-tuning/deploying-loras Deploy one or multiple LoRA models fine tuned on Fireworks After fine-tuning your model on Fireworks, deploy it to make it available for inference. You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for details. ## Single-LoRA deployment Deploy your LoRA fine-tuned model with a single command that delivers performance matching the base model. This streamlined approach, called live merge, eliminates the previous two-step process and provides better performance compared to multi-LoRA deployments. ### Quick deployment Deploy your LoRA fine-tuned model with one simple command: ```bash theme={null} firectl create deployment "accounts/fireworks/models/" ``` Your deployment will be ready to use once it completes, with performance that matches the base model. ## Multi-LoRA deployment If you have multiple fine-tuned versions of the same base model (e.g., you've fine-tuned the same model for different use cases, applications, or prototyping), you can share a single base model deployment across these LoRA models to achieve higher utilization. Multi-LoRA deployment comes with performance tradeoffs. We recommend using it only if you need to serve multiple fine-tunes of the same base model and are willing to trade performance for higher deployment utilization. ### Deploy with CLI Deploy the base model with addons enabled: ```bash theme={null} firectl create deployment "accounts/fireworks/models/" --enable-addons ``` Once the deployment is ready, load your LoRA models onto the deployment: ```bash theme={null} firectl load-lora --deployment ``` You can load multiple LoRA models onto the same deployment by repeating this command with different model IDs. ### When to use multi-LoRA deployment Use multi-LoRA deployment when you: * Need to serve multiple fine-tuned models based on the same base model * Want to maximize deployment utilization * Can accept some performance tradeoff compared to single-LoRA deployment * Are managing multiple variants or experiments of the same model ## Next steps Learn about deployment configuration and optimization Upload LoRA models fine-tuned outside of Fireworks # Direct Preference Optimization Source: https://docs.fireworks.ai/fine-tuning/dpo-fine-tuning Direct Preference Optimization (DPO) fine-tunes models by training them on pairs of preferred and non-preferred responses to the same prompt. This teaches the model to generate more desirable outputs while reducing unwanted behaviors. **Use DPO when:** * Aligning model outputs with brand voice, tone, or style guidelines * Reducing hallucinations or incorrect reasoning patterns * Improving response quality where there's no single "correct" answer * Teaching models to follow specific formatting or structural preferences ## Fine-tuning with DPO Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted training example. **Minimum Requirements:** * **Minimum examples needed:** 3 * **Maximum examples:** Up to 3 million examples per dataset * **File format:** JSONL (each line is a valid JSON object) * **Dataset Schema:** Each training sample must include the following fields: * An `input` field containing a `messages` array, where each message is an object with two fields: * `role`: one of `system`, `user`, or `assistant` * `content`: a string representing the message content * A `preferred_output` field containing an assistant message with an ideal response * A `non_preferred_output` field containing an assistant message with a suboptimal response Here’s an example conversation dataset (one training example): ```json einstein_dpo.jsonl theme={null} { "input": { "messages": [ { "role": "user", "content": "What is Einstein famous for?" } ], "tools": [] }, "preferred_output": [ { "role": "assistant", "content": "Einstein is renowned for his theory of relativity, especially the equation E=mc²." } ], "non_preferred_output": [ { "role": "assistant", "content": "He was a famous scientist." } ] } ``` We currently only support one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message. Save this dataset as jsonl file locally, for example `einstein_dpo.jsonl`. There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`. * You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard. Dataset Pn * Upload dataset using `firectl` ```bash theme={null} firectl create dataset /path/to/file.jsonl ``` You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client. ```jsx theme={null} // Create Dataset Entry const createDatasetPayload = { datasetId: "trader-poe-sample-data", dataset: { userUploaded: {} } // Additional params such as exampleCount }; const urlCreateDataset = `${BASE_URL}/datasets`; const response = await fetch(urlCreateDataset, { method: "POST", headers: HEADERS_WITH_CONTENT_TYPE, body: JSON.stringify(createDatasetPayload) }); ``` ```jsx theme={null} // Upload JSONL file const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`; const files = new FormData(); files.append("file", localFileInput.files[0]); const uploadResponse = await fetch(urlUpload, { method: "POST", headers: HEADERS, body: files }); ``` While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets. Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). Simple use `firectl` to create a new DPO job: ```bash theme={null} firectl create dpoj \ --base-model accounts/account-id/models/base-model-id \ --dataset accounts/my-account-id/datasets/my-dataset-id \ --output-model new-model-id ``` for our example, we might run the following command: ```bash theme={null} firectl create dpoj \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset accounts/pyroworks/datasets/einstein-dpo \ --output-model einstein-dpo-model ``` to fine-tune a [Llama 3.1 8b Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-8b-instruct) model with our Einstein dataset. Use `firectl` to monitor progress updates for the DPO fine-tuning job. ```bash theme={null} firectl get dpoj dpo-job-id ``` Once the job is complete, the `STATE` will be set to `JOB_STATE_COMPLETED`, and the fine-tuned model can be deployed. Once training completes, you can create a deployment to interact with the fine-tuned model. Refer to [deploying a fine-tuned model](/fine-tuning/fine-tuning-models#deploying-a-fine-tuned-model) for more details. ## Next Steps Explore other fine-tuning methods to improve model output for different use cases. Train models on input-output examples to improve task-specific performance. Optimize models using AI feedback for complex reasoning and decision-making. Fine-tune vision-language models to understand both images and text. # Agent Tracing Source: https://docs.fireworks.ai/fine-tuning/environments Understand where your agent runs and how tracing enables reinforcement fine-tuning ## Why agent tracing is critical to doing RL Reinforcement learning for agents depends on the entire chain of actions, tool calls, state transitions, and intermediate decisions—not just the final answer. Tracing captures this full trajectory so you can compute reliable rewards, reproduce behavior, and iterate quickly. **Why it matters** * **Credit assignment**: You need a complete record of each step to attribute reward to the decisions that caused success or failure. * **Reproducibility**: Deterministic replays require the exact prompts, model parameters, tool I/O, and environment state. * **Debuggability**: You can pinpoint where an episode fails (model output, tool error, data mismatch, timeout). Use Fireworks Tracing to drive the RL loop: emit structured logs with `FireworksTracingHttpHandler`, tag them with rollout correlation metadata, and signal completion using `Status.rollout_finished()` or `Status.rollout_error()`. When you make model calls, use the `model_base_url` issued by the trainer (it points to `https://tracing.fireworks.ai`) so chat completions are recorded as traces via an OpenAI-compatible endpoint. ## How Fireworks tracing works for RFT * **Traced completions**: The trainer provides a `model_base_url` on `https://tracing.fireworks.ai` that encodes correlation metadata. Your agent uses this OpenAI-compatible URL for LLM calls; tracing.fireworks.ai records the calls as traces automatically. * **Structured logging sink**: Your agent logs to Fireworks via `FireworksTracingHttpHandler`, including a structured `Status` when a rollout finishes or errors. * **Join traces and logs**: The trainer polls the logging sink by `rollout_id` to detect completion, then loads the full trace. Logs and traces are deterministically joined using the same correlation tags. ### Correlation metadata * **Correlate every log and trace** with these metadata fields provided in `/init`: `invocation_id`, `experiment_id`, `rollout_id`, `run_id`, `row_id`. * **Emit structured completion** from your server logs: * Add `FireworksTracingHttpHandler` and `RolloutIdFilter` to attach the `rollout_id` * Log `Status.rollout_finished()` on success, or `Status.rollout_error(message)` on failure * **Alternative**: If you run one rollout per process, set `EP_ROLLOUT_ID` in the child process instead of adding a filter. * **Record model calls as traces** by using the `model_base_url` from the trainer. It encodes the correlation IDs so your completions are automatically captured. ### tracing.fireworks.ai base URL * **Purpose-built for RL**: tracing.fireworks.ai is the Fireworks gateway used during RFT to capture traces and correlate them with rollout status. * **OpenAI-compatible**: It exposes Chat Completions-compatible endpoints, so you set it as your client's `base_url`. * **Correlation-aware**: The trainer embeds `rollout_id`, `run_id`, and related IDs into the `model_base_url` path so your completions are automatically tagged and joinable with logs. * **Drop-in usage**: Always use the `model_base_url` provided in `/init`—do not override it—so traces and logs are correctly linked. ## End-to-end tracing setup with tracing.fireworks.ai Your server implements `/init` and receives `metadata` and `model_base_url`. Attach `RolloutIdFilter` or set `EP_ROLLOUT_ID` for the current rollout. Call the model using `model_base_url` so chat completions are persisted as traces with correlation tags. Attach `FireworksTracingHttpHandler` to your logger and log `Status.rollout_finished()` or `Status.rollout_error()` when the rollout concludes. The trainer polls Fireworks logs by `rollout_id`, then loads the full traces; logs and traces share the same tags and are joined to finalize results and compute rewards. ### Remote server minimal example ```python remote_server.py theme={null} import logging import os from eval_protocol import InitRequest, Status, FireworksTracingHttpHandler, RolloutIdFilter # Configure Fireworks logging sink once at startup logging.getLogger().addHandler(FireworksTracingHttpHandler()) @app.post("/init") def init(request: InitRequest): # Option A: add filter that injects rollout_id on every log record logger = logging.getLogger(f"eval.{request.metadata.rollout_id}") logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) # Option B: per-process correlation (use when spawning one rollout per process) # os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id # Make model calls via the correlated base URL so completions are traced # client = YourLLMClient(base_url=request.model_base_url, api_key=request.api_key) try: # ... execute rollout steps, tool calls, etc. ... logger.info("rollout finished", extra={"status": Status.rollout_finished()}) except Exception as e: logger.error("rollout error", extra={"status": Status.rollout_error(str(e))}) ``` Under the hood, the trainer polls the logging sink for `Status` and then loads the full trace for scoring. Because both logs and traces share the same correlation tags, Fireworks can deterministically join them to finalize results and compute rewards. ### What to capture in a trace * **Inputs and context**: Task ID, dataset split, initial state, seeds, and any retrieval results provided to the agent. * **Model calls**: System/user messages, tool messages, model/version, parameters (e.g., temperature, top\_p, seed), token counts, and optional logprobs. * **Tool and API calls**: Request/response summaries, status codes, durations, retries, and sanitized payload snippets. * **Environment state transitions**: Key state before/after each action that affects reward or next-step choices. * **Rewards**: Per-step shaping rewards, terminal reward, and component breakdowns with weights and units. * **Errors and timeouts**: Exceptions, stack traces, and where they occurred in the trajectory. * **Artifacts**: Files, code, unit test results, or other outputs needed to verify correctness. Never record secrets or raw sensitive data in traces. Redact tokens, credentials, and PII. Store references (IDs, hashes) instead of full payloads whenever possible. ### How tracing powers the training loop 1. **Rollout begins**: Trainer creates a rollout and sends it to your environment (local or remote) with a unique identifier. 2. **Agent executes**: Your agent emits spans for model calls, tool calls, and state changes; your evaluator computes step and terminal rewards. 3. **Rewards aggregate**: The trainer consumes your rewards and updates the policy; traces are stored for replay and analysis. 4. **Analyze and iterate**: You filter traces by reward, failure type, latency, or cost to refine prompts, tools, or reward shaping. ### How RemoteRolloutProcessor uses Fireworks Tracing 1. **Remote server logs completion** with structured status: `Status.rollout_finished()` or `Status.rollout_error()`. 2. **Trainer polls Fireworks Tracing** by `rollout_id` until completion status is found. 3. **Status extracted** from structured fields (`code`, `message`, `details`) to finalize the rollout result. ### Best practices * **Make it deterministic**: Record seeds, versions, and any non-deterministic knobs; prefer idempotent tool calls or cached fixtures in test runs. * **Keep signals bounded**: Normalize rewards to a consistent range (e.g., \[0, 1]) and document your components and weights. * **Summarize, don’t dump**: Log compact summaries and references for large payloads to keep traces fast and cheap. * **Emit heartbeats**: Send periodic status updates so long-running rollouts are observable; always finalize with success or failure. * **Use consistent schemas**: Keep field names and structures stable to enable dashboards, filters, and automated diagnostics. ## Next steps Implement `/init`, tracing, and structured status for remote agents Build and deploy a local evaluator in under 10 minutes Launch your RFT job Design effective reward functions for your task # Evaluators Source: https://docs.fireworks.ai/fine-tuning/evaluators Understand the fundamentals of evaluators and reward functions in reinforcement fine-tuning An evaluator (also called a reward function) is code that scores model outputs from 0.0 (worst) to 1.0 (best). During reinforcement fine-tuning, your evaluator guides the model toward better responses by providing feedback on its generated outputs. ## Why evaluators matter Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what "good" means. This is powerful because: * **No perfect data required** - Just prompts and a way to score outputs * **Encourages exploration** - Models learn strategies, not just patterns * **Noise tolerant** - Even noisy signals can improve model performance * **Encodes domain expertise** - Complex rules and logic that are hard to demonstrate with examples ## Anatomy of an evaluator Every evaluator has three core components: ### 1. Input data The prompt and any ground truth data needed for evaluation: ```python theme={null} { "messages": [ {"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is 15 * 23?"} ], "ground_truth": "345" # Optional additional data } ``` ### 2. Model output The assistant's response to evaluate: ```python theme={null} { "role": "assistant", "content": "Let me calculate that step by step:\n15 * 23 = 345" } ``` ### 3. Scoring logic Code that compares the output to your criteria: ```python theme={null} def evaluate(model_output: str, ground_truth: str) -> float: # Extract answer from model's response predicted = extract_number(model_output) # Score it if predicted == int(ground_truth): return 1.0 # Perfect else: return 0.0 # Wrong ``` ## Types of evaluators ### Rule-based evaluators Check if outputs match specific patterns or rules: * **Exact match** - Output exactly equals expected value * **Contains** - Output includes required text * **Regex** - Output matches a pattern * **Format validation** - Output follows required structure (e.g., valid JSON) Start with rule-based evaluators. They're simple, fast, and surprisingly effective. ### Execution-based evaluators Run code or commands to verify correctness: * **Code execution** - Run generated code and check results * **Test suites** - Pass generated code through unit tests * **API calls** - Execute commands and verify outcomes * **Simulations** - Run agents in environments and measure success ### LLM-as-judge evaluators Use another model to evaluate quality: * **Rubric scoring** - Judge outputs against criteria * **Comparative ranking** - Compare multiple outputs * **Natural language assessment** - Evaluate subjective qualities like helpfulness ## Scoring guidelines Your evaluator should return a score between 0.0 and 1.0: | Score range | Meaning | Example | | ----------- | ------- | --------------------------- | | 1.0 | Perfect | Exact correct answer | | 0.7-0.9 | Good | Right approach, minor error | | 0.4-0.6 | Partial | Some correct elements | | 0.1-0.3 | Poor | Wrong but attempted | | 0.0 | Failure | Completely wrong | Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes. ## Best practices Begin with basic evaluation logic and refine over time: ```python theme={null} # Start here score = 1.0 if predicted == expected else 0.0 # Then refine if needed score = calculate_similarity(predicted, expected) ``` Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results. Training generates many outputs to evaluate, so performance matters: * **Cache expensive computations**: Store results of repeated calculations * **Use timeouts for code execution**: Prevent hanging on infinite loops * **Batch API calls when possible**: Reduce network overhead * **Profile slow evaluators and optimize**: Identify and fix bottlenecks Aim for evaluations that complete in seconds, not minutes. Slow evaluators directly increase training time and cost. Models will generate unexpected outputs, so build robust error handling: ```python theme={null} try: result = execute_code(model_output) score = check_result(result) except TimeoutError: score = 0.0 # Code ran too long except SyntaxError: score = 0.0 # Invalid code except Exception as e: score = 0.0 # Any other error ``` Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain. Models will exploit evaluation weaknesses, so design defensively: **Example: Length exploitation** If you score outputs by length, the model might generate verbose nonsense. Add constraints: ```python theme={null} # Bad: Model learns to write long outputs score = min(len(output) / 1000, 1.0) # Better: Require correctness AND reasonable length if is_correct(output): score = 1.0 if len(output) < 500 else 0.8 else: score = 0.0 ``` **Example: Format over substance** If you only check JSON validity, the model might return valid but wrong JSON. Check content too: ```python theme={null} # Bad: Only checks format score = 1.0 if is_valid_json(output) else 0.0 # Better: Check format AND content if is_valid_json(output): data = json.loads(output) score = evaluate_content(data) else: score = 0.0 ``` Always combine format checks with content validation to prevent models from gaming the system. ## Debugging evaluators Test your evaluator before training. Look for: * **Correct scoring** - Good outputs score high, bad outputs score low * **Reasonable runtime** - Each evaluation completes in reasonable time * **Clear feedback** - Evaluation reasons explain scores Run your evaluator on manually created good and bad examples first. If it doesn't score them correctly, fix the evaluator before training. ## Next steps Connect to your environment for single and multi-turn agents Follow a complete example building and using an evaluator # Supervised Fine Tuning - Text Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-models This guide will focus on using supervised fine-tuning to fine-tune and deploy a model with on-demand hosting. ## Fine-tuning a model using SFT You can confirm that a base model is available to fine-tune by looking for the `Tunnable` tag in the model library or by using: ```bash theme={null} firectl get model -a fireworks ``` And looking for `Tunable: true`. Some base models cannot be tuned on Fireworks (`Tunable: false`) but still list support for LoRA (`Supports Lora: true`). This means that users can tune a LoRA for this base model on a separate platform and upload it to Fireworks for inference. Consult [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for more information. Datasets must be in JSONL format, where each line represents a complete JSON-formatted training example. Make sure your data conforms to the following restrictions: * **Minimum examples:** 3 * **Maximum examples:** 3 million per dataset * **File format:** `.jsonl` * **Message schema:** Each training sample must include a messages array, where each message is an object with two fields: * `role`: one of `system`, `user`, or `assistant`. A message with the `system` role is optional, but if specified, it must be the first message of the conversation * `content`: a string representing the message content * `weight`: optional key with value to be configured in either 0 or 1. message will be skipped if value is set to 0 * **Sample weight:** Optional key `weight` at the root of the JSON object. It can be any floating point number (positive, negative, or 0) and is used as a loss multiplier for tokens in that sample. If used, this field must be present in all samples in the dataset. Here is an example conversation dataset: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."} ] } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"} ] } ``` Here is an example conversation dataset with sample weights: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."} ], "weight": 0.5 } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"} ], "weight": 1.0 } ``` We also support function calling dataset with a list of tools. An example would look like: ```json theme={null} { "tools": [ { "type": "function", "function": { "name": "get_car_specs", "description": "Fetches detailed specifications for a car based on the given trim ID.", "parameters": { "trimid": { "description": "The trim ID of the car for which to retrieve specifications.", "type": "int", "default": "" } } } }, ], "messages": [ { "role": "user", "content": "What is the specs of the car with trim 121?" }, { "role": "assistant", "tool_calls": [ { "type": "function", "function": { "name": "get_car_specs", "arguments": "{\"trimid\": 121}" } } ] } ] } ``` For the subset of models that supports thinking (e.g. DeepSeek R1, GPT OSS models and Qwen3 thinking models), we also support fine tuning with thinking traces. If you wish to fine tune with thinking traces, the dataset could also include thinking traces for assistant turns. Though optional, ideally each assistant turn includes a thinking trace. For example: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris.", "reasoning_content": "The user is asking about the capital city of France, it should be Paris."} ] } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0, "reasoning_content": "The user is asking about the result of 1+1, the answer is 2."}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4", "reasoning_content": "The user is asking about the result of 2+2, the answer should be 4."} ] } ``` Note that when fine tuning with intermediate thinking traces, the number of total tuned tokens could exceed the number of total tokens in the dataset. This is because we perform preprocessing and expand the dataset to ensure train-inference consistency. There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`. * You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard. Dataset Pn ```bash theme={null} firectl create dataset /path/to/jsonl/file ``` You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client. ```jsx theme={null} // Create Dataset Entry const createDatasetPayload = { datasetId: "trader-poe-sample-data", dataset: { userUploaded: {} } // Additional params such as exampleCount }; const urlCreateDataset = `${BASE_URL}/datasets`; const response = await fetch(urlCreateDataset, { method: "POST", headers: HEADERS_WITH_CONTENT_TYPE, body: JSON.stringify(createDatasetPayload) }); ``` ```jsx theme={null} // Upload JSONL file const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`; const files = new FormData(); files.append("file", localFileInput.files[0]); const uploadResponse = await fetch(urlUpload, { method: "POST", headers: HEADERS, body: files }); ``` While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets. Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). There are also a couple ways to launch the fine-tuning jobs. We highly recommend creating supervised fine tuning jobs via `UI` . Simply navigate to the `Fine-Tuning` tab, click `Fine-Tune a Model` and follow the wizard from there. You can even pick a LoRA model to start the fine-tuning for continued training. Fine Tuning Pn Create Sftj Pn Ensure the fine tuned model ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). This will return a fine-tuning job ID. For a full explanation of the settings available to control the fine-tuning process, including learning rate and epochs, consult [additional SFT job settings](#additional-sft-job-settings). ```bash theme={null} firectl create sftj --base-model --dataset --output-model ``` Similar to UI, instead of tuning a base model, you can also start tuning from a previous LoRA model using ```bash theme={null} firectl create sftj --warm-start-from --dataset --output-model ``` Notice that we use `--warm-start-from` instead of `--base-model` when creating this job. With `UI`, once the job is created, it will show in the list of jobs. Clicking to view the job details to monitor the job progress. Sftj Details Pn With `firectl`, you can monitor the progress of the tuning job by running ```bash theme={null} firectl get sftj ``` Once the job successfully completes, you will see the new LoRA model in your model list ```bash theme={null} firectl list models ``` For a complete Python SDK example that demonstrates the full workflow (creating datasets, uploading files, and launching a supervised fine-tuning job), see the [Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py). ## Deploying a fine-tuned model After fine-tuning completes, deploy your model to make it available for inference: ```bash theme={null} firectl create deployment ``` This creates a dedicated deployment with performance matching the base model. For more details on deploying fine-tuned models, including multi-LoRA deployments, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras). ## Additional SFT job settings Additional tuning settings are available when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like `epochs` and `learning rate`, we recommend using default settings and only changing hyperparameters if results are not as desired. By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data. `evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl ```shell theme={null} firectl create sftj \ --evaluation-dataset my-eval-set \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Depending on the size of the model, the default context size will be different. For most models, the default context size is >= 32768. Training examples will be cut-off at 32768 tokens. Usually you do not need to set the max context length unless out of memory error is encountered with higher lora rank and large max context length. ```shell theme={null} firectl create sftj \ --max-context-length 65536 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Batch size is the number of tokens packed into one forward step during training. One batch could consist of multiple training samples. We do sequence packing on the training samples, and batch size controls how many total tokens will be packed into each batch. ```shell theme={null} firectl create sftj \ --batch-size 65536 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported. **Note: we set a max value of 3 million dataset examples × epochs** ```shell theme={null} firectl create sftj \ --epochs 2.0 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value is automatically based on your selected model. ```shell theme={null} firectl create sftj \ --learning-rate 0.0001 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Learning rate warmup steps controls the number of training steps during which the learning rate will be linearly ramped up to the set learning rate. ```shell theme={null} firectl create sftj \ --learning-rate 0.0001 \ --learning-rate-warmup-steps 200 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Gradient accumulation steps controls the number of forward steps and backward steps to take (gradients are accumulated) before optimizer.step() is taken. Gradient accumulation steps > 1 increases effective batch size. ```shell theme={null} firectl create sftj \ --gradient-accumulation-steps 4 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 64. Our default value is 8. ```shell theme={null} firectl create sftj \ --lora-rank 16 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key. ```shell theme={null} firectl create sftj \ --wandb-entity my-org \ --wandb-api-key xxx \ --wandb-project "My Project" \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within [ID constraints](/getting-started/concepts#resource-names-and-ids). ```shell theme={null} firectl create sftj \ --output-model my-model \ --base-model MY_BASE_MODEL \ --dataset cancerset ``` By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID. ```shell theme={null} firectl create sftj \ --job-id my-fine-tuning-job \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` ## Appendix * `Python SDK` [references](/tools-sdks/python-sdk) * `Restful API` [references](/api-reference/introduction) * `firectl` [references](/tools-sdks/firectl/firectl) * [Complete Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py) for a code-only implementation # Supervised Fine Tuning - Vision Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-vlm Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. To see all vision models that support fine-tuning, visit the [Model Library for vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true). ## Fine-tuning a VLM using LoRA vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example. **Dataset Requirements:** * **Format**: `.jsonl` file * **Minimum examples**: 3 * **Maximum examples**: 3 million per dataset * **Images**: Must be base64 encoded with proper MIME type prefixes * **Supported image formats**: PNG, JPG, JPEG **Message Schema:** Each training example must include a `messages` array where each message has: * `role`: one of `system`, `user`, or `assistant` * `content`: an array containing text and image objects or just text ### Basic VLM Dataset Example ```json theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them." }, { "role": "user", "content": [ { "type": "text", "text": "What objects do you see in this image?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ] }, { "role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image." } ] } ``` ### If your dataset contains image urls Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64. ```json theme={null} { "type": "image_url", "image_url": { // ❌ Raw HTTP/HTTPS URLs are NOT supported "url": "https://example.com/image.jpg" } } ``` ```json theme={null} { "type": "image_url", "image_url": { // ✅ Use data URI with base64 encoding // Format: data:image/{format};base64,{base64_encoded_data} "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ``` You can use the following script to automatically convert your dataset to the correct format: **Usage:** ```bash theme={null} # Install required dependency pip install requests # Download the script wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py # Run the script - will output a new dataset _base64.jsonl python download_images_and_encode_to_b64.py --input_file ``` ### Advanced Dataset Examples ```json expandable theme={null} { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images and tell me the differences" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..." } }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..." } } ] }, { "role": "assistant", "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different." } ] } ``` ```json expandable theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation." }, { "role": "user", "content": [ { "type": "text", "text": "Can you describe this kitchen layout for me?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light." }, { "role": "user", "content": "Now look at this living room. Do you think the styles would work well together?" }, { "role": "assistant", "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?" }, { "role": "user", "content": [ { "type": "text", "text": "Sorry about that! Here's the living room:" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones." }, ] } ``` ### Try with an Example Dataset To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `` tags for reasoning. ```bash theme={null} # Download the example dataset curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` ```bash theme={null} # Download the example dataset wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` Upload your prepared JSONL dataset to Fireworks for training: ```bash theme={null} firectl create dataset my-vlm-dataset /path/to/vlm_training_data.jsonl ``` Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard. Dataset creation interface ```javascript theme={null} // Create dataset entry const createDatasetPayload = { datasetId: "my-vlm-dataset", dataset: { userUploaded: {} } }; const response = await fetch(`${BASE_URL}/datasets`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify(createDatasetPayload) }); // Upload JSONL file const formData = new FormData(); formData.append("file", fileInput.files[0]); const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}` }, body: formData }); ``` For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details. Create a supervised fine-tuning job for your VLM: ```bash theme={null} firectl create sftj \ --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \ --dataset my-vlm-dataset \ --output-model my-custom-vlm \ --epochs 3 ``` For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings). 1. Navigate to the Fine-tuning tab in the Fireworks console 2. Click "Create Fine-tuning Job" 3. Select your VLM base model (Qwen 2.5 VL) 4. Choose your uploaded dataset 5. Configure training parameters 6. Launch the job Fine-tuning job creation interface VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity. Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning). VLM fine-tuning job in the Fireworks console Monitor key metrics: * **Training loss**: Should generally decrease over time * **Evaluation loss**: Monitor for overfitting if using evaluation dataset * **Training progress**: Epochs completed and estimated time remaining Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment. Once training is complete, deploy your custom VLM: ```bash theme={null} # Create a deployment for your fine-tuned VLM firectl create deployment my-custom-vlm # Check deployment status firectl get deployment accounts/your-account/deployment/deployment-id ``` Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page. Deploy dropdown in the fine-tuning job page ## Advanced Configuration For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide. ## Interactive Tutorials: Fine-tuning VLMs For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation. **Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI** **Finetuning a VLM to beat SOTA closed source model** The cookbooks above cover the following: * Setting up your environment with Fireworks CLI * Preparing vision datasets in the correct format * Launching and monitoring VLM fine-tuning jobs * Testing your fine-tuned model * Best practices for VLM fine-tuning * Running inference on serverless VLMs * Running evals to show performance gains ## Testing Your Fine-tuned VLM After deployment, test your fine-tuned VLM using the same API patterns as base VLMs: ```python Python (OpenAI SDK) theme={null} import openai client = openai.OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key="", ) response = client.chat.completions.create( model="accounts/your-account/models/my-custom-vlm", messages=[{ "role": "user", "content": [{ "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg" }, },{ "type": "text", "text": "What's in this image?", }], }] ) print(response.choices[0].message.content) ``` If you fine-tuned using the example dataset, your model should include `` tags in its response. # Fine Tuning Overview Source: https://docs.fireworks.ai/fine-tuning/finetuning-intro Fireworks helps you fine-tune models to improve quality and performance for your product use cases, without the burden of building & maintaining your own training infrastructure. ## Fine-tuning methods Train models using custom reward functions for complex reasoning tasks Train text models with labeled examples of desired outputs Train vision-language models with image and text pairs Align models with human preferences using pairwise comparisons ## Supported models Fireworks supports fine-tuning for most major open source models, including DeepSeek, Qwen, Kimi, and Llama model families, and supports fine-tuning large state-of-the-art models like Kimi K2 0905 and DeepSeek V3.1. To see all models that support fine-tuning, visit the [Model Library for text models](https://app.fireworks.ai/models?filter=LLM\&tunable=true) or [vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true). ## Fireworks uses LoRA Fireworks uses **[Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685)** to fine-tune models efficiently. The fine-tuning process generates a LoRA addon—a small adapter that modifies the base model's behavior without retraining all its weights. This approach is: * **Faster and cheaper** - Train models in hours, not days * **Easy to deploy** - Deploy LoRA addons instantly on Fireworks * **Flexible** - Run [multiple LoRAs](/fine-tuning/deploying-loras#multi-lora-deployment) on a single base model deployment ## When to use Supervised Fine-Tuning (SFT) vs. Reinforcement Fine-Tuning (RFT) In supervised fine-tuning, you provide a dataset with labeled examples of “good” outputs. In reinforcement fine-tuning, you provide a grader function that can be used to score the model's outputs. The model is iteratively trained to produce outputs that maximize this score. To learn more about the differences between SFT and RFT, see [when to use Supervised Fine-Tuning (SFT) vs. Reinforcement Fine-Tuning (RFT)](./finetuning-intro#when-to-use-supervised-fine-tuning-sft-vs-reinforcement-fine-tuning-models-rft). Supervised fine-tuning (SFT) works well for many common scenarios, especially when: * You have a sizable dataset (\~1000+ examples) with high-quality, ground-truth lables. * The dataset covers most possible input scenarios. * Tasks are relatively straightforward, such as: * Classification * Content extraction However, SFT may struggle in situations where: * Your dataset is small. * You lack ground-truth outputs (a.k.a. “golden generations”). * The task requires multi-step reasoning. Here is a simple decision tree: ```mermaid theme={null} flowchart TD B{"Do you have labeled ground truth data?"} B --"Yes"--> C{"How much?"} C --"more than 1000 examples"--> D["SFT"] C --"100-1000 examples"-->F{"Does reasoning help?"} C --"~100s examples"--> E["RFT"] F --"No"-->D F -- "Yes" -->E B --"No"--> G{"Is this a verifiable task (see below)?"} G -- "Yes" -->E G -- "No"-->H["RLHF / LLM as judge"] ``` `Verifiable` refers to whether it is relatively easy to make a judgement on the quality of the model generation. # Basics Source: https://docs.fireworks.ai/fine-tuning/how-rft-works Understand the reinforcement learning fundamentals behind RFT ## What is reinforcement fine-tuning? In traditional supervised fine-tuning, you provide a dataset with labeled examples showing exactly what the model should output. In reinforcement fine-tuning, you instead provide: 1. **A dataset**: Prompts, with input examples for the model to respond to 2. **An evaluator**: Code that scores the model's outputs from 0.0 (bad) to 1.0 (good), also known as a reward function 3. **An agent**: An LLM application, with access to tools, APIs, and data needed for your task During training, the model generates responses to each prompt, receives scores from your reward function, and produces outputs that maximize the reward. ## Use cases Reinforcement fine-tuning helps you train models to excel at: * **Code generation and analysis** - Writing and debugging functions with verifiable execution results or test outcomes * **Structured output generation** - JSON formatting, data extraction, classification, and schema compliance with programmatic validation * **Domain-specific reasoning** - Legal analysis, financial modeling, or medical triage with verifiable criteria and compliance checks * **Tool-using agents** - Multi-step workflows where agents call external APIs with measurable success criteria ## How it works Define how you'll score model outputs from 0 to 1. For example, scoring outputs higherchecking if your agent called the right tools, or if your LLM-as-judge rates the output highly. Create a JSONL file with prompts (system and user messages). These will be used to generate rollouts during training. Train locally, or connect your agent as a remote server to Fireworks with our /init and /status endpoints. Create an RFT job via the UI or CLI. Fireworks orchestrates rollouts, evaluates them, and trains the model to maximize reward. Once training completes, deploy your fine-tuned LoRA model to production with an on-demand deployment. ### RFT works best when: 1. You can determine whether a model's output is "good" or "bad," even if only approximately 2. You have prompts but lack perfect "golden" completions to learn from 3. The task requires multi-step reasoning where evaluating intermediate steps is hard 4. You want the model to explore creative solutions beyond your training examples ## Next steps Learn how to design effective reward functions Learn how to launch and configure RFT jobs # Monitor Training Source: https://docs.fireworks.ai/fine-tuning/monitor-training Track RFT job progress and diagnose issues in real-time Once your RFT job is running, the Fireworks dashboard provides comprehensive monitoring tools to track progress, inspect individual rollouts, and debug issues as they arise. ## Accessing the monitoring dashboard After creating your RFT job, you'll receive a dashboard link in the CLI output: ``` Dashboard Links: RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123 ``` Click this link or navigate manually: 1. Go to [Fireworks Dashboard](https://app.fireworks.ai) 2. Click **Fine-Tuning** in the sidebar 3. Select your job from the list ## Understanding the overview The main dashboard shows your job's current state and key metrics. ### Job status Your job is queued waiting for GPU resources. Queue time depends on current demand and your account priority. **Action**: None needed. Job will start automatically when resources become available. Fireworks is validating your dataset to ensure it meets format requirements and quality standards. **Duration**: Typically 1-2 minutes **Action**: None needed. If validation fails, you'll receive specific error messages about issues in your dataset. Training is actively in progress. Rollouts are being generated, evaluated, and the model is learning. **Action**: Monitor metrics and rollout quality. This is when you'll watch reward curves improve. Training finished successfully. Your fine-tuned model is ready for deployment. **Action**: Review final metrics, then [deploy your model](/fine-tuning/deploying-loras). Training encountered an unrecoverable error and stopped. **Action**: Check error logs and troubleshooting section below. Common causes include evaluator errors, resource limits, or dataset issues. You or another user manually stopped the job. **Action**: Review partial results if needed. Create a new job to continue training. Training stopped automatically because the full epoch showed no improvement. All rollouts received the same scores, indicating no training progress. **Action**: This typically indicates an issue with your evaluator or training setup. Check that: * Your evaluator is returning varied scores (not all 0s or all 1s) * The reward function can distinguish between good and bad outputs * The model is actually generating different responses Review the troubleshooting section below for common causes. ### Key metrics at a glance The overview panel displays: * **Elapsed time**: How long the job has been running * **Progress**: Current epoch and step counts * **Reward**: Latest mean reward from rollouts * **Model**: Base model and output model names ## Training metrics ### Reward curves The most important metric in RFT is the reward curve, which shows how well your model is performing over time. **What to look for**: * **Upward trend** - Model is learning and improving * **Plateauing** - Model may have converged; consider stopping or adjusting parameters * **Decline** - Potential issue with evaluator or training instability * **Spikes** - Could indicate noisy rewards or outliers in evaluation Reward curve showing upward trend over training epochs Healthy training shows steady reward improvement. Don't worry about minor fluctuations—focus on the overall trend. ### Training loss Loss measures how well the model is fitting the training data: * **Decreasing loss** - Normal learning behavior * **Increasing loss** - Learning rate may be too high * **Flat loss** - Model may not be learning; check evaluator rewards ### Evaluation metrics If you provided an evaluation dataset, you'll see validation metrics: * **Eval reward**: Model performance on held-out data * **Generalization gap**: Difference between training and eval rewards Large gaps between training and eval rewards suggest overfitting. Consider reducing epochs or adding more diverse training data. ## Inspecting rollouts Understanding individual rollouts helps you verify your evaluator is working correctly and identify quality issues. ### Rollout overview table Click any **Epoch** in the training timeline, then click the **table icon** to view all rollouts for that step. Table showing rollout IDs, prompts, responses, and rewards The table shows: * **Row ID**: Unique identifier for each dataset row used in this rollout * **Prompt**: The input prompt sent to the model * **Messages**: The model's generated response messages * **Valid**: Whether the rollout completed successfully without errors * **Reason**: Explanation if the rollout failed or was marked invalid * **Score**: Reward score assigned by your evaluator (0.0 to 1.0) **What to check**: * Most rollouts succeeding (status: complete) * Reward distribution makes sense (high for good outputs, low for bad) * Many failures indicate evaluator issues * All rewards identical may indicate evaluator is broken ### Individual rollout details Click any row in the rollout table to see full details: Detailed view of a single rollout showing full prompt, response, and evaluation You'll see: 1. **Full prompt**: Exact messages sent to the model 2. **Model response**: Complete generated output 3. **Evaluation result**: Reward score and reasoning (if provided) 4. **Metadata**: Token counts, timing, temperature settings 5. **Tool calls**: For agentic rollouts with function calling Copy and paste model outputs to test them manually. For example, if you're training a code generator, try running the generated code yourself to verify your evaluator is scoring correctly. ### Quality spot checks Regularly inspect rollouts at different stages of training: **Early training (first epoch)**: * Verify evaluator is working correctly * Check that high-reward rollouts are actually good * Ensure low-reward rollouts are actually bad **Mid-training**: * Confirm model quality is improving * Look for new strategies or behaviors emerging * Check that evaluator isn't being gamed **Late training**: * Verify final model quality meets your standards * Check for signs of overfitting (memorizing training data) * Ensure diversity in responses (not all identical) ## Live logs Real-time logs show what's happening inside your training job. ### Accessing logs Click the **Logs icon** next to the table icon to view real-time logs for your training job. Live log streaming showing rollout processing and evaluation ### Using logs for debugging When things go wrong, logs are your first stop: 1. **Filter by error level**: Focus on `[ERROR]` and `[WARNING]` messages 2. **Search for rollout IDs**: Track specific rollouts through their lifecycle 3. **Look for patterns**: Repeated errors indicate systematic issues 4. **Check timestamps**: Correlate errors with metric changes ## Common issues and solutions **Symptoms**: Reward curve flat or very low throughout training **Possible causes**: * Evaluator always returning 0 or very low scores * Model outputs not matching expected format * Task too difficult for base model **Solutions**: 1. Inspect rollouts to verify evaluator is working: * Check that some rollouts get high rewards * Verify reward logic makes sense 2. Test evaluator locally on known good/bad outputs 3. Simplify the task or provide more examples 4. Try a stronger base model **Symptoms**: Reward increases then crashes and stays low **Possible causes**: * Learning rate too high causing training instability * Model found an exploit in the evaluator (reward hacking) * Catastrophic forgetting **Solutions**: 1. Stop training and use the last good checkpoint 2. Restart with lower learning rate (e.g., `--learning-rate 5e-5`) 3. Review recent rollouts for reward hacking behavior 4. Improve evaluator to be more robust **Symptoms**: Rollout table shows lots of errors or timeouts **Possible causes**: * Evaluator code errors * Timeout too short for evaluation * External API failures (for remote evaluators) * Resource exhaustion **Solutions**: 1. Check error logs for specific error messages 2. Test evaluator locally to reproduce errors 3. Increase `--rollout-timeout` if evaluations need more time 4. Add better error handling in evaluator code 5. For remote evaluators: check server health and logs **Symptoms**: Loss goes up instead of down **Possible causes**: * Learning rate too high * Conflicting reward signals * Numerical instability **Solutions**: 1. Reduce learning rate by 2-5x 2. Check that rewards are consistent (same prompt gets similar rewards) 3. Verify rewards are in valid range \[0, 1] 4. Consider reducing batch size **Symptoms**: Model generates the same response for every prompt **Possible causes**: * Temperature too low (near 0) * Model found one high-reward response and overfit to it * Evaluator only rewards one specific output **Solutions**: 1. Increase `--temperature` to 0.8-1.0 2. Make evaluator more flexible to accept diverse good answers 3. Use more diverse prompts in training data 4. Reduce epochs to prevent overfitting **Symptoms**: Many rollouts timing out with remote environment **Possible causes**: * Remote server slow or overloaded * Network latency issues * Evaluator not logging completion correctly **Solutions**: 1. Check remote server logs for errors 2. Verify server is logging `Status.rollout_finished()` 3. Increase `--rollout-timeout` to allow more time 4. Scale remote server to handle concurrent requests 5. Optimize evaluator code for performance ## Performance optimization ### Speeding up training If training is slower than expected: **Slow evaluators directly increase training time**: * Profile your evaluator code to find bottlenecks * Cache expensive computations * Use batch processing for API calls * Add timeouts to prevent hanging **For remote evaluators**: * Add more worker instances to handle concurrent rollouts * Use faster machines (more CPU, memory) * Optimize network connectivity to Fireworks Target: Evaluations should complete in 1-5 seconds per rollout. **Reduce compute while maintaining quality**: * Decrease `--n` (e.g., from 8 to 4 rollouts per prompt) * Reduce `--max-tokens` if responses don't need to be long * Lower temperature slightly to speed up sampling Caution: Too few rollouts (n \< 4) may hurt training quality. ### Cost optimization Reduce costs without sacrificing too much quality: 1. **Start small**: Experiment with `qwen3-0p6b` before scaling to larger models 2. **Reduce rollouts**: Use `--n 4` instead of 8 3. **Shorter responses**: Lower `--max-tokens` to minimum needed 4. **Fewer epochs**: Start with 1 epoch, only add more if needed 5. **Efficient evaluators**: Minimize API calls and computation ## Stopping and resuming jobs ### Stopping a running job If you need to stop training: 1. Click **Cancel Job** in the dashboard 2. Or via CLI: ```bash theme={null} firectl delete rftj ``` The model state at the last checkpoint is saved and can be deployed. Cancelled jobs cannot be resumed. If you want to continue training, create a new job starting from the last checkpoint. ### Using checkpoints Checkpoints are automatically saved during training. To continue from a checkpoint: ```bash theme={null} eval-protocol create rft \ --warm-start-from accounts/your-account/models/previous-checkpoint \ --output-model continued-training ``` This is useful for: * Extending training after early stopping * Trying different hyperparameters on a trained model * Building on previous successful training runs ## Comparing multiple jobs Running multiple experiments? Compare them side-by-side: 1. Navigate to **Fine-Tuning** dashboard 2. Select multiple jobs using checkboxes 3. Click **Compare** This shows: * Reward curves overlaid on same graph * Parameter differences highlighted * Final metrics comparison * Training time and cost comparison Use consistent naming for experiments (e.g., `math-lr-1e4`, `math-lr-5e5`) to make comparisons easier. ## Exporting metrics For deeper analysis or paper writing: ### Via dashboard 1. Click **Export** button in job view 2. Choose format: CSV, JSON 3. Select metrics to export (rewards, loss, rollout data) ### Via API ```python theme={null} import requests response = requests.get( f"https://api.fireworks.ai/v1/accounts/{account}/reinforcementFineTuningJobs/{job_id}/metrics", headers={"Authorization": f"Bearer {api_key}"} ) metrics = response.json() ``` ### Weights & Biases integration If you enabled W\&B when creating the job: ```bash theme={null} eval-protocol create rft \ --wandb-project my-experiments \ --wandb-entity my-org \ ... ``` All metrics automatically sync to W\&B for advanced analysis, comparison, and sharing. ## Best practices Check your job within the first 15-30 minutes of training: * Verify evaluator is working correctly * Confirm rewards are in expected range * Catch configuration errors early Don't wait until training completes to discover issues. Every few epochs, inspect 5-10 random rollouts: * Manually verify high-reward outputs are actually good * Check low-reward outputs are actually bad * Look for unexpected model behaviors This catches evaluator bugs and reward hacking. When you find good hyperparameters, save the command: ```bash theme={null} # Save to file for reproducibility echo "eval-protocol create rft --base-model ... --learning-rate 5e-5 ..." > best_config.sh ``` Makes it easy to reproduce results or share with team. Name jobs descriptively: * Good: `math-solver-llama8b-temp08-n8` * Bad: `test1`, `try2`, `final-final` Future you will thank you when comparing experiments. Keep notes on what worked and what didn't: * Hypothesis for each experiment * Parameters changed * Results and insights * Next steps Build institutional knowledge for your team. ## Next steps Once training completes, deploy your fine-tuned model for inference Learn how to adjust parameters for better results Improve your reward functions based on training insights Start a new experiment using the CLI # Parameter Tuning Source: https://docs.fireworks.ai/fine-tuning/parameter-tuning Learn how training parameters affect model behavior and outcomes ## Overview Reinforcement fine-tuning uses two categories of parameters to control model training: **training parameters** that govern how the model learns, and **rollout (sampling) parameters** that control how the model generates responses during training. Most experiments converge well with the default values. Adjust parameters only when you have a clear hypothesis based on your training metrics and reward curves. ## Training Parameters Core parameters that control how your model learns during the training process. **What it does**: Controls how aggressively the model updates its weights during each training step. Think of it as the "step size" when descending the loss landscape. **Default**: `1e-4` (0.0001)\ **Valid range**: `1e-5` to `5e-4` **How it affects outcome**: * **Too high** → Unstable training where reward spikes briefly then collapses as the model overshoots optimal weights. * **Too low** → Painfully slow convergence. The reward curve plateaus too early before reaching optimal performance. * **Just right** → Steady, consistent reward improvement throughout training. **When to adjust**: * **Decrease** when you see reward spikes followed by crashes in your training metrics * **Increase** when the reward curve plateaus too early and stops improving * Keep changes within 2× of the default value **What it does**: The number of complete passes through your training dataset. Each epoch processes every example once. **Default**: `1`\ **Valid range**: `1` to `10` (whole numbers only) **How it affects outcome**: * **Too few** → The model hasn't had enough exposure to learn patterns from your data * **Too many** → Overfitting risk where the model memorizes the training set instead of generalizing * **Just right** → Reward curve shows steady improvement and plateaus near the end of training **When to adjust**: * **Add 1-2 more epochs** if the reward is still climbing steadily at the end of training * **Keep at 1** for most tasks—the default works well * Watch your reward curves to detect when adding more epochs stops helping **What it does**: Controls the number of trainable parameters in your LoRA adapter. LoRA (Low-Rank Adaptation) adds small adapter layers to the base model rather than training all weights. Higher rank means more capacity to learn new behaviors. **Default**: `8`\ **Valid range**: `4` to `32` (must be powers of 2: 4, 8, 16, 32) **How it affects outcome**: * **Lower rank (4-8)** → Faster training, but may lack capacity for complex tasks * **Just right (8-16)** → Balances capacity and efficiency for most tasks * **Higher rank (32)** → More learning capacity, but requires significantly more GPUs and risks overfitting **When to adjust**: * **Increase** for complex reasoning tasks or when the model struggles to learn desired behaviors * Consider task complexity: simple style changes need lower rank, complex reasoning needs higher **What it does**: The amount of data (measured in tokens) processed in each training step before updating model weights. Unlike traditional batch sizes that count sequences (e.g., 32 or 64 sequences), Fireworks RFT uses **token-based batch sizing**. For example, with an 8k max sequence length, a 64k batch size allows up to 8 sequences per batch (64k tokens ÷ 8k tokens/sequence = 8 sequences). **Default**: `32k tokens` **How it affects outcome**: * **Smaller batches** → Noisier gradient updates that may help exploration, but slower training throughput * **Larger batches** → Smoother, more stable updates and faster training throughput **When to adjust**: * Most users should stick with the default. Modify if you want a smaller/larger amount of tokens per train step ## Rollout (Sampling) Parameters Parameters that control how the model generates responses during training rollouts. **What it does**: Controls the randomness of the model's token selection during generation. Higher temperature = more random/creative, lower = more deterministic/focused. **Default**: `0.7`\ **Valid range**: `0.1` to `2.0` (must be >0) **How it affects outcome**: * **0.0-0.1 (near-greedy)** → Deterministic outputs with no exploration. Leads to mode collapse and repetitive text. **Avoid in RFT.** * **0.5-1.0 (sweet spot)** → Good balance of exploration and coherence. Ideal for most RLHF applications. * **>1.2 (high randomness)** → Very creative but potentially incoherent outputs **When to adjust**: * **Lower (0.3-0.5)** for tasks requiring precision, factual accuracy, or safety (less toxic outputs) * **Raise (1.0-1.2)** for creative tasks like story generation or when you need more diverse rollout exploration * **Never use 0.0**—greedy sampling breaks RFT by eliminating exploration **What it does**: Dynamically limits token sampling to the smallest set of tokens whose cumulative probability exceeds threshold p. Only considers the most probable tokens that together make up the top p% of probability mass. **Default**: `1.0` (considers all tokens)\ **Valid range**: `0` to `1` **How it affects outcome**: * Lower values (0.2-0.5) filter out long-tail, low-probability tokens that often cause hallucinations * Higher values (0.9-1.0) allow more diversity in outputs * Prevents the model from selecting very unlikely tokens that may be nonsensical **When to adjust**: * **Lower to 0.2-0.5** when your reward function penalizes hallucinations or factual errors * **Keep at 0.9-1.0** for creative tasks that benefit from diverse vocabulary * Works well in combination with temperature for fine-grained control **What it does**: Limits sampling to only the K most probable tokens at each step. A fixed-size cutoff (unlike top-p which is dynamic). **Default**: `40`\ **Valid range**: `0` to `100` (0 = disabled) **How it affects outcome**: * Similar to top-p but uses a fixed number of candidates instead of a probability threshold * Lower k = more focused, less diverse outputs * Higher k = more exploration and creativity **When to adjust**: * **Combine with temperature** (e.g., temp 0.8 + top-k 40) for balanced creative exploration * **Keep ≤50** to maintain reasonable inference latency * Consider using top-p instead for most use cases—it adapts better to varying probability distributions **What it does**: How many different responses the model generates for each prompt during training. The policy optimization algorithm compares these candidates to compute the KL divergence term and learn which responses are better. **Default**: `4`\ **Valid range**: `2` to `8` (minimum 2 required) **How it affects outcome**: * **n=1** → **Not allowed.** Policy optimization requires multiple candidates to learn from comparisons * **n=2-4** → Minimal viable exploration. Faster and cheaper but less signal for learning * **n=4-8** → Good balance of learning signal and cost for most tasks * **n>8** → Diminishing returns. Significantly slower and more expensive with marginal quality gains **When to adjust**: * **Increase to 6-8** when you need higher quality and cost isn't a concern * **Keep at 4** for most experiments—it's the sweet spot * **Never set to 1**—this will cause training to fail * Consider the tradeoff: more rollouts = better signal but linearly higher cost **What it does**: The maximum number of tokens the model can generate in a single response during rollouts. **Default**: `2048`\ **Valid range**: `16` to `16384` **How it affects outcome**: * Directly affects task completion: too short and the model can't finish complex tasks * Longer responses improve reward on summarization, story generation, and reasoning tasks * Linearly increases training cost—every token generated costs compute **When to adjust**: * **Increase** when your tasks require longer reasoning chains, detailed summaries, or complex multi-step solutions * **Decrease** to reduce costs for tasks with naturally short outputs (classification, short-form Q\&A) * Monitor your reward curves: if the model is cutting off mid-response, increase max tokens ## Parameter Interactions Parameters don't work in isolation—they interact in important ways. These three work together to control sampling behavior. Using all three gives you fine-grained control: * **Temperature** sets the overall randomness * **Top-p** dynamically filters by probability mass * **Top-k** sets a hard limit on candidate tokens Example: `temperature=0.8, top_p=0.9, top_k=40` gives creative but controlled outputs. Larger batch sizes provide more stable gradients, which may allow for slightly higher learning rates. However, the default learning rate is tuned for the default batch size—only adjust if you have evidence from your training curves. Larger base models (70B+) may need higher LoRA ranks to capture complex behaviors, but they also require more resources. For smaller models (\<13B), rank 8-16 is usually sufficient. ## Tuning Strategies Best practices for adjusting parameters to achieve your training goals. The default parameters are carefully tuned to work well for most RFT tasks. Don't change them unless you have a clear hypothesis based on your training metrics. Run at least one baseline experiment with defaults before making any adjustments. This gives you: * A performance benchmark to compare against * Understanding of whether parameter tuning is actually needed * Evidence about which metrics need improvement Many successful RFT jobs use all default parameters. When you do adjust parameters, change only one at a time and measure the impact on your reward curves and evaluation metrics. **Good workflow:** 1. Run baseline with defaults 2. Identify specific issue (e.g., reward crashes, slow convergence) 3. Change ONE parameter that should address that issue 4. Compare results 5. Repeat **Avoid:** Changing multiple parameters simultaneously—you won't know which change caused the improvement or regression. Use Weights & Biases integration to: * Compare training curves across experiments * Track reward progression over time * Log all hyperparameters automatically This makes it easy to identify which parameter changes actually helped and which hurt performance. Quick reference for goal-directed parameter tuning: * **Faster convergence** → ↑ epochs (add 1-2), tune learning rate (stay \<2× default) * **Better quality** → ↑ temperature (1.0-1.2), ↑ rollouts (6-8), ↑ max tokens * **Safer/less toxic** → ↓ temperature (0.3-0.5), ↓ top-p (0.5), ↓ top-k * **More creative** → ↑ temperature (1.0-1.2), top-p = 0.9 * **Lower cost** → ↓ rollouts, ↓ max tokens, ↓ batch size * **Higher capacity** → ↑ LoRA rank (16-32), but monitor memory usage * **Prevent overfitting** → Keep epochs = 1, consider lower LoRA rank ## Next Steps Complete guide to CLI parameters and options Launch your RFT job Hands-on tutorial showing parameter tuning in practice Learn about the RFT training process and workflow # Single-Turn Training Quickstart Source: https://docs.fireworks.ai/fine-tuning/quickstart-math Train a model to be an expert at answering GSM8K math questions **Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Single-Turn Training** path—the fastest way to get started with RFT. In this quickstart, you'll train a small language model—`Qwen3 0.6B`—to solve mathematical reasoning problems from the GSM8K dataset. ## What you'll learn * How to set up and test an evaluator locally, using the Eval Protocol SDK * How to take that evaluator and use it in an RFT job, from the command line * How to monitor training progress and evaluate accuracy improvements Prefer a notebook experience? You can also [run this tutorial in Google Colab](https://colab.research.google.com/drive/16xrb9rx6AoAEOtrDXumzo71HjhunaoPi#scrollTo=CP18QX4tgi-0). Note that Colab requires billing enabled on your Google account. ## Prerequisites * Python 3.10+ * A Fireworks API key (stored in your shell or .env) * Command-line access (terminal or shell) ## 1. Install dependencies and set up files Clone the quickstart-gsm8k repository and install dependencies: ```bash theme={null} git clone https://github.com/eval-protocol/quickstart-gsm8k.git cd quickstart-gsm8k pip install -r requirements.txt ``` Create the `gsm8k_artifacts/` folder structure and copy files: ```bash theme={null} mkdir -p gsm8k_artifacts/{tests/pytest/gsm8k,development} cp evaluation.py gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py cp gsm8k_sample.jsonl gsm8k_artifacts/development/gsm8k_sample.jsonl ``` The repository includes: * **Evaluator** (`evaluation.py`): Defines how to evaluate math answers * **Dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on Install the latest `eval-protocol` SDK, `pytest`, and `requests`: ```bash theme={null} python -m pip install --upgrade pip python -m pip install pytest requests git+https://github.com/eval-protocol/python-sdk.git ``` Download the evaluator and dataset files: Run this Python script to download two files from the Eval Protocol repository into a folder on your machine called `gsm8k_artifacts/`. * **Test script** (`test_pytest_math_example.py`): Defines how to evaluate math answers * **Sample dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on ```python tutorial/download_gsm8k_assets.py theme={null} from pathlib import Path import requests ARTIFACT_ROOT = Path("gsm8k_artifacts") TEST_PATH = ARTIFACT_ROOT / "tests" / "pytest" / "gsm8k" / "test_pytest_math_example.py" DATASET_PATH = ARTIFACT_ROOT / "development" / "gsm8k_sample.jsonl" files_to_download = { TEST_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/tests/pytest/gsm8k/test_pytest_math_example.py", DATASET_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/development/gsm8k_sample.jsonl", } for local_path, url in files_to_download.items(): local_path.parent.mkdir(parents=True, exist_ok=True) response = requests.get(url, timeout=30) response.raise_for_status() local_path.write_bytes(response.content) print(f"Saved {url} -> {local_path}") ``` Expected output: ``` Saved https://raw.githubusercontent.com/.../test_pytest_math_example.py -> gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py Saved https://raw.githubusercontent.com/.../gsm8k_sample.jsonl -> gsm8k_artifacts/development/gsm8k_sample.jsonl ``` ## 2. Test your evaluator locally In this step, we will test your evaluator by examining the output locally. Feel free to iterate on the evaluator you downloaded in the last step until it gives the output you want. Open a terminal and run: ```bash theme={null} ep logs ``` This will start a local server, navigate to `http://localhost:8000`. Keep this terminal running. In a **new terminal**, call the test script to run the evaluator on your dataset of sample math problems. ```bash theme={null} cd gsm8k_artifacts ep local-test ``` This command discovers and runs your `@evaluation_test` with pytest. As the test runs, you'll see evaluation scores appear in the browser, with detailed logs for each problem the model attempts. `pytest` will also register your evaluator and dataset with Fireworks automatically, so you can use them in the next step for RFT. GSM8K evaluation UI showing model scores and trajectories ## 3. Start training First, set your Fireworks API key so the Fireworks CLI can authenticate you: ```bash theme={null} export FIREWORKS_API_KEY="" ``` Next, we'll launch the RFT job using the evaluator and dataset you just registered. We're using a small base model (`qwen3-0p6b`) to keep training fast and inexpensive. Because your evaluator and dataset were already registered with Fireworks in the last step, we don't need to specify them again here. ```bash theme={null} cd .. eval-protocol create rft --base-model accounts/fireworks/models/qwen3-0p6b ``` The CLI will output dashboard links where you can monitor your training job in real-time. GSM8K evaluation score showing upward trajectory You can also store your API key in a `.env` file instead of exporting it each session. ## Monitor your training progress Your RFT job is now running. You can monitor progress in the dashboard links provided by the CLI output. Re-run the pytest evaluation command to measure your model's performance on new checkpoints: ```bash theme={null} cd gsm8k_artifacts pytest -q tests/pytest/gsm8k/test_pytest_math_example.py::test_math_dataset -s ``` This helps you see how your model's accuracy improves over time and decide when to stop training. You can adjust the evaluation logic to better fit your needs: * **Modify reward shaping**: Edit the scoring logic in `test_pytest_math_example.py` to match your answer format expectations * **Use your own data**: Replace the sample dataset by either editing the JSONL file locally or passing `--dataset-jsonl` when creating the RFT job ### What's happening behind the scenes Understanding the training workflow: 1. **Evaluation registration**: The pytest script evaluates a small GSM8K subset using numeric answer checking, then automatically registers both your evaluator and dataset with Fireworks 2. **RFT job creation**: The `create rft` command connects your registered evaluator and dataset to a Reinforcement Fine-Tuning job for your chosen base model 3. **Continuous improvement**: As training progresses, evaluation scores on the held-out set reflect improved accuracy, allowing you to iterate quickly before scaling to larger experiments ## Next steps Learn all CLI options to customize your training parameters Train agents that run in your production infrastructure Understand how reinforcement fine-tuning works # Remote Agent Quickstart Source: https://docs.fireworks.ai/fine-tuning/quickstart-svg-agent Train an SVG drawing agent running in a remote environment **Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Remote Agent Training** path—for training agents that run in your production infrastructure. In this quickstart, you'll train an agent to generate SVG drawings. Your agent runs in a remote server (Vercel), which means rollouts happen remotely while Fireworks handles the training. This approach lets you train agents that already live in your production environment. Here's a quick walkthrough: