# Exporting Billing Metrics Source: https://docs.fireworks.ai/accounts/exporting-billing-metrics Export billing and usage metrics for all Fireworks services ## Overview Fireworks provides a CLI tool to export comprehensive billing metrics for all usage types including serverless inference, on-demand deployments, and fine-tuning jobs. The exported data can be used for cost analysis, internal billing, and usage tracking. ## Exporting billing metrics Use the Fireworks CLI to export a billing CSV that includes all usage: ```bash theme={null} # Authenticate (once) firectl login # Export billing metrics to CSV firectl billing export-metrics ``` ## Examples Export all billing metrics for an account: ```bash theme={null} firectl billing export-metrics ``` Export metrics for a specific date range and filename: ```bash theme={null} firectl billing export-metrics \ --start-time "2025-01-01" \ --end-time "2025-01-31" \ --filename january_metrics.csv ``` ## Output format The exported CSV includes the following columns: * **email**: Account email * **start\_time**: Request start timestamp * **end\_time**: Request end timestamp * **usage\_type**: Type of usage (e.g., TEXT\_COMPLETION\_INFERENCE\_USAGE) * **accelerator\_type**: GPU/hardware type used * **accelerator\_seconds**: Compute time in seconds * **base\_model\_name**: The model used * **model\_bucket**: Model category * **parameter\_count**: Model size * **prompt\_tokens**: Input tokens * **completion\_tokens**: Output tokens ### Sample row ```csv theme={null} email,start_time,end_time,usage_type,accelerator_type,accelerator_seconds,base_model_name,model_bucket,parameter_count,prompt_tokens,completion_tokens user@example.com,2025-10-20 17:16:48 UTC,2025-10-20 17:16:48 UTC,TEXT_COMPLETION_INFERENCE_USAGE,,,accounts/fireworks/models/llama4-maverick-instruct-basic,Llama 4 Maverick Basic,401583781376,803,109 ``` ## Automation Each `firectl billing export-metrics` call supports a maximum 31-day time range. To export longer historical ranges, run the command in multiple 31-day chunks and combine the CSV files in your downstream pipeline. You can automate exports in cron jobs and load the CSV into your internal systems: ```bash theme={null} # Example: Daily export with dated filename firectl billing export-metrics \ --start-time "$(date -v-1d '+%Y-%m-%d')" \ --end-time "$(date '+%Y-%m-%d')" \ --filename "billing_$(date '+%Y%m%d').csv" ``` ```bash theme={null} # Example: Backfill 6 months in 31-day chunks start_date="2025-01-01" end_date="2025-07-01" current_start="$start_date" while [ "$(date -j -f "%Y-%m-%d" "$current_start" "+%s")" -lt "$(date -j -f "%Y-%m-%d" "$end_date" "+%s")" ]; do current_end="$(date -j -v+31d -f "%Y-%m-%d" "$current_start" "+%Y-%m-%d")" # Clamp the chunk end to the requested end_date if [ "$(date -j -f "%Y-%m-%d" "$current_end" "+%s")" -gt "$(date -j -f "%Y-%m-%d" "$end_date" "+%s")" ]; then current_end="$end_date" fi firectl billing export-metrics \ --start-time "$current_start" \ --end-time "$current_end" \ --filename "billing_${current_start}_to_${current_end}.csv" current_start="$current_end" done ``` Run `firectl billing export-metrics --help` to see all available flags and options. ## Coverage This export includes: * **Serverless inference**: All serverless API usage * **On-demand deployments**: Deployment usage (see also [Exporting deployment metrics](/deployments/exporting-metrics) for real-time Prometheus metrics) * **Fine-tuning jobs**: Fine-tuning compute usage * **Other services**: All billable Fireworks services For real-time monitoring of on-demand deployment performance metrics (latency, throughput, etc.), use the [Prometheus metrics endpoint](/deployments/exporting-metrics) instead. ## See also * [firectl CLI overview](/tools-sdks/firectl/firectl) * [Exporting deployment metrics](/deployments/exporting-metrics) - Real-time Prometheus metrics for on-demand deployments * [Account quotas](/guides/quotas_usage/account-quotas) - Spending tiers, budget controls, and account-wide request limits * [Serverless rate limits](/serverless/rate-limits) - Adaptive serverless TPM bounds # Usage & Cost Breakdown Source: https://docs.fireworks.ai/accounts/exporting-usage-and-costs Break down usage and rated costs by deployment, model, API key, or custom tags — via firectl or the billingUsage API ## Overview Fireworks exposes the same usage-and-cost data through two equivalent surfaces: * **CLI** — [`firectl billing get-usage`](/tools-sdks/firectl/commands/billing-get-usage), best for ad-hoc queries, shell scripting, and one-off cost reviews. * **HTTP API** — [`GET /v1/accounts/{account_id}/billingUsage`](/api-reference/get-billing-usage), best for cron jobs, dashboards, and downstream cost-attribution pipelines. Both return the same response shape and accept the same dimensions. Every example below shows the CLI form and the equivalent cURL side-by-side. Pick whichever fits your workflow. The output has two parts: * **Account costs** — rated dollar totals for the range (CLI: prints by default; API: companion `GetBillingSummary` endpoint). * **Usage** — metered quantities (tokens, accelerator-seconds, audio input seconds) grouped by your chosen dimensions. This page complements [Exporting Billing Metrics](/accounts/exporting-billing-metrics): use `export-metrics` for a raw per-event CSV dump, and the workflows on this page for grouped, rated views. CLI examples require `firectl` 1.7.21 or later. Run `firectl version`, then `firectl upgrade` if needed. ## Authentication For the API, send your Fireworks API key as a bearer token. Any key on the target account works. ```bash theme={null} export ACCOUNT_ID="" export FIREWORKS_API_KEY="fw_..." ``` For the CLI, run `firectl login` once and `firectl` reads credentials from `~/.fireworks/auth.ini`. ## Basic usage Get a 30-day account-wide breakdown (defaults to all usage types, grouped by model for serverless and by deployment + accelerator for dedicated): ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 \ --end-time 2026-06-01 ``` Add `-o json` for machine-readable output. ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" ``` ## Examples ### Serverless usage by model ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --usage-type serverless \ --group-by model_name ``` ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=model_name" ``` ### Serverless usage by API key Breaks out serverless token consumption per API key. Pass both `api_key_id` (stable internal ID) and `api_key_name` (human-readable label from the console / `firectl api-key create --name`) so the response carries both. ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --usage-type serverless \ --group-by api_key_id \ --group-by api_key_name \ --group-by model_name ``` ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=api_key_id" \ --data-urlencode "groupBy=api_key_name" \ --data-urlencode "groupBy=model_name" ``` Sample row from the API response: ```json theme={null} { "startTime": "2026-05-28T00:00:00Z", "endTime": "2026-05-29T00:00:00Z", "promptTokens": "1842301", "completionTokens": "412980", "audioInputSeconds": 0, "usageType": "TEXT_COMPLETION_INFERENCE_USAGE", "group": { "api_key_id": "key_4nMFyHCSZP4CRKqa", "api_key_name": "prod-eng", "model_name": "accounts/fireworks/models/kimi-k2.6" } } ``` Token counts come back as JSON **strings** (int64 over JSON). Cast them with `tonumber` in `jq` or the equivalent in your client before doing arithmetic. The deprecated top-level `apiKeyId` field is only populated when `groupBy=api_key_id` is requested — always read API-key values from the `group` map. ### Filter to a specific API key Repeat `--filter` (CLI) or `filter[][values]=` (API) to OR multiple values for the same dimension. ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --usage-type serverless \ --group-by model_name \ --filter api_key_name=prod-eng ``` ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=model_name" \ --data-urlencode 'filter[api_key_name][values]=prod-eng' ``` ### Dedicated deployment usage by deployment and GPU type ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --usage-type dedicated-deployment \ --group-by deployment_name \ --group-by accelerator_type ``` ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=DEDICATED_DEPLOYMENT" \ --data-urlencode "groupBy=deployment_name" \ --data-urlencode "groupBy=accelerator_type" ``` ### Filter to a single deployment ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --filter deployment_name=accounts/my-account/deployments/my-deployment ``` ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode 'filter[deployment_name][values]=accounts/my-account/deployments/my-deployment' ``` ### Account-level cost totals only ```bash theme={null} firectl billing get-usage \ --start-time 2026-05-01 --end-time 2026-06-01 \ --account-costs-only ``` Rated dollar totals come from a companion endpoint, `GetBillingSummary`. Use the CLI for this view today; we'll surface the same data through the API in a future release. ## Reference ### CLI flags | Flag | Description | | ---------------------- | ---------------------------------------------------------------------------------- | | `--start-time` | Start time (inclusive), as `YYYY-MM-DD` or `'YYYY-MM-DD hh:mm:ss'`. | | `--end-time` | End time (exclusive), same formats. | | `--usage-type` | `all`, `serverless`, or `dedicated-deployment`. Defaults to all. | | `--group-by` | Dimension to group by. Repeatable. | | `--filter` | `key=value` filter. Repeatable; repeated values for the same key are OR'ed. | | `--timezone` | IANA timezone for daily aggregation (e.g. `America/Los_Angeles`). Defaults to UTC. | | `--account-costs-only` | Print only account-level cumulative costs for the range. | | `-o, --output` | `text` (default) or `json`. | Run `firectl billing get-usage --help` for the full list. ### API parameters The same dimensions are passed as `groupBy=` (repeat for multiple) and `filter[][values]=` (repeat for OR). `usageType` takes `SERVERLESS`, `DEDICATED_DEPLOYMENT`, or omitted for all. `timezone` and `startTime`/`endTime` mirror the CLI flags. See [the full API reference](/api-reference/get-billing-usage) for parameter schemas and response types. ### Grouping dimensions Valid `--group-by` / `groupBy` and `--filter` / `filter` dimensions depend on the usage type: * **Serverless**: `model_name`, `api_key_id`, `api_key_name`, `annotations.team`, `annotations.project`, `annotations.environment` * **Dedicated deployment**: `deployment_name`, `accelerator_type`, `annotations.team`, `annotations.project`, `annotations.environment` Dedicated-deployment rows also include the deployment's region (`placement`, e.g. `US`, `EUROPE`, `GLOBAL`) and metered `accelerator_seconds`. ## Custom tags (team / project / environment) Group by `annotations.team`, `annotations.project`, or `annotations.environment` to split usage by your own labels. The tag source depends on usage type: * **Dedicated deployments**: set an `annotations` map on the deployment, e.g. `{"team": "search", "project": "x", "environment": "prod"}`. * **Serverless**: send a per-request header on inference calls: ```http theme={null} POST /inference/v1/chat/completions HTTP/1.1 Host: api.fireworks.ai Authorization: Bearer fw_... Fireworks-Annotations: team=search,project=ranker,environment=prod Content-Type: application/json ``` Annotation values are validated server-side; unrecognized keys are dropped silently. ## Cookbook: per-API-key reporting recipes These recipes target the HTTP API, where downstream aggregation in `jq` (or any client) is easiest. ### Aggregate per key, across models Sums prompt and completion tokens for each API key across every model it called, sorted by prompt volume. ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=api_key_id" \ --data-urlencode "groupBy=api_key_name" \ --data-urlencode "groupBy=model_name" \ | jq '.serverlessCosts | group_by(.group.api_key_id) | map({ api_key_id: .[0].group.api_key_id, api_key_name: .[0].group.api_key_name, models: (map(.group.model_name) | unique), prompt_tokens: ([.[].promptTokens | tonumber] | add), completion_tokens: ([.[].completionTokens | tonumber] | add) }) | sort_by(-.prompt_tokens)' ``` ### Group by model, then by key (cost-by-tool view) If reporting starts from "how much did each model cost me, and which keys drove that", flip the nesting: ```bash theme={null} curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=2026-05-01T00:00:00Z" \ --data-urlencode "endTime=2026-06-01T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=api_key_id" \ --data-urlencode "groupBy=api_key_name" \ --data-urlencode "groupBy=model_name" \ | jq '.serverlessCosts | group_by(.group.model_name) | map({ model: .[0].group.model_name, api_keys: ( group_by(.group.api_key_id) | map({ api_key_id: .[0].group.api_key_id, api_key_name: .[0].group.api_key_name, prompt_tokens: ([.[].promptTokens | tonumber] | add), completion_tokens: ([.[].completionTokens | tonumber] | add) }) | sort_by(-.prompt_tokens) ) }) | sort_by(.model)' ``` Multiply the token totals by the published [serverless prices](/serverless/pricing) to convert to dollars for chargeback. ### Backfill more than 31 days The endpoint caps each request at a 31-day window. To pull a longer history, loop month-by-month: ```bash theme={null} start_date="2026-01-01" end_date="2026-06-01" current="$start_date" while [ "$(date -u -d "$current" '+%s')" -lt "$(date -u -d "$end_date" '+%s')" ]; do next="$(date -u -d "$current +30 days" '+%Y-%m-%d')" if [ "$(date -u -d "$next" '+%s')" -gt "$(date -u -d "$end_date" '+%s')" ]; then next="$end_date" fi curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \ -H "Authorization: Bearer ${FIREWORKS_API_KEY}" \ --data-urlencode "startTime=${current}T00:00:00Z" \ --data-urlencode "endTime=${next}T00:00:00Z" \ --data-urlencode "usageType=SERVERLESS" \ --data-urlencode "groupBy=api_key_id" \ --data-urlencode "groupBy=api_key_name" \ > "usage_${current}_to_${next}.json" current="$next" done ``` ## Granularity and freshness * Usage is aggregated into **daily** buckets (`--timezone` / `timezone=` sets the day boundary). There are no sub-daily buckets. * Responses are cached for several minutes — fine for cron jobs and dashboards, not for real-time. ## Coverage caveats * **Tokens, not dollars.** The endpoint returns metered quantities (`promptTokens`, `completionTokens`, `accelerator_seconds`, `audioInputSeconds`). Multiply by the [serverless prices](/serverless/pricing) for cost, or use `--account-costs-only` for account-level dollar totals. * **Inference types covered today**: text completion / chat completion and audio inference. Embeddings and image generation aren't yet reflected in `billingUsage` responses; coverage will expand in subsequent releases. * **Dedicated deployments** are attributed at the deployment level, not by API key. Use `usageType=DEDICATED_DEPLOYMENT` with `groupBy=deployment_name` for that breakdown. Run `firectl billing get-usage --help` to see all available CLI flags and options. ## See also * [`firectl billing get-usage`](/tools-sdks/firectl/commands/billing-get-usage) - CLI command reference * [`GET /v1/accounts/{account_id}/billingUsage`](/api-reference/get-billing-usage) - HTTP API reference * [Exporting Billing Metrics](/accounts/exporting-billing-metrics) - Raw per-event billing CSV export * [Account quotas](/guides/quotas_usage/account-quotas) - Spending tiers and budget controls # Service Accounts Source: https://docs.fireworks.ai/accounts/service-accounts How to manage and use service accounts in Fireworks Service accounts in Fireworks allow applications, scripts, and automated systems to authenticate and perform actions securely—without relying on human credentials. They are ideal for CI/CD pipelines, backend services, and automated workflows. Service Accounts let you avoid shared credentials and easily distinguish between what automated systems did vs humans in audit logs. Service accounts can take actions using an API key, like creating deployments, running models or creating datasets (see [API reference](https://fireworks.ai/docs/api-reference/introduction)). Service accounts cannot login through the web interface or use OIDC tokens. To manage service accounts via the Fireworks web UI visit [app.fireworks.ai/account/users](https://app.fireworks.ai/account/users). ## Creating a Service Account Using our firectl you can create service accounts ```bash theme={null} firectl user create --user-id "my-service-account" --service-account ``` ## Creating an API Key for a Service Account Using firectl you can create an API key on behalf of a service account: ```bash theme={null} firectl api-key create --service-account "my-service-account" ``` ## Roles You can assign a role when creating a service account using the `--role` flag: ```bash theme={null} firectl user create --user-id "my-service-account" --service-account --role=contributor ``` If not specified, the default service account role is `user`. To change the role of an existing service account, use the update command: ```bash theme={null} firectl user update my-service-account --role=inference-user ``` See [Managing users](/accounts/users) for available roles. ## Listing Service Accounts To list all service accounts in your account: ```bash theme={null} firectl user list --filter 'service_account=true' ``` ## Billing * Service accounts count toward the same account quotas and limits assigned to the account * Usage is tracked by the account, not individual user vs service account ## Auditing In audit logs users are referenced by their email id's. Service accounts are referenced by `my-service-account@my-account.sa.fireworks.ai`. # Custom SSO Source: https://docs.fireworks.ai/accounts/sso Set up custom Single Sign-On (SSO) authentication for Fireworks AI Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform. By default, Fireworks supports Google SSO. If you have an enterprise account, Fireworks supports bringing your own identity provider using: * OpenID Connect (OIDC) provider * SAML 2.0 provider Coordinate with your Fireworks AI representative to enable the integration. ## OpenID Connect (OIDC) provider Create an OIDC client application in your identity provider, e.g. Okta. Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret). Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse ``` Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up. ## SAML 2.0 provider Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm). Set the SSO URL to the URL provided by Fireworks. It looks like: ``` https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse ``` Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like: ``` urn:amazon:cognito:sp: ``` Create an Attribute Statement with the name: ``` http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress ``` and the value `user.email` **Okta:** After saving the app, open **Sign On** → **Attribute Statements (SAML)** → expand **Show legacy configuration** → add the attribute statement there. Okta no longer configures this during app creation. Leave the rest of the settings as defaults. Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up. ## Just-In-Time (JIT) user provisioning JIT user provisioning automatically creates user accounts when they sign in through SSO for the first time. When enabled, users who authenticate through your identity provider are automatically added to your Fireworks account without requiring manual user creation. To enable JIT user provisioning, use the [`--enable-jit-user-provisioning`](/tools-sdks/firectl/commands/identity-provider-create) flag when creating your identity provider with firectl. ## Enforce SSO When SSO enforcement is enabled, account access is restricted to users with approved tenant domains only. Users with matching domains must authenticate via the identity provider, and users with other domains are blocked. To enforce SSO, use the [`--enforce-sso`](/tools-sdks/firectl/commands/identity-provider-create) flag when creating your identity provider with firectl, or toggle "Enforce SSO for all users" in the Fireworks console. ## Troubleshooting ### Invalid samlResponse or relayState from identity provider This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports service provider (SP) initiated login. See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an in-depth explanation. ### Required String parameter 'RelayState' is not present See above. # Managing users Source: https://docs.fireworks.ai/accounts/users Add, delete, and manage roles for users in your Fireworks account See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users. ## User roles Each user in an account is assigned a role that determines their level of access: | Role | Description | | :----------------- | :---------------------------------------------------------------------------------------------------------------------- | | **Admin** | Full administrative control over resources, users, and access. Can manage all account settings and add or remove users. | | **User** (default) | Can manage all resources, including those owned by others, but cannot manage users or access settings. | | **Contributor** | Can run inference on any resource and create and manage their own resources. Cannot modify resources owned by others. | | **Inference User** | Can view all resources and run inference, but cannot create or modify resources. | The `contributor` and `inference-user` roles are newer roles that provide more granular access control. Contact Fireworks support if you need these roles enabled for your account. #### Resource management | Permission | Inference User | Contributor | User | Admin | | :--------------------------------------------------------------------- | :------------: | :---------: | :--: | :---: | | Execute inference on any deployment | ✅ | ✅ | ✅ | ✅ | | View all resources (deployments, models, fine tuning jobs, datasets) | ✅ | ✅ | ✅ | ✅ | | Create new resources (deployments, models, fine tuning jobs, datasets) | ❌ | ✅ | ✅ | ✅ | | Manage their own resources (edit/delete) | ❌ | ✅ | ✅ | ✅ | | Manage resources owned by others (edit/delete) | ❌ | ❌ | ✅ | ✅ | #### API key & account management | Permission | Inference User | Contributor | User | Admin | | :----------------------------------------------- | :------------: | :---------: | :--: | :---: | | Manage self-owned API keys (create/delete) | ✅ | ✅ | ✅ | ✅ | | View all users and service accounts | ✅ | ✅ | ✅ | ✅ | | Create service account API keys | ❌ | ❌ | ❌ | ✅ | | Delete other users and service accounts API keys | ❌ | ❌ | ❌ | ✅ | | Add/modify/delete users and their access | ❌ | ❌ | ❌ | ✅ | ## Adding users To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://app.fireworks.ai/account/users](https://app.fireworks.ai/account/users). ```bash theme={null} firectl user create --email="alice@example.com" ``` To create another admin user, pass the `--role=admin` flag: ```bash theme={null} firectl user create --email="alice@example.com" --role=admin ``` ## Updating a user's role To update a user's role, run ```bash theme={null} firectl user update --role= ``` Where `` is one of: `admin`, `user`, `contributor`, or `inference-user`. ## Deleting users You can remove a user from your account by running: ```bash theme={null} firectl user delete ``` # Create a Message Source: https://docs.fireworks.ai/api-reference/anthropic-messages post /v1/messages **Anthropic-compatible endpoint.** Send a structured list of input messages with text and/or image content, and the model will generate the next message in the conversation. The Messages API can be used for either single queries or stateless multi-turn conversations. **Fireworks Quickstarts:** - [Serverless Quickstart](/getting-started/quickstart) - [Deployments Quickstart](/getting-started/ondemand-quickstart) This endpoint provides an Anthropic-compatible Messages API surface on Fireworks. For setup, supported features, and known differences, see [Anthropic compatibility](/tools-sdks/anthropic-compatibility). # Cancel Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/cancel-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:cancel # Create API Key Source: https://docs.fireworks.ai/api-reference/create-api-key post /v1/accounts/{account_id}/users/{user_id}/apiKeys # Create Batch Inference Job Source: https://docs.fireworks.ai/api-reference/create-batch-inference-job post /v1/accounts/{account_id}/batchInferenceJobs # Create Dataset Source: https://docs.fireworks.ai/api-reference/create-dataset post /v1/accounts/{account_id}/datasets # Load LoRA Source: https://docs.fireworks.ai/api-reference/create-deployed-model post /v1/accounts/{account_id}/deployedModels # Create Deployment Source: https://docs.fireworks.ai/api-reference/create-deployment post /v1/accounts/{account_id}/deployments ## Creating a deployment with a deployment shape [Deployment shapes](/guides/ondemand-deployments#deployment-shapes) are pre-configured templates optimized for speed, cost, or efficiency. To create a deployment with a specific shape, pass the `deploymentShape` field in the request body along with `baseModel`. Use the [List Deployment Shape Versions](/api-reference/list-deployment-shape-versions) endpoint to find available shapes for your model. ```bash theme={null} curl -X POST "https://api.fireworks.ai/v1/accounts/YOUR_ACCOUNT_ID/deployments" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "baseModel": "accounts/fireworks/models/gpt-oss-120b", "deploymentShape": "accounts/fireworks/deploymentShapes/gpt-oss-120b-minimal", "minReplicaCount": 0, "maxReplicaCount": 1 }' ``` When using a deployment shape, you do not need to specify `activeModelVersion` or `targetModelVersion` — the shape provides the necessary configuration. # Create dpo job Source: https://docs.fireworks.ai/api-reference/create-dpo-job post /v1/accounts/{account_id}/dpoJobs # Create Evaluation Job Source: https://docs.fireworks.ai/api-reference/create-evaluation-job post /v1/accounts/{account_id}/evaluationJobs # Create Evaluator Source: https://docs.fireworks.ai/api-reference/create-evaluator post /v1/accounts/{account_id}/evaluatorsV2 Creates a custom evaluator for scoring model outputs. Evaluators use the [Eval Protocol](https://evalprotocol.io) to define test cases, run model inference, and score responses. They are used with evaluation jobs and Reinforcement Fine-Tuning (RFT). ## Source Code Requirements Your project should contain: - `requirements.txt` - Python dependencies for your evaluator - `test_*.py` - Pytest test file(s) with [`@evaluation_test`](https://evalprotocol.io/reference/evaluation-test) decorated functions - Any additional code/modules your evaluator needs ## Workflow **Recommended:** Use the [`ep upload`](https://evalprotocol.io/reference/cli#ep-upload) CLI command to handle all these steps automatically. If using the API directly: 1. Call this endpoint to create the evaluator resource 2. Package your source directory as a `.tar.gz` (respecting `.gitignore`) 3. Call [Get Evaluator Upload Endpoint](/api-reference/get-evaluator-upload-endpoint) to get a signed upload URL 4. `PUT` the tar.gz file to the signed URL 5. Call [Validate Evaluator Upload](/api-reference/validate-evaluator-upload) to trigger server-side validation 6. Poll [Get Evaluator](/api-reference/get-evaluator) until ready Once active, reference the evaluator in [Create Evaluation Job](/api-reference/create-evaluation-job) or [Create Reinforcement Fine-tuning Job](/api-reference/create-reinforcement-fine-tuning-job). # Create Model Source: https://docs.fireworks.ai/api-reference/create-model post /v1/accounts/{account_id}/models # Create Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs # Create Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs # Create Router Source: https://docs.fireworks.ai/api-reference/create-router post /v1/accounts/{account_id}/routers # Create secret Source: https://docs.fireworks.ai/api-reference/create-secret post /v1/accounts/{account_id}/secrets # Create Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/create-supervised-fine-tuning-job post /v1/accounts/{account_id}/supervisedFineTuningJobs # Create User Source: https://docs.fireworks.ai/api-reference/create-user post /v1/accounts/{account_id}/users # Create embeddings Source: https://docs.fireworks.ai/api-reference/creates-an-embedding-vector-representing-the-input-text post /embeddings # Delete API Key Source: https://docs.fireworks.ai/api-reference/delete-api-key post /v1/accounts/{account_id}/users/{user_id}/apiKeys:delete # Delete Batch Inference Job Source: https://docs.fireworks.ai/api-reference/delete-batch-inference-job delete /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id} # Delete Dataset Source: https://docs.fireworks.ai/api-reference/delete-dataset delete /v1/accounts/{account_id}/datasets/{dataset_id} # Unload LoRA Source: https://docs.fireworks.ai/api-reference/delete-deployed-model delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Delete Deployment Source: https://docs.fireworks.ai/api-reference/delete-deployment delete /v1/accounts/{account_id}/deployments/{deployment_id} # Delete dpo job Source: https://docs.fireworks.ai/api-reference/delete-dpo-job delete /v1/accounts/{account_id}/dpoJobs/{dpo_job_id} # Delete Evaluation Job Source: https://docs.fireworks.ai/api-reference/delete-evaluation-job delete /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id} # Delete Evaluator Source: https://docs.fireworks.ai/api-reference/delete-evaluator delete /v1/accounts/{account_id}/evaluators/{evaluator_id} Deletes an evaluator and its associated versions and build artifacts. # Delete Model Source: https://docs.fireworks.ai/api-reference/delete-model delete /v1/accounts/{account_id}/models/{model_id} # Delete Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-job delete /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id} # Delete Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-step delete /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id} # Delete Response Source: https://docs.fireworks.ai/api-reference/delete-response delete /v1/responses/{response_id} Deletes a model response by its ID. Once deleted, the response data will be gone immediately and permanently. The response cannot be recovered and any conversations that reference this response ID will no longer be able to access it. # Delete Router Source: https://docs.fireworks.ai/api-reference/delete-router delete /v1/accounts/{account_id}/routers/{router_id} # Delete secret Source: https://docs.fireworks.ai/api-reference/delete-secret delete /v1/accounts/{account_id}/secrets/{secret_id} # Delete Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/delete-supervised-fine-tuning-job delete /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id} # Execute one training step for keep-alive Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/execute-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:executeTrainStep # Generate an image with FLUX.1 [schnell] FP8 Source: https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image [FLUX.1 \[schnell\]](https://huggingface.co/fireworks-ai/FLUX.1-schnell-fp8-flumina) is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. The FP8 version uses reduced precision numerics for 2x faster inference. See our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-1-schnell-fp8) to quickly try it out in your browser. ## Headers Specifies which format to return the response in. With `image/png` and `image/jpeg`, the server will populate the response body with a binary image of the specified format. The media type of the request body. The Bearer with Fireworks API Key. ## Request Body Prompt to use for the image generation process. Aspect ratio of the generated image. **Options:** `1:1`, `21:9`, `16:9`, `3:2`, `5:4`, `4:5`, `2:3`, `9:16`, `9:21`, `4:3`, `3:4` Classifier-free guidance scale for the image diffusion process. Default value is 3.5. Number of denoising steps for the image generation process. Default value is 4. Random seed to use for the image generation process. If 0, we will use a totally random seed. ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image" headers = { "Content-Type": "application/json", "Accept": "image/jpeg", "Authorization": "Bearer $API_KEY", } data = { "prompt": "A beautiful sunset over the ocean" } response = requests.post(url, headers=headers, json=data) if response.status_code == 200: with open("a.jpg", "wb") as f: f.write(response.content) print("Image saved as a.jpg") else: print("Error:", response.status_code, response.text) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image", { method: "POST", headers: { "Content-Type": "application/json", "Accept": "image/jpeg", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ prompt: "A beautiful sunset over the ocean" }), }); // To process the response and get the image: const buffer = await response.arrayBuffer(); fs.writeFile('a.jpg', Buffer.from(buffer), () => console.log('Finished downloading!')); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image \ -H 'Content-Type: application/json' \ -H 'Accept: image/jpeg' \ -H "Authorization: Bearer $API_KEY" \ --data ' { "prompt": "A beautiful sunset over the ocean" }' -o a.jpg ``` ```json Accept: application/json theme={null} { "id": "1234567890", "base64": ["data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...", "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."], "finishReason": "SUCCESS", "seed": 1234567890 } ``` ```txt Accept: image/jpeg theme={null} /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wAARCAABAAEDASIAAhEBAxEB/8QAFQABAQAAAAAAAAAAAAAAAAAAAAv/xAAUEAEAAAAAAAAAAAAAAAAAAAAA/8QAFQEBAQAAAAAAAAAAAAAAAAAAAAX/xAAUEQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIRAxEAPwCdABmX/9k= ``` ```txt Accept: image/png theme={null} iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNkYPhfDwAChwGA60e6kgAAAABJRU5ErkJggg== ``` ## Response The unique identifier for the image generation request. Includes a base64-encoded string containing an image in PNG format. To retrieve the image, base64-decode the string into binary data, then load that binary data as a PNG file. Can be `SUCCESS` or `CONTENT_FILTERED`. Specifies the outcome of the image generation process. It could be `SUCCESS` indicating that the image was successfully generated, or `CONTENT_FILTERED` if the image was filtered due to the safety\_check=true parameter being set. The seed used for the image generation process. When the Accept type is `image/jpeg`, the response body will contain a binary image. Additionally, the response will include headers such as: **Content-Length:** Represents the length of the binary image content. **Seed:** The random seed used to generate the image. **Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`. When the Accept type is `image/png`, the response body will contain a binary image. Additionally, the response will include headers such as: **Content-Length:** Represents the length of the binary image content. **Seed:** The random seed used to generate the image. **Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`. # Generate or edit an image with FLUX.1 Kontext Source: https://docs.fireworks.ai/api-reference/generate-or-edit-image-using-flux-kontext POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model} 💡 Note that this API is async and will return the **request\_id** instead of the image. Call the [get\_result](/api-reference/get-generated-image-from-flux-kontex) API to obtain the generated image. FLUX Kontext Pro is a specialized model for generating contextually-aware images from text descriptions. Designed for professional use cases requiring high-quality, consistent image generation. Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-pro) to quickly try it out in your browser. FLUX Kontext Max is the most advanced model in the Kontext series, offering maximum quality and context understanding. Ideal for enterprise applications requiring the highest level of image generation performance. Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-max) to quickly try it out in your browser. ## Path The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API. ## Headers The media type of the request body. Your Fireworks API key. ## Request Body Prompt to use for the image generation process. Base64 encoded image or URL to use with Kontext. Optional seed for reproducibility. Aspect ratio of the image between 21:9 and 9:21. Output format for the generated image. Can be 'jpeg' or 'png'. **Options:** `jpeg`, `png` URL to receive webhook notifications. **Length:** 1-2083 characters Optional secret for webhook signature verification. Whether to perform upsampling on the prompt. If active, automatically modifies the prompt for more creative generation. Tolerance level for input and output moderation. Between 0 and 6, 0 being most strict, 6 being least strict. Limit of 2 for Image to Image. **Range:** 0-6 ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}" headers = { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY", } data = { "prompt": "A beautiful sunset over the ocean", "input_image": "", "seed": 42, "aspect_ratio": "", "output_format": "jpeg", "webhook_url": "", "webhook_secret": "", "prompt_upsampling": False, "safety_tolerance": 2 } response = requests.post(url, headers=headers, json=data) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ prompt: "A beautiful sunset over the ocean" }), }); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model} \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ --data ' { "prompt": "A beautiful sunset over the ocean" }' ``` ## Response Successful Response request id Unsuccessful Response error message # Get Account Source: https://docs.fireworks.ai/api-reference/get-account get /v1/accounts/{account_id} # Get Batch Inference Job Source: https://docs.fireworks.ai/api-reference/get-batch-inference-job get /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id} # Get Account Usage Source: https://docs.fireworks.ai/api-reference/get-billing-usage get /v1/accounts/{account_id}/billingUsage # Get Dataset Source: https://docs.fireworks.ai/api-reference/get-dataset get /v1/accounts/{account_id}/datasets/{dataset_id} # Get Dataset Download Endpoint Source: https://docs.fireworks.ai/api-reference/get-dataset-download-endpoint get /v1/accounts/{account_id}/datasets/{dataset_id}:getDownloadEndpoint # Get Dataset Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-dataset-upload-endpoint post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint # Get LoRA Source: https://docs.fireworks.ai/api-reference/get-deployed-model get /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Get Deployment Source: https://docs.fireworks.ai/api-reference/get-deployment get /v1/accounts/{account_id}/deployments/{deployment_id} # Get Deployment Shape Source: https://docs.fireworks.ai/api-reference/get-deployment-shape get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id} # Get Deployment Shape Version Source: https://docs.fireworks.ai/api-reference/get-deployment-shape-version get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions/{version_id} # Get dpo job Source: https://docs.fireworks.ai/api-reference/get-dpo-job get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id} # Get dpo job metrics file endpoint Source: https://docs.fireworks.ai/api-reference/get-dpo-job-metrics-file-endpoint get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:getMetricsFileEndpoint # Get Evaluation Job Source: https://docs.fireworks.ai/api-reference/get-evaluation-job get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id} # Get Evaluation Job execution logs (stream log endpoint + tracing IDs). Source: https://docs.fireworks.ai/api-reference/get-evaluation-job-log-endpoint get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id}:getExecutionLogEndpoint # Get Evaluator Source: https://docs.fireworks.ai/api-reference/get-evaluator get /v1/accounts/{account_id}/evaluators/{evaluator_id} Retrieves an evaluator by name. Use this to monitor build progress after creation (**step 6** in the [Create Evaluator](/api-reference/create-evaluator) workflow). Possible states: - `BUILDING` - Environment is being prepared - `ACTIVE` - Evaluator is ready to use - `BUILD_FAILED` - Check build logs via [Get Evaluator Build Log Endpoint](/api-reference/get-evaluator-build-log-endpoint) # Get Evaluator Build Log Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-build-log-endpoint get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getBuildLogEndpoint Returns a signed URL to download the evaluator's build logs. Useful for debugging `BUILD_FAILED` state. # Get Evaluator Source Code Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-source-code-endpoint get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getSourceCodeSignedUrl Returns a signed URL to download the evaluator's source code archive. Useful for debugging or reviewing the uploaded code. # Get Evaluator Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-evaluator-upload-endpoint post /v1/accounts/{account_id}/evaluators/{evaluator_id}:getUploadEndpoint Returns signed URLs for uploading evaluator source code (**step 3** in the [Create Evaluator](/api-reference/create-evaluator) workflow). After receiving the signed URL, upload your `.tar.gz` archive using HTTP `PUT` with `Content-Type: application/octet-stream` header. # Get generated image from FLUX.1 Kontext Source: https://docs.fireworks.ai/api-reference/get-generated-image-from-flux-kontex GET https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result Replace **model** with **flux-kontext-pro** in the API to get the result. Replace **model** with **flux-kontext-max** in the API to get the result. ## Path The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API. ## Headers The media type of the request body. Your Fireworks API key. ## Request Body Request id generated from create/edit image request. ```python Python theme={null} import requests url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result" headers = { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY", } data = { id: "request_id" } response = requests.post(url, headers=headers, json=data) print(response.text) ``` ```typescript TypeScript theme={null} import fs from "fs"; import fetch from "node-fetch"; (async () => { const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer $API_KEY" }, body: JSON.stringify({ id: "request_id" }), }); })().catch(console.error); ``` ```shell curl theme={null} curl --request POST \ -S --fail-with-body \ --url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $API_KEY" \ --data ' { id: "request_id" }' ``` ## Response Task id for retrieving result Available options: Task not found, Pending, Request Moderated, Content Moderated, Ready, Error # Get Model Source: https://docs.fireworks.ai/api-reference/get-model get /v1/accounts/{account_id}/models/{model_id} # Get Model Download Endpoint Source: https://docs.fireworks.ai/api-reference/get-model-download-endpoint get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint # Get Model Upload Endpoint Source: https://docs.fireworks.ai/api-reference/get-model-upload-endpoint post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint # Get Quota Source: https://docs.fireworks.ai/api-reference/get-quota get /v1/accounts/{account_id}/quotas/{quota_id} Gets a single quota by resource name. # Get Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-job get /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id} # Get Reinforcement Fine-tuning Step Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-step get /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id} # Get Response Source: https://docs.fireworks.ai/api-reference/get-response get /v1/responses/{response_id} # Get Router Source: https://docs.fireworks.ai/api-reference/get-router get /v1/accounts/{account_id}/routers/{router_id} # Get Secret Source: https://docs.fireworks.ai/api-reference/get-secret get /v1/accounts/{account_id}/secrets/{secret_id} Retrieves a secret by name. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included. # Get Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/get-supervised-fine-tuning-job get /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id} # Get User Source: https://docs.fireworks.ai/api-reference/get-user get /v1/accounts/{account_id}/users/{user_id} # Introduction Source: https://docs.fireworks.ai/api-reference/introduction Fireworks AI REST API enables you to interact with various language, image and embedding models using an API Key. It also lets you automate management of models, deployments, datasets, and more. ## Authentication All requests made to the Fireworks AI REST API must include an `Authorization` header with a valid `Bearer` token using your API key, along with the `Content-Type: application/json` header. ### Getting your API key You can obtain an API key by: * Using the [`firectl api-key create`](/tools-sdks/firectl/commands/api-key-create) command * Generating one through the [Fireworks AI dashboard](https://app.fireworks.ai/settings/users/api-keys) ### Request headers Include the following headers in your REST API requests: ```json theme={null} authorization: Bearer content-type: application/json ``` ## Account management APIs In addition to inference and deployment APIs, Fireworks exposes account-scoped quota endpoints. * [List Quotas](/api-reference/list-quotas) * [Get Quota](/api-reference/get-quota) * [Update Quota](/api-reference/update-quota) # List Accounts Source: https://docs.fireworks.ai/api-reference/list-accounts get /v1/accounts # List API Keys Source: https://docs.fireworks.ai/api-reference/list-api-keys get /v1/accounts/{account_id}/users/{user_id}/apiKeys # List Batch Inference Jobs Source: https://docs.fireworks.ai/api-reference/list-batch-inference-jobs get /v1/accounts/{account_id}/batchInferenceJobs # List Datasets Source: https://docs.fireworks.ai/api-reference/list-datasets get /v1/accounts/{account_id}/datasets # List LoRAs Source: https://docs.fireworks.ai/api-reference/list-deployed-models get /v1/accounts/{account_id}/deployedModels # List Deployment Shapes Versions Source: https://docs.fireworks.ai/api-reference/list-deployment-shape-versions get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions Use this endpoint to query available deployment shape versions for a given model. Use `-` as a wildcard for both `account_id` and `deployment_shape_id` to search across all accounts and shapes. ## Example: List shapes for a model To list validated deployment shapes for a specific model, use the `filter` parameter with `snapshot.base_model` and `latest_validated=true`: ```bash theme={null} curl -s "https://api.fireworks.ai/v1/accounts/-/deploymentShapes/-/versions?filter=snapshot.base_model%3D%22accounts%2Ffireworks%2Fmodels%2Fgpt-oss-120b%22%20AND%20latest_validated%3Dtrue&order_by=create_time%20desc" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" | jq . ``` ### Filter syntax The `filter` parameter uses [AIP-160 filtering](https://google.aip.dev/160). Common patterns: | Filter | Description | | ------------------------------------------------------------ | ------------------------------------------------------ | | `snapshot.base_model="accounts/fireworks/models/MODEL_NAME"` | Filter by base model | | `latest_validated=true` | Only return the latest validated version of each shape | Combine multiple conditions with `AND`: ``` snapshot.base_model="accounts/fireworks/models/MODEL_NAME" AND latest_validated=true ``` Remember to URL-encode the filter value when using curl directly. `=` becomes `%3D`, `"` becomes `%22`, and `/` becomes `%2F`. # List Deployments Source: https://docs.fireworks.ai/api-reference/list-deployments get /v1/accounts/{account_id}/deployments # List dpo jobs Source: https://docs.fireworks.ai/api-reference/list-dpo-jobs get /v1/accounts/{account_id}/dpoJobs # List Evaluation Jobs Source: https://docs.fireworks.ai/api-reference/list-evaluation-jobs get /v1/accounts/{account_id}/evaluationJobs # List Evaluators Source: https://docs.fireworks.ai/api-reference/list-evaluators get /v1/accounts/{account_id}/evaluators Lists all evaluators for an account with pagination support. # List Models Source: https://docs.fireworks.ai/api-reference/list-models get /v1/accounts/{account_id}/models # List Quotas Source: https://docs.fireworks.ai/api-reference/list-quotas get /v1/accounts/{account_id}/quotas Lists all quotas for an account. # List Reinforcement Fine-tuning Jobs Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-jobs get /v1/accounts/{account_id}/reinforcementFineTuningJobs # List Reinforcement Fine-tuning Steps Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-steps get /v1/accounts/{account_id}/rlorTrainerJobs # List Responses Source: https://docs.fireworks.ai/api-reference/list-responses get /v1/responses Get a list of all responses for the authenticated account. Args: limit: Maximum number of responses to return (default: 20, max: 100) after: Cursor for pagination - return responses after this ID before: Cursor for pagination - return responses before this ID # List Routers Source: https://docs.fireworks.ai/api-reference/list-routers get /v1/accounts/{account_id}/routers # List Secrets Source: https://docs.fireworks.ai/api-reference/list-secrets get /v1/accounts/{account_id}/secrets Lists all secrets for an account. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included for each secret. # List Supervised Fine-tuning Jobs Source: https://docs.fireworks.ai/api-reference/list-supervised-fine-tuning-jobs get /v1/accounts/{account_id}/supervisedFineTuningJobs # List Users Source: https://docs.fireworks.ai/api-reference/list-users get /v1/accounts/{account_id}/users # Create Chat Completion Source: https://docs.fireworks.ai/api-reference/post-chatcompletions post /v1/chat/completions Create a completion for the provided prompt and parameters. For RL / agent rollouts, Fireworks inference exposes additional rollout-specific features: [`x-session-affinity` and `x-multi-turn-session-id`](https://docs.fireworks.ai/guides/rollout-inference#session-affinity) for multi-turn trajectories, and [MoE Router Replay (R3)](https://docs.fireworks.ai/guides/rollout-inference#moe-router-replay) for MoE expert tracing during rollouts. # Create Completion Source: https://docs.fireworks.ai/api-reference/post-completions post /v1/completions Create a completion for the provided prompt and parameters. For RL / agent rollouts, Fireworks inference exposes additional rollout-specific features: [`x-session-affinity` and `x-multi-turn-session-id`](https://docs.fireworks.ai/guides/rollout-inference#session-affinity) for multi-turn trajectories, and [MoE Router Replay (R3)](https://docs.fireworks.ai/guides/rollout-inference#moe-router-replay) for MoE expert tracing during rollouts. # Create Response Source: https://docs.fireworks.ai/api-reference/post-responses post /v1/responses Creates a model response, optionally interacting with custom tools via the Model Context Protocol (MCP). This endpoint supports conversational continuation and streaming. Explore our cookbooks for detailed examples: - [Basic MCP Usage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_examples.ipynb) - [Streaming with MCP](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_with_streaming.ipynb) - [Conversational History with `previous_response_id`](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_previous_response_cookbook.ipynb) - [Basic Streaming](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_streaming_example.ipynb) - [Controlling Response Storage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/mcp_server_with_store_false_argument.ipynb) # Prepare Model for different precisions Source: https://docs.fireworks.ai/api-reference/prepare-model post /v1/accounts/{account_id}/models/{model_id}:prepare # Rerank documents Source: https://docs.fireworks.ai/api-reference/rerank-documents post /rerank Rerank documents for a query using relevance scoring # Resume Dpo Job Source: https://docs.fireworks.ai/api-reference/resume-dpo-job post /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:resume # Resume Reinforcement Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-job post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:resume # Resume Rlor Trainer Job Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-step post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:resume # Resume Supervised Fine-tuning Job Source: https://docs.fireworks.ai/api-reference/resume-supervised-fine-tuning-job post /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}:resume # Scale Deployment to a specific number of replicas or to zero Source: https://docs.fireworks.ai/api-reference/scale-deployment patch /v1/accounts/{account_id}/deployments/{deployment_id}:scale # Undelete Deployment Source: https://docs.fireworks.ai/api-reference/undelete-deployment post /v1/accounts/{account_id}/deployments/{deployment_id}:undelete # Update Dataset Source: https://docs.fireworks.ai/api-reference/update-dataset patch /v1/accounts/{account_id}/datasets/{dataset_id} # Update LoRA Source: https://docs.fireworks.ai/api-reference/update-deployed-model patch /v1/accounts/{account_id}/deployedModels/{deployed_model_id} # Update Deployment Source: https://docs.fireworks.ai/api-reference/update-deployment patch /v1/accounts/{account_id}/deployments/{deployment_id} # Update Evaluator Source: https://docs.fireworks.ai/api-reference/update-evaluator patch /v1/accounts/{account_id}/evaluators/{evaluator_id} Updates evaluator metadata (display_name, description, default_dataset). Changing `requirements` or `entry_point` triggers a rebuild. To upload new source code, set `prepare_code_upload: true` then follow the upload flow. # Update Model Source: https://docs.fireworks.ai/api-reference/update-model patch /v1/accounts/{account_id}/models/{model_id} # Update Quota Source: https://docs.fireworks.ai/api-reference/update-quota patch /v1/accounts/{account_id}/quotas/{quota_id} Updates a quota. # Update Router Source: https://docs.fireworks.ai/api-reference/update-router patch /v1/accounts/{account_id}/routers/{router_id} # Update secret Source: https://docs.fireworks.ai/api-reference/update-secret patch /v1/accounts/{account_id}/secrets/{secret_id} # Update User Source: https://docs.fireworks.ai/api-reference/update-user patch /v1/accounts/{account_id}/users/{user_id} # Upload Dataset Files Source: https://docs.fireworks.ai/api-reference/upload-dataset-files post /v1/accounts/{account_id}/datasets/{dataset_id}:upload Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint). # Validate Dataset Upload Source: https://docs.fireworks.ai/api-reference/validate-dataset-upload post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload # Validate Evaluator Upload Source: https://docs.fireworks.ai/api-reference/validate-evaluator-upload post /v1/accounts/{account_id}/evaluators/{evaluator_id}:validateUpload Triggers server-side validation of the uploaded source code (**step 5** in the [Create Evaluator](/api-reference/create-evaluator) workflow). The server extracts and processes the archive, then builds the evaluator environment. Poll [Get Evaluator](/api-reference/get-evaluator) to monitor progress. # Validate Model Upload Source: https://docs.fireworks.ai/api-reference/validate-model-upload get /v1/accounts/{account_id}/models/{model_id}:validateUpload # Autoscaling Source: https://docs.fireworks.ai/deployments/autoscaling Configure how your deployment scales based on traffic Control how your deployment scales based on traffic and load. ## Configuration options | Flag | Type | Default | Description | | ------------------------ | --------- | ------------- | ------------------------------------------------------ | | `--min-replica-count` | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero | | `--max-replica-count` | Integer | 1 | Maximum number of replicas | | `--scale-up-window` | Duration | 30s | Wait time before scaling up | | `--scale-down-window` | Duration | 10m | Wait time before scaling down | | `--scale-to-zero-window` | Duration | 1h | Idle time before scaling to zero (min: 5m) | | `--load-targets` | Key-value | `default=0.8` | Scaling thresholds. See options below | **Load target options** (use as `--load-targets =[,=...]`): * `default=` - General load target from 0 to 1 * `tokens_generated_per_second=` - Desired tokens per second per replica * `prompt_tokens_per_second=` - Desired prompt tokens per second per replica * `requests_per_second=` - Desired requests per second per replica * `concurrent_requests=` - Desired concurrent requests per replica When multiple targets are specified, the maximum replica count across all is used. ## Common patterns Scale to zero when idle to minimize costs: ```bash theme={null} firectl deployment create \ --min-replica-count 0 \ --max-replica-count 3 \ --scale-to-zero-window 1h ``` Best for: Development, testing, or intermittent production workloads. Keep replicas running for instant response: ```bash theme={null} firectl deployment create \ --min-replica-count 2 \ --max-replica-count 10 \ --scale-up-window 15s \ --load-targets concurrent_requests=5 ``` Best for: Low-latency requirements, avoiding cold starts, high-traffic applications. Match known traffic patterns: ```bash theme={null} firectl deployment create \ --min-replica-count 3 \ --max-replica-count 5 \ --scale-down-window 30m \ --load-targets tokens_generated_per_second=150 ``` Best for: Steady workloads where you know typical load ranges. ## Scaling from zero behavior When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process: ```json theme={null} { "error": { "message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.", "code": "DEPLOYMENT_SCALING_UP", "type": "error" } } ``` Requests to a scaled-to-zero deployment are **not queued**. Your application must implement retry logic to handle `503` responses while the deployment scales up. ### Handling scale-from-zero responses Implement retry logic with exponential backoff to gracefully handle scale-up delays: ```python theme={null} import time import requests def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5): """Query a deployment with retry logic for scale-from-zero scenarios.""" delay = initial_delay for attempt in range(max_retries): response = requests.post(url, json=payload, headers=headers) # Only retry if deployment is scaling up if response.status_code == 503: error_code = response.json().get("error", {}).get("code") if error_code == "DEPLOYMENT_SCALING_UP": print(f"Deployment scaling up, retrying in {delay}s...") time.sleep(delay) delay = min(delay * 1.5, 60) # Cap at 60 seconds continue response.raise_for_status() return response.json() raise Exception("Deployment did not scale up in time") ``` ```javascript theme={null} async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) { let delay = initialDelay; for (let attempt = 0; attempt < maxRetries; attempt++) { const response = await fetch(url, { method: 'POST', headers: { 'Content-Type': 'application/json', ...headers }, body: JSON.stringify(payload) }); // Only retry if deployment is scaling up if (response.status === 503) { const body = await response.json(); if (body.error?.code === 'DEPLOYMENT_SCALING_UP') { console.log(`Deployment scaling up, retrying in ${delay/1000}s...`); await new Promise(resolve => setTimeout(resolve, delay)); delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds continue; } } if (!response.ok) throw new Error(`HTTP ${response.status}`); return response.json(); } throw new Error('Deployment did not scale up in time'); } ``` ```bash theme={null} # Simple retry loop for scale-from-zero MAX_RETRIES=30 RETRY_DELAY=5 for i in $(seq 1 $MAX_RETRIES); do response=$(curl -s -w "\n%{http_code}" \ https://api.fireworks.ai/inference/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -d '{"model": "accounts//deployments/", ...}') http_code=$(echo "$response" | tail -n1) body=$(echo "$response" | head -n -1) # Only retry if deployment is scaling up if [ "$http_code" -eq 503 ]; then error_code=$(echo "$body" | jq -r '.error.code // empty') if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..." sleep $RETRY_DELAY RETRY_DELAY=$((RETRY_DELAY * 2)) continue fi echo "$body" exit 1 fi # Check for success (2xx status codes) if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then echo "$body" exit 0 fi echo "$body" exit 1 done echo "Deployment did not scale up in time" exit 1 ``` Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running. Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up. # Performance benchmarking Source: https://docs.fireworks.ai/deployments/benchmarking Measure and optimize your deployment's performance with load testing Understanding your deployment's performance under various load conditions is essential for production readiness. Fireworks provides tools and best practices for benchmarking throughput, latency, and identifying bottlenecks. ## Fireworks Benchmark Tool Use our open-source benchmarking tool to measure and optimize your deployment's performance: **[Fireworks Benchmark Tool](https://github.com/fw-ai/benchmark)** This tool allows you to: * Test throughput and latency under various load conditions * Simulate production traffic patterns * Identify performance bottlenecks * Compare different deployment configurations ### Installation ```bash theme={null} git clone https://github.com/fw-ai/benchmark.git cd benchmark pip install -r requirements.txt ``` ### Basic usage Run a basic benchmark test: ```bash theme={null} python benchmark.py \ --model "accounts/fireworks/models/llama-v3p1-8b-instruct" \ --deployment "your-deployment-id" \ --num-requests 1000 \ --concurrency 10 ``` ### Key metrics to monitor When benchmarking your deployment, focus on these key metrics: * **Throughput**: Requests per second (RPS) your deployment can handle * **Latency**: Time to first token (TTFT) and end-to-end response time * **Token generation rate**: Tokens per second during generation * **Error rate**: Failed requests under load ## Custom benchmarking You can also develop custom performance testing scripts or integrate with monitoring tools to track metrics over time. Consider: * Using production-like request patterns and payloads * Testing with various concurrency levels * Monitoring resource utilization (GPU, memory, network) * Testing autoscaling behavior under load ## Best practices 1. **Warm up your deployment**: Run a few requests before benchmarking to ensure models are loaded 2. **Test realistic scenarios**: Use request patterns and payloads similar to your production workload 3. **Gradually increase load**: Start with low concurrency and gradually increase to find your deployment's limits 4. **Monitor for errors**: Track error rates and response codes to identify issues under load 5. **Compare configurations**: Test different deployment shapes, quantization levels, and hardware to optimize cost and performance ## Next steps Configure autoscaling to handle variable load Optimize your client code for maximum throughput # Client-side performance optimization Source: https://docs.fireworks.ai/deployments/client-side-performance-optimization Optimize your client code for maximum performance with dedicated deployments When using a dedicated deployment, it is important to optimize the client-side HTTP connection pooling for maximum performance. We recommend using our [Python SDK](/tools-sdks/python-sdk) as it has good defaults for connection pooling and utilizes [httpx](https://www.python-httpx.org/) for optimal performance with Python's `asyncio` library. It also includes retry logic for handling `429` errors that Fireworks returns when the server is overloaded. ## General optimization recommendations Based on our benchmarks, we recommend the following: 1. Use a client library optimized for high concurrency, such as [httpx](https://www.python-httpx.org/) in Python or [http.Agent](https://nodejs.org/api/http.html#class-httpagent) in Node.js. 2. Use the `AsyncFireworks` client for high-concurrency workloads. 3. Increase concurrency until performance stops improving or you observe too many `429` errors. ## Code example: Optimal concurrent requests (Python) Install the [Fireworks Python SDK](/tools-sdks/python-sdk): The SDK is currently in alpha. Use the `--pre` flag when installing to get the latest version. ```bash pip theme={null} pip install --pre fireworks-ai ``` ```bash poetry theme={null} poetry add --pre fireworks-ai ``` ```bash uv theme={null} uv add --pre fireworks-ai ``` Here's how to implement optimal concurrent requests using `asyncio` and the `AsyncFireworks` client: ```python main.py theme={null} import asyncio import time import statistics from fireworks import AsyncFireworks async def make_concurrent_requests( messages: list[str], model: str, max_workers: int = 1000, ): """Make concurrent requests with optimized connection pooling""" client = AsyncFireworks( max_retries=5, ) # Semaphore to limit concurrent requests semaphore = asyncio.Semaphore(max_workers) latencies = [] async def single_request(message: str): """Make a single request with semaphore control""" async with semaphore: start_time = time.perf_counter() response = await client.chat.completions.create( model=model, messages=[{"role": "user", "content": message}], max_tokens=100, ) latency = time.perf_counter() - start_time latencies.append(latency) return response.choices[0].message.content # Create all request tasks tasks = [single_request(message) for message in messages] # Execute all requests concurrently results = await asyncio.gather(*tasks) return results, latencies # Usage example async def main(): messages = ["Hello!"] * 1000 # 1000 requests model = "accounts/fireworks/models/qwen3-0p6b" start_time = time.perf_counter() results, latencies = await make_concurrent_requests( messages=messages, model=model, ) total_time = time.perf_counter() - start_time # Calculate performance metrics num_requests = len(results) requests_per_second = num_requests / total_time # Latency statistics (in milliseconds) latencies_ms = [lat * 1000 for lat in latencies] avg_latency = statistics.mean(latencies_ms) min_latency = min(latencies_ms) max_latency = max(latencies_ms) p50_latency = statistics.median(latencies_ms) p95_latency = statistics.quantiles(latencies_ms, n=20)[18] # 95th percentile p99_latency = statistics.quantiles(latencies_ms, n=100)[98] # 99th percentile print("\n" + "=" * 50) print("Performance Results") print("=" * 50) print(f"Total requests: {num_requests}") print(f"Total time: {total_time:.2f} seconds") print(f"Throughput: {requests_per_second:.2f} requests/second") print("\nLatency Statistics (ms):") print(f" Min: {min_latency:.2f}") print(f" Max: {max_latency:.2f}") print(f" Avg: {avg_latency:.2f}") print(f" P50 (median): {p50_latency:.2f}") print(f" P95: {p95_latency:.2f}") print(f" P99: {p99_latency:.2f}") print("=" * 50) if __name__ == "__main__": asyncio.run(main()) ``` This implementation: * Uses `AsyncFireworks` for non-blocking async requests with optimized connection pooling * Uses `asyncio.Semaphore` to control concurrency to avoid overwhelming the server # Exporting Metrics Source: https://docs.fireworks.ai/deployments/exporting-metrics Export metrics from your dedicated deployments to your observability stack ## Overview Fireworks provides a metrics endpoint in Prometheus format, enabling integration with popular observability tools like Prometheus, OpenTelemetry (OTel) Collector, Datadog Agent, and Vector. This page covers real-time performance metrics (latency, throughput, etc.) for on-demand deployments. For billing and usage data across all Fireworks services, see [Exporting Billing Metrics](/accounts/exporting-billing-metrics). ## Setting Up Metrics Collection ### Endpoint The metrics endpoint is as follows. This URL and authorization header can be directly used by services like Grafana Cloud to ingest Fireworks metrics. ``` https://api.fireworks.ai/v1/accounts//metrics ``` ### Authentication Use the Authorization header with your Fireworks API key: ```json theme={null} { "Authorization": "Bearer YOUR_API_KEY" } ``` ### Scrape Interval We recommend using a 1-minute scrape interval as metrics are updated every 30s. ### Rate Limits To ensure service stability and fair usage: * Maximum of 6 requests per minute per account * Exceeding this limit results in HTTP 429 (Too Many Requests) responses * Use a 1-minute scrape interval to stay within limits ## Integration Options Fireworks metrics can be integrated with various observability platforms through multiple approaches: ### OpenTelemetry Collector Integration The Fireworks metrics endpoint can be integrated with OpenTelemetry Collector by configuring a Prometheus receiver that scrapes the endpoint. This allows Fireworks metrics to be pushed to a variety of popular exporters—see the [OpenTelemetry registry](https://opentelemetry.io/ecosystem/registry/) for a full list. ### Direct Prometheus Integration To integrate directly with Prometheus, specify the Fireworks metrics endpoint in your scrape config: ```yaml theme={null} global: scrape_interval: 60s scrape_configs: - job_name: 'fireworks' metrics_path: 'v1/accounts//metrics' authorization: type: "Bearer" credentials: "YOUR_API_KEY" static_configs: - targets: ['api.fireworks.ai'] scheme: https ``` For more details on Prometheus configuration, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/). ### Supported Platforms Fireworks metrics can be exported to various observability platforms including: * Prometheus * Datadog * Grafana * New Relic ## Available Metrics ### Common Labels All metrics include the following common labels: * `base_model`: The base model identifier (e.g., "accounts/fireworks/models/deepseek-v3") * `deployment`: Full deployment path (e.g., "accounts/account-name/deployments/deployment-id") * `deployment_account`: The account name * `deployment_id`: The deployment identifier ### Rate Metrics (per second) These metrics show activity rates calculated using 1-minute windows: #### Request Rate * `request_counter_total:sum_by_deployment`: Request rate per deployment #### Error Rate * `requests_error_total:sum_by_deployment`: Error rate per deployment, broken down by HTTP status code (includes additional `http_code` label) #### Token Processing Rates * `tokens_cached_prompt_total:sum_by_deployment`: Rate of cached prompt tokens per deployment * `tokens_prompt_total:sum_by_deployment`: Rate of total prompt tokens processed per deployment ### Latency Histogram Metrics These metrics provide latency distribution data with histogram buckets, calculated using 1-minute windows: #### Generation Latency * `latency_generation_per_token_ms_bucket:sum_by_deployment`: Per-token generation time distribution * `latency_generation_queue_ms_bucket:sum_by_deployment`: Time spent waiting in generation queue #### Request Latency * `latency_overall_ms_bucket:sum_by_deployment`: End-to-end request latency distribution * `latency_to_first_token_ms_bucket:sum_by_deployment`: Time to first token distribution #### Prefill Latency * `latency_prefill_ms_bucket:sum_by_deployment`: Prefill processing time distribution * `latency_prefill_queue_ms_bucket:sum_by_deployment`: Time spent waiting in prefill queue ### Token Distribution Metrics These histogram metrics show token count distributions per request, calculated using 1-minute windows: * `tokens_generated_per_request_bucket:sum_by_deployment`: Distribution of generated tokens per request * `tokens_prompt_per_request_bucket:sum_by_deployment`: Distribution of prompt tokens per request ### Resource Utilization Metrics These gauge metrics show average resource usage: * `generator_kv_blocks_fraction:avg_by_deployment`: Average fraction of KV cache blocks in use * `generator_kv_slots_fraction:avg_by_deployment`: Average fraction of KV cache slots in use * `generator_model_forward_time:avg_by_deployment`: Average time spent in model forward pass * `requests_coordinator_concurrent_count:avg_by_deployment`: Average number of concurrent requests * `prefiller_prompt_cache_ttl:avg_by_deployment`: Average prompt cache time-to-live # Regions Source: https://docs.fireworks.ai/deployments/regions Fireworks runs a global fleet of hardware on which you can deploy your models. Fireworks runs a global fleet so you can deploy models close to users, meet data-residency needs, and scale across clouds. This page covers **multi-region** (default behavior and quota groupings), **single-region** availability and hardware, how to **use and change** regions, and **quotas**. ## Multi-region (recommended) By default, deployments are multi-region: Fireworks can move and spread them across regions as needed. Multi-regions (**GLOBAL**, **US**, **EUROPE**, **APAC**) are high-level groupings of single regions. Your deployment may run in any single region(s) within that multi-region. Utilizing multiple clouds and locations maximizes the odds that there's capacity to scale. Multi-region deployments enable resilience to localized outages, maintaining application availability as workloads scale across regions. ### Supported multi-regions Supported multi-regions: `GLOBAL`, `US`, `EUROPE`, `APAC`. ## Single region availability Single regions are concrete locations (e.g. `US_IOWA_1`, `EU_FRANKFURT_1`) where your deployment can run. We have the single regions listed below available; we recommend multi-region for most users because of its advantages (elastic scaling, higher reliability). If you have a specific need for a single region, contact [Fireworks](mailto:inquiries@fireworks.ai) to request it. The table below shows which single regions are available and what hardware is offered in each. | **Region** | **Accelerator Type(s)** | | | ----------------- | ---------------------------------------- | - | | `US_ARIZONA_1` | `NVIDIA_H100_80GB` | | | `US_CALIFORNIA_1` | `NVIDIA_H200_141GB` | | | `US_GEORGIA_2` | `NVIDIA_B200_180GB` | | | `US_GEORGIA_3` | `NVIDIA_H200_141GB` | | | `US_ILLINOIS_1` | `NVIDIA_H100_80GB` | | | `US_ILLINOIS_2` | `NVIDIA_A100_80GB` | | | `US_IOWA_1` | `NVIDIA_H100_80GB` | | | `US_OHIO_1` | `NVIDIA_B200_180GB` | | | `US_TEXAS_2` | `NVIDIA_H100_80GB` | | | `US_UTAH_1` | `NVIDIA_B200_180GB` | | | `US_VIRGINIA_1` | `NVIDIA_H100_80GB`, `NVIDIA_H200_141GB` | | | `US_WASHINGTON_2` | `NVIDIA_H100_80GB` | | | `US_WASHINGTON_3` | `NVIDIA_B200_180GB` | | | `US_WASHINGTON_4` | `NVIDIA_B200_180GB` | | | `EU_FRANKFURT_1` | `NVIDIA_H100_80GB` | | | `EU_ICELAND_1` | `NVIDIA_H200_141GB` | | | `EU_ICELAND_2` | `NVIDIA_B200_180GB`, `NVIDIA_H200_141GB` | | | `AP_TOKYO_1` | `NVIDIA_H100_80GB` | | | `AP_TOKYO_2` | `NVIDIA_H200_141GB` | | ## Using a region When creating a deployment, you can pass the `--region` flag to pin it to a single region: ``` firectl deployment create accounts/fireworks/models/llama-v3p1-8b-instruct \ --region GLOBAL ``` ## Changing regions Updating the single region for a deployment in-place is not supported. To move a deployment to a different single region, create a new deployment in the desired region, then delete the old deployment. ## Quotas Quota is granted at the **multi-region** level for new users. By default, all users receive quota for **GLOBAL** multi-region. For specific single region quota, please contact Fireworks. To view your current quotas, run: ``` firectl quota list ``` To use single regions that are not generally available (see the table above), or to request additional multi-region quota, contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai). # Reserved capacity Source: https://docs.fireworks.ai/deployments/reservations Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the following advantages over ordinary [on-demand deployments](/guides/ondemand-deployments): * Guaranteed capacity * Higher quotas * Lower GPU-hour prices * Pre-GA access to newer regions * Pre-GA access to newest hardware ## Usage and billing Consuming a reservation is done by creating a deployment that meets the reservation parameters. For example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both deployments are running, 12 of the H100s will count towards using your reservation, while the excess 4 H100s will be metered and billed at the on-demand rate. Follow [deploying models on-demand](/guides/ondemand-deployments) to create a deployment. When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding number of deployments, otherwise you may be billed at for your usage at on-demand rates. Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract (e.g. monthly, quarterly, or yearly). Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is actively used. ## Purchasing or renewing a reservation To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us). ## Viewing your reservations To view your existing reservations, run: ``` firectl reservation list ``` # Routers Source: https://docs.fireworks.ai/deployments/routers Distribute traffic across multiple deployments for A/B testing, traffic migration, and load distribution. A **Router** is a resource that controls how inference traffic is routed to one or more deployments. Instead of sending all requests to a single deployment, a router lets you split traffic across multiple deployments — useful for A/B testing model variants, gradually migrating traffic to a new deployment, or distributing load. Traffic is split proportionally based on the number of replicas in each deployment. For example, if a router covers two deployments — one with 3 replicas and another with 2 — the first receives 60% of traffic and the second receives 40%. Routers only work with multi-region deployments. ## When to use a router ### Stable alias for deployment replacement If you plan to replace a deployment later (e.g., changing to a new model later), give your application the **router name** instead of the deployment name. You can then swap the underlying deployment without your application changing anything. ``` Your app calls: accounts//routers/my-router └── Initially routes to: accounts//deployments/v1 └── Later updated to: accounts//deployments/v2 ``` ### A/B testing between deployments Place multiple deployments under a single router. Traffic is automatically split by replica count, so you can control the ratio by adjusting replicas on each deployment. ```bash theme={null} firectl router create \ --router-id=ab-test \ --deployments=model-a,model-b ``` ### Gradual traffic migration Shift traffic from an old deployment to a new one with zero downtime by scaling replicas up on the new deployment and down on the old. See the [worked example](#example-traffic-migration) below. ## How traffic routing works Traffic is distributed based on **replica count**. Each replica across all deployments in the router receives an equal share of traffic. | Deployment | Replicas | Traffic share | | -------------- | -------- | ------------- | | `deployment-a` | 3 | 60% | | `deployment-b` | 2 | 40% | | **Total** | **5** | **100%** | To shift traffic, scale the replica counts on the underlying deployments. The router automatically adjusts the distribution. ### Sending traffic to a router Use the router's name in the `model` field of your API request, just like you would use a deployment name: ```bash theme={null} curl -s -X POST https://api.fireworks.ai/inference/v1/chat/completions \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "accounts//routers/", "messages": [{"role": "user", "content": "Hello"}] }' ``` ### Routing strategy Traffic is routed using **weighted replica** selection: each request is randomly assigned to a deployment, weighted by its replica count. A deployment with more replicas receives proportionally more traffic. ## Managing routers ### Creating a router A router requires at least one deployment. ```bash theme={null} firectl router create \ --deployments=, ``` Optional flags: | Flag | Description | | ---------------- | -------------------------------------------------------------- | | `--router-id` | Set a specific router ID. If omitted, a random ID is generated | | `--display-name` | Human-readable name for the router | | `--model` | The model to route traffic to | | `--strategy` | Routing strategy. Default: `weighted-random` | | `--public` | Make the router accessible to other accounts | ### Listing routers ```bash theme={null} firectl router list ``` ### Getting router details ```bash theme={null} firectl router get ``` You can also use the full resource name: ```bash theme={null} firectl router get accounts//routers/ ``` ### Updating a router Update the deployments, strategy, or other properties of an existing router: ```bash theme={null} firectl router update \ --deployments=,, ``` ### Deleting a router ```bash theme={null} firectl router delete ``` Deleting a router takes effect immediately. Any traffic sent to the router's alias will fail. Make sure all clients have switched to a different route before deleting. ## Example: traffic migration This example walks through migrating traffic from an existing deployment to a new one with zero downtime. **Step 1** — Create a router for your existing deployment and point your application at the router alias: ```bash theme={null} firectl router create \ --router-id=my-router \ --deployments=current-deployment ``` Your application sends traffic to `accounts//routers/my-router`. All traffic goes to `current-deployment`. **Step 2** — Create the new deployment and add it to the router: ```bash theme={null} firectl deployment create accounts//models/ \ --deployment-id=new-deployment ``` ```bash theme={null} firectl router update my-router \ --deployments=current-deployment,new-deployment ``` A new deployment starts with 1 replica by default, so if `current-deployment` has 4 replicas, the split is immediately 80%/20%. **Step 3** — Shift more traffic by increasing replicas on the new deployment and decreasing the old: ```bash theme={null} firectl deployment update new-deployment \ --min-replica-count=4 \ --max-replica-count=4 firectl deployment update current-deployment \ --min-replica-count=1 \ --max-replica-count=1 ``` Traffic split is now 20% old / 80% new. **Step 4** — Complete the migration by scaling the old deployment to zero: ```bash theme={null} firectl deployment update current-deployment \ --min-replica-count=0 \ --max-replica-count=0 ``` All traffic now flows to `new-deployment`. Clean up by removing the old deployment from the router: ```bash theme={null} firectl router update my-router --deployments=new-deployment ``` Monitor your new deployment's latency and error rates at each step before shifting more traffic. This lets you catch issues early and roll back by increasing replicas on the old deployment. # Speculative Decoding Source: https://docs.fireworks.ai/deployments/speculative-decoding Speed up generation with draft models and n-gram speculation Speed up text generation by using a smaller "draft" model to assist the main model, or using n-gram based speculation. Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case. ## Configuration options | Flag | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------------- | | `--draft-model` | string | Draft model name. Can be a Fireworks model or custom model. See recommendations below. | | `--draft-token-count` | int32 | Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4. | | `--ngram-speculation-length` | int32 | Alternative to draft model: uses N-gram based speculation from previous input. | `--draft-model` and `--ngram-speculation-length` cannot be used together. ## Recommended draft models | Draft model | Use with | | -------------------------------------------------- | --------------------- | | `accounts/fireworks/models/llama-v3p2-1b-instruct` | All Llama models > 3B | | `accounts/fireworks/models/qwen2p5-0p5b-instruct` | All Qwen models > 3B | ## Examples Use a smaller model to speed up generation: ```bash theme={null} firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \ --draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \ --draft-token-count=4 ``` Use input history for speculation (no draft model needed): ```bash theme={null} firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \ --ngram-speculation-length=3 \ --draft-token-count=4 ``` Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding. # Cloud Integrations Source: https://docs.fireworks.ai/ecosystem/integrations Cloud Integrations ## Agentic Coding Harnesses Use Fireworks models in Claude Code via the FireConnect CLI Use Fireworks models in OpenCode via the FireConnect CLI ## Cloud Deployments Access frontier open models through Azure, billed to your Azure account Deploy Fireworks models on AWS SageMaker Run Fireworks on Amazon Elastic Kubernetes Service Deploy using Amazon Elastic Container Service Build and deploy AI agents with AgentCore ## Need Help? For assistance with cloud deployments or custom integrations, [contact our team](https://fireworks.ai/contact). # Agent Frameworks Source: https://docs.fireworks.ai/ecosystem/integrations/agent-frameworks Build production-ready AI agents with Fireworks and leading open-source frameworks Fireworks AI seamlessly integrates with the best open-source agent frameworks, enabling you to build magical, production-ready applications powered by state-of-the-art language models. ## Supported Frameworks Build LLM applications with powerful orchestration and tool integration Efficient data retrieval and document indexing for LLM-based agents Orchestrate collaborative multi-agent systems for complex tasks Type-safe AI agent development with Pydantic validation Modern agent orchestration with seamless OpenAI-compatible integration Use Claude Code with Fireworks models for AI-powered coding Add Fireworks models to Copilot Chat via a custom endpoint ## Need Help? For assistance with agent framework integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai). # Microsoft Foundry Source: https://docs.fireworks.ai/ecosystem/integrations/azure-foundry Deploy frontier open models inside your Azure subscription, billed through Azure. Fireworks AI is a first-party inference provider inside Microsoft Foundry. You can access frontier open models through your existing Azure account, with usage billed through Azure and counting toward your Microsoft Azure Consumption Commitment (MACC). This page covers the Fireworks side of the integration. For Azure portal setup steps, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models). **New to Fireworks?** Foundry users get the same OpenAI-compatible API and model catalog as direct Fireworks customers. Start with the [PayGo quickstart](#paygo-quickstart) below — you can be making requests in about 10 minutes. ## Prerequisites * An active Azure subscription * The Fireworks integration enabled at the subscription level (see below) * A Microsoft Foundry project with the **Azure AI Developer** role assigned ### Opt-in Fireworks on Foundry requires a one-time opt-in per Azure subscription before you can create deployments. Follow the steps in the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#enable-fireworks-on-foundry). ## Deployment modes Fireworks on Foundry supports three deployment modes. | Mode | Also called | Pricing | Regions | Right for | | ----------------- | ------------------------------ | --------------------------------- | -------------------- | -------------------------------------------- | | **PayGo** | Serverless, Data Zone Standard | Per token, MACC-eligible | US Data Zone only | Prototyping, low-volume workloads | | **PTU** | Provisioned Throughput | Per PTU-hour, ACD + MACC eligible | Global | Production workloads with consistent traffic | | **Custom Models** | Bring Your Own Model | PTU pricing | Global (PTU regions) | Fine-tuned model deployment | PTU deployments can be created directly in the Azure portal. For help with PTU sizing on Fireworks models, contact [sales@fireworks.ai](mailto:sales@fireworks.ai). ## Available models All models use the OpenAI-compatible chat completions API and are added to the catalog on a rolling basis. For the current list of available models, see the [Microsoft Learn catalog](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#available-catalog-models). Chat completions only. Embeddings, image generation, and audio modalities are not available through Foundry. ## PayGo quickstart PayGo (Data Zone Standard) is available in: East US, East US 2, Central US, North Central US, West US, West US 3. The throughput limit for PayGo deployments is **250,000 tokens per minute (TPM)**. ### Make your first request Foundry deployments use an OpenAI-compatible endpoint. Use your Foundry project endpoint and Azure API key. ```python theme={null} from openai import OpenAI client = OpenAI( base_url="https://.services.ai.azure.com/models", api_key="", ) response = client.chat.completions.create( model="fireworks-ai/FW-GLM-5.1", messages=[{"role": "user", "content": "Hello"}], ) print(response.choices[0].message.content) ``` Find your project endpoint in the Microsoft Foundry portal under **Project settings**. ## PTU (Provisioned Throughput) PTU deployments provide dedicated GPU capacity reserved for your workload, with consistent throughput and global region availability. * Dedicated capacity, not shared with other tenants * Available globally, not limited to US Data Zone * ACD-eligible and MACC-eligible You can create a PTU deployment directly in the Azure portal. For more on provisioned throughput, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/provisioned-throughput). For help with PTU sizing on Fireworks models, contact [sales@fireworks.ai](mailto:sales@fireworks.ai). ## Custom Models Fine-tune on Fireworks and deploy on Foundry, or bring your own weights from wherever you post-train to deploy on Foundry. Your model is served on Fireworks infrastructure within Azure, billed through your Azure account. ### Supported base architectures For the list of supported custom model architectures, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#supported-model-architectures). ### Deployment To import and deploy a custom model, follow the [Import custom models into Foundry guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/import-custom-models?tabs=rest-api). ## Billing All Fireworks on Foundry usage is billed through Azure. You do not need a separate Fireworks billing account or contract. * PayGo and PTU usage is MACC-eligible * PTU deployments are ACD-eligible and qualify for quota retirement * Direct Fireworks usage at [fireworks.ai](https://fireworks.ai) is billed separately and does not count toward MACC ## Troubleshooting | Issue | Resolution | | ------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Quota exceeded error | Request a limit increase at [aka.ms/fireworks-quota](https://aka.ms/fireworks-quota) | | Access denied on deployment | Verify you have the **Azure AI Developer** role on the project | | Opt-in not propagating | Allow up to 30 minutes after registering `Fireworks.EnableDeploy` | | Custom Model deployment failing | Confirm weights are full-weight (not LoRA adapters) and the architecture is in the [supported list](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#supported-model-architectures) | | PTU provisioning questions | Contact [sales@fireworks.ai](mailto:sales@fireworks.ai) | ## Additional resources * [Enable Fireworks on Foundry (Microsoft Learn)](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models) * [Microsoft Foundry portal](https://ai.azure.com/) * [Fireworks fine-tuning docs](/fine-tuning/finetuning-intro) * [Fireworks Trust Center](https://fireworks.ai/trust) * [sales@fireworks.ai](mailto:sales@fireworks.ai) for PTU provisioning and Custom Model support # Claude Code Source: https://docs.fireworks.ai/ecosystem/integrations/claude-code Use Fireworks AI models in Claude Code with the FireConnect CLI [FireConnect](https://github.com/fw-ai/fireconnect) routes [Claude Code](https://claude.ai/code) through Fireworks AI models. Install it once, then use `fireconnect on` and `fireconnect off` to switch providers without editing config files by hand. ## Prerequisites * [Claude Code](https://claude.ai/code) installed * A [Fireworks API key](https://app.fireworks.ai/settings/users/api-keys) (`fw_...`) or a [Fire Pass](/firepass) key (`fpk_...`) * Node.js (the installer can install it via Homebrew or apt if it is missing) ## Install ```bash theme={null} curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | bash ``` For non-interactive setup (CI or scripted installs): ```bash theme={null} curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fw_..." bash ``` The installer: * Uses Node.js to update Claude Code settings (it does not install or update npm packages) * Prompts for your Fireworks API key, or reads it from `FIREWORKS_API_KEY` * Runs `fireconnect on` to apply the default model mapping and write `~/.claude/settings.json` * Clones the FireConnect CLI to `~/.fireconnect/cli` and installs a `fireconnect` launcher to `~/.local/bin` * Adds `~/.local/bin` to your shell `PATH` Restart Claude Code after installation, then test with: ```text theme={null} Is this thing on? ``` ## Using Fire Pass If you have a [Fire Pass](/firepass) subscription, use your `fpk_...` key instead: ```bash theme={null} curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fpk_..." bash ``` FireConnect automatically detects Fire Pass keys and routes all model aliases to `kimi-k2p6-turbo` — the only model covered by the Fire Pass subscription. ## Default model mapping | Alias | Standard key (`fw_...`) | Fire Pass key (`fpk_...`) | | -------- | ----------------------- | ------------------------- | | main | `kimi-k2p6-turbo` | `kimi-k2p6-turbo` | | opus | `kimi-k2p6-turbo` | `kimi-k2p6-turbo` | | sonnet | `glm-5p1` | `kimi-k2p6-turbo` | | haiku | `minimax-m2p5` | `kimi-k2p6-turbo` | | subagent | `minimax-m2p5` | `kimi-k2p6-turbo` | Short model IDs like `kimi-k2p6-turbo` are expanded to full Fireworks paths (for example, `accounts/fireworks/routers/kimi-k2p6-turbo`). ## What gets written FireConnect writes these settings to `~/.claude/settings.json`: ```json theme={null} { "env": { "ANTHROPIC_BASE_URL": "https://api.fireworks.ai/inference", "ANTHROPIC_API_KEY": "fw_YOUR_FIREWORKS_API_KEY", "ANTHROPIC_AUTH_TOKEN": "fw_YOUR_FIREWORKS_API_KEY", "ANTHROPIC_MODEL": "accounts/fireworks/routers/kimi-k2p6-turbo", "ANTHROPIC_DEFAULT_OPUS_MODEL": "accounts/fireworks/routers/kimi-k2p6-turbo", "ANTHROPIC_DEFAULT_SONNET_MODEL": "accounts/fireworks/models/glm-5p1", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "accounts/fireworks/models/minimax-m2p5", "CLAUDE_CODE_SUBAGENT_MODEL": "accounts/fireworks/models/minimax-m2p5" } } ``` FireConnect writes both `ANTHROPIC_API_KEY` (preferred) and `ANTHROPIC_AUTH_TOKEN` (compatibility alias) with the same Fireworks key. It also saves a backup of your previous provider settings to `~/.fireconnect/claude/` so `fireconnect off` can restore them. ## CLI reference ```bash theme={null} fireconnect on # Route Claude Code through Fireworks fireconnect off # Restore your previous provider fireconnect status # Show the current provider and model fireconnect list # Show the default and current model mapping fireconnect set # Change model aliases without touching credentials fireconnect reset # Reset model aliases to defaults fireconnect uninstall # Remove FireConnect from this machine ``` Run `fireconnect help ` for all options. ### Manual setup If you already have a Fireworks API key, you can skip the installer and enable routing directly: ```bash theme={null} fireconnect on --api-key fw_... ``` Restart Claude Code after this completes. ### Switch models Short model IDs work everywhere: ```bash theme={null} fireconnect set --main kimi-k2p6-turbo --sonnet glm-5p1 --haiku minimax-m2p5 --subagent minimax-m2p5 ``` ### Turn off Fireworks routing ```bash theme={null} fireconnect off ``` This restores your previous `~/.claude/settings.json` from the backup saved in `~/.fireconnect/claude/`. ### Enable with a specific API key ```bash theme={null} fireconnect on --api-key fw_... ``` ## Uninstall ```bash theme={null} fireconnect uninstall ``` This disables Fireworks routing for Claude Code, removes `~/.fireconnect/claude/`, and deletes the `fireconnect` CLI launcher from `~/.local/bin`. ## Source FireConnect is open source: [github.com/fw-ai/fireconnect](https://github.com/fw-ai/fireconnect) # Development Setup with Fireworks Docs MCP Source: https://docs.fireworks.ai/ecosystem/integrations/development-setup Configure the Fireworks AI Docs MCP server for Claude Code and Cursor ## Claude Code Add the MCP server via the CLI: ```bash theme={null} claude mcp add --transport http fireworks-docs https://docs.fireworks.ai/mcp ``` Or add it to your project's `mcp.json`: ```json theme={null} { "mcpServers": { "fireworks-docs": { "url": "https://docs.fireworks.ai/mcp" } } } ``` ## Cursor One-click install: [Install Fireworks Docs MCP](https://cursor.com/en/install-mcp?name=fireworks-docs\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuZmlyZXdvcmtzLmFpL21jcCJ9) Or manually add to your workspace's `mcp.json`: ```json theme={null} { "mcpServers": { "fireworks-docs": { "url": "https://docs.fireworks.ai/mcp" } } } ``` ## Using the MCP Server Once configured, your AI coding agent can search the full Fireworks AI documentation. Example queries: * "How do I configure autoscaling for deployments?" * "What parameters does the chat completions endpoint accept?" * "Show me examples of function calling with Fireworks models" * "Find the API reference for batch inference" # GitHub Copilot Source: https://docs.fireworks.ai/ecosystem/integrations/github-copilot Use Fireworks AI models in GitHub Copilot Chat via a custom endpoint Use [Fireworks AI](https://fireworks.ai) models in **GitHub Copilot Chat** by adding a **Custom Endpoint** in VS Code (or other hosts that support Copilot custom models). Fireworks offers **200+ models**—copy the model `id` and token limits from the [Model Library](https://app.fireworks.ai/models). Use endpoint URL `https://api.fireworks.ai/inference/v1`. ## Prerequisites * A Fireworks [API key](https://app.fireworks.ai/settings/users/api-keys) * GitHub Copilot with access to **Other Models** and **Custom Endpoint** (availability depends on your Copilot plan) ## Setup In Copilot Chat, click the active model name at the bottom (often **Auto**). In the menu, click the gear icon next to **Other Models**. Copilot Chat model picker with gear icon next to Other Models In **Language Models**, click **+ Add Models...** in the top right, then choose **Custom Endpoint**. Add Models menu with Custom Endpoint selected Enter **Fireworks AI** as the group name and press Enter. Prompt to enter group name Fireworks AI Paste your Fireworks API key (hidden by default) and press Enter to confirm. Fireworks API key entry field When asked for the default request/response format, select **Responses API**. API type selection with Responses API highlighted A configuration file opens. **Do not change** the auto-generated header at the top—only fill in the model template below it. Copilot custom endpoint config file with template section Fill in your model fields, then save (**Ctrl+S** on Windows/Linux, **Cmd+S** on macOS) and close the settings modal. Example for [DeepSeek V4 Pro](https://app.fireworks.ai/models/fireworks/deepseek-v4-pro): | Field | Value | | ------------------- | ------------------------------------------- | | **id** | `accounts/fireworks/models/deepseek-v4-pro` | | **name** | `DeepSeek V4 Pro` | | **url** | `https://api.fireworks.ai/inference/v1` | | **toolCalling** | `true` | | **vision** | `false` | | **maxInputTokens** | `1000000` | | **maxOutputTokens** | `384000` | Completed model template with DeepSeek V4 Pro fields Use the exact model `id` and token limits from the model page in the [Model Library](https://app.fireworks.ai/models). Values differ per model. Return to Copilot Chat, open the model picker (**Auto**), expand **Other Models**, and choose your model under **Fireworks AI**. Model picker showing Fireworks AI custom model at bottom of Other Models ## Related * [Claude Code](/ecosystem/integrations/claude-code) — use Fireworks models with Claude Code * [Development Setup with Fireworks Docs MCP](/ecosystem/integrations/development-setup) — add Fireworks docs to your coding agent # MLOps & Observability Source: https://docs.fireworks.ai/ecosystem/integrations/mlops-observability Track and monitor your Fireworks AI deployments with leading MLOps and observability platforms Fireworks AI integrates with industry-leading MLOps and observability platforms to help you monitor, track, and optimize your AI applications in production. ## Supported Platforms Track fine-tuning experiments and visualize training metrics with W\&B Mlflow Tracing to track prompts, outputs, latency etc as your build AI applications with FireworksAI ## Need Help? For assistance with MLOps and observability integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai). # OpenCode Source: https://docs.fireworks.ai/ecosystem/integrations/opencode Use Fireworks AI models in OpenCode with the FireConnect CLI [FireConnect](https://github.com/fw-ai/fireconnect) routes [OpenCode](https://opencode.ai) through Fireworks AI models. Install the CLI once, then use `fireconnect on --harness opencode` to switch providers without editing config files by hand. ## Prerequisites * [OpenCode](https://opencode.ai) installed * A [Fireworks API key](https://app.fireworks.ai/settings/users/api-keys) (`fw_...`) or a [Fire Pass](/firepass) key (`fpk_...`) * The FireConnect CLI (see [Install the CLI](#install-the-cli) below) ## Install the CLI If you do not already have `fireconnect` on your `PATH`, install it with: ```bash theme={null} curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | bash ``` The installer also configures Claude Code by default. If you only use OpenCode, run `fireconnect off` after install to restore your Claude Code settings, then follow the steps below. For non-interactive setup: ```bash theme={null} curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fw_..." bash ``` The installer uses Node.js to update settings (it does not install or update npm packages), clones the CLI to `~/.fireconnect/cli`, and adds a `fireconnect` launcher to `~/.local/bin`. ## Enable Fireworks routing ```bash theme={null} export FIREWORKS_API_KEY=fw_... fireconnect on --harness opencode ``` Restart OpenCode after enabling, then confirm routing: ```bash theme={null} fireconnect status --harness opencode ``` ## Using Fire Pass Use your `fpk_...` key instead of a standard `fw_...` key: ```bash theme={null} export FIREWORKS_API_KEY=fpk_... fireconnect on --harness opencode --api-key fpk_... ``` FireConnect detects Fire Pass keys and defaults OpenCode to `kimi-k2p6-turbo` — the only model covered by the Fire Pass subscription. ## Default model OpenCode routes a single default model (no opus/sonnet/haiku alias slots). The default is `kimi-k2p6-turbo`, written to config as `fireworks/accounts/fireworks/routers/kimi-k2p6-turbo`. Short model IDs like `glm-5p1` are expanded to full Fireworks paths (for example, `accounts/fireworks/models/glm-5p1`). ## What gets written FireConnect merges a `provider.fireworks` block into `~/.config/opencode/opencode.json`: * An OpenAI-compatible adapter pointed at `https://api.fireworks.ai/inference/v1` * A default `model` set to `fireworks/` * Your other providers are left untouched FireConnect snapshots your original `opencode.json` before the first change. The snapshot lives in `~/.fireconnect/opencode/`. Running `fireconnect off --harness opencode` restores the file byte-for-byte. ### API key handling * If the key comes from `FIREWORKS_API_KEY`, it is written as `{env:FIREWORKS_API_KEY}` so the secret stays out of the config file. * Passing `--api-key` writes the literal key instead. * OpenCode's `auth.json` is never touched. ## CLI reference All commands use `--harness opencode`: ```bash theme={null} fireconnect on --harness opencode # Enable Fireworks routing fireconnect off --harness opencode # Restore original config fireconnect status --harness opencode # Check current provider fireconnect list --harness opencode # Show the current model fireconnect set --harness opencode --main glm-5p1 # Switch model fireconnect reset --harness opencode # Reset model to default ``` Run `fireconnect help ` for all options. ### Switch models ```bash theme={null} fireconnect set --harness opencode --main glm-5p1 ``` ### Turn off Fireworks routing ```bash theme={null} fireconnect off --harness opencode ``` This restores your previous `opencode.json` from the backup in `~/.fireconnect/opencode/`. ### Use a non-default config file ```bash theme={null} fireconnect on --harness opencode --config-path /path/to/opencode.json ``` ## Built-in provider connection OpenCode also supports connecting to Fireworks directly without FireConnect: 1. Type `/connect` in OpenCode and search for **fireworks.ai** 2. Paste your Fireworks API key and press Enter 3. Type `/models` and select a model (for Fire Pass, choose **Kimi K2.6 Turbo**) ## Source FireConnect is open source: [github.com/fw-ai/fireconnect](https://github.com/fw-ai/fireconnect) # Cookbooks Source: https://docs.fireworks.ai/examples/cookbooks Interactive Jupyter notebooks demonstrating advanced use cases and best practices with Fireworks AI Explore our collection of notebooks that showcase real-world applications, best practices, and advanced techniques for building with Fireworks AI. ## Fine-Tuning & Training Transfer large model capabilities to efficient models using a two-stage SFT + RFT approach. **Techniques:** Supervised Fine-Tuning (SFT) + Reinforcement Fine-Tuning (RFT) **Results:** 52% → 70% accuracy on GSM8K mathematical reasoning Beat frontier closed-source models for product catalog cleansing with vision-language model fine-tuning. **Techniques:** Supervised Fine-Tuning (SFT) **Results:** 48% increase in quality from base model ## Multimodal AI Extract structured data from invoices, forms, and financial documents using state-of-the-art OCR and document understanding. **Use Cases:** Forms, invoices, financial documents, product catalogs **Results:** 90.8% accuracy on invoice extraction (100% on invoice numbers and dates) Real-time audio transcription with streaming support and low latency. **Features:** Streaming support, low-latency transcription, production-ready Analyze video and audio content with Qwen3 Omni, a multimodal model supporting video, audio, and text inputs. **Features:** Video captioning, scene analysis, content understanding, multimodal Q\&A ## API Features Leverage Model Context Protocol (MCP) for GitHub repository analysis, code search, and documentation Q\&A. **Features:** Repository analysis, code search, documentation Q\&A, GitMCP integration **Models:** Qwen 3 235B with external tool support # Courses Source: https://docs.fireworks.ai/examples/introduction Standalone end-to-end examples showing how to use Fireworks to solve real-world use cases Learn how to use Fireworks to fine-tune a model to convert natural language to SQL queries. Learn how to build reinforcement learning systems that avoid reward hacking. Learn to distill the knowledge of large AI models into efficient, deployable alternatives. # How do I close my Fireworks.ai account? Source: https://docs.fireworks.ai/faq-new/account-access/how-do-i-close-my-fireworksai-account To close your account: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) 2. Include in your request: * Your account ID * A clear request for account deletion Before closing your account, please ensure: * All outstanding invoices are paid, and any payment issues on prepaid accounts are resolved * Any active deployments are terminated * Important data is backed up if needed # I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this? Source: https://docs.fireworks.ai/faq-new/account-access/i-have-multiple-fireworks-accounts-when-i-try-to-login-with-google-on-fireworks If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord. # What email does GitHub authentication use? Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-github-authentication-use When you authenticate with Fireworks using GitHub, we use the **primary email address** associated with your GitHub account for identification and account management. ## How it works Fireworks automatically retrieves your primary email address from your GitHub profile during the authentication process. This email address becomes your Fireworks account identifier. ## Managing your primary email To change your primary email address on GitHub: 1. Go to your [GitHub email settings](https://github.com/settings/emails) 2. Select the email address you want to set as primary in the "Primary email address" section You can also follow the [GitHub documentation](https://docs.github.com/en/enterprise-cloud@latest/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/changing-your-primary-email-address) for detailed instructions on managing email preferences. ## Switching between accounts You can easily switch which Fireworks account your GitHub authentication logs into by changing your primary email address on GitHub before logging in. This allows you to: * Log into different Fireworks accounts using the same GitHub account * Switch between personal and work accounts by updating your GitHub primary email * Maintain separate billing and usage tracking for different email addresses The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your GitHub primary email before authenticating. # What email does LinkedIn authentication use? Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-linkedin-authentication-use When you authenticate with Fireworks using LinkedIn, we use the **primary email address** associated with your LinkedIn account for identification and account management. ## How it works Fireworks automatically retrieves your primary email address from your LinkedIn profile during the authentication process. This email address becomes your Fireworks account identifier. ## Managing your primary email To change your primary email address on LinkedIn: 1. Go to your [LinkedIn email settings](https://www.linkedin.com/mypreferences/d/manage-email-addresses) 2. From there, you can add new email addresses or change your primary email 3. Click **Add email address** to add a new email or select an existing one to make primary You can also follow the [LinkedIn documentation](https://www.linkedin.com/help/linkedin/answer/a519904) for detailed instructions on managing email preferences. ## Switching between accounts You can easily switch which Fireworks account your LinkedIn authentication logs into by changing your primary email address on LinkedIn before logging in. This allows you to: * Log into different Fireworks accounts using the same LinkedIn account * Switch between personal and work accounts by updating your LinkedIn primary email * Maintain separate billing and usage tracking for different email addresses The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your LinkedIn primary email before authenticating. # What should I do if I can't access my company account after being invited when I already have a personal account? Source: https://docs.fireworks.ai/faq-new/account-access/what-should-i-do-if-i-cant-access-my-company-account-after-being-invited-when-i This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to). To resolve this: 1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts 2. Include in your email: * The account ID you created personally (e.g., username-44ace8) * The company account ID you need access to (e.g., company-a57b2a) * Mention that you're having trouble accessing your company account Note: This is a known scenario that support can resolve once they verify your email ownership. # Are there discounts for bulk usage? Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-discounts-for-bulk-usage We offer discounts for bulk or pre-paid purchases. Contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss volume pricing. # Are there extra fees for serving fine-tuned models? Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-extra-fees-for-serving-fine-tuned-models Fine-tuned (LoRA) models require a dedicated deployment to serve. Here's what you need to know: **What you pay for**: * **Deployment costs** on a per-GPU-second basis for hosting the model * **The fine-tuning process** itself, if applicable **Deployment options**: * **Live-merge deployment**: Deploy your LoRA model with weights merged into the base model for optimal performance * **Multi-LoRA deployment**: Deploy up to 100 LoRA models as addons on a single base model deployment For more details on deploying fine-tuned models, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras). # How does billing and credit usage work? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-does-billing-and-credit-usage-work Fireworks uses a **pre-paid credits** model for new self-serve accounts: * Add a valid payment method and billing address, then purchase credits to use the platform. * Usage across serverless, on-demand deployments, and fine-tuning deducts from your credit balance. * If your balance reaches zero and auto top-up is not enabled, usage pauses until you add credits. * You can configure auto top-up and a monthly budget cap in the billing dashboard. Legacy and enterprise exceptions: * Accounts created before **June 1** remain on their existing postpaid terms unless migrated. * Enterprise accounts can be configured for postpaid billing on request. For grandfathered and postpaid enterprise accounts, usage and billing operate through a **tiered system**: * Each **tier** has a monthly usage limit, regardless of available credits. * Once you reach your tier's limit, **service will be suspended** even if you have remaining credits. * **Usage limits** reset at the beginning of each month. * Pre-purchased credits do not prevent additional charges once the limit is exceeded. Enterprise accounts do not have the same self-serve limits. See [Enterprise quotas](/faq/enterprise/service/quotas) for more information. For details on spend limits, budget caps, and quota controls, see our [Account quotas guide](/guides/quotas_usage/account-quotas#view-and-adjust-your-spend-limit). # How many tokens per image? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-many-tokens-per-image Learn how to calculate token usage for images in vision models and understand pricing implications Image token consumption varies by model and resolution, typically ranging from 1,000 to 2,500 tokens per image for most common resolutions. ## Common resolution token counts The following table shows the token counts for a single image for Qwen2.5 VL at different image resolutions: | Resolution | Token Count | | ---------- | ----------- | | 336×336 | 144 | | 672×672 | 576 | | 1024×1024 | 1,369 | | 1280×720 | 1,196 | | 1920×1080 | 2,769 | | 2560×1440 | 4,641 | | 3840×2160 | 10,549 | ## Calculating exact token count for your images You can determine exact token usage by processing your images through the model's tokenizer. For instance, for Qwen2.5 VL, you can use the following code: ```bash theme={null} pip install torch torchvision transformers pillow ``` ```python Tokenizing your image theme={null} import requests from PIL import Image from transformers import AutoProcessor import os # Your image source - can be URL or local path IMAGE_URL_OR_PATH = "https://images.unsplash.com/photo-1519125323398-675f0ddb6308" def load_image(source): """Load image from URL or local file path""" if source.startswith(('http://', 'https://')): print(f"Downloading image from URL: {source}") response = requests.get(source) response.raise_for_status() return Image.open(requests.get(source, stream=True).raw) else: print(f"Loading image from path: {source}") if not os.path.exists(source): raise FileNotFoundError(f"Image file not found: {source}") return Image.open(source) def count_image_tokens(image): """Count how many tokens an image takes using Qwen 2.5 VL processor""" processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "What's in this image?"}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=[image], return_tensors="pt") input_ids = inputs["input_ids"][0] # Count the image pad tokens (151655 is Qwen2.5 VL's image token ID) image_tokens = (input_ids == 151655).sum().item() return image_tokens, input_ids def main(): import sys image_source = sys.argv[1] if len(sys.argv) > 1 else IMAGE_URL_OR_PATH print(f"Processing image: {image_source}") image = load_image(image_source) print(f"Image size: {image.size}") print(f"Image mode: {image.mode}") print("\nCalculating tokens...") image_tokens, input_ids = count_image_tokens(image) print(f"Total tokens: {len(input_ids)}") print(f"Image tokens: {image_tokens}") print(f"Text tokens: {len(input_ids) - image_tokens}") if __name__ == "__main__": main() ``` ```bash Usage theme={null} # Calculate tokens for an image URL python token_calculator.py "https://example.com/image.jpg" # Calculate tokens for a local image python token_calculator.py "path/to/your/image.png" ``` # How much does Fireworks cost? Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-much-does-fireworks-cost Fireworks AI uses a **usage-based pre-paid** model for new self-serve accounts. You purchase credits, then usage is deducted based on: * **Per token** for serverless inference * **Per GPU usage time** for on-demand deployments * **Per token of training data** for fine-tuning Billing model by account type: * Accounts created before **June 1** keep their existing postpaid terms (grandfathered). * Enterprise accounts can be configured for postpaid billing on request. For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options. Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing). # Is prompt caching billed differently for serverless models? Source: https://docs.fireworks.ai/faq-new/billing-pricing/is-prompt-caching-billed-differently Yes, **cached prompt tokens are discounted compared to uncached tokens for serverless models**. The default discount is 50%, but the exact discount varies by model. Check the [Model Library](https://fireworks.ai/models) for model-specific cached and uncached input token pricing. # How do credits work? Source: https://docs.fireworks.ai/faq-new/billing-pricing/what-happens-when-i-finish-my-1-dollar-credit ## How credits are applied Fireworks uses a **pre-paid credits** model for new self-serve accounts: * Credits are used first for all usage. * If credits are exhausted and auto top-up is disabled, usage pauses until you add credits. * If auto top-up is enabled, credits are purchased automatically when your balance reaches your configured minimum. * You can set a monthly budget cap to limit total spend. Accounts created before **June 1** remain on existing postpaid terms (grandfathered). Enterprise accounts can also be configured for postpaid billing on request. ## Missing credits after purchase? If you don't see your credits reflected immediately: 1. Visit your **billing dashboard** 2. Review the **"Credits"** section 3. Check your **credit balance** and **auto top-up settings** **Important**: In the pre-paid model, usage consumes available credits. If your balance is low, enable auto top-up to avoid interruptions. ## Why did I receive an invoice after depositing credits? Most self-serve accounts on pre-paid billing should not see month-end overage invoices. If you received an invoice, your account is likely on a postpaid contract (for example, grandfathered or enterprise postpaid terms). If this seems unexpected, contact [community\_billing@fireworks.ai](mailto:community_billing@fireworks.ai) so we can confirm your billing configuration. ## What happens when I finish my \$1 credit? When you finish your \$1 credit, the following occurs: ## Account Status * **Without payment method**: Your account will be **suspended** until you add a payment method. For request-rate behavior, see [Account quotas](/guides/quotas_usage/account-quotas#account-wide-request-limits); for serverless TPM upper bounds, see [Serverless rate limits](/serverless/rate-limits). * **With payment method**: Add credits to continue usage. [Account-wide request limits](/guides/quotas_usage/account-quotas#account-wide-request-limits) increase, and [serverless TPM upper bounds](/serverless/rate-limits) grow as your account spend tier rises. **Payment Method Requirements:** * Adding a payment method is required to continue service after credit depletion * Add credits (or enable auto top-up) to continue service after credit depletion * New self-serve accounts use pre-paid billing by default * Grandfathered and enterprise postpaid accounts can still receive invoices based on their contract terms * As you spend more with Fireworks, your adaptive usage limits and serverless TPM upper bounds can increase ## Where's my receipt for purchased credits? Receipts for purchased credits are sent via Stripe upon purchase. Check your email for receipts from Stripe (not Fireworks). If you can't find your receipt, contact [community\_billing@fireworks.ai](mailto:community_billing@fireworks.ai). For spend limits, tiers, and account-wide request limits, see [Account quotas](/guides/quotas_usage/account-quotas). For adaptive serverless TPM upper bounds, see [Serverless rate limits](/serverless/rate-limits). # Why might my account be suspended even with remaining credits? Source: https://docs.fireworks.ai/faq-new/billing-pricing/why-might-my-account-be-suspended-even-with-remaining-credits Your account may be suspended due to several factors: 1. **Budget cap reached**: * Your monthly budget cap can pause usage even if you still have credit balance. * Increase your budget cap in Billing to resume usage. 2. **Payment or risk checks**: * Accounts may be temporarily paused if payment verification fails. * In some cases, manual review can temporarily limit usage. 3. **Billing model mismatch**: * Accounts created before **June 1** may still be on grandfathered postpaid terms. * Enterprise accounts may use custom postpaid billing if requested. If you're experiencing account suspension issues or need assistance with your budget and billing limits, please contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai). # Are there any quotas for serverless? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/are-there-any-quotas-for-serverless Yes. Standard serverless, Priority tier, and Fast all have serverless rate limits and quotas. For the detailed serverless policy, see our [Serverless rate limits guide](/serverless/rate-limits). # Do you provide notice before removing model availability? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-provide-notice-before-removing-model-availability Yes, we provide advance notice before removing models from the serverless infrastructure: * **Minimum 2 weeks’ notice** before model removal * Longer notice periods may be provided for **popular models**, depending on usage * Higher-usage models may have extended deprecation timelines **Best Practices**: 1. Monitor announcements regularly. 2. Prepare a migration plan in advance. 3. Test alternative models to ensure continuity. 4. Keep your contact information updated for timely notifications. # Do you support Auto Scaling? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-support-auto-scaling Yes, our system supports **auto scaling** with the following features: * **Scaling down to zero** capability for resource efficiency * Controllable **scale-up and scale-down velocity** * **Custom scaling rules and thresholds** to match your specific needs # How does autoscaling affect my costs? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-autoscaling-affect-my-costs * **Scaling from 0**: No minimum cost when scaled to zero * **Scaling up**: Each new replica adds to your total cost proportionally. For example: * Scaling from 1 to 2 replicas doubles your GPU costs * If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total) For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing). # How does billing and scaling work for on-demand GPU deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-and-scaling-work-for-on-demand-gpu-deployments On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments: **Billing**: * Charges start when the server begins accepting requests * **Billed by GPU-second** for each active instance * Costs accumulate even if there are no active API calls **Scaling options**: * Supports **autoscaling** from 0 to multiple GPUs * Each additional GPU **adds to the billing rate** * Can handle unlimited requests within the GPU’s capacity **Management requirements**: * Not fully serverless; requires some manual management * **Manually delete deployments** when no longer needed * Or configure autoscaling to **scale down to 0** during inactive periods **Cost control tips**: * Regularly **monitor active deployments** * **Delete unused deployments** to avoid unnecessary costs * Consider **serverless options** for intermittent usage * Use **autoscaling to 0** to optimize costs during low-demand times # How does billing work for on-demand deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-work-for-on-demand-deployments On-demand deployments come with automatic cost optimization features: * **Default autoscaling**: Automatically scales to 0 replicas when not in use * **Pay for what you use**: Charged only for GPU time when replicas are active * **Flexible configuration**: Customize autoscaling behavior to match your needs **Best practices for cost management**: 1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use 2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges 3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling). # How does the system scale? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-the-system-scale Our system is **horizontally scalable**, meaning it: * Scales linearly with additional **replicas** of the deployment * **Automatically allocates resources** based on demand * Manages **distributed load handling** efficiently # Are there SLAs for serverless? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/is-latency-guaranteed-for-serverless-models Our multi-tenant serverless offering does not currently come with Service Level Agreements (SLAs) for latency or availability. If you have specific performance or availability requirements, we recommend: * **On-demand deployments**: Provides dedicated resources with predictable performance * **Contact sales**: [Reach out to discuss](https://fireworks.ai/company/contact-us) custom solutions and enterprise options # What are the rate limits for on-demand deployments? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-are-the-rate-limits-for-on-demand-deployments On-demand deployments have GPU quotas that determine your maximum allocation. For detailed information about on-demand deployment quotas and GPU limits, see our [Account quotas guide](/guides/quotas_usage/account-quotas#on-demand-deployment-quotas). Need higher GPU allocations? [Contact us](https://fireworks.ai/company/contact-us) to discuss custom solutions for your use case. # What factors affect the number of simultaneous requests that can be handled? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled The request handling capacity is influenced by multiple factors: * **Model size and type** * **Number of GPUs** allocated to the deployment * **GPU type** (e.g., A100 vs. H100) * **Prompt size** and **generation token length** * **Deployment type** (serverless vs. on-demand) # What’s the supported throughput? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/whats-the-supported-throughput Throughput capacity typically depends on several factors: * **Deployment type** (serverless or on-demand) * **Traffic patterns** and **request patterns** * **Hardware configuration** * **Model size and complexity** # Why am I experiencing request timeout errors and slow response times with serverless LLM models? Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/why-am-i-experiencing-request-timeout-errors-and-slow-response-times-with-server Timeout errors and increased response times can occur due to **server load during high-traffic periods**. With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned. The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**. Deployment barriers should be **minimal** and **pricing is based on usage**. However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**. With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes. Which is why our two recommended ways to address timeout and response time issues is: ### Current solution (recommended for production) * **Use on-demand deployments** for more stable performance * **Guaranteed response times** * **Dedicated resources** to ensure availability We are always investing in ways to improve speed and performance. ### Upcoming improvements * Enhanced SLAs for uptime * More consistent generation speeds during peak load times If you experience persistent issues, please include the following details in your support request: 1. Exact **model name** 2. **Timestamp** of errors (in UTC) 3. **Frequency** of timeouts 4. **Average wait times** ### Performance optimization tips * Consider **batch processing** for handling bulk requests * Implement **retry logic with exponential backoff** * Monitor **usage patterns** to identify peak traffic times * Set **appropriate timeout settings** based on model complexity # Does Fireworks support custom base models? Source: https://docs.fireworks.ai/faq-new/models-inference/does-fireworks-support-custom-base-models Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models). # Does the API support batching and load balancing? Source: https://docs.fireworks.ai/faq-new/models-inference/does-the-api-support-batching-and-load-balancing Current capabilities include: * **Load balancing**: Yes, supported out of the box * **Continuous batching**: Yes, supported * **Batch inference**: Yes, supported via the [Batch API](/guides/batch-inference) * **Streaming**: Yes, supported For asynchronous batch processing of large volumes of requests, see our [Batch API documentation](/guides/batch-inference). # FLUX image generation Source: https://docs.fireworks.ai/faq-new/models-inference/flux-image-generation ## Can I generate multiple images in a single API call? No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance. ## Does FLUX support image-to-image generation? No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development. ## Can I create custom LoRA models with FLUX? Inference on FLUX-LoRA adapters is currently supported. However, managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available. # How do I control output image sizes when using SDXL ControlNet? Source: https://docs.fireworks.ai/faq-new/models-inference/how-do-i-control-output-image-sizes-when-using-sdxl-controlnet When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request: The input control signal image will be automatically: * **Resized** to fit your specified dimensions * **Cropped** to preserve aspect ratio **Example**: To generate a 768x1344 image, explicitly include these parameters in your request: ```json theme={null} { "width": 768, "height": 1344 } ``` *Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests. # How to check if a model is available on serverless? Source: https://docs.fireworks.ai/faq-new/models-inference/how-to-check-if-a-model-is-available-on-serverless ## Web UI Go to [https://app.fireworks.ai/models?filter=LLM\&serverless=true](https://app.fireworks.ai/models?filter=LLM\&serverless=true) ## API You can programmatically retrieve all serverless models using the [List Models API](/api-reference/list-models) with the `supports_serverless=true` filter. ```python theme={null} from fireworks import Fireworks client = Fireworks() # List all serverless models models = client.models.list(filter="supports_serverless=true") for model in models: print(model.name) ``` You can also combine filters and customize the response: ```python theme={null} # List serverless models with pagination models = client.models.list( filter="supports_serverless=true", page_size=50, ) for model in models: print(f"{model.name}: {model.display_name}") ``` ```bash theme={null} curl "https://api.fireworks.ai/v1/accounts/fireworks/models?filter=supports_serverless%3Dtrue" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" ``` With pagination: ```bash theme={null} curl "https://api.fireworks.ai/v1/accounts/fireworks/models?filter=supports_serverless%3Dtrue&pageSize=50" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" ``` The filter parameter uses the [AIP-160 filter syntax](https://google.aip.dev/160). The `supports_serverless` field indicates whether a model is available on serverless infrastructure. See the [List Models API reference](/api-reference/list-models) for all available parameters including `order_by`, `page_size`, and `read_mask`. # There’s a model I would like to use that isn’t available on Fireworks. Can I request it? Source: https://docs.fireworks.ai/faq-new/models-inference/theres-a-model-i-would-like-to-use-that-isnt-available-on-fireworks-can-i-reques Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform. **To request new models**: 1. **Join our [Discord server](https://discord.gg/fireworks-ai)** 2. Let us know which models you’d like to see 3. Provide **use case details**, if possible, to help us prioritize We regularly evaluate and add new models based on: * **Community requests** * **Popular demand** * **Technical feasibility** * **Licensing requirements** # What factors affect the number of simultaneous requests that can be handled? Source: https://docs.fireworks.ai/faq-new/models-inference/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled Request handling capacity depends on several factors: * **Model size and type** * **Number of GPUs allocated** to the deployment * **GPU type** (e.g., A100, H100) * **Prompt size** * **Generation token length** * **Deployment type** (serverless vs. on-demand) # Fireworks Agent: Classification Source: https://docs.fireworks.ai/fine-tuning/agent/classification Benchmark base models, fine-tune on labeled data, and pick the best classifier — automatically. Fireworks Agent's classification workflow is purpose-built for label-prediction tasks. It evaluates one or more base models against your labeled dataset, fine-tunes the strongest candidates, and reports a head-to-head comparison of base vs fine-tuned accuracy on a held-out test split. Use this workflow for classification, content extraction, intent detection, routing, and other tasks with a discrete label set. For the underlying SFT mechanics (job parameters, supported base models, dataset format), see [Managed Fine-Tuning → Supervised Fine-Tuning](/fine-tuning/fine-tuning-models). The classification workflow is built on top of SFT with classification-specific dataset handling and reporting. ## What you give Agent | Input | Required? | Notes | | -------------------------------- | --------- | -------------------------------------------------------------------------------------- | | Dataset ID(s) | **Yes** | Single dataset (split 80/20 train/test) or two datasets (separate train + eval) | | Models to evaluate and fine-tune | **Yes** | Agent does **not** default to "all models"; pick from the supported list when prompted | | Candidate labels | No | Agent infers labels from your data if you don't list them explicitly | | Imbalance-ratio threshold | No | Defaults to `50.0` (ratio of most-frequent to least-frequent label) | ### Dataset requirements * Each sample must contain `messages` in OpenAI chat-completion format. * `ground_truth` is optional. If absent, Agent extracts the label from the final assistant message using task-specific logic in the plan. * `ground_truth` may be a single string or a list of strings. ## Example session instructions Single dataset with automatic split, two candidate models: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Benchmark and fine-tune classification on accounts/myacct/datasets/intent-labels. Compare Qwen3 8B and Qwen3 32B. Labels are: billing, technical, account, sales." ``` Separate train and eval datasets, model already chosen: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run classification fine-tuning on accounts/myacct/datasets/train, eval on accounts/myacct/datasets/test, using Qwen3 8B." ``` **Where classification lives in the 7-phase pipeline:** Phase 1 is dataset inspection (with per-label sample counts and imbalance detection), phase 2 is plan + cost approval, **phase 3 is the base-model benchmark plus fine-tuning sweep**, phase 4 is the full-data run for each candidate, **phase 5 is the fine-tuned evaluation** with per-label and overall accuracy, phase 6 is deployment of the winner, phase 7 is the base-vs-fine-tuned comparison report. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job). ## Workflow stages Agent stages your dataset locally once, computes per-label sample counts, detects the imbalance ratio, and inspects message structure. If `ground_truth` is missing, Agent decides how to extract the label from the final assistant turn. If you provided candidate labels in your instruction, Agent uses them. Otherwise, Agent infers them from the data and surfaces the inferred set in the plan for you to confirm. Agent computes the ratio of most-frequent to least-frequent label. If it exceeds the threshold (default `50.0`), Agent flags the imbalance in the plan and proposes mitigations (rebalancing, weighted training, or a smaller candidate set). Agent writes a plan and presents it with a cost breakdown (base-model evaluation inference + fine-tuning + fine-tuned evaluation inference + total). Approve once to proceed. Agent runs inference on the held-out eval split against each selected base model and reports per-label and overall accuracy. This becomes the baseline. Agent fine-tunes the strongest candidates with the default tuning grid (or your explicit grid), batched into waves of up to 6 active jobs. Agent runs the same eval inference against each fine-tuned model and computes per-label accuracy, overall accuracy, and confusion-matrix style summaries. Agent picks the winner, deploys it (phase 6), and writes the final comparison report (phase 7) showing base vs fine-tuned accuracy per candidate plus a `fireworks-ai` SDK snippet for inference. ## Output When the session reports `succeeded`, Agent's response includes: * Per-label and overall accuracy for every base model evaluated * Per-label and overall accuracy for every fine-tuned candidate * The winning model ID, deployment ID, and inference endpoint * A `fireworks-ai` SDK snippet for label prediction * `final_report.md` in the session workspace with the full base vs fine-tuned comparison and estimated-vs-actual cost ## Customizing the run * **Explicit labels:** *"Labels are: positive, negative, neutral."* * **Imbalance threshold override:** *"Use an imbalance threshold of 20."* * **Inference-only mode:** *"Just benchmark — don't fine-tune."* * **Single candidate:** *"Only fine-tune Qwen3 8B, skip the base-vs-base comparison."* * **Custom split:** *"Use a 70/30 train/test split."* **Agent crib notes** * Required inputs: dataset ID and at least one base model to evaluate. Agent will ask explicitly if either is missing. * Agent needs to know your labels eventually — either supply them in the instruction or confirm Agent's inferred set when prompted. * The default imbalance threshold is `50.0`; if your dataset is highly imbalanced, expect Agent to flag it in the plan. * For multi-label classification (a sample with multiple ground-truth labels), pass `ground_truth` as a list in your dataset. * Agent creates new resources only; your dataset, models, and deployments are never deleted or modified in place. # Fireworks Agent: Preference Learning (DPO/ORPO) Source: https://docs.fireworks.ai/fine-tuning/agent/dpo Run preference fine-tuning end-to-end with optional base-model sweep, automatic pair generation, and pairwise evaluation. Fireworks Agent's preference-learning workflow runs DPO or ORPO fine-tuning against pre-paired preference data, or generates pairs for you from a prompts-only dataset using delta learning. It can sweep multiple base models when you don't know which to pick, evaluates winners pairwise (or with your evaluator), and produces a final comparison report. For the underlying DPO mechanics and dataset format details, see [Managed Fine-Tuning → DPO Fine-Tuning](/fine-tuning/dpo-fine-tuning). This page documents the Fireworks Agent workflow built on top of it. ## What you give Agent | Input | Required? | Notes | | ------------------ | --------- | ----------------------------------------------------------------------------------------------------------------------- | | Dataset ID(s) | **Yes** | A single dataset (split 80/20 train/test automatically) or two datasets (separate train + test) | | Base model | No | If omitted, Agent runs a **model sweep** across supported base models to pick the best automatically | | Evaluator | No | Evaluator ID, custom rubric text, or none (Agent builds a data-grounded pairwise judge rubric if you don't provide one) | | Performance target | No | Optional goal score, for example *"win rate above 70%"* | ## Example session instructions Pre-paired preference data with a specific base model: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run DPO on accounts/myacct/datasets/customer-prefs using Qwen3 32B." ``` Prompts-only dataset with automatic pair generation and a base-model sweep: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run preference learning on accounts/myacct/datasets/prompts-only. Generate preference pairs automatically and sweep base models to find the best one." ``` **Where DPO lives in the 7-phase pipeline:** Phase 1 is dataset inspection, phase 2 is plan + cost approval, **phase 3 is the preference sweep** (replacing the SFT HP sweep — includes pair generation up-front for Format B), phase 5 is the pairwise evaluation, phase 6 is deployment of the winner, phase 7 is the final report. DPO does **not** run a separate phase 4 full-data retrain — the sweep itself is the training run on the chosen base model + config. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job). ## Dataset formats Agent accepts two formats: ### Format A — DPO format (pre-paired preferences) Each sample has `input`, `preferred_output`, and `non_preferred_output` fields. `input.messages` holds the conversation; `preferred_output` and `non_preferred_output` hold candidate assistant responses. When this format is detected, Agent skips pair generation and goes straight to training. ### Format B — prompts-only Each sample has a `messages` field with user messages only (no assistant completions). Agent generates preference pairs automatically using **delta learning**: it samples completions from a strong and a weak model, then constructs preferred/non-preferred pairs for training. ## Workflow stages Agent stages the dataset locally exactly once per session, computes token statistics, and decides between Format A (skip pair generation) and Format B (generate pairs). Agent resolves your evaluator choice (evaluator ID / custom rubric / auto rubric) and asks for anything missing — usually the dataset and, if you omitted both base model and grid, confirmation that a base-model sweep is OK. Agent presents a plan plus a cost breakdown (training + any pair-generation inference + evaluator inference + total) and asks for a single approval covering both. Agent generates preference pairs via delta learning and uploads the resulting dataset to Fireworks under a new, timestamped name. Your original dataset is left untouched. If no base model was specified, Agent runs DPO/ORPO across a curated set of supported base models. If a base model was specified, Agent runs an HP sweep against that single base model. Training jobs are batched (default cap of 6 active at once). For each trained model, Agent generates completions on the held-out test split and scores them. With your own evaluator, scores are reported independently. Without one, Agent uses a pairwise judge rubric grounded in actual training samples. Agent deploys the winning fine-tuned model and writes a final report comparing base and fine-tuned models, with the deployment endpoint and (if you supplied a performance target) whether the target was met. ## Evaluator handling Agent supports three evaluator paths, in priority order: 1. **Evaluator ID** — for example `accounts/myacct/evaluators/my-eval`. Agent fetches the evaluator code, installs dependencies, and runs it to score each model's completions independently. Agent reports average scores for the base model and every fine-tuned candidate. 2. **Custom rubric text** — provide a pairwise LLM judge rubric in your instruction. Agent uses it to compare two completions head-to-head. 3. **Neither** — Agent inspects training samples and writes a data-grounded pairwise judge rubric automatically. ## Output When the session reports `succeeded`, Agent returns: * The winning fine-tuned model ID and its deployment endpoint * Base vs fine-tuned comparison: scores or win rate from the chosen evaluator * A copy-paste `fireworks-ai` SDK snippet for the deployed model * `final_report.md` in the session workspace with per-model scores, pair-generation provenance (if Format B), and estimated-vs-actual cost ## Supported base models The model sweep selects from the supported preference-learning base models. For the canonical list, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models). ## Customizing the run * **Pin a base model:** *"Use Qwen3 32B."* — skips the model sweep. * **Explicit grid:** *"Sweep Qwen3 32B and Qwen3-30B-A3B with beta 0.1 and 0.3."* * **Bring your own evaluator:** *"Use evaluator accounts/myacct/evaluators/my-rubric."* * **Auto-generate pairs:** *"Generate preference pairs automatically."* * **Set a target:** *"Stop early once we reach 75% win rate against the base."* **Agent crib notes** * Required input: dataset ID. Everything else is optional. * Agent will pause for one approval (plan + cost) and again at the comparison report. The promotion gate appears only when a clear winner needs confirmation. * If the dataset is prompts-only, Agent will generate pairs by sampling strong and weak models — expect inference cost on top of training cost. * Agent always creates new datasets with timestamped names; your original dataset is never overwritten. * For deeper customization of the loss (custom beta schedules, hybrid objectives), use the [Training API](/fine-tuning/training-api/introduction) instead. # Fireworks Agent: Evaluator Authoring Source: https://docs.fireworks.ai/fine-tuning/agent/evaluators Have Fireworks Agent generate a reusable evaluator from your dataset — for scoring candidates in an SFT sweep, or for use with Managed RFT. Fireworks Agent can write a task-specific evaluator from your dataset alone. Two flavors: * **SFT evaluators** — a Python evaluator (`evaluator.py`) plus a spec (`eval_spec.md`) that Agent uses to score candidates during a subsequent SFT sweep in the same session. * **RFT evaluators** — an Eval Protocol `@evaluation_test` evaluator ready to drive a Reinforcement Fine-Tuning job. Use evaluator authoring when you have a dataset and a clear notion of what "correct" looks like, but no evaluator script yet. ## SFT evaluators ### What you get Agent generates two artifacts in the session workspace: * `outputs/eval_spec.md` — a human-readable spec describing what the evaluator checks (the contract: what counts as correct, how partial credit works, edge cases). * `outputs/evaluator.py` — a Python evaluator that takes a model's outputs and the dataset's ground truth and returns scores. After the artifacts are written, Agent surfaces the full `eval_spec.md` and `evaluator.py` contents in chat so you can review them before they're used downstream. ### Example session instructions Author an evaluator only: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Generate an evaluator for accounts/myacct/datasets/customer-support. Outputs are short text answers; check whether the final assistant reply matches ground truth on key facts." ``` Author an evaluator and continue straight into SFT in the same session — Agent reuses the freshly-written evaluator without re-authoring: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Generate an evaluator for accounts/myacct/datasets/customer-support, then run SFT on Qwen3 8B and use that evaluator to pick the winning candidate." ``` **Where evaluator authoring lives in the 7-phase pipeline:** When evaluator authoring runs as a standalone session, phases 3–7 of the standard pipeline don't apply; the session writes `outputs/evaluator.py` + `outputs/eval_spec.md` and stops. When you chain authoring into SFT in the same session, those artifacts feed **phase 5 (Evaluation)** of the follow-on training pipeline — used to score candidates during phase 3 and again for direct evaluation of the final model. (RFT evaluators are saved to your Fireworks account and then used by [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models), not by Agent.) See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job). When you ask for both in one instruction, Agent writes the evaluator first, then automatically continues into SFT with **same-session evaluator reuse**: the SFT workflow picks up `outputs/evaluator.py` and `outputs/eval_spec.md` without re-authoring them, and reuses the staged dataset paths so the dataset is downloaded only once. ### Multi-turn handoff If you want fine-grained control of the handoff, structure your two instructions like this: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Generate an evaluator for accounts/myacct/datasets/mydata." # Wait for evaluator artifacts to be written and presented in chat. ``` Then continue in the **same session**: ```bash theme={null} source .env && firectl session update \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Now run SFT on Qwen3 32B using the evaluator we just authored. Reuse outputs/evaluator.py and outputs/eval_spec.md — do not regenerate them." ``` Agent will inherit the staged dataset and the evaluator artifacts without re-downloading or rewriting them. ## RFT evaluators **Agent authors RFT evaluators but does not run RFT training.** This workflow produces and validates the Eval Protocol evaluator file, then registers it with your Fireworks account. The actual RFT training job runs through [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models) — not from an Agent session. ### What you get An Eval Protocol `@evaluation_test` evaluator file, validated end-to-end, ready to drop into a Reinforcement Fine-Tuning job. The plan includes the concrete evaluator code, validation commands, and the command to save the evaluator to Fireworks. This is purpose-built for tasks where you can score model outputs against reference data — math problems, code generation, structured-output extraction, agentic workflows with verifiable side effects. ### Example session instruction ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Build an RFT evaluator for accounts/myacct/datasets/math-problems. Score whether the final numeric answer matches ground truth." ``` Agent inspects samples, writes the evaluator, validates it on a few records, and presents the plan with the save command. You approve once and Agent executes the plan, registering the evaluator with your Fireworks account. ### Handing off to RFT training Once the evaluator is saved, run the RFT job through Managed Fine-Tuning — see the [Reinforcement Fine-Tuning Overview](/fine-tuning/reinforcement-fine-tuning-models) and [Evaluators concepts](/fine-tuning/evaluators). For example: ```bash theme={null} firectl rftj create \ --base-model accounts/fireworks/models/qwen3-8b \ --evaluator accounts/myacct/evaluators/ \ --dataset accounts/myacct/datasets/math-problems ``` Or use the [Web UI](/fine-tuning/web-ui-guide) to launch the RFT job interactively. ## Workflow summary Agent stages the dataset locally, samples records, and infers the evaluator contract from data plus your scoring intent. Agent will not finalize an evaluator without successfully staging readable data. For SFT, Agent writes both `eval_spec.md` (the contract) and `evaluator.py` (the implementation) and self-checks that both are non-empty before finishing. For RFT, Agent writes a single Eval Protocol `@evaluation_test` file and self-checks that it's non-empty and that validation succeeds. Agent surfaces the artifacts inline in chat. For RFT, Agent also presents a plan with validation and save commands and asks for one approval. If your instruction asks for downstream SFT, Agent continues into the SFT workflow in the same session and reuses the just-authored evaluator — no re-downloading, no re-authoring. RFT training itself runs through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models), not from an Agent session. ## When to use which | Use case | Workflow | | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | You want an evaluator Agent can use to score candidates during an SFT sweep, with optional auto-continue into SFT | **SFT evaluator authoring** (run end-to-end by Agent) | | You want an Eval Protocol evaluator to drive an RFT job | **RFT evaluator authoring** (Agent writes and saves the evaluator; RFT training runs through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models)) | | You don't have a clear notion of "correct" yet | Start with **validation-loss-only SFT** on [Agent SFT](/fine-tuning/agent/sft) and add an evaluator later | **Agent crib notes** * Required input: dataset ID. Agent also wants your scoring intent in plain English — "check whether the answer matches ground truth", "verify the JSON has the right schema", etc. * For SFT evaluators, ask for both authoring and SFT in the same instruction to get same-session evaluator reuse for free. * For RFT evaluators, expect a plan + cost approval before the evaluator is saved to your Fireworks account. **The Agent session ends after the evaluator is saved.** Hand off to [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models) to run the actual RFT training job. * Agent surfaces the generated `eval_spec.md` and `evaluator.py` inline in chat after authoring — relay them to the user. * All evaluator artifacts live under `outputs/` in the session workspace and can be inspected via `firectl session get ` if needed. # Fireworks Agent Overview Source: https://docs.fireworks.ai/fine-tuning/agent/introduction Describe what you want, approve the plan and cost, get a deployed fine-tuned model. Fireworks Agent is a hosted Fireworks assistant that owns the full fine-tuning loop. You describe what you want — *"fine-tune a model that classifies our support tickets"*, *"improve Llama 3.1 70B on our function-calling data"*, *"train a smaller model that matches GPT-4 on our routing task"* — and Agent picks the base model, prepares the dataset, runs a hyperparameter sweep, submits training, evaluates the result, and deploys the fine-tuned model. You stay in the loop for approvals and final calls; everything else is handled. Agent is the easiest of the three Fireworks fine-tuning paths, sitting alongside [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) and the [Training API](/fine-tuning/training-api/introduction). It's the right starting point when you want a working fine-tuned model without writing config files or Python training loops. **Naming.** This documentation refers to the product as **Fireworks Agent** (or just **Agent**). You may also see it called `pilot` in internal source code, in CLI permission presets (`--permission-preset=pilot`), in the embedded manifest file (`pilot.yaml`), and in some legacy support contexts — those are all the same product. Use *"Fireworks Agent"* or *"Agent"* in your own prompts and communication. ## What Agent does for you Agent recommends a base model and tuning method (SFT, DPO, or classification) from your task description and a peek at your data. Inspects your dataset, proposes hyperparameters, estimates cost, and presents a single plan for approval before any spend. Submits the job, streams progress, evaluates checkpoints, and ships a deployed model at the end. Concretely, Agent can: * Run **SFT, DPO, and classification** jobs from a natural-language prompt * Inspect your dataset and call out format issues before training starts * Recommend a base model from a curated panel based on your task shape * Run a short **hyperparameter sweep** before committing to full training * Stream a live progress feed with eval loss, cost-so-far, and ETA * Evaluate the trained model against a held-out set and surface the best checkpoint * Deploy the fine-tuned model so you can call it from `chat/completions` immediately * Author task-specific [evaluators](/fine-tuning/agent/evaluators) for use in SFT sweeps, or Eval Protocol evaluators you can then run through [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models) * Answer questions about your account, deployments, jobs, and Fireworks models along the way Agent does **not** run RFT training itself — for that, author the evaluator with Agent and then submit the RFT job through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models). Agent also cannot run an arbitrary Python training loop, use a custom loss function, or sample mid-training from your own evaluator — for those, use the [Training API](/fine-tuning/training-api/introduction) directly. ## Architecture ```mermaid theme={null} flowchart LR Client["Client
(user via web app,
user via firectl / REST API,
or coding agent)"] -->|"create session"| AgentAPI["Fireworks Agent API"] AgentAPI -->|dispatch| Runner["Session Runner"] Runner -->|"plan + cost estimate"| AgentAPI AgentAPI -->|"events stream"| Client Client -->|"approve / answer"| AgentAPI AgentAPI -->|"session update"| Runner Runner -->|"firectl + Fireworks API"| Platform["Fireworks Platform"] Platform -->|results| Runner Runner -->|"final report + deployed model"| Client ``` The runner is an ephemeral, sandboxed environment with its own filesystem. It executes Agent's plan against your Fireworks account using your API key. Sessions can pause for hours or days waiting on user input without consuming compute. ## Two ways to use Agent The default — and recommended — surface for most users. Open **Agent** in the left nav of [app.fireworks.ai](https://app.fireworks.ai) for a chat interface that streams Agent's plan, progress, and final report. Best for: * Most fine-tuning workflows, end to end * Teams that want a visual plan, cost, and approval UX * Watching a long training run with a live progress feed * Skipping `firectl` installation and service-account setup ### Dashboard quickstart Click **Agent** in the left navigation at [app.fireworks.ai](https://app.fireworks.ai). A good first prompt is specific about *what* you're training for, *what data* to use, and *what success looks like*: ```text theme={null} Fine-tune a model on accounts/your-account/datasets/support-tickets. Classify each ticket into one of 12 categories. Target: better than GPT-4 mini on accuracy. Budget: under $5. ``` Agent will inspect the dataset, propose a plan, and stop for your approval. Agent presents one structured plan with a cost estimate. Approve, request a change (*"use Qwen3 32B instead"*, *"skip HP tuning"*), or cancel. No spend happens before this gate. Agent streams phase-anchored updates every few minutes through the final report, which includes the deployed model ID and inference endpoint. The advanced path, for power users and anyone already living in a coding-agent harness. Use it two ways: * **Drive Agent directly from `firectl session`** — script it, run it from CI, or call the REST API. * **Let Claude Code, Cursor, Codex, Aider, Goose, or another coding agent drive it for you** by installing the [Fireworks Agent skill file](/fine-tuning/agent/use-with-coding-agents). The coding agent shells out to `firectl session` using a scoped service-account key. Best for: * Fine-tuning as a step in a larger coding workflow * Reproducing a training run with code-checked-in instructions * Power users who already orchestrate everything from their coding agent or terminal * Scripting and automation against the `firectl session` / REST API ### CLI quickstart Create a service account scoped to Agent's capabilities (the `pilot` permission preset — see the [security section below](#security-service-accounts-and-the-agent-manifest) for the rationale) and mint an API key: ```bash theme={null} firectl -a user create \ --service-account \ --user-id=fireworks-agent \ --permission-preset=pilot firectl -a api-key create --service-account=fireworks-agent ``` Save the returned key in a `.env` file in your project root: ```bash .env theme={null} FIREWORKS_AGENT_API_KEY=fw-... ``` The Fireworks Agent skill sources `.env` automatically. See [Service Accounts](/accounts/service-accounts) for the full setup. ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run SFT on Qwen3 32B using accounts/myacct/datasets/mydata" ``` The command returns a session ID, for example `abc123`. ```bash theme={null} source .env && firectl session events abc123 --api-key $FIREWORKS_AGENT_API_KEY --wait ``` The `--wait` flag keeps streaming until the session reaches `waiting`, `succeeded`, `failed`, or `cancelled`. Without it, the command dumps existing events and exits. When the stream stops at `waiting`, read Agent's question, then send your answer back to the same session: ```bash theme={null} source .env && firectl session update abc123 \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Approved, proceed." ``` Re-run `firectl session events abc123 --wait` to resume. Repeat until the session reports `succeeded`. ## How Agent runs a training job Every Agent session moves through the same seven phases. Coding agents should expect this sequence; humans can use it as a mental model for what to expect next. | # | Phase | What happens | | - | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | 1 | **Data inspection** | Agent reads your dataset, reports format, sample count, token count, and any issues. | | 2 | **Planning & approval** | Agent proposes base model, tuning method, hyperparameters, eval path, and a cost estimate. You approve, edit, or cancel. | | 3 | **HP tuning** | A short parallel sweep (typically 3 configs) over LoRA rank and learning rate, capped at 6 active jobs by default. | | 4 | **Full training** | The best config from phase 3 runs to completion on the full dataset, with per-epoch eval loss. | | 5 | **Evaluation** | The trained model is evaluated against a held-out set using one of three strategies you pick in phase 2: validation loss only (default), an evaluator you provide, or an evaluator Agent generates for you. | | 6 | **Deployment** | The model is deployed and a `fireworks-ai` SDK snippet is ready for inference. | | 7 | **Final report** | Deployed model ID, key metrics, total cost, and per-phase summary in one message. | DPO uses the same shape with phase 3 replaced by a preference sweep (or pair generation followed by a preference sweep when the dataset is prompts-only). Classification uses the same shape with phase 3 expanded into a base-model benchmark plus a fine-tuning sweep, and phase 5 reports per-label and overall accuracy. The promotion gate between phase 3 and phase 4 is one of the two user-facing pauses (the other is plan approval in phase 2). ### The approval and cost contract Agent never spends without an explicit approval. This is structural, not a setting. At the end of **Phase 2 (Planning)** — and again before any new spend-incurring step — Agent surfaces a structured cost preview and waits for approval. In the dashboard this is a yes/no prompt. From a coding agent, the skill holds the session in a `waiting` state, surfaces Agent's exact question, and only proceeds after you respond via `firectl session update`. Reject and the session ends with no charges. The preview always includes: * Total estimated cost (in USD, with a confidence range) * Estimated wall time * Per-phase cost breakdown (HP tuning / full training / evaluation / deployment) * Cost-so-far in the session (for re-approvals on long runs) ### Out-of-coverage behavior If you ask Agent to use a model or method outside its supported set, it refuses rather than silently approximating. For example, asking for full-parameter tuning on a model with no Agent recipe returns a clear *"not supported in Agent — use Managed Fine-Tuning or the Training API"* message with a pointer to the right surface. See [When not to use Agent](#when-not-to-use-agent). ## What Agent can do today End-to-end SFT with dataset inspection, hyperparameter sweep, evaluator-guided model selection, and a deployed winner. Run DPO or ORPO on pre-paired preferences or generate pairs automatically with delta learning, with an optional base-model sweep. Benchmark base models, fine-tune on labeled data, and compare base vs fine-tuned classification accuracy on a held-out split. Generate a reusable Python evaluator Agent uses to score candidates during an SFT sweep, or an Eval Protocol evaluator you can take to a Managed RFT job — directly from your dataset. Copy-paste skill files for Claude Code, Cursor, Codex, Aider, and Goose so they can drive Agent for you. ## Agent vs Managed Fine-Tuning vs Training API All three sit on the same training infrastructure, GPU shapes, and tuning methods. The difference is how much you drive. | | **Fireworks Agent** | **Managed Fine-Tuning** | **Training API** | | ------------------------------- | ------------------------------------------------------------------------- | --------------------------------- | ---------------------------------- | | **Interface** | Natural language (dashboard chat, `firectl session`, or via coding agent) | UI, `firectl`, REST | Python script | | **Who picks the model** | Agent recommends | You | You | | **Who tunes hyperparameters** | Agent runs a sweep | You set them | You set them | | **Cost approval** | Built-in gate | None — you submit jobs directly | None | | **Custom loss / training loop** | Not supported | Not supported | Supported | | **Inference-in-the-loop eval** | Not supported | Not supported | Supported (hotload) | | **Best for** | Getting a working fine-tuned model fast, without ML expertise | Production runs with known config | Research, custom RL, hybrid losses | ### When not to use Agent Reach for a more direct surface when: * You need a **custom loss function** or hybrid objective → [Training API](/fine-tuning/training-api/introduction) * You need to **hotload checkpoints** for mid-training inference evaluation → [Training API](/fine-tuning/training-api/introduction) * You already know your config and just want to **submit a job** → [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) * You need **full-parameter tuning** on a model Agent doesn't cover → [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) * You're training in a **fully automated CI pipeline** with no human approval → Agent's approval gate is interactive by design; [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) is the better fit today ## Security: service accounts and the Agent manifest When a coding agent drives Fireworks Agent on your behalf, it should authenticate as a **service account** with the `pilot` permission preset, not your personal user key. This enforces a layered permissions model: > Effective permissions = User role ∩ Agent capability manifest ### The manifest is a real artifact The Agent capability manifest is a versioned YAML file (`pilot.yaml`, kept under its original internal name) embedded into the Fireworks control-plane binary at build time. It enumerates the exact set of RPC methods the `pilot` preset is allowed to call — roughly 80 methods grouped by capability surface: * **Account & billing** — `GetAccountUsage`, `GetQuota`, `ListQuotas`, `ListCosts` * **Models** — `GetModel`, `ListModels`, `CreateModelVersion`, `PrepareModel`, `ValidateModelUpload` * **Deployments** — `GetDeployment`, `CreateDeployment`, `DeployModelVersion`, `GetDeploymentMetrics` * **Datasets** — `CreateDataset`, `GetDataset`, `ListDatasets`, `PreviewDataset`, `SplitDataset` * **Evaluators and evaluations** — `CreateEvaluator`, `GetEvaluator`, `CreateEvaluation`, `TestEvaluation` * **Fine-tuning jobs** — `CreateSupervisedFineTuningJob`, `CreateDpoJob`, `CreateReinforcementFineTuningJob`, `CreateRlorTrainerJob` *(the RFT and RLOR-trainer RPCs are granted by the manifest but Agent's current workflows don't use them — see [What Agent does for you](#what-agent-does-for-you))* * **Training shapes** — `GetTrainingShape`, `ListTrainingShapes` * **Batch inference and inference logs** — `CreateBatchInferenceJob`, `ListInferenceLogs` The control plane enforces the manifest as a **hard ceiling** before checking the underlying user's role: even if the user has broader permissions, the preset cannot exceed what the manifest allows. Any RPC outside the manifest returns `PERMISSION_DENIED` at the API gateway, regardless of how the request was constructed. ### Non-destructive guarantee, structurally enforced Agent's promise to never delete, cancel, or destroy your existing resources is enforced by the manifest itself, not by skill-level politeness. The manifest **does not include any `Delete*`, `Cancel*`, or destructive RPC methods**. Even a malicious or hallucinated tool call targeting `DeleteModel`, `CancelReinforcementFineTuningJob`, or `DeleteDeployment` is rejected at the control plane before it reaches the resource layer. ### Cross-account reads, never cross-account writes The `pilot` preset is granted **read-only** access across accounts. This is what lets Agent reach Fireworks-owned public resources — base models at `accounts/fireworks/models/...`, public deployment shapes, public datasets — using only your account's API key. Agent cannot write into any other account; mutating operations are scoped to your account. ### Auto-update on control-plane releases Because the manifest is compiled into the control-plane binary, expanded Agent capabilities ship automatically with every control-plane deploy. Your service account stores only the preset *name* (`pilot`), not the list of allowed methods — so new capabilities are picked up without rotating keys or re-provisioning the service account. See [Service Accounts](/accounts/service-accounts) for setup details. ## Session lifecycle reference | Command | What it does | Confirmation required | | ------------------------------------------------------ | --------------------------------------------- | ------------------------------- | | `firectl session create --instruction ""` | Start a new session | No | | `firectl session events --wait` | Stream events until terminal or waiting state | No | | `firectl session get ` | Get current status and details | No | | `firectl session list` | List sessions for your account | No | | `firectl session update --instruction ""` | Send a response to a waiting session | **Yes** — confirm with the user | | `firectl session cancel ` | Stop a running session (keeps the record) | **Yes** — confirm with the user | | `firectl session delete ` | Remove the session record (irreversible) | **Yes** — confirm with the user | All commands accept `--api-key $FIREWORKS_AGENT_API_KEY` for non-interactive auth and `--scope optimize` (the default scope). ## Troubleshooting Agent shares the on-demand pool with the Training API. If GPU capacity is tight, jobs queue. If you need guaranteed capacity, [request a reservation](https://fireworks.ai/contact). Agent only runs methods it has curated recipes for. For anything outside that set, use [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) or the [Training API](/fine-tuning/training-api/introduction). You're missing the `--wait` flag. Without it, `firectl session events` prints existing events and returns. The Fireworks Agent skill always passes `--wait`, which keeps the stream open until the session reaches `waiting`, `succeeded`, `failed`, or `cancelled`. If you're driving `firectl` directly, add `-w / --wait`. Agent's preview includes HP tuning, full training, evaluation, and the first hour of deployment. Reject the plan and ask Agent to skip HP tuning or use a smaller base model — the next preview will reflect the lower scope. ## Next steps Open Agent in the left nav at app.fireworks.ai. Install the skill file in Claude Code, Cursor, Codex, Aider, or Goose. Drive the same training infra directly when you know your config. Write your own Python training loop on Fireworks GPUs. **Agent crib notes** * Auth: set `FIREWORKS_AGENT_API_KEY` in a project-local `.env` (the key is from a service account with the `pilot` permission preset). Source it via `source .env && ...` and pass on every command as `--api-key $FIREWORKS_AGENT_API_KEY`. * Use the **same session ID** for follow-ups. Never create a new session to continue an existing conversation. * Always pass `--wait` to `session events`, or the command exits immediately after dumping history. * `create`, `get`, `events`, and `list` are safe to run without user confirmation. **Always confirm with the user before `update`, `cancel`, or `delete`.** * On `waiting`, surface Agent's exact question to the user verbatim; do not paraphrase. * See [Use with coding agents](/fine-tuning/agent/use-with-coding-agents) for a complete copy-paste skill for Claude Code, Cursor, Codex, Aider, and Goose. # Fireworks Agent: Supervised Fine-Tuning Source: https://docs.fireworks.ai/fine-tuning/agent/sft Run end-to-end SFT with Fireworks Agent — dataset inspection, hyperparameter sweep, evaluator-guided selection, and a deployed winner. Fireworks Agent's SFT workflow takes a dataset and (optionally) a base model, runs a hyperparameter sweep with held-out evaluation, picks the winner, retrains on the full data, and deploys the result. You approve a single plan with a cost estimate up front; Agent handles everything from there and pauses only at meaningful decision points. For the underlying SFT mechanics (job parameters, supported base models, dataset format), see [Managed Fine-Tuning → Supervised Fine-Tuning](/fine-tuning/fine-tuning-models). This page documents the Fireworks Agent workflow built on top of it. ## What you give Agent Agent needs enough to build an executable plan. The required inputs: * **Dataset ID** — an existing Fireworks dataset in `READY` state, in OpenAI-compatible chat format. Optionally a separate evaluation dataset. * **Base model(s)** — one or more base models. If you omit this, Agent will ask you to choose from the supported list. * **Evaluation approach** — one of three strategies (see below). Default is validation loss only. Everything else (epochs, LoRA rank, learning rate, batching) is resolved by Agent from defaults or your explicit overrides. ## Example session instruction ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run supervised fine-tuning on accounts/myacct/datasets/customer-support-conv. Use Qwen3 32B as the base model. Use validation loss for evaluation." ``` For explicit candidates instead of the default tuning grid: ```bash theme={null} source .env && firectl session create \ --api-key $FIREWORKS_AGENT_API_KEY \ --instruction "Run SFT on accounts/myacct/datasets/mydata across qwen3-8b and qwen3-32b with learning rates 1e-4 and 5e-5, LoRA ranks 16 and 32, and 3 epochs." ``` **Where SFT lives in the 7-phase pipeline:** Phase 1 is dataset inspection, phase 2 is plan + cost approval, **phase 3 is the candidate sweep** described below, phase 4 is the full-data final run, **phase 5 is held-out evaluation** (using the strategy you picked in phase 2), phase 6 is deployment, phase 7 is the final report. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job). ## Workflow stages Agent stages your dataset locally exactly once per session (`firectl dataset download ...`), inspects format and sample structure, estimates token counts for cost, and decides whether any conversion is needed (for example, mapping `ground_truth` fields onto an assistant message or rewriting `tool` roles). Agent picks an evaluation strategy (see [Evaluation paths](#evaluation-paths) below) and resolves your candidate grid. The default tuning grid is three HP configurations with the LoRA rank and learning rate shown below; epochs default to `min(5, ceil(2500 / total_samples))` unless you override them. | HP config | LoRA rank | Learning rate | | --------- | --------- | ------------- | | 1 | 8 | 1.5e-4 | | 2 | 16 | 1.0e-4 | | 3 | 32 | 5.0e-5 | For HP tuning on datasets larger than 1,000 samples, Agent subsamples to 1,000 (seed `42`) to keep candidate-search costs bounded. Agent writes a plan to the session workspace and presents it to you with a cost breakdown (Training + Inference + Total). A single approval covers both the plan and the estimate. Reply with `Approved, proceed.` or ask for revisions and Agent will re-cost and re-present. Agent launches the candidate training runs, capped at **6 active jobs at a time** by default. Each candidate trains on the (sub-sampled) train split and is evaluated against the held-out test split using the evaluation strategy you chose. Before the full-data final run, Agent pauses at a promotion gate. It surfaces the candidate scoreboard (validation loss and any evaluator metrics) and asks you to confirm the winner. Reply with `Proceed with the winning config.` Agent trains the winning configuration on the full dataset (epochs default to `min(5, ceil(2500 / total_samples))` for the final run). Agent then evaluates the final model directly and writes `final_report.md`. Agent deploys the final model and reports the deployed model ID, deployment ID, inference endpoint, and a copy-paste `fireworks-ai` SDK snippet you can use immediately. ## Evaluation paths Agent supports three evaluation strategies. You can specify one in your instruction, or Agent will ask which to use in plain English (it does **not** say "Path A" / "Path B" / "Path C" to you — the labels below are docs shorthand for the three options). ### Path A — validation loss only The default. Agent creates a held-out test split, trains each candidate, and picks the winner purely on validation loss. No task-level evaluator is run. Choose this when: * You don't have an evaluator script for the task * The dataset is small or evaluator design is not yet settled * You want the fastest, lowest-cost sweep Trigger phrase: *"Use validation loss for evaluation."* or simply *"validation loss is fine"* if Agent asks. ### Path B — bring your own evaluator You provide a Python evaluator (uploaded to Fireworks, or generated in the same session via [evaluator authoring](/fine-tuning/agent/evaluators)). Agent runs the evaluator on each candidate's outputs and on the final model. Trigger phrase: *"Use evaluator accounts/myacct/evaluators/my-eval."* or *"Use my own evaluator"* if Agent asks. ### Path C — Agent-generated evaluator Agent inspects your data and writes a Python evaluator for structured or objectively checkable outputs (for example: numeric answers, JSON schemas, exact-match labels). It then uses that evaluator to score candidates and the final model. Trigger phrase: *"Generate an evaluator for me."* or *"agent-generated evaluator"* if Agent asks. ## Output When the session reports `succeeded`, Agent's final message includes: * The deployed **model ID** and **deployment ID** * The inference endpoint and a ready-to-run `fireworks-ai` SDK snippet * Final training loss and evaluation loss (or evaluator score) for the winning model * Provenance for any rollout/evaluation evidence carried forward from candidate search * A link to `final_report.md` in the session workspace with the full plan, costs (estimated vs actual), and per-candidate metrics ## Supported base models Agent's SFT workflow supports the same base models as Managed Fine-Tuning. For the canonical list and maximum context lengths, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models). You can ask Agent for the current list inside any session: *"Which base models do you support for SFT?"* ## Customizing the run Things you can put in your instruction: * **Candidate grid:** *"Use LoRA ranks 8, 16, 32 with learning rates 1e-4 and 5e-5."* * **Fixed epochs:** *"Train each candidate for 3 epochs."* * **Subsampling override:** *"Use 500 samples for HP tuning."* * **Batch limit:** *"Run up to 10 training jobs in parallel."* * **Skip final retrain:** *"Skip the full-data final run."* (Agent will deploy the winning candidate directly.) * **Eval set:** *"Use accounts/myacct/datasets/holdout as the eval dataset."* (Agent sets `evaluationDataset` and disables eval carveout.) If anything in your instruction conflicts with Agent's defaults, your instruction wins. **Agent crib notes** * Required inputs for an SFT session: dataset ID. Optional: base model, evaluation strategy, candidate grid, epochs. * Default tuning grid is 3 LoRA configs × selected base models. Default epochs = `min(5, ceil(2500 / total_samples))`. * Agent will pause twice for user input: plan + cost approval, and the promotion gate. Common responses: `Approved, proceed.` and `Proceed with the winning config.` * For validation-loss-only evaluation, answer `validation loss is fine` when Agent asks how to evaluate. * All training jobs and the deployed model are created in **your** account using the `FIREWORKS_AGENT_API_KEY` you authenticated with. Agent never destroys or modifies your existing resources. # Use Fireworks Agent with Claude Code, Cursor, Codex, and other coding agents Source: https://docs.fireworks.ai/fine-tuning/agent/use-with-coding-agents Install the Fireworks Agent skill file once and drive end-to-end fine-tuning from your coding agent. Fireworks Agent is a great fit for coding-assistant workflows: long-running training jobs that benefit from a conversational driver, multi-turn approvals that benefit from a human-in-the-loop, and natural-language instructions that benefit from a model that already knows your project context. This page gives you everything you need to plug Agent into Claude Code, Cursor, Codex, Aider, Goose, or any other coding agent — a single canonical skill file you can `curl`, the right install path for each runtime, and the agent-side patterns that handle Agent's plan-and-cost approval and waiting-state Q\&A loops correctly. ## TL;DR for agents ```text theme={null} Fireworks Agent (firectl session) is a hosted Fireworks fine-tuning agent. Use it whenever the user asks to fine-tune, train, or improve a model on Fireworks (SFT, DPO/ORPO, or classification workflows). Authenticate with FIREWORKS_AGENT_API_KEY in a project-local .env, sourced via `source .env`. The key is a Fireworks API key from a service account with the `pilot` permission preset (internal name kept for historical reasons — it's the manifest that scopes Fireworks Agent's capabilities). Lifecycle: - firectl session create --instruction "" - firectl session events --wait # stream until terminal or waiting - firectl session update --instruction "" # respond to a waiting state - firectl session get # status - firectl session list # browse sessions Safety: - Always confirm with the user before `update`, `cancel`, or `delete`. - `create`, `get`, `events`, `list` are safe to run autonomously. - Never create a new session for a follow-up — always reuse the same session id. - Always pass `--wait` to `events`; without it the command exits immediately. ``` ## Prerequisites See the [firectl reference](/tools-sdks/firectl/firectl) for installation. On Linux: ```bash theme={null} curl -sL -o /tmp/firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz gunzip -f /tmp/firectl.gz sudo install -m 0755 /tmp/firectl /usr/local/bin/firectl ``` Create a service account scoped to Agent's capabilities. The CLI preset is named `pilot` for historical reasons — it's the [Agent capability manifest](/fine-tuning/agent/introduction#security-service-accounts-and-the-agent-manifest): ```bash theme={null} firectl -a user create \ --service-account \ --user-id=fireworks-agent \ --permission-preset=pilot firectl -a api-key create --service-account=fireworks-agent ``` Drop the returned key into your project root: ```bash .env theme={null} FIREWORKS_AGENT_API_KEY=fw-... ``` The skill sources `.env` automatically. See [Service Accounts](/accounts/service-accounts) for the full setup. ## Install the skill file The Fireworks Agent skill is a single Markdown document that teaches your coding agent how to drive Agent: when to invoke it, how to authenticate, how to handle waiting states and approval gates, which `firectl session` flags are confirmed working, and the common questions Agent asks mid-session. It auto-attaches based on the `description` frontmatter — no slash commands required. Canonical source in the public [`fw-ai/cookbook`](https://github.com/fw-ai/cookbook) repo. `curl` the raw URL into your coding agent (see below); re-fetch at the start of a fine-tuning session to pick up the latest confirmed flags. ```bash Claude Code theme={null} # Project-scoped (recommended) mkdir -p .claude/skills/fireworks-agent curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \ -o .claude/skills/fireworks-agent/SKILL.md # Or user-scoped (available in every project) mkdir -p ~/.claude/skills/fireworks-agent curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \ -o ~/.claude/skills/fireworks-agent/SKILL.md ``` ```bash Cursor theme={null} mkdir -p .cursor/skills/fireworks-agent curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \ -o .cursor/skills/fireworks-agent/SKILL.md ``` ```bash Codex / Aider / Goose theme={null} # These runtimes read AGENTS.md as ambient context at session start. curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md >> AGENTS.md ``` Once the skill is installed, prompts like *"Fine-tune Qwen3 32B on my customer-support dataset"* will trigger Fireworks Agent automatically. The coding agent will create a session, stream events, surface Agent's questions to you, and wait for your approval before sending responses back. ## How the agent should drive Fireworks Agent Every Agent workflow pauses at least once (plan + cost approval) and may pause again at the promotion gate or whenever the planner needs missing information. Your coding agent must handle the loop correctly. ```mermaid theme={null} flowchart TD Create["firectl session create"] --> Stream["firectl session events --wait"] Stream -->|"status: waiting"| Capture["Capture LAST_TS"] Capture --> Extract["Extract status_info question"] Extract --> Ask["Surface question to user verbatim"] Ask --> Confirm["Get user response
+ confirmation"] Confirm --> Update["firectl session update --instruction '...'"] Update --> Resume["Resume events, filter older than LAST_TS"] Resume --> Stream Stream -->|"status: succeeded"| Report["Surface deployed model + final report"] Stream -->|"status: failed"| Triage["Surface error, ask user how to proceed"] ``` Key invariants: 1. **`--wait` is required.** `session events` without `--wait` exits immediately after dumping history. 2. **Use the same session ID for follow-ups.** Never create a new session to continue a conversation. The runner reads state from the previous session's workspace. 3. **Pause for confirmation on `update`, `cancel`, `delete`.** Read-only commands (`create`, `get`, `events`, `list`) are safe to run autonomously. 4. **Surface Agent's questions verbatim.** Agent's exact question contains the information the user needs to answer correctly. Don't paraphrase. 5. **Filter history on resume.** After a `session update`, the next `events --wait` re-dumps history. Filter on the timestamp captured before the update so the user only sees new traces. ### Fallback when the stream drops If the events stream drops unexpectedly (network error, client timeout), fall back to polling `session get` until the status is terminal or waiting: ```bash theme={null} source .env && until firectl session get --api-key $FIREWORKS_AGENT_API_KEY 2>/dev/null \ | grep -E "waiting|succeeded|failed|cancelled"; do sleep 10; done \ && firectl session get --api-key $FIREWORKS_AGENT_API_KEY ``` Then resume `events --wait` from the captured timestamp once you know the session is alive. ### Common waiting-state prompts and good responses | Agent asks about | Reasonable default response | | -------------------------------- | -------------------------------------------------------------------------------------------- | | Which evaluation strategy to use | `"validation loss is fine"` (no task-level evaluator) | | Plan and cost approval | `"Approved, proceed."` | | Promotion gate / winning config | `"Proceed with the winning config."` | | Missing base model | `"Use Qwen3 32B."` (or whatever model the user picked) | | Missing dataset | (Agent won't reach `waiting` without a dataset — surface this back to the user and ask them) | | Plan revisions | Forward the user's revision request verbatim | ## Pitfalls * **Forgetting `--wait`.** The most common failure mode. Always pass it on `events`. * **Creating a new session for a follow-up.** Agent loses all prior context. Use `session update` on the existing ID. * **Running `update` / `cancel` / `delete` autonomously.** These are user-confirmation gates. Always ask first. * **Treating Agent's safety refusals as failures.** Agent won't delete, cancel, or destroy your existing resources. If your instruction contains a destructive intent, rephrase it as a non-destructive action (list, inspect, create, monitor). * **Streaming through a TTY-truncating wrapper.** Piping `firectl session events` through `tail` or `head` can hide the `[done] session status:` footer and break the loop. Stream directly. ## Reference: session commands | Command | Description | Confirmation required | | ----------------------------------------------------------------------------------------- | -------------------------- | --------------------- | | `firectl session create --api-key $FIREWORKS_AGENT_API_KEY --instruction ""` | Start a session | No | | `firectl session events --api-key $FIREWORKS_AGENT_API_KEY --wait` | Stream events | No | | `firectl session get --api-key $FIREWORKS_AGENT_API_KEY` | Get status | No | | `firectl session list --api-key $FIREWORKS_AGENT_API_KEY` | List sessions | No | | `firectl session update --api-key $FIREWORKS_AGENT_API_KEY --instruction ""` | Reply to a waiting session | **Yes** | | `firectl session cancel --api-key $FIREWORKS_AGENT_API_KEY` | Cancel a running session | **Yes** | | `firectl session delete --api-key $FIREWORKS_AGENT_API_KEY` | Delete the session record | **Yes** | `session create` and `session update` accept the long-form `--instruction` flag (short form: `-n`). All session commands accept `--scope optimize` (the default scope). **Agent crib notes** * `curl` the canonical [`fireworks-agent SKILL.md`](https://github.com/fw-ai/cookbook/blob/main/skills/fireworks-agent/SKILL.md) from the public `fw-ai/cookbook` repo into `.claude/skills/fireworks-agent/SKILL.md`, `.cursor/skills/fireworks-agent/SKILL.md`, or append it to `AGENTS.md` for Codex/Aider/Goose. * Authenticate via `FIREWORKS_AGENT_API_KEY` in a project-local `.env`, sourced with `source .env`. The key is a Fireworks service-account API key with the `pilot` permission preset (the underlying capability manifest is kept under that internal name). * Always reuse the same session ID for follow-ups. Always pass `--wait` to `events`. Always confirm before `update / cancel / delete`. # Training Overview Source: https://docs.fireworks.ai/fine-tuning/cli-reference Launch RFT jobs using the eval-protocol CLI **Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details. The Eval Protocol CLI provides the fastest, most reproducible way to launch RFT jobs. This page covers everything you need to know about using `eval-protocol create rft`. Before launching, review [Training Prerequisites & Validation](/fine-tuning/training-prerequisites) for requirements, validation checks, and common errors. Already familiar with [firectl](/fine-tuning/cli-reference#using-firectl-cli-alternative)? Use it as an alternative to eval-protocol. ## Installation and setup The following guide will help you: * Upload your evaluator to Fireworks. If you don't have one yet, see [Concepts > Evaluators](/fine-tuning/evaluators) * Upload your dataset to Fireworks * Create and launch the RFT job ```bash theme={null} pip install eval-protocol ``` Verify installation: ```bash theme={null} eval-protocol --version ``` Configure your Fireworks API key: ```bash theme={null} export FIREWORKS_API_KEY="fw_your_api_key_here" ``` Or create a `.env` file: ```bash theme={null} FIREWORKS_API_KEY=fw_your_api_key_here ``` Before training, verify your evaluator works. This command discovers and runs your `@evaluation_test` with pytest. If a Dockerfile is present, it builds an image and runs the test in Docker; otherwise it runs on your host. ```bash theme={null} cd evaluator_directory ep local-test ``` If using a Dockerfile, it must use a Debian-based image (no Alpine or CentOS), be single-stage (no multi-stage builds), and only use supported instructions: `FROM`, `RUN`, `COPY`, `ADD`, `WORKDIR`, `USER`, `ENV`, `CMD`, `ENTRYPOINT`, `ARG`. Instructions like `EXPOSE` and `VOLUME` are ignored. See the [RFT quickstart guide](/fine-tuning/quickstart-svg-agent) for details. From the directory where your evaluator and dataset (dataset.jsonl) are located, ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --output-model my-model-name ``` The CLI will: * Upload evaluator code (if changed) * Upload dataset (if changed) * Create the RFT job * Display dashboard links for monitoring Expected output: ``` Created Reinforcement Fine-tuning Job name: accounts/your-account/reinforcementFineTuningJobs/abc123 Dashboard Links: Evaluator: https://app.fireworks.ai/dashboard/evaluators/your-evaluator Dataset: https://app.fireworks.ai/dashboard/datasets/your-dataset RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123 ``` Click the RFT Job link to watch training progress in real-time. See [Monitor Training](/fine-tuning/monitor-training) for details. ## Common CLI options Customize your RFT job with these flags: **Model and output**: ```bash theme={null} --base-model accounts/fireworks/models/llama-v3p1-8b-instruct # Base model to fine-tune --output-model my-custom-name # Name for fine-tuned model ``` **Training parameters**: ```bash theme={null} --epochs 2 # Number of training epochs (default: 1) --learning-rate 5e-5 # Learning rate (default: 1e-4) --lora-rank 16 # LoRA rank (default: 8) --batch-size 65536 # Batch size in tokens (default: 32768) --chunk-size 200 # Prompts rolled out per GRPO training step (default: 200). -1 disables chunking. ``` **Loss method**: ```bash theme={null} --rl-loss-method dapo # RL loss method: grpo (default), dapo, gspo-token --rl-kl-beta 0.001 # KL beta override (only for grpo; rejected for dapo/gspo-token) ``` **Rollout (sampling) parameters**: ```bash theme={null} --temperature 0.8 # Sampling temperature (default: 0.7) --n 8 # Number of rollouts per prompt (default: 4) --response-candidates-count 8 # Alias for --n in firectl (default: 8, minimum: 2) --max-tokens 4096 # Max tokens per response (default: 32768) --top-p 0.95 # Top-p sampling (default: 1.0) --top-k 50 # Top-k sampling (default: 40) --max-concurrent-rollouts 64 # Max in-flight rollouts per job (default: 96, or the value set in @evaluation_test). Throughput only; no training effect. ``` **Remote environments**: ```bash theme={null} --remote-server-url https://your-evaluator.example.com # For remote rollout processing ``` **Force re-upload**: ```bash theme={null} --force # Re-upload evaluator even if unchanged ``` See all options: ```bash theme={null} eval-protocol create rft --help ``` ## Advanced options Track training metrics in W\&B for deeper analysis: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --wandb-project my-rft-experiments \ --wandb-entity my-org ``` Set `WANDB_API_KEY` in your environment first. Save intermediate checkpoints during training: ```bash theme={null} firectl rftj create \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --checkpoint-frequency 500 # Save every 500 steps ... ``` Available in `firectl` only. For evaluators that need more time: ```bash theme={null} firectl rftj create \ --rollout-timeout 300 # 5 minutes per rollout ... ``` Default is 60 seconds. Increase for complex evaluations. For other tuning parameters — rollout concurrency, chunk size, loss method, and more — see [Parameter Tuning](/fine-tuning/parameter-tuning). ## Examples **Fast experimentation** (small model, 1 epoch): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/qwen3-0p6b \ --output-model quick-test ``` **High-quality training** (more rollouts, higher temperature): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --output-model high-quality-model \ --n 8 \ --temperature 1.0 ``` **Remote environment** (for multi-turn agents): ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --remote-server-url https://your-agent.example.com \ --output-model remote-agent ``` **Multiple epochs with custom learning rate**: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --epochs 3 \ --learning-rate 5e-5 \ --output-model multi-epoch-model ``` ## Using `firectl` CLI (Alternative) For users already familiar with Fireworks `firectl`, you can create RFT jobs directly: ```bash theme={null} firectl rftj create \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset accounts/your-account/datasets/my-dataset \ --evaluator accounts/your-account/evaluators/my-evaluator \ --output-model my-finetuned-model ``` **Differences from `eval-protocol`**: * Requires fully qualified resource names (accounts/...) * Must manually upload evaluators and datasets first * More verbose but offers finer control * Same underlying API as `eval-protocol` See [firectl documentation](/tools-sdks/firectl/commands/reinforcement-fine-tuning-job-create) for all options. ## Next steps Review requirements, validation, and common errors Track job progress, inspect rollouts, and debug issues Learn how to adjust parameters for better results # Remote Environment Setup Source: https://docs.fireworks.ai/fine-tuning/connect-environments Implement the /init endpoint to run evaluations in your infrastructure If you already have an agent running in your product, or need to run rollouts on your own infrastructure, you can integrate it with RFT using the `RemoteRolloutProcessor`. This delegates rollout execution to an HTTP service you control. Remote agent are ideal for: * Multi-turn agentic workflows with tool use * Access to private databases, APIs, or internal services * Integration with existing agent codebases * Complex simulations that require your infrastructure New to RFT? Start with [local agent](/fine-tuning/quickstart-math) instead. They're simpler and cover most use cases. Only use remote agent environments when you need access to private infrastructure or have an existing agent to integrate. ## How remote rollouts work Remote rollout processor flow diagram showing the interaction between Eval Protocol, your remote server, and Fireworks Tracing During training, Fireworks calls your service's `POST /init` endpoint with the dataset row and correlation metadata. Your agent executes the task (e.g., multi-turn conversation, tool calls, simulation steps), logging progress via Fireworks tracing. Your service sends structured logs tagged with rollout metadata to Fireworks so the system can track completion. Once Fireworks detects completion, it pulls the full trace and evaluates it using your scoring logic. Everything except implementing your remote server is handled automatically by Eval Protocol. You only need to implement the `/init` endpoint and add Fireworks tracing. ## Implementing the /init endpoint Your remote service must implement a single `/init` endpoint that accepts rollout requests. ### Request schema Model configuration including model name and inference parameters like temperature, max\_tokens, etc. Array of conversation messages to send to the model Array of available tools for the model (for function calling) Base URL for making LLM calls through Fireworks tracing (includes correlation metadata) Rollout execution metadata for correlation (rollout\_id, run\_id, row\_id, etc.) Fireworks API key to use for model calls ### Example request ```json theme={null} { "completion_params": { "model": "accounts/fireworks/models/llama-v3p1-8b-instruct", "temperature": 0.7, "max_tokens": 2048 }, "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the weather for a city", "parameters": { "type": "object", "properties": { "city": { "type": "string" } } } } } ], "model_base_url": "https://tracing.fireworks.ai/rollout_id/brave-night-42/invocation_id/wise-ocean-15/experiment_id/calm-forest-28/run_id/quick-river-07/row_id/bright-star-91", "metadata": { "invocation_id": "wise-ocean-15", "experiment_id": "calm-forest-28", "rollout_id": "brave-night-42", "run_id": "quick-river-07", "row_id": "bright-star-91" }, "api_key": "fw_your_api_key" } ``` ## Metadata correlation The `metadata` object contains correlation IDs that you must include when logging to Fireworks tracing. This allows Eval Protocol to match logs and traces back to specific evaluation rows. Required metadata fields: * `invocation_id` - Identifies the evaluation invocation * `experiment_id` - Groups related experiments * `rollout_id` - Unique ID for this specific rollout (most important) * `run_id` - Identifies the evaluation run * `row_id` - Links to the dataset row `RemoteRolloutProcessor` automatically generates these IDs and sends them to your server. You don't need to create them yourself—just pass them through to your logging. ## Fireworks tracing integration Your remote server must use Fireworks tracing to report rollout status. Eval Protocol polls these logs to detect when rollouts complete. ### Basic setup ```python theme={null} import logging from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler, RolloutIdFilter # Configure Fireworks tracing handler globally fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) @app.post("/init") def init(request: InitRequest): # Create rollout-specific logger with filter rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}") rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) try: # Execute your agent logic here result = execute_agent(request) # Log successful completion with structured status rollout_logger.info( f"Rollout {request.metadata.rollout_id} completed", extra={"status": Status.rollout_finished()} ) return {"status": "success"} except Exception as e: # Log errors with structured status rollout_logger.error( f"Rollout {request.metadata.rollout_id} failed: {e}", extra={"status": Status.rollout_error(str(e))} ) raise ``` ### Key components 1. **FireworksTracingHttpHandler**: Sends logs to Fireworks tracing service 2. **RolloutIdFilter**: Tags logs with the rollout ID for correlation 3. **Status objects**: Structured status reporting that Eval Protocol can parse * `Status.rollout_finished()` - Signals successful completion * `Status.rollout_error(message)` - Signals failure with error details ### Alternative: Environment variable approach For simpler setups, you can use the `EP_ROLLOUT_ID` environment variable instead of manual filters. If your server processes one rollout at a time (e.g., serverless functions, container per request): ```python theme={null} import os import logging from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler # Set rollout ID in environment os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id # Configure handler (automatically picks up EP_ROLLOUT_ID) fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) logger = logging.getLogger(__name__) @app.post("/init") def init(request: InitRequest): # Logs are automatically tagged with rollout_id logger.info("Processing rollout...") # ... execute agent logic ... ``` If your `/init` handler spawns separate Python processes for each rollout: ```python theme={null} import os import logging import multiprocessing from eval_protocol import FireworksTracingHttpHandler, InitRequest def execute_rollout_step_sync(request): # Set EP_ROLLOUT_ID in the child process os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id logging.getLogger().addHandler(FireworksTracingHttpHandler()) # Execute your rollout logic here # Logs are automatically tagged @app.post("/init") async def init(request: InitRequest): # Do NOT set EP_ROLLOUT_ID in parent process p = multiprocessing.Process( target=execute_rollout_step_sync, args=(request,) ) p.start() return {"status": "started"} ``` ### How Eval Protocol uses tracing 1. **Your server logs completion**: Uses `Status.rollout_finished()` or `Status.rollout_error()` 2. **Eval Protocol polls**: Searches Fireworks logs by `rollout_id` tag until completion signal found 3. **Status extraction**: Reads structured status fields (`code`, `message`, `details`) to determine outcome 4. **Trace retrieval**: Fetches full trace of model calls and tool use for evaluation ## Complete example Here's a minimal but complete remote server implementation: ```python theme={null} from fastapi import FastAPI from fastapi.responses import JSONResponse from eval_protocol import InitRequest, FireworksTracingHttpHandler, RolloutIdFilter, Status import logging app = FastAPI() # Setup Fireworks tracing fireworks_handler = FireworksTracingHttpHandler() logging.getLogger().addHandler(fireworks_handler) @app.post("/init") async def init(request: InitRequest): # Create rollout-specific logger rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}") rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) rollout_logger.info(f"Starting rollout {request.metadata.rollout_id}") try: # Your agent logic here # 1. Make model calls using request.model_base_url # 2. Call tools, interact with environment # 3. Collect results result = run_your_agent( messages=request.messages, tools=request.tools, model_config=request.completion_params, api_key=request.api_key ) # Signal completion rollout_logger.info( f"Rollout {request.metadata.rollout_id} completed successfully", extra={"status": Status.rollout_finished()} ) return {"status": "success", "result": result} except Exception as e: # Signal error rollout_logger.error( f"Rollout {request.metadata.rollout_id} failed: {str(e)}", extra={"status": Status.rollout_error(str(e))} ) return JSONResponse( status_code=500, content={"status": "error", "message": str(e)} ) def run_your_agent(messages, tools, model_config, api_key): # Implement your agent logic here # Make model calls, use tools, etc. pass ``` ## Testing locally Before deploying, test your remote server locally: ```bash theme={null} uvicorn main:app --reload --port 8080 ``` In your evaluator test, point to your local server: ```python theme={null} from eval_protocol.pytest import RemoteRolloutProcessor rollout_processor = RemoteRolloutProcessor( remote_base_url="http://localhost:8080" ) ``` ```bash theme={null} pytest my-evaluator-name.py -vs ``` This sends test rollouts to your local server and verifies the integration works. ## Deploying your service Once tested locally, deploy to production: * ✅ Service is publicly accessible (or accessible via VPN/private network) * ✅ HTTPS endpoint with valid SSL certificate (recommended) * ✅ Authentication/authorization configured * ✅ Monitoring and logging set up * ✅ Auto-scaling configured for concurrent rollouts * ✅ Error handling and retry logic implemented * ✅ Service availability SLA meets training requirements **Vercel/Serverless**: * One rollout per function invocation * Use environment variable approach * Configure timeout for long-running evaluations **AWS ECS/Kubernetes**: * Handle concurrent requests with proper worker configuration * Use RolloutIdFilter approach * Set up load balancing **On-premise**: * Ensure network connectivity from Fireworks * Configure firewall rules * Set up VPN if needed for security ## Connecting to RFT Once your remote server is deployed, create an RFT job that uses it: ```bash theme={null} eval-protocol create rft \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --remote-server-url https://your-evaluator.example.com \ --dataset my-dataset ``` The RFT job will send all rollouts to your remote server for evaluation during training. ## Troubleshooting **Symptoms**: Rollouts show as timed out or never complete **Solutions**: * Check that your service is logging `Status.rollout_finished()` correctly * Verify Fireworks tracing handler is configured * Ensure rollout\_id is included in log tags * Check for exceptions being swallowed without logging **Symptoms**: Eval Protocol can't match logs to rollouts **Solutions**: * Verify you're using the exact `rollout_id` from request metadata * Check that RolloutIdFilter or EP\_ROLLOUT\_ID is set correctly * Ensure logs are being sent to Fireworks (check tracing dashboard) **Symptoms**: Training is slow, high rollout latency **Solutions**: * Scale your service to handle concurrent requests * Optimize your agent logic (caching, async operations) * Add more workers or instances * Profile your code to find bottlenecks **Symptoms**: Model calls fail, API errors **Solutions**: * Verify API key is passed correctly from request * Check that your service has network access to Fireworks * Ensure model\_base\_url is used for traced calls ## Example implementations Learn by example: Complete walkthrough using a Vercel TypeScript server for SVG generation Minimal Python implementation showing the basics ## Next steps Launch your RFT job using the CLI Track rollout progress and debug issues Full Remote Rollout Processor tutorial Design effective reward functions # Debug SFT tokenization Source: https://docs.fireworks.ai/fine-tuning/debug-sft-tokenization Download rendered token IDs and loss masks for supervised fine-tuning jobs. When supervised fine-tuning quality looks wrong, first check what the trainer actually saw. Fireworks can attach a **Render Samples** download to supervised fine-tuning job details. The file is a JSONL sample of records after Fireworks applies the model's chat template, tokenizer, and training mask. Use render samples to answer questions such as: * Did `system`, `user`, `assistant`, and tool messages render with the expected special tokens? * Are only the intended assistant tokens included in the loss? * Did a message-level `weight: 0` or sample-level `weight` remove the tokens you expected? * Does Fireworks' tokenizer output match the tokenizer behavior you tested locally? The render samples file is a diagnostic sample, not a full dataset export. New supervised fine-tuning jobs capture up to 20 rendered records by default. Older jobs, jobs that fail before rendering, or jobs without captured samples may not show the download. ## Download render samples Go to the [Fireworks dashboard](https://app.fireworks.ai/dashboard/fine-tuning), then open the supervised fine-tuning job you want to inspect. In the job details sidebar, look for **Render Samples**. Click **Download**. Each line in the downloaded file is one rendered training record. Render samples can contain text from your training dataset in `decoded_tokens`. Treat the downloaded file like training data and do not share it publicly. ## Understand the JSONL fields A render sample record looks like this: ```json theme={null} { "source_jsonl_row_index": 4, "source_jsonl_line_number": 5, "split_index": 0, "worker_id": 2, "renderer": "qwen3", "train_on_what": "all_assistant_messages", "token_ids": [10, 11, 12], "decoded_tokens": ["<|im_start|>", "assistant", "Hello"], "token_weights": [0.0, 0.0, 1.0], "training_target_token_ids": [11, 12], "training_loss_weights": [0.0, 1.0] } ``` | Field | Meaning | | --------------------------- | ----------------------------------------------------------------------------------------------------------- | | `source_jsonl_row_index` | Zero-based index of the source dataset row. | | `source_jsonl_line_number` | One-based source line number, useful for opening the row in an editor. | | `split_index` | Index of the rendered record produced from that source row. Most rows produce one record. | | `renderer` | Chat template renderer selected for the base model. | | `train_on_what` | Which message content is configured to contribute to training loss. | | `token_ids` | Full rendered token sequence before the next-token shift. | | `decoded_tokens` | One-token decode for each token ID. Tokenizers may show whitespace markers or byte fallback pieces. | | `token_weights` | Per-token training weight in rendered order. `0` means context only; a positive value contributes to loss. | | `training_target_token_ids` | Shifted next-token targets passed to the trainer. This array is usually one shorter than `token_ids`. | | `training_loss_weights` | Loss weights aligned with `training_target_token_ids`. A positive value means that target token is trained. | For quick inspection, `token_ids`, `decoded_tokens`, and `token_weights` are the easiest fields to scan. For exact trainer behavior, use `training_target_token_ids` and `training_loss_weights`; those are shifted for next-token prediction. ## Inspect a downloaded file Use this local script to print each rendered token with its training status: ```python theme={null} import json from pathlib import Path for line in Path("render_samples.jsonl").read_text().splitlines(): record = json.loads(line) print( f"\nsource line {record['source_jsonl_line_number']} " f"split {record['split_index']} " f"renderer={record['renderer']} " f"train_on={record['train_on_what']}" ) for index, (token_id, text, weight) in enumerate( zip(record["token_ids"], record["decoded_tokens"], record["token_weights"]) ): status = "TRAIN" if float(weight) > 0 else "ctx" print(f"{index:04d} {int(token_id):8d} {float(weight):g} {status:5s} {text!r}") ``` Then compare the reported `source_jsonl_line_number` with the original dataset row: ```bash theme={null} sed -n '5p' train.jsonl ``` Replace `5` with the line number from the render sample. ## Common findings | What you see | Likely cause | What to do | | --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | Assistant answer tokens have `token_weights` of `0` | The assistant message has `weight: 0`, the sample has zero weight, or the job is configured to train on different content. | Check the original JSONL row and remove unintended weights. | | User or system tokens have positive `token_weights` | The row schema or training configuration is not representing roles as intended. | Verify every message has the correct `role`, and avoid putting assistant text in a `user` message. | | Expected text is missing from `decoded_tokens` | The source row may have been split, truncated, or rendered differently by the model chat template. | Check `split_index`, source line number, and the job's max context length. | | Extra special tokens appear around messages | The selected model renderer is adding chat template markers. | This is often expected. If the markers are wrong for your use case, check that the base model and dataset format match. | | Token boundaries look surprising | Many tokenizers encode whitespace, Unicode, and byte fallback pieces in non-obvious ways. | Compare with the same Hugging Face tokenizer using `skip_special_tokens=False`. | | The Render Samples row is missing | The job may predate this feature, may have failed before rendering, or may not have captured samples. | Create a new supervised fine-tuning job, or contact support with the job ID if the job should have rendered samples. | ## Compare with a local tokenizer If you have access to the matching Hugging Face tokenizer, compare Fireworks' rendered tokens with local tokenizer output: ```python theme={null} import json from pathlib import Path from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("", trust_remote_code=True) record = json.loads(Path("render_samples.jsonl").read_text().splitlines()[0]) print(tokenizer.decode(record["token_ids"], skip_special_tokens=False)) ``` The local decode should help explain token boundaries and special tokens. If local tokenization differs, confirm that you are using the same tokenizer family and revision as the base model selected for fine-tuning. # Deploying Fine Tuned Models Source: https://docs.fireworks.ai/fine-tuning/deploying-loras Deploy one or multiple LoRA models fine tuned on Fireworks using live merge or multi-LoRA After fine-tuning your model on Fireworks, deploy it to make it available for inference. Fireworks supports two deployment methods for LoRA fine-tuned models: **live merge** and **multi-LoRA**. Each method has different tradeoffs around performance, cost, and flexibility. Fine-tuned LoRA models, whether created on the Fireworks platform or imported, can **only** be deployed to **on-demand (dedicated) deployments**. Serverless deployment is not supported for LoRA models. You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for details. ## Choosing a deployment method Fireworks offers two ways to deploy LoRA fine-tuned models. The right choice depends on how many fine-tuned variants you need to serve and your performance requirements. | | **Live merge** | **Multi-LoRA** | | ------------------------- | ---------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | | **How it works** | LoRA weights are merged into the base model at deployment time, creating a single merged model | Base model is deployed with addon support; LoRA adapters are loaded dynamically at request time | | **Number of LoRAs** | One per deployment | Multiple per deployment | | **Inference performance** | Matches the base model (no overhead) | Some overhead per request due to dynamic adapter application | | **Throughput** | Same as base model | Lower maximum throughput under high concurrency | | **Cost efficiency** | One deployment per fine-tune | Share a single deployment across many fine-tunes | | **Best for** | Production workloads requiring maximum performance | Experimentation, A/B testing, or serving many variants of the same base model | If you only need to serve a single fine-tuned model, **live merge is the recommended approach**. It delivers the best performance with the simplest setup. ## Live merge deployment Live merge is the simplest way to deploy a fine-tuned model. Fireworks automatically merges the LoRA weights into the base model at deployment time, producing a model that performs identically to a natively fine-tuned model with no inference overhead. ### How it works When you deploy a LoRA model directly, Fireworks: 1. Takes your LoRA adapter weights and the base model 2. Merges them into a single set of weights at deployment time 3. Serves the merged model as a standalone deployment The result is a deployment that is indistinguishable from a fully fine-tuned model in terms of latency, throughput, and memory usage. ### Deploy with live merge Deploy your LoRA fine-tuned model with a single command: ```bash theme={null} firectl deployment create "accounts//models/" ``` Your deployment will be ready to use once it completes, with performance that matches the base model. ### Sending requests Send inference requests to your live-merge deployment by referencing the deployment directly: ```python theme={null} from fireworks import Fireworks client = Fireworks() response = client.chat.completions.create( model="accounts//models/", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) ``` ```bash theme={null} curl https://api.fireworks.ai/inference/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -d '{ "model": "accounts//models/", "messages": [ { "role": "user", "content": "Hello!" } ] }' ``` ### When to use live merge * You need maximum inference performance (latency and throughput matching the base model) * You are serving a single fine-tuned model in production * You want the simplest possible deployment workflow ## Multi-LoRA deployment Multi-LoRA lets you load multiple LoRA adapters onto a single base model deployment. This is useful when you have several fine-tuned variants of the same base model and want to share GPU resources across them rather than creating a separate deployment for each. ### How it works With multi-LoRA: 1. You deploy the base model with addon support enabled 2. You load one or more LoRA adapters onto the running deployment 3. At inference time, the correct adapter is selected and applied dynamically based on the model specified in the request Because adapters are applied dynamically rather than merged, there is some performance overhead compared to live merge. This overhead increases with higher request concurrency. ### LoRA addon shape compatibility Not all deployment shapes support LoRA addons. **FP8 and FP4 quantized shapes do not support `--enable-addons`.** | Precision | `--enable-addons` supported? | | --------- | ---------------------------- | | BF16 | ✅ Yes | | FP8 | ❌ No | | FP4 | ❌ No | Many base models default to FP8 or FP4 shapes. If you need LoRA addon inference on one of these models, you have two options: **Option 1 — Use a BF16 deployment shape** ```bash theme={null} # List available shapes for your model firectl deployment-shape-version list --base-model accounts/fireworks/models/ # Create deployment with a BF16 shape and addons enabled firectl deployment create "accounts/fireworks/models/" \ --deployment-shape \ --enable-addons ``` **Option 2 — Merge the adapter into a standalone model** If no BF16 addon-compatible shape is available, use [live merge](#live-merge-deployment) (recommended for a single adapter) or merge the LoRA into a standalone Fireworks model, then deploy that merged model without `--enable-addons`. See [Uploading custom models](/models/uploading-custom-models#importing-fine-tuned-models) and [`firectl model create`](/tools-sdks/firectl/commands/model-create). `"addons cannot be enabled with quantized precisions (FP8/FP4)"` — your model's default shape is quantized; use Option 1 or 2 above. `"the deployment shape version does not exist or you do not have access to it"` — the shape you requested is not available on your account; contact support. ### Deploy with multi-LoRA Deploy the base model with addons enabled: ```bash theme={null} firectl deployment create "accounts/fireworks/models/" --enable-addons ``` Once the deployment is ready, load your LoRA models onto the deployment: ```bash theme={null} firectl load-lora --deployment ``` Repeat this command for each LoRA adapter you want to load. ### Sending requests To route inference requests to a specific LoRA adapter on a multi-LoRA deployment, set the `model` field to `#`. The `#` separator tells Fireworks to route the request to the specified adapter on the given deployment. **Deprecation notice:** The `deployedModel` request key for routing to LoRA addons is deprecated and will not be supported for any new deployments. Use the `model` field with the `#` format shown below. ```python theme={null} from fireworks import Fireworks client = Fireworks() response = client.chat.completions.create( model="accounts//models/#accounts//deployments/", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) ``` ```python theme={null} import os from openai import OpenAI client = OpenAI( api_key=os.environ.get("FIREWORKS_API_KEY"), base_url="https://api.fireworks.ai/inference/v1" ) response = client.chat.completions.create( model="accounts//models/#accounts//deployments/", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) ``` ```javascript theme={null} import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.FIREWORKS_API_KEY, baseURL: "https://api.fireworks.ai/inference/v1", }); const response = await client.chat.completions.create({ model: "accounts//models/#accounts//deployments/", messages: [ { role: "user", content: "Hello!", }, ], }); console.log(response.choices[0].message.content); ``` ```bash theme={null} curl https://api.fireworks.ai/inference/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FIREWORKS_API_KEY" \ -d '{ "model": "accounts//models/#accounts//deployments/", "messages": [ { "role": "user", "content": "Hello!" } ] }' ``` ### When to use multi-LoRA * You need to serve multiple fine-tuned models based on the same base model * You want to maximize GPU utilization by sharing a single deployment * You are running experiments or A/B tests across multiple fine-tuned variants * You can accept some performance overhead compared to live merge ## Performance considerations Live merge eliminates all LoRA-related inference overhead because the adapter weights are baked into the model at deployment time. The resulting deployment behaves exactly like a natively fine-tuned base model. Multi-LoRA deployments incur overhead because adapters are applied dynamically: * **Time to first token (TTFT):** Increases by roughly 10–30% due to adapter loading and prompt processing overhead * **Generation speed:** Overhead grows with higher request concurrency * **Maximum throughput:** Lower than a live-merge deployment under sustained load For a deeper dive into LoRA performance characteristics and optimization strategies, see [Understanding LoRA Performance](/guides/understanding_lora_performance). ## Next steps Learn about deployment configuration and optimization Upload LoRA models fine-tuned outside of Fireworks Understand performance tradeoffs and optimization strategies # Direct Preference Optimization Source: https://docs.fireworks.ai/fine-tuning/dpo-fine-tuning Direct Preference Optimization (DPO) fine-tunes models by training them on pairs of preferred and non-preferred responses to the same prompt. This teaches the model to generate more desirable outputs while reducing unwanted behaviors. **Use DPO when:** * Aligning model outputs with brand voice, tone, or style guidelines * Reducing hallucinations or incorrect reasoning patterns * Improving response quality where there's no single "correct" answer * Teaching models to follow specific formatting or structural preferences ## Fine-tuning with DPO Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted training example. **Minimum Requirements:** * **Minimum examples needed:** 3 * **Maximum examples:** Up to 3 million examples per dataset * **File format:** JSONL (each line is a valid JSON object) * **Dataset Schema:** Each training sample must include the following fields: * An `input` field containing a `messages` array, where each message is an object with two fields: * `role`: one of `system`, `user`, or `assistant` * `content`: a string representing the message content * A `preferred_output` field containing an assistant message with an ideal response * A `non_preferred_output` field containing an assistant message with a suboptimal response Here’s an example conversation dataset (one training example): ```json einstein_dpo.jsonl theme={null} { "input": { "messages": [ { "role": "user", "content": "What is Einstein famous for?" } ], "tools": [] }, "preferred_output": [ { "role": "assistant", "content": "Einstein is renowned for his theory of relativity, especially the equation E=mc²." } ], "non_preferred_output": [ { "role": "assistant", "content": "He was a famous scientist." } ] } ``` We currently only support one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message. Save this dataset as jsonl file locally, for example `einstein_dpo.jsonl`. There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`. * You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard. Dataset Pn * Upload dataset using `firectl` ```bash theme={null} firectl dataset create /path/to/file.jsonl ``` You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client. ```jsx theme={null} // Create Dataset Entry const createDatasetPayload = { datasetId: "trader-poe-sample-data", dataset: { userUploaded: {} } // Additional params such as exampleCount }; const urlCreateDataset = `${BASE_URL}/datasets`; const response = await fetch(urlCreateDataset, { method: "POST", headers: HEADERS_WITH_CONTENT_TYPE, body: JSON.stringify(createDatasetPayload) }); ``` ```jsx theme={null} // Upload JSONL file const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`; const files = new FormData(); files.append("file", localFileInput.files[0]); const uploadResponse = await fetch(urlUpload, { method: "POST", headers: HEADERS, body: files }); ``` While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets. Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). ```bash theme={null} firectl dpoj create \ --base-model accounts/account-id/models/base-model-id \ --dataset accounts/my-account-id/datasets/my-dataset-id \ --output-model new-model-id ``` For our example, we might run the following command: ```bash theme={null} firectl dpoj create \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset accounts/pyroworks/datasets/einstein-dpo \ --output-model einstein-dpo-model ``` to fine-tune a [Llama 3.1 8b Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-8b-instruct) model with our Einstein dataset. ```bash theme={null} firectl dpoj get dpo-job-id ``` Once the job is complete, the `STATE` will be set to `JOB_STATE_COMPLETED`, and the fine-tuned model can be deployed. Once training completes, you can create a deployment to interact with the fine-tuned model. Refer to [deploying a fine-tuned model](/fine-tuning/fine-tuning-models#deploying-a-fine-tuned-model) for more details. ## Next Steps Explore other fine-tuning methods to improve model output for different use cases. Train models on input-output examples to improve task-specific performance. Optimize models using AI feedback for complex reasoning and decision-making. Fine-tune vision-language models to understand both images and text. # Agent Tracing Source: https://docs.fireworks.ai/fine-tuning/environments Understand where your agent runs and how tracing enables reinforcement fine-tuning ## Why agent tracing is critical to doing RL Reinforcement learning for agents depends on the entire chain of actions, tool calls, state transitions, and intermediate decisions—not just the final answer. Tracing captures this full trajectory so you can compute reliable rewards, reproduce behavior, and iterate quickly. **Why it matters** * **Credit assignment**: You need a complete record of each step to attribute reward to the decisions that caused success or failure. * **Reproducibility**: Deterministic replays require the exact prompts, model parameters, tool I/O, and environment state. * **Debuggability**: You can pinpoint where an episode fails (model output, tool error, data mismatch, timeout). Use Fireworks Tracing to drive the RL loop: emit structured logs with `FireworksTracingHttpHandler`, tag them with rollout correlation metadata, and signal completion using `Status.rollout_finished()` or `Status.rollout_error()`. When you make model calls, use the `model_base_url` issued by the trainer (it points to `https://tracing.fireworks.ai`) so chat completions are recorded as traces via an OpenAI-compatible endpoint. ## How Fireworks tracing works for RFT * **Traced completions**: The trainer provides a `model_base_url` on `https://tracing.fireworks.ai` that encodes correlation metadata. Your agent uses this OpenAI-compatible URL for LLM calls; tracing.fireworks.ai records the calls as traces automatically. * **Structured logging sink**: Your agent logs to Fireworks via `FireworksTracingHttpHandler`, including a structured `Status` when a rollout finishes or errors. * **Join traces and logs**: The trainer polls the logging sink by `rollout_id` to detect completion, then loads the full trace. Logs and traces are deterministically joined using the same correlation tags. ### Correlation metadata * **Correlate every log and trace** with these metadata fields provided in `/init`: `invocation_id`, `experiment_id`, `rollout_id`, `run_id`, `row_id`. * **Emit structured completion** from your server logs: * Add `FireworksTracingHttpHandler` and `RolloutIdFilter` to attach the `rollout_id` * Log `Status.rollout_finished()` on success, or `Status.rollout_error(message)` on failure * **Alternative**: If you run one rollout per process, set `EP_ROLLOUT_ID` in the child process instead of adding a filter. * **Record model calls as traces** by using the `model_base_url` from the trainer. It encodes the correlation IDs so your completions are automatically captured. ### tracing.fireworks.ai base URL * **Purpose-built for RL**: tracing.fireworks.ai is the Fireworks gateway used during RFT to capture traces and correlate them with rollout status. * **OpenAI-compatible**: It exposes Chat Completions-compatible endpoints, so you set it as your client's `base_url`. * **Correlation-aware**: The trainer embeds `rollout_id`, `run_id`, and related IDs into the `model_base_url` path so your completions are automatically tagged and joinable with logs. * **Drop-in usage**: Always use the `model_base_url` provided in `/init`—do not override it—so traces and logs are correctly linked. ## End-to-end tracing setup with tracing.fireworks.ai Your server implements `/init` and receives `metadata` and `model_base_url`. Attach `RolloutIdFilter` or set `EP_ROLLOUT_ID` for the current rollout. Call the model using `model_base_url` so chat completions are persisted as traces with correlation tags. Attach `FireworksTracingHttpHandler` to your logger and log `Status.rollout_finished()` or `Status.rollout_error()` when the rollout concludes. The trainer polls Fireworks logs by `rollout_id`, then loads the full traces; logs and traces share the same tags and are joined to finalize results and compute rewards. ### Remote server minimal example ```python remote_server.py theme={null} import logging import os from eval_protocol import InitRequest, Status, FireworksTracingHttpHandler, RolloutIdFilter # Configure Fireworks logging sink once at startup logging.getLogger().addHandler(FireworksTracingHttpHandler()) @app.post("/init") def init(request: InitRequest): # Option A: add filter that injects rollout_id on every log record logger = logging.getLogger(f"eval.{request.metadata.rollout_id}") logger.addFilter(RolloutIdFilter(request.metadata.rollout_id)) # Option B: per-process correlation (use when spawning one rollout per process) # os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id # Make model calls via the correlated base URL so completions are traced # client = YourLLMClient(base_url=request.model_base_url, api_key=request.api_key) try: # ... execute rollout steps, tool calls, etc. ... logger.info("rollout finished", extra={"status": Status.rollout_finished()}) except Exception as e: logger.error("rollout error", extra={"status": Status.rollout_error(str(e))}) ``` Under the hood, the trainer polls the logging sink for `Status` and then loads the full trace for scoring. Because both logs and traces share the same correlation tags, Fireworks can deterministically join them to finalize results and compute rewards. ### What to capture in a trace * **Inputs and context**: Task ID, dataset split, initial state, seeds, and any retrieval results provided to the agent. * **Model calls**: System/user messages, tool messages, model/version, parameters (e.g., temperature, top\_p, seed), token counts, and optional logprobs. * **Tool and API calls**: Request/response summaries, status codes, durations, retries, and sanitized payload snippets. * **Environment state transitions**: Key state before/after each action that affects reward or next-step choices. * **Rewards**: Per-step shaping rewards, terminal reward, and component breakdowns with weights and units. * **Errors and timeouts**: Exceptions, stack traces, and where they occurred in the trajectory. * **Artifacts**: Files, code, unit test results, or other outputs needed to verify correctness. Never record secrets or raw sensitive data in traces. Redact tokens, credentials, and PII. Store references (IDs, hashes) instead of full payloads whenever possible. ### How tracing powers the training loop 1. **Rollout begins**: Trainer creates a rollout and sends it to your environment (local or remote) with a unique identifier. 2. **Agent executes**: Your agent emits spans for model calls, tool calls, and state changes; your evaluator computes step and terminal rewards. 3. **Rewards aggregate**: The trainer consumes your rewards and updates the policy; traces are stored for replay and analysis. 4. **Analyze and iterate**: You filter traces by reward, failure type, latency, or cost to refine prompts, tools, or reward shaping. ### How RemoteRolloutProcessor uses Fireworks Tracing 1. **Remote server logs completion** with structured status: `Status.rollout_finished()` or `Status.rollout_error()`. 2. **Trainer polls Fireworks Tracing** by `rollout_id` until completion status is found. 3. **Status extracted** from structured fields (`code`, `message`, `details`) to finalize the rollout result. ### Best practices * **Make it deterministic**: Record seeds, versions, and any non-deterministic knobs; prefer idempotent tool calls or cached fixtures in test runs. * **Keep signals bounded**: Normalize rewards to a consistent range (e.g., \[0, 1]) and document your components and weights. * **Summarize, don’t dump**: Log compact summaries and references for large payloads to keep traces fast and cheap. * **Emit heartbeats**: Send periodic status updates so long-running rollouts are observable; always finalize with success or failure. * **Use consistent schemas**: Keep field names and structures stable to enable dashboards, filters, and automated diagnostics. ## Next steps Implement `/init`, tracing, and structured status for remote agents Build and deploy a local evaluator in under 10 minutes Launch your RFT job Design effective reward functions for your task # Evaluators Source: https://docs.fireworks.ai/fine-tuning/evaluators Understand the fundamentals of evaluators and reward functions in reinforcement fine-tuning An evaluator (also called a reward function) is code that scores model outputs from 0.0 (worst) to 1.0 (best). During reinforcement fine-tuning, your evaluator guides the model toward better responses by providing feedback on its generated outputs. ## Why evaluators matter Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what "good" means. This is powerful because: * **No perfect data required** - Just prompts and a way to score outputs * **Encourages exploration** - Models learn strategies, not just patterns * **Noise tolerant** - Even noisy signals can improve model performance * **Encodes domain expertise** - Complex rules and logic that are hard to demonstrate with examples ## Anatomy of an evaluator Every evaluator has three core components: ### 1. Input data The prompt and any ground truth data needed for evaluation: ```python theme={null} { "messages": [ {"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is 15 * 23?"} ], "ground_truth": "345" # Optional additional data } ``` ### 2. Model output The assistant's response to evaluate: ```python theme={null} { "role": "assistant", "content": "Let me calculate that step by step:\n15 * 23 = 345" } ``` ### 3. Scoring logic Code that compares the output to your criteria: ```python theme={null} def evaluate(model_output: str, ground_truth: str) -> float: # Extract answer from model's response predicted = extract_number(model_output) # Score it if predicted == int(ground_truth): return 1.0 # Perfect else: return 0.0 # Wrong ``` ## Types of evaluators ### Rule-based evaluators Check if outputs match specific patterns or rules: * **Exact match** - Output exactly equals expected value * **Contains** - Output includes required text * **Regex** - Output matches a pattern * **Format validation** - Output follows required structure (e.g., valid JSON) Start with rule-based evaluators. They're simple, fast, and surprisingly effective. ### Execution-based evaluators Run code or commands to verify correctness: * **Code execution** - Run generated code and check results * **Test suites** - Pass generated code through unit tests * **API calls** - Execute commands and verify outcomes * **Simulations** - Run agents in environments and measure success ### LLM-as-judge evaluators Use another model to evaluate quality: * **Rubric scoring** - Judge outputs against criteria * **Comparative ranking** - Compare multiple outputs * **Natural language assessment** - Evaluate subjective qualities like helpfulness ## Scoring guidelines Your evaluator should return a score between 0.0 and 1.0: | Score range | Meaning | Example | | ----------- | ------- | --------------------------- | | 1.0 | Perfect | Exact correct answer | | 0.7-0.9 | Good | Right approach, minor error | | 0.4-0.6 | Partial | Some correct elements | | 0.1-0.3 | Poor | Wrong but attempted | | 0.0 | Failure | Completely wrong | Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes. ## Best practices Begin with basic evaluation logic and refine over time: ```python theme={null} # Start here score = 1.0 if predicted == expected else 0.0 # Then refine if needed score = calculate_similarity(predicted, expected) ``` Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results. Training generates many outputs to evaluate, so performance matters: * **Cache expensive computations**: Store results of repeated calculations * **Use timeouts for code execution**: Prevent hanging on infinite loops * **Batch API calls when possible**: Reduce network overhead * **Profile slow evaluators and optimize**: Identify and fix bottlenecks Aim for evaluations that complete in seconds, not minutes. Slow evaluators directly increase training time and cost. Models will generate unexpected outputs, so build robust error handling: ```python theme={null} try: result = execute_code(model_output) score = check_result(result) except TimeoutError: score = 0.0 # Code ran too long except SyntaxError: score = 0.0 # Invalid code except Exception as e: score = 0.0 # Any other error ``` Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain. Models will exploit evaluation weaknesses, so design defensively: **Example: Length exploitation** If you score outputs by length, the model might generate verbose nonsense. Add constraints: ```python theme={null} # Bad: Model learns to write long outputs score = min(len(output) / 1000, 1.0) # Better: Require correctness AND reasonable length if is_correct(output): score = 1.0 if len(output) < 500 else 0.8 else: score = 0.0 ``` **Example: Format over substance** If you only check JSON validity, the model might return valid but wrong JSON. Check content too: ```python theme={null} # Bad: Only checks format score = 1.0 if is_valid_json(output) else 0.0 # Better: Check format AND content if is_valid_json(output): data = json.loads(output) score = evaluate_content(data) else: score = 0.0 ``` Always combine format checks with content validation to prevent models from gaming the system. ## Debugging evaluators Test your evaluator before training. Look for: * **Correct scoring** - Good outputs score high, bad outputs score low * **Reasonable runtime** - Each evaluation completes in reasonable time * **Clear feedback** - Evaluation reasons explain scores Run your evaluator on manually created good and bad examples first. If it doesn't score them correctly, fix the evaluator before training. ## Next steps Connect to your environment for single and multi-turn agents Follow a complete example building and using an evaluator # Supervised Fine Tuning - Text Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-models This guide will focus on using supervised fine-tuning to fine-tune a model and deploy it to an on-demand (dedicated) deployment, which is the only supported method for serving fine-tuned models. For the full list of base models supported by managed fine-tuning (SFT, DPO, and RFT) and their max context lengths, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models). ## Fine-tuning a model using SFT You can confirm that a base model is available to fine-tune by looking for the `Tunnable` tag in the model library or by using: ```bash theme={null} firectl model get -a fireworks ``` And looking for `Tunable: true`. Some base models cannot be tuned on Fireworks (`Tunable: false`) but still list support for LoRA (`Supports Lora: true`). This means that users can tune a LoRA for this base model on a separate platform and upload it to Fireworks for inference. Consult [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for more information. Fireworks uses the **OpenAI-compatible chat completion format** for SFT training data. If you already have datasets formatted for OpenAI fine-tuning, they work on Fireworks with no changes needed. Datasets must be in JSONL format, where each line represents a complete JSON-formatted training example. Make sure your data conforms to the following restrictions: * **Minimum examples:** 3 * **Maximum examples:** 3 million per dataset * **File format:** `.jsonl` * **Message schema:** Each training sample must include a messages array, where each message is an object with two fields: * `role`: one of `system`, `user`, or `assistant`. A message with the `system` role is optional, but if specified, it must be the first message of the conversation * `content`: the message content. This can be either a plain string **or** a list of content parts in the OpenAI chat completions style, e.g. `[{"type": "text", "text": "..."}]`. Both forms are accepted, and you can mix them freely across messages and even within the same dataset * `weight`: optional key with value to be configured in either 0 or 1. message will be skipped if value is set to 0 * **Sample weight:** Optional key `weight` at the root of the JSON object. It can be any floating point number (positive, negative, or 0) and is used as a loss multiplier for tokens in that sample. If used, this field must be present in all samples in the dataset. Here is an example conversation dataset: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."} ] } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"} ] } ``` #### OpenAI-style structured content In addition to plain strings, `content` may also be a list of content parts following the OpenAI chat completions format. For text fine-tuning, use `{"type": "text", "text": "..."}` parts. This is convenient if you already produce data in the OpenAI chat completions shape, or if you generate datasets with the OpenAI SDK. The string form and the list form are equivalent for text models, and you can mix them within the same file (and even within the same conversation): ```json theme={null} {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Paris."}]}]} {"messages": [{"role": "user", "content": [{"type": "text", "text": "What is 1+1?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "2"}], "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]} {"messages": [{"role": "user", "content": [{"type": "text", "text": "Say hello "}, {"type": "text", "text": "in French."}]}, {"role": "assistant", "content": "Bonjour."}]} ``` All keys you can use with the string form — including the per-message `weight` and `reasoning_content` — work the same way with the list form. When a single message contains multiple text parts (as in the third example above), the parts are concatenated when the chat template is applied. For text-only fine-tuning, only `{"type": "text", ...}` parts are used; image parts are reserved for [vision fine-tuning](/fine-tuning/fine-tuning-vlm). Here is an example conversation dataset with sample weights: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."} ], "weight": 0.5 } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"} ], "weight": 1.0 } ``` We also support function calling dataset with a list of tools. An example would look like: ```json theme={null} { "tools": [ { "type": "function", "function": { "name": "get_car_specs", "description": "Fetches detailed specifications for a car based on the given trim ID.", "parameters": { "trimid": { "description": "The trim ID of the car for which to retrieve specifications.", "type": "int", "default": "" } } } }, ], "messages": [ { "role": "user", "content": "What is the specs of the car with trim 121?" }, { "role": "assistant", "tool_calls": [ { "type": "function", "function": { "name": "get_car_specs", "arguments": "{\"trimid\": 121}" } } ] } ] } ``` For the subset of models that supports thinking (e.g. DeepSeek R1, GPT OSS models and Qwen3 thinking models), we also support fine tuning with thinking traces. If you wish to fine tune with thinking traces, the dataset could also include thinking traces for assistant turns. Though optional, ideally each assistant turn includes a thinking trace. For example: ```json theme={null} { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris.", "reasoning_content": "The user is asking about the capital city of France, it should be Paris."} ] } { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "weight": 0, "reasoning_content": "The user is asking about the result of 1+1, the answer is 2."}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4", "reasoning_content": "The user is asking about the result of 2+2, the answer should be 4."} ] } ``` Note that when fine tuning with intermediate thinking traces, the number of total tuned tokens could exceed the number of total tokens in the dataset. This is because we unroll multi-turn conversations into multiple training examples to ensure train-inference consistency. During inference, a model's thinking traces from previous turns are **not** visible in the conversation history — only the final `content` is retained. To match this behavior during training, we expand each multi-turn conversation into several single-turn training examples, where each example only tunes on one assistant turn and presents the conversation history exactly as it would appear at inference time (i.e., without previous thinking traces). For example, consider this two-turn dataset entry: ```json theme={null} { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "reasoning_content": "Simple arithmetic: 1+1=2."}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4", "reasoning_content": "Following up: 2+2=4."} ] } ``` This gets expanded into two training examples: **Example 1** — tunes on the first assistant turn: ```json theme={null} { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2", "reasoning_content": "Simple arithmetic: 1+1=2."} ] } ``` **Example 2** — tunes on the second assistant turn, with the first turn's thinking trace stripped to match inference behavior: ```json theme={null} { "messages": [ {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4", "reasoning_content": "Following up: 2+2=4."} ] } ``` Because the conversation context is duplicated across these expanded examples, the total tuned token count will be larger than the raw dataset token count. The expansion grows with the number of assistant turns in each conversation: a conversation with *N* assistant turns produces *N* separate training examples. There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`. * You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard. Dataset Pn ```bash theme={null} firectl dataset create /path/to/jsonl/file ``` You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client. ```jsx theme={null} // Create Dataset Entry const createDatasetPayload = { datasetId: "trader-poe-sample-data", dataset: { userUploaded: {} } // Additional params such as exampleCount }; const urlCreateDataset = `${BASE_URL}/datasets`; const response = await fetch(urlCreateDataset, { method: "POST", headers: HEADERS_WITH_CONTENT_TYPE, body: JSON.stringify(createDatasetPayload) }); ``` ```jsx theme={null} // Upload JSONL file const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`; const files = new FormData(); files.append("file", localFileInput.files[0]); const uploadResponse = await fetch(urlUpload, { method: "POST", headers: HEADERS, body: files }); ``` While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets. Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). There are also a couple ways to launch the fine-tuning jobs. We highly recommend creating supervised fine tuning jobs via `UI` . Simply navigate to the `Fine-Tuning` tab, click `Fine-Tune a Model` and follow the wizard from there. You can even pick a LoRA model to start the fine-tuning for continued training. Fine Tuning Pn Create Sftj Pn Ensure the fine tuned model ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). This will return a fine-tuning job ID. For a full explanation of the settings available to control the fine-tuning process, including learning rate and epochs, consult [additional SFT job settings](#additional-sft-job-settings). ```bash theme={null} firectl sftj create --base-model --dataset --output-model ``` Similar to UI, instead of tuning a base model, you can also start tuning from a previous LoRA model using ```bash theme={null} firectl sftj create --warm-start-from --dataset --output-model ``` Notice that we use `--warm-start-from` instead of `--base-model` when creating this job. With `UI`, once the job is created, it will show in the list of jobs. Clicking to view the job details to monitor the job progress. Sftj Details Pn If the fine-tuned model appears to learn the wrong text or ignore the expected assistant response, use **Render Samples** on the job details page to inspect the rendered token IDs and loss masks. See [Debug SFT tokenization](/fine-tuning/debug-sft-tokenization). With `firectl`, you can monitor the progress of the tuning job by running ```bash theme={null} firectl sftj get ``` Once the job successfully completes, you will see the new LoRA model in your model list ```bash theme={null} firectl model list ``` For a complete Python SDK example that demonstrates the full workflow (creating datasets, uploading files, and launching a supervised fine-tuning job), see the [Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py). ## Deploying a fine-tuned model After fine-tuning completes, deploy your model to make it available for inference: ```bash theme={null} firectl deployment create ``` This creates a dedicated deployment with performance matching the base model. For more details on deploying fine-tuned models, including multi-LoRA deployments, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras). ## Additional SFT job settings Additional tuning settings are available when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like `epochs` and `learning rate`, we recommend using default settings and only changing hyperparameters if results are not as desired. By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data. `evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl ```shell theme={null} firectl sftj create \ --evaluation-dataset my-eval-set \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Depending on the size of the model, the default context size will be different. For most models, the default context size is >= 32768. Training examples will be cut-off at 32768 tokens. Usually you do not need to set the max context length unless out of memory error is encountered with higher lora rank and large max context length. ```shell theme={null} firectl sftj create \ --max-context-length 65536 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Batch size is the number of tokens packed into one forward step during training. One batch could consist of multiple training samples. We do sequence packing on the training samples, and batch size controls how many total tokens will be packed into each batch. ```shell theme={null} firectl sftj create \ --batch-size 65536 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported. **Note: we set a max value of 3 million dataset examples × epochs** ```shell theme={null} firectl sftj create \ --epochs 2.0 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value is automatically based on your selected model. ```shell theme={null} firectl sftj create \ --learning-rate 0.0001 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Learning rate warmup steps controls the number of training steps during which the learning rate will be linearly ramped up to the set learning rate. ```shell theme={null} firectl sftj create \ --learning-rate 0.0001 \ --learning-rate-warmup-steps 200 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` Gradient accumulation steps controls the number of forward steps and backward steps to take (gradients are accumulated) before optimizer.step() is taken. Gradient accumulation steps > 1 increases effective batch size. ```shell theme={null} firectl sftj create \ --gradient-accumulation-steps 4 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 32. Our default value is 8. ```shell theme={null} firectl sftj create \ --lora-rank 16 \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key. ```shell theme={null} firectl sftj create \ --wandb-entity my-org \ --wandb-api-key xxx \ --wandb-project "My Project" \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within [ID constraints](/getting-started/concepts#resource-names-and-ids). ```shell theme={null} firectl sftj create \ --output-model my-model \ --base-model MY_BASE_MODEL \ --dataset cancerset ``` By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID. ```shell theme={null} firectl sftj create \ --job-id my-fine-tuning-job \ --base-model MY_BASE_MODEL \ --dataset cancerset \ --output-model my-tuned-model ``` ## Appendix * `Python SDK` [references](/tools-sdks/python-sdk) * `Restful API` [references](/api-reference/introduction) * `firectl` [references](/tools-sdks/firectl/firectl) * [Complete Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py) for a code-only implementation # Supervised Fine Tuning - Vision Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-vlm Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding. To see all vision models that support fine-tuning, visit the [Model Library for vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true). ## Fine-tuning a VLM using LoRA vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example. **Dataset Requirements:** * **Format**: `.jsonl` file * **Minimum examples**: 3 * **Maximum examples**: 3 million per dataset * **Images**: Must be base64 encoded with proper MIME type prefixes * **Supported image formats**: PNG, JPG, JPEG **Message Schema:** Each training example must include a `messages` array where each message has: * `role`: one of `system`, `user`, or `assistant` * `content`: an array containing text and image objects or just text ### Basic VLM Dataset Example ```json theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and answer questions about them." }, { "role": "user", "content": [ { "type": "text", "text": "What objects do you see in this image?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ] }, { "role": "assistant", "content": "I can see a red car, a tree, and a blue house in this image." } ] } ``` ### If your dataset contains image urls Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64. ```json theme={null} { "type": "image_url", "image_url": { // ❌ Raw HTTP/HTTPS URLs are NOT supported "url": "https://example.com/image.jpg" } } ``` ```json theme={null} { "type": "image_url", "image_url": { // ✅ Use data URI with base64 encoding // Format: data:image/{format};base64,{base64_encoded_data} "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..." } } ``` You can use the following script to automatically convert your dataset to the correct format: **Usage:** ```bash theme={null} # Install required dependency pip install requests # Download the script wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py # Run the script - will output a new dataset _base64.jsonl python download_images_and_encode_to_b64.py --input_file ``` ### Advanced Dataset Examples ```json expandable theme={null} { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images and tell me the differences" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..." } }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..." } } ] }, { "role": "assistant", "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different." } ] } ``` ```json expandable theme={null} { "messages": [ { "role": "system", "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation." }, { "role": "user", "content": [ { "type": "text", "text": "Can you describe this kitchen layout for me?" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light." }, { "role": "user", "content": "Now look at this living room. Do you think the styles would work well together?" }, { "role": "assistant", "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?" }, { "role": "user", "content": [ { "type": "text", "text": "Sorry about that! Here's the living room:" }, { "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..." } } ] }, { "role": "assistant", "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones." }, ] } ``` ### Try with an Example Dataset To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `` tags for reasoning. ```bash theme={null} # Download the example dataset curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` ```bash theme={null} # Download the example dataset wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl ``` Upload your prepared JSONL dataset to Fireworks for training: ```bash theme={null} firectl dataset create my-vlm-dataset /path/to/vlm_training_data.jsonl ``` Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard. Dataset creation interface ```javascript theme={null} // Create dataset entry const createDatasetPayload = { datasetId: "my-vlm-dataset", dataset: { userUploaded: {} } }; const response = await fetch(`${BASE_URL}/datasets`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify(createDatasetPayload) }); // Upload JSONL file const formData = new FormData(); formData.append("file", fileInput.files[0]); const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, { method: "POST", headers: { "Authorization": `Bearer ${API_KEY}` }, body: formData }); ``` For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details. Create a supervised fine-tuning job for your VLM: ```bash theme={null} firectl sftj create \ --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \ --dataset my-vlm-dataset \ --output-model my-custom-vlm \ --epochs 3 ``` For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings). 1. Navigate to the Fine-tuning tab in the Fireworks console 2. Click "Create Fine-tuning Job" 3. Select your VLM base model (Qwen 2.5 VL) 4. Choose your uploaded dataset 5. Configure training parameters 6. Launch the job Fine-tuning job creation interface VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity. Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning). VLM fine-tuning job in the Fireworks console Monitor key metrics: * **Training loss**: Should generally decrease over time * **Evaluation loss**: Monitor for overfitting if using evaluation dataset * **Training progress**: Epochs completed and estimated time remaining Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment. Once training is complete, deploy your custom VLM: ```bash theme={null} # Create a deployment for your fine-tuned VLM firectl deployment create my-custom-vlm # Check deployment status firectl deployment get accounts/your-account/deployment/deployment-id ``` Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page. Deploy dropdown in the fine-tuning job page ## Advanced Configuration For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide. Need custom training loops for VLMs? The **Training API** also supports vision-language model fine-tuning with full control over loss functions, training objectives, and evaluation. See [Training API — Vision Inputs](/fine-tuning/training-api/vision-inputs) for details. ## Interactive Tutorials: Fine-tuning VLMs For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation. **Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI** **Finetuning a VLM to beat SOTA closed source model** The cookbooks above cover the following: * Setting up your environment with Fireworks CLI * Preparing vision datasets in the correct format * Launching and monitoring VLM fine-tuning jobs * Testing your fine-tuned model * Best practices for VLM fine-tuning * Running inference on serverless VLMs * Running evals to show performance gains ## Testing Your Fine-tuned VLM After deployment, test your fine-tuned VLM using the same API patterns as base VLMs: ```python Python (OpenAI SDK) theme={null} import openai client = openai.OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key="", ) response = client.chat.completions.create( model="accounts/your-account/models/my-custom-vlm", messages=[{ "role": "user", "content": [{ "type": "image_url", "image_url": { "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg" }, },{ "type": "text", "text": "What's in this image?", }], }] ) print(response.choices[0].message.content) ``` If you fine-tuned using the example dataset, your model should include `` tags in its response. # Training Overview Source: https://docs.fireworks.ai/fine-tuning/finetuning-intro Fireworks helps you fine-tune models to improve quality and performance for your product use cases, without the burden of building & maintaining your own training infrastructure. **Coming from OpenAI?** Fireworks uses the same [OpenAI-compatible chat completion format](/fine-tuning/fine-tuning-models#prepare-a-dataset) for training data — the same `messages` array with `role`, `content`, `tool_calls`, and `weight` fields. You can use your existing SFT datasets with no conversion required. See our [OpenAI compatibility guide](/tools-sdks/openai-compatibility) for more details. ## Three ways to fine-tune Fireworks offers three approaches to fine-tuning, from fully autonomous to fully custom. Pick the one that fits how much control you want: **Describe what you want in plain English.** Agent picks the base model, prepares the data, sweeps hyperparameters, evaluates, trains, and deploys. You approve a single plan and cost up front. Best for the fastest path from dataset to deployed fine-tuned model — from the Fireworks dashboard or from inside Claude Code, Cursor, Codex, Aider, or Goose. **Give Fireworks your data and configuration.** The platform handles scheduling, training, checkpointing, and model output. No custom code required. Best for teams that want managed SFT, DPO, or RFT with LoRA or full-parameter tuning. **Write custom Python training loops.** You control the loss function, optimizer step, checkpointing, and weight sync. Fireworks handles the distributed GPU infrastructure. Best for research teams needing custom loops, custom rollout orchestration, or inference-in-the-loop evaluation. | | **Fireworks Agent** | **Managed Fine-Tuning** | **Training API** | | ------------------------------- | ------------------------------------------------------------------------- | -------------------------------------------- | ---------------------------------- | | **Interface** | Natural language (dashboard chat, `firectl session`, or via coding agent) | UI, `firectl`, REST API | Python script | | **Who picks the model** | Agent recommends | You | You | | **Who tunes hyperparameters** | Agent runs a sweep | You set them | You set them | | **Cost approval** | Built-in gate before any spend | None — you submit jobs directly | None | | **Tuning method** | Full-parameter or LoRA | Full-parameter or LoRA | Full-parameter or LoRA | | **Custom loss / training loop** | Not supported | Not supported | Supported | | **Inference-in-the-loop eval** | Not supported | Not supported | Supported (hotload) | | **Best for** | Getting a working fine-tuned model fast, without ML expertise | Production fine-tuning with standard methods | Research, custom RL, hybrid losses | ## When to use SFT vs. RFT In supervised fine-tuning, you provide a dataset with labeled examples of "good" outputs. In reinforcement fine-tuning, you provide a grader function that can be used to score the model's outputs. The model is iteratively trained to produce outputs that maximize this score. Supervised fine-tuning (SFT) works well for many common scenarios, especially when: * You have a sizable dataset (\~1000+ examples) with high-quality, ground-truth labels. * The dataset covers most possible input scenarios. * Tasks are relatively straightforward, such as: * Classification * Content extraction However, SFT may struggle in situations where: * Your dataset is small. * You lack ground-truth outputs (a.k.a. "golden generations"). * The task requires multi-step reasoning. Here is a simple decision tree: ```mermaid theme={null} flowchart TD B{"Do you have labeled ground truth data?"} B --"Yes"--> C{"How much?"} C --"more than 1000 examples"--> D["SFT"] C --"100-1000 examples"-->F{"Does reasoning help?"} C --"~100s examples"--> E["RFT"] F --"No"-->D F -- "Yes" -->E B --"No"--> G{"Is this a verifiable task (see below)?"} G -- "Yes" -->E G -- "No"-->H["RLHF / LLM as judge"] ``` `Verifiable` refers to whether it is relatively easy to make a judgement on the quality of the model generation. ## When to use the Training API instead Move from managed fine-tuning to the [Training API](/fine-tuning/training-api/introduction) when you need: * **Custom training logic** — hybrid objectives, custom reward shaping, or a non-standard algorithm beyond managed settings * **Inference-in-the-loop evaluation** — hotload checkpoints onto a serving deployment and sample mid-training * **Per-step control** — custom gradient accumulation, dynamic learning rate schedules, or algorithm research ### Detailed capability comparison | Capability | Managed RFT | Training API | | ----------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | Launch training | CLI or UI | Python script | | Loss functions | `grpo`, `dapo`, `gspo-token` (built-in) | Any custom loss via `forward_backward_custom` | | Tuning modes | Full-parameter or LoRA | Full-parameter or LoRA | | Context length | Full context length supported by the selected training shape | Full context length supported by the selected training shape | | Training loop | Fully managed | You write the loop | | Per-step diagnostics | Dashboard (reward, loss, rollouts) | Full Python access to all metrics | | Zero-variance filtering | Automatic | You implement | | Checkpoint management | Automatic | You control via `save_weights_for_sampler_ext` | ### Migrating from managed flow to Training API If you've been using managed RFT and want more control — custom loss functions, richer diagnostics, or algorithm experimentation — the Training API lets you implement your own training loop while keeping the same GPU infrastructure. Managed jobs and cookbook recipes now use the same core tuning capabilities, including LoRA or full-parameter tuning and the full context length supported by the selected training shape. ### MoE models and Routing Replay For Mixture-of-Experts (MoE) models like Kimi K2 (384 experts), training stability benefits from **Routing Replay** — caching the expert routing assignments from the reference policy's forward pass and replaying them during the training forward pass. This ensures that the same experts process the same tokens in both the reference and policy models, reducing gradient noise from routing changes. Routing Replay is available in the Training API via the `loss_fn_inputs` mechanism — you can pass routing matrices from the reference forward pass into the training datum. Use the Training API when you need to inspect or customize those forward-pass inputs directly. # Basics Source: https://docs.fireworks.ai/fine-tuning/how-rft-works Understand the reinforcement learning fundamentals behind RFT ## What is reinforcement fine-tuning? In traditional supervised fine-tuning, you provide a dataset with labeled examples showing exactly what the model should output. In reinforcement fine-tuning, you instead provide: 1. **A dataset**: Prompts, with input examples for the model to respond to 2. **An evaluator**: Code that scores the model's outputs from 0.0 (bad) to 1.0 (good), also known as a reward function 3. **An agent**: An LLM application, with access to tools, APIs, and data needed for your task During training, the model generates responses to each prompt, receives scores from your reward function, and produces outputs that maximize the reward. ## Use cases Reinforcement fine-tuning helps you train models to excel at: * **Code generation and analysis** - Writing and debugging functions with verifiable execution results or test outcomes * **Structured output generation** - JSON formatting, data extraction, classification, and schema compliance with programmatic validation * **Domain-specific reasoning** - Legal analysis, financial modeling, or medical triage with verifiable criteria and compliance checks * **Tool-using agents** - Multi-step workflows where agents call external APIs with measurable success criteria ## How it works Define how you'll score model outputs from 0 to 1. For example, scoring outputs higher by checking if your agent called the right tools, or if your LLM-as-judge rates the output highly. Create a JSONL file with prompts (system and user messages). These will be used to generate rollouts during training. Train locally, or connect your agent as a remote server to Fireworks with our /init and /status endpoints. Create an RFT job via the UI or CLI. Fireworks orchestrates rollouts, evaluates them, and trains the model to maximize reward. Once training completes, deploy your fine-tuned LoRA model to production with an on-demand deployment. ### RFT works best when: 1. You can determine whether a model's output is "good" or "bad," even if only approximately 2. You have prompts but lack perfect "golden" completions to learn from 3. The task requires multi-step reasoning where evaluating intermediate steps is hard 4. You want the model to explore creative solutions beyond your training examples ## Next steps Learn how to design effective reward functions Learn how to launch and configure RFT jobs # Managed Fine-Tuning Overview Source: https://docs.fireworks.ai/fine-tuning/managed-finetuning-intro Fine-tune models with Fireworks-managed infrastructure — no custom code required. Give Fireworks your data and configuration. The platform handles scheduling, training, checkpointing, and model output. Training data uses the **OpenAI-compatible chat completion format**, so existing OpenAI SFT datasets work with no conversion required. ## Methods Train text models with labeled examples of desired outputs Train vision-language models with image and text pairs Align models with human preferences using pairwise comparisons Train models using custom reward functions for complex reasoning tasks ## Free Reinforcement Fine-Tuning **Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details. When creating a **Reinforcement Fine-Tuning** job in the UI, look for the "Free tuning" filter in the model selection area: Free tuning filter in the model selection area For SFT and DPO pricing, see the [pricing page](https://fireworks.ai/pricing). ## Supported base models Fireworks supports fine-tuning for most major open source models, including DeepSeek, Qwen, Kimi, Gemma, GLM, and Llama families. The same set of base models is available for SFT, DPO, and RFT — once a base model is supported, every managed fine-tuning method works against it. The table below is generated from the live training shape registry. The "Max supported context length" is the largest `max_supported_context_length` across all training shapes registered for that base model — use it as the upper bound when you set a per-job context length on `firectl sftj create`, `firectl dpoj create`, or RFT job creation. | Base model | Max supported context length | | ------------------------------- | ---------------------------- | | `gemma-4-26b-a4b-it` | 256K (262,144 tokens) | | `gemma-4-31b-it` | 256K (262,144 tokens) | | `glm-5p1` | 200K (200,000 tokens) | | `kimi-k2p5` | 256K (262,144 tokens) | | `kimi-k2p6` | 256K (262,144 tokens) | | `llama-v3p3-70b-instruct` | 128K (131,072 tokens) | | `minimax-m2p5` | 192K (196,608 tokens) | | `nemotron-nano-3-30b-a3b` | 256K (262,144 tokens) | | `qwen3-235b-a22b-instruct-2507` | 128K (128,000 tokens) | | `qwen3-30b-a3b` | 128K (131,072 tokens) | | `qwen3-30b-a3b-instruct-2507` | 128K (128,000 tokens) | | `qwen3-32b` | 128K (131,072 tokens) | | `qwen3-4b` | 64K (65,536 tokens) | | `qwen3-8b` | 256K (256,000 tokens) | | `qwen3-vl-8b-instruct` | 256K (262,144 tokens) | | `qwen3p5-27b` | 256K (262,144 tokens) | | `qwen3p5-35b-a3b` | 256K (262,144 tokens) | | `qwen3p5-397b-a17b` | 256K (262,144 tokens) | | `qwen3p5-9b` | 256K (262,144 tokens) | | `qwen3p6-27b` | 256K (262,144 tokens) | To browse the broader catalog (including non-tunable inference models), visit the [Model Library for text models](https://app.fireworks.ai/models?filter=LLM\&tunable=true) or [vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true). ## Tuning modes and context length Managed fine-tuning supports both **[Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685)** and full-parameter tuning, depending on the model, method, and selected training shape. It also supports the full context lengths exposed by the available training shapes, matching the same long-context capabilities used by cookbook recipes. Choose LoRA when you want efficient adapter training and flexible deployment, including [multiple LoRAs](/fine-tuning/deploying-loras#multi-lora-deployment) on a single base model deployment. Choose full-parameter tuning when you need to update all model weights for difficult reasoning, alignment, or domain adaptation tasks. **Deprecation notice:** The `deployedModel` request key for routing to LoRA addons is deprecated and will not be supported for any new deployments. Please migrate to the `model` field with the `#` format described in [Routing requests to LoRA addons](/fine-tuning/deploying-loras#routing-requests-to-lora-addons). # Monitor Training Source: https://docs.fireworks.ai/fine-tuning/monitor-training Track RFT job progress and diagnose issues in real-time Once your RFT job is running, the Fireworks dashboard provides comprehensive monitoring tools to track progress, inspect individual rollouts, and debug issues as they arise. ## Accessing the monitoring dashboard After creating your RFT job, you'll receive a dashboard link in the CLI output: ``` Dashboard Links: RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123 ``` Click this link or navigate manually: 1. Go to [Fireworks Dashboard](https://app.fireworks.ai) 2. Click **Fine-Tuning** in the sidebar 3. Select your job from the list ## Understanding the overview The main dashboard shows your job's current state and key metrics. ### Job status Your job is queued waiting for GPU resources. Queue time depends on current demand and your account priority. **Action**: None needed. Job will start automatically when resources become available. Fireworks is validating your dataset to ensure it meets format requirements and quality standards. **Duration**: Typically 1-2 minutes **Action**: None needed. If validation fails, you'll receive specific error messages about issues in your dataset. Training is actively in progress. Rollouts are being generated, evaluated, and the model is learning. **Action**: Monitor metrics and rollout quality. This is when you'll watch reward curves improve. Training finished successfully. Your fine-tuned model is ready for deployment. **Action**: Review final metrics, then [deploy your model](/fine-tuning/deploying-loras). Training encountered an unrecoverable error and stopped. **Action**: Check error logs and troubleshooting section below. Common causes include evaluator errors, resource limits, or dataset issues. You or another user manually stopped the job. **Action**: Review partial results if needed. Create a new job to continue training. Training stopped automatically because the full epoch showed no improvement. All rollouts received the same scores, indicating no training progress. **Action**: This typically indicates an issue with your evaluator or training setup. Check that: * Your evaluator is returning varied scores (not all 0s or all 1s) * The reward function can distinguish between good and bad outputs * The model is actually generating different responses Review the troubleshooting section below for common causes. ### Key metrics at a glance The overview panel displays: * **Elapsed time**: How long the job has been running * **Progress**: Current epoch and step counts * **Reward**: Latest mean reward from rollouts * **Model**: Base model and output model names ## Training metrics ### Reward curves The most important metric in RFT is the reward curve, which shows how well your model is performing over time. **What to look for**: * **Upward trend** - Model is learning and improving * **Plateauing** - Model may have converged; consider stopping or adjusting parameters * **Decline** - Potential issue with evaluator or training instability * **Spikes** - Could indicate noisy rewards or outliers in evaluation Reward curve showing upward trend over training epochs Healthy training shows steady reward improvement. Don't worry about minor fluctuations—focus on the overall trend. ### Training loss Loss measures how well the model is fitting the training data: * **Decreasing loss** - Normal learning behavior * **Increasing loss** - Learning rate may be too high * **Flat loss** - Model may not be learning; check evaluator rewards ### Evaluation metrics If you provided an evaluation dataset, you'll see validation metrics: * **Eval reward**: Model performance on held-out data * **Generalization gap**: Difference between training and eval rewards Large gaps between training and eval rewards suggest overfitting. Consider reducing epochs or adding more diverse training data. ## Inspecting rollouts Understanding individual rollouts helps you verify your evaluator is working correctly and identify quality issues. ### Rollout overview table Click any **Epoch** in the training timeline, then click the **table icon** to view all rollouts for that step. Table showing rollout IDs, prompts, responses, and rewards The table shows: * **Row ID**: Unique identifier for each dataset row used in this rollout * **Prompt**: The input prompt sent to the model * **Messages**: The model's generated response messages * **Valid**: Whether the rollout completed successfully without errors * **Reason**: Explanation if the rollout failed or was marked invalid * **Score**: Reward score assigned by your evaluator (0.0 to 1.0) **What to check**: * Most rollouts succeeding (status: complete) * Reward distribution makes sense (high for good outputs, low for bad) * Many failures indicate evaluator issues * All rewards identical may indicate evaluator is broken ### Individual rollout details Click any row in the rollout table to see full details: Detailed view of a single rollout showing full prompt, response, and evaluation You'll see: 1. **Full prompt**: Exact messages sent to the model 2. **Model response**: Complete generated output 3. **Evaluation result**: Reward score and reasoning (if provided) 4. **Metadata**: Token counts, timing, temperature settings 5. **Tool calls**: For agentic rollouts with function calling Copy and paste model outputs to test them manually. For example, if you're training a code generator, try running the generated code yourself to verify your evaluator is scoring correctly. ### Quality spot checks Regularly inspect rollouts at different stages of training: **Early training (first epoch)**: * Verify evaluator is working correctly * Check that high-reward rollouts are actually good * Ensure low-reward rollouts are actually bad **Mid-training**: * Confirm model quality is improving * Look for new strategies or behaviors emerging * Check that evaluator isn't being gamed **Late training**: * Verify final model quality meets your standards * Check for signs of overfitting (memorizing training data) * Ensure diversity in responses (not all identical) ## Live logs Real-time logs show what's happening inside your training job. ### Accessing logs Click the **Logs icon** next to the table icon to view real-time logs for your training job. Live log streaming showing rollout processing and evaluation ### Using logs for debugging When things go wrong, logs are your first stop: 1. **Filter by error level**: Focus on `[ERROR]` and `[WARNING]` messages 2. **Search for rollout IDs**: Track specific rollouts through their lifecycle 3. **Look for patterns**: Repeated errors indicate systematic issues 4. **Check timestamps**: Correlate errors with metric changes ## Training diagnostics ### Available in the managed flow The managed RFT dashboard provides: * **Reward curves:** Mean reward over training steps * **Training loss:** Policy loss over time * **Rollout inspection:** Individual rollouts with scores, messages, and metadata ### Traces page The **Traces** page in the Fireworks dashboard provides per-rollout execution traces, including timing, token counts, and evaluation results. Trace data can be downloaded for offline analysis using the download button on the Traces page. ### Metrics not directly surfaced The following diagnostics are not directly surfaced in the managed RFT dashboard today: * **Filtering rates:** How many zero-variance groups were dropped per iteration * **Effective batch size:** Actual number of training groups after filtering * **Advantage magnitude and distribution:** Per-step advantage statistics * **KL divergence:** Distance between the current policy and the reference model * **Per-token importance sampling ratios:** Clipping frequency and magnitude These metrics can be partially inferred from trace data and rollout inspection. For richer per-step diagnostics, consider using the [Training API](/fine-tuning/training-api/introduction), which gives you full Python control over the training loop and allows you to log any metric you need. ## Common issues and solutions **Symptoms**: Reward curve flat or very low throughout training **Possible causes**: * Evaluator always returning 0 or very low scores * Model outputs not matching expected format * Task too difficult for base model **Solutions**: 1. Inspect rollouts to verify evaluator is working: * Check that some rollouts get high rewards * Verify reward logic makes sense 2. Test evaluator locally on known good/bad outputs 3. Simplify the task or provide more examples 4. Try a stronger base model **Symptoms**: Reward increases then crashes and stays low **Possible causes**: * Learning rate too high causing training instability * Model found an exploit in the evaluator (reward hacking) * Catastrophic forgetting **Solutions**: 1. Stop training and use the last good checkpoint 2. Restart with lower learning rate (e.g., `--learning-rate 5e-5`) 3. Review recent rollouts for reward hacking behavior 4. Improve evaluator to be more robust **Symptoms**: Rollout table shows lots of errors or timeouts **Possible causes**: * Evaluator code errors * Timeout too short for evaluation * External API failures (for remote evaluators) * Resource exhaustion **Solutions**: 1. Check error logs for specific error messages 2. Test evaluator locally to reproduce errors 3. Increase `--rollout-timeout` if evaluations need more time 4. Add better error handling in evaluator code 5. For remote evaluators: check server health and logs **Symptoms**: Loss goes up instead of down **Possible causes**: * Learning rate too high * Conflicting reward signals * Numerical instability **Solutions**: 1. Reduce learning rate by 2-5x 2. Check that rewards are consistent (same prompt gets similar rewards) 3. Verify rewards are in valid range \[0, 1] 4. Consider reducing batch size **Symptoms**: Model generates the same response for every prompt **Possible causes**: * Temperature too low (near 0) * Model found one high-reward response and overfit to it * Evaluator only rewards one specific output **Solutions**: 1. Increase `--temperature` to 0.8-1.0 2. Make evaluator more flexible to accept diverse good answers 3. Use more diverse prompts in training data 4. Reduce epochs to prevent overfitting **Symptoms**: Many rollouts timing out with remote environment **Possible causes**: * Remote server slow or overloaded * Network latency issues * Evaluator not logging completion correctly **Solutions**: 1. Check remote server logs for errors 2. Verify server is logging `Status.rollout_finished()` 3. Increase `--rollout-timeout` to allow more time 4. Scale remote server to handle concurrent requests 5. Optimize evaluator code for performance ## Performance optimization ### Speeding up training If training is slower than expected: **Slow evaluators directly increase training time**: * Profile your evaluator code to find bottlenecks * Cache expensive computations * Use batch processing for API calls * Add timeouts to prevent hanging **For remote evaluators**: * Add more worker instances to handle concurrent rollouts * Use faster machines (more CPU, memory) * Optimize network connectivity to Fireworks Target: Evaluations should complete in 1-5 seconds per rollout. **Reduce compute while maintaining quality**: * Decrease `--n` (e.g., from 8 to 4 rollouts per prompt) * Reduce `--max-tokens` if responses don't need to be long * Lower temperature slightly to speed up sampling Caution: Too few rollouts (n \< 4) may hurt training quality. ### Cost optimization Reduce costs without sacrificing too much quality: 1. **Start small**: Experiment with `qwen3-0p6b` before scaling to larger models 2. **Reduce rollouts**: Use `--n 4` instead of 8 3. **Shorter responses**: Lower `--max-tokens` to minimum needed 4. **Fewer epochs**: Start with 1 epoch, only add more if needed 5. **Efficient evaluators**: Minimize API calls and computation ## Stopping and resuming jobs ### Stopping a running job If you need to stop training: 1. Click **Cancel Job** in the dashboard 2. Or via CLI: ```bash theme={null} firectl rftj delete ``` The model state at the last checkpoint is saved and can be deployed. Cancelled jobs cannot be resumed. If you want to continue training, create a new job starting from the last checkpoint. ### Using checkpoints Checkpoints are automatically saved during training. To continue from a checkpoint: ```bash theme={null} eval-protocol create rft \ --warm-start-from accounts/your-account/models/previous-checkpoint \ --output-model continued-training ``` This is useful for: * Extending training after early stopping * Trying different hyperparameters on a trained model * Building on previous successful training runs ## Comparing multiple jobs Running multiple experiments? Compare them side-by-side: 1. Navigate to **Fine-Tuning** dashboard 2. Select multiple jobs using checkboxes 3. Click **Compare** This shows: * Reward curves overlaid on same graph * Parameter differences highlighted * Final metrics comparison * Training time and cost comparison Use consistent naming for experiments (e.g., `math-lr-1e4`, `math-lr-5e5`) to make comparisons easier. ## Exporting metrics For deeper analysis or paper writing: ### Via dashboard 1. Click **Export** button in job view 2. Choose format: CSV, JSON 3. Select metrics to export (rewards, loss, rollout data) ### Via API ```python theme={null} import requests response = requests.get( f"https://api.fireworks.ai/v1/accounts/{account}/reinforcementFineTuningJobs/{job_id}/metrics", headers={"Authorization": f"Bearer {api_key}"} ) metrics = response.json() ``` ### Weights & Biases integration If you enabled W\&B when creating the job: ```bash theme={null} eval-protocol create rft \ --wandb-project my-experiments \ --wandb-entity my-org \ ... ``` All metrics automatically sync to W\&B for advanced analysis, comparison, and sharing. ## Best practices Check your job within the first 15-30 minutes of training: * Verify evaluator is working correctly * Confirm rewards are in expected range * Catch configuration errors early Don't wait until training completes to discover issues. Every few epochs, inspect 5-10 random rollouts: * Manually verify high-reward outputs are actually good * Check low-reward outputs are actually bad * Look for unexpected model behaviors This catches evaluator bugs and reward hacking. When you find good hyperparameters, save the command: ```bash theme={null} # Save to file for reproducibility echo "eval-protocol create rft --base-model ... --learning-rate 5e-5 ..." > best_config.sh ``` Makes it easy to reproduce results or share with team. Name jobs descriptively: * Good: `math-solver-llama8b-temp08-n8` * Bad: `test1`, `try2`, `final-final` Future you will thank you when comparing experiments. Keep notes on what worked and what didn't: * Hypothesis for each experiment * Parameters changed * Results and insights * Next steps Build institutional knowledge for your team. ## Next steps Once training completes, deploy your fine-tuned model for inference Learn how to adjust parameters for better results Improve your reward functions based on training insights Start a new experiment using the CLI # Price comparison vs Tinker Source: https://docs.fireworks.ai/fine-tuning/multi-turn-cost-comparison Estimate the cost of multi-turn agentic RL rollouts on Fireworks compared to Tinker's per-token pricing If you're running RL or agentic post-training on a long-context model and your provider bills you per token with **no cross-turn prefix cache**, the prefill cost grows quadratically with the number of turns — every turn re-prefills the full conversation history. On Fireworks Dedicated, session-affinity routing keeps an episode pinned to one replica so the KV cache is reused across turns, and cached prompt tokens contribute essentially zero extra compute. The calculator below makes that difference concrete. Set your episode shape (turns, context growth, generation length) and compare: * **Tinker** — flat per-token billing, no cross-turn cache (re-prefill every turn) * **Fireworks Dedicated** — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate ## Performance and benchmarking notes ### Dedicated trainer vs pooled/serverless resourcing Tinker runs training jobs on a **pooled/serverless** GPU fleet, which lets a single job burst onto many more GPUs than you would dedicate to a replica on Fireworks. That burst is what makes individual Tinker steps feel fast — but it also **caps the maximum training speed you can buy**: you cannot pay to scale beyond the pool's per-job allocation, and you cannot reserve isolated capacity. Fireworks dedicated trainers take the opposite trade-off: predictable, isolated execution with no shared-pool queueing or noisy-neighbor variance, and the ability to scale **wall-clock time and cost independently** by adjusting replica count. If you want faster steps on dedicated, increase replica count and parallelize work. For **large model training or longer rollouts**, we have consistently found the dedicated setup like ours is **cheaper overall and can also be faster** depending on the customer's resourcing needs. ### Context-length benchmarking caveat Benchmark comparisons are only apples-to-apples when truncation policy and effective context length are matched. If one system truncates `>32k` samples and another does not, the non-truncating run is doing more work and will appear slower. ### Replica count is a speed/cost knob Users can trade cost and wall-clock time by scaling replicas. A quick back-of-envelope estimate: $$ \text{\$ / 1M tokens} \approx \frac{\text{GPU count} \cdot \text{\$ / GPU-hour}}{\text{tokens/sec(cluster)} \cdot 3600} \cdot 10^6 $$ ## How the numbers come together ### Tinker (the cost customers describe) Each turn re-prefills the full accumulated context: $$ \text{Prefill tokens (Tinker)} = \sum_{t=1}^{T} P_t = T \cdot P_1 + \Delta \cdot \frac{T(T-1)}{2} $$ …where $P_1$ is the initial prompt (system + tools + task), $\Delta$ is the context added per turn (model response + tool result), and $T$ is the turn count. This is **quadratic in $T$**. $$ \text{Cost (Tinker)} = \frac{\text{Prefill tokens}}{10^6} \cdot r_{\text{prefill}} + \frac{\text{Decode tokens}}{10^6} \cdot r_{\text{sample}} $$ ### Fireworks Dedicated — GPU-hour billing Dedicated deployments are billed per GPU-second, so the prefix cache shows up as **higher effective throughput** rather than a discount on per-token rates. Across one episode, each unique token is prefilled at most once — the rest of the prompt is served from the prefix cache and contributes essentially no GPU work. The uncached portion that actually hits prefill is: $$ \text{Uncached prompt} = P_T = P_1 + (T - 1) \Delta $$ On a saturated cluster: $$ \text{Cluster-hours} = \frac{\text{Uncached prompt} / \text{prefill TPS}}{3600} $$ $$ \text{Cost} = \text{Cluster-hours} \cdot N_{\text{GPU}} \cdot r_{\text{GPU/hr}} $$ Because cached tokens contribute essentially nothing to wall-clock work, the cluster's effective \$/M token rate falls as utilization rises. For continuous RL training, where rollouts run at sustained pace, dedicated is typically the cheapest path at scale. The calculator's dedicated path uses *saturated* throughput estimates as defaults. A small, lightly-loaded test deployment will look more expensive per token than these numbers because the cluster is paid for whether it's busy or idle. Tune the throughput inputs in the **Advanced** panel to match your actual rollout pace. ## What's covered The calculator currently includes the four models for which Tinker publishes per-token rates: | Model | Tinker prefill / sample (per 1M) | | ------------------------ | -------------------------------- | | Kimi K2.6 (128K) | $5.15 / $12.81 | | Kimi K2.5 (128K) | $5.15 / $12.81 | | Qwen3.5-397B-A17B (256K) | $4.00 / $10.00 | | GPT-OSS-120B (128K) | $0.63 / $1.54 | All Fireworks-side rates are taken from the public pages linked below and the constants live in `snippets/multi-turn-cost-calculator.jsx` — update there if either side's pricing changes. ## FAQ ### What is the fastest way to reduce wall-clock time? Increase replicas and overlap sampling/training where your workflow allows it. Those are usually the most direct levers for shortening end-to-end cycle time. ### How should I compare costs between providers? Use matched assumptions for context length, truncation policy, and effective resource allocation. The calculator at the top of this page handles the math once you plug in your episode shape — be sure to also align truncation policy and effective context window between providers before drawing conclusions. ## Sources * Tinker pricing: [thinkingmachines.ai/tinker](https://thinkingmachines.ai/tinker) * Fireworks GPU-hour pricing: [fireworks.ai/pricing](https://fireworks.ai/pricing) * Related: [RFT Cost Estimator](/fine-tuning/rft-cost-estimator) — same idea, but for the training-side bill (Fireworks GPU-hour, no comparison column). This is an estimator, not a quote (updated). Real costs depend on your exact workload, cache hit rate, hardware utilization, and rate-card terms at run time. # Parameter Tuning Source: https://docs.fireworks.ai/fine-tuning/parameter-tuning Learn how training parameters affect model behavior and outcomes ## Overview Reinforcement fine-tuning uses two categories of parameters to control model training: **training parameters** that govern how the model learns, and **rollout (sampling) parameters** that control how the model generates responses during training. Most experiments converge well with the default values. Adjust parameters only when you have a clear hypothesis based on your training metrics and reward curves. ## Training Parameters Core parameters that control how your model learns during the training process. **What it does**: Controls how aggressively the model updates its weights during each training step. Think of it as the "step size" when descending the loss landscape. **Default**: `1e-4` (0.0001)\ **Valid range**: `1e-5` to `5e-4` **How it affects outcome**: * **Too high** → Unstable training where reward spikes briefly then collapses as the model overshoots optimal weights. * **Too low** → Painfully slow convergence. The reward curve plateaus too early before reaching optimal performance. * **Just right** → Steady, consistent reward improvement throughout training. **When to adjust**: * **Decrease** when you see reward spikes followed by crashes in your training metrics * **Increase** when the reward curve plateaus too early and stops improving * Keep changes within 2× of the default value **What it does**: The number of complete passes through your training dataset. Each epoch processes every example once. **Default**: `1`\ **Valid range**: `1` to `10` (whole numbers only) **How it affects outcome**: * **Too few** → The model hasn't had enough exposure to learn patterns from your data * **Too many** → Overfitting risk where the model memorizes the training set instead of generalizing * **Just right** → Reward curve shows steady improvement and plateaus near the end of training **When to adjust**: * **Add 1-2 more epochs** if the reward is still climbing steadily at the end of training * **Keep at 1** for most tasks—the default works well * Watch your reward curves to detect when adding more epochs stops helping **What it does**: Controls the number of trainable parameters in your LoRA adapter. LoRA (Low-Rank Adaptation) adds small adapter layers to the base model rather than training all weights. Higher rank means more capacity to learn new behaviors. **Default**: `8`\ **Valid range**: `4` to `32` (must be powers of 2: 4, 8, 16, 32) **How it affects outcome**: * **Lower rank (4-8)** → Faster training, but may lack capacity for complex tasks * **Just right (8-16)** → Balances capacity and efficiency for most tasks * **Higher rank (32)** → More learning capacity, but requires significantly more GPUs and risks overfitting **When to adjust**: * **Increase** for complex reasoning tasks or when the model struggles to learn desired behaviors * Consider task complexity: simple style changes need lower rank, complex reasoning needs higher **What it does**: The amount of data (measured in tokens) processed in each training step before updating model weights. Unlike traditional batch sizes that count sequences (e.g., 32 or 64 sequences), Fireworks RFT uses **token-based batch sizing**. For example, with an 8k max sequence length, a 64k batch size allows up to 8 sequences per batch (64k tokens ÷ 8k tokens/sequence = 8 sequences). **Default**: `32k tokens` **How it affects outcome**: * **Smaller batches** → Noisier gradient updates that may help exploration, but slower training throughput * **Larger batches** → Smoother, more stable updates and faster training throughput **When to adjust**: * Most users should stick with the default. Modify if you want a smaller/larger amount of tokens per train step **What it does**: Sets the minimum number of prompts rolled out before each GRPO training step. Controls how on-policy the training is by determining how often the model is updated relative to rollout generation — a chunk is a slice of the dataset that the trainer fully rolls out *before* taking a training step, after which the next chunk's rollouts are generated from the updated policy. **Default**: `200` (auto-applied only when the dataset has at least `2 × chunk_size` examples; datasets with fewer examples run without chunking) **Valid values**: `-1` to disable chunking, any positive integer to set an explicit size. Setting `0` (or leaving unset) uses the default behavior above. **On-policy spectrum**: * **Small chunk size** → more frequent training steps, rollouts stay close to the policy being trained (more on-policy), but more forward/backward passes per epoch and slower wall-clock time. * **Large chunk size** (or `chunk_size = dataset_size`) → fewer training steps, rollouts become stale relative to the updated policy (more off-policy), faster wall-clock but potentially lower sample efficiency. * **Fully online RL**: `chunk_size=1` (generate one prompt's rollouts → train → repeat). Not typically recommended in practice. * **Fully offline RL**: `chunk_size = dataset_size` (generate all rollouts first, then train — equivalent to 1 epoch with no mid-epoch updates). **Epoch/chunk interaction** An epoch is still a full pass through the entire dataset. `chunk_size` controls how frequently the model gets a GRPO training step *within* each epoch. For example, with `chunk_size=200`, `dataset_size=1000`, `epochs=2`, and `response_candidates_count=8`: ``` epoch 0 chunk 0 (prompts 1-200) × 8 rollouts → train epoch 0 chunk 1 (prompts 201-400) × 8 rollouts → train epoch 0 chunk 2 (prompts 401-600) × 8 rollouts → train epoch 0 chunk 3 (prompts 601-800) × 8 rollouts → train epoch 0 chunk 4 (prompts 801-1000) × 8 rollouts → train epoch 1 chunk 0 (prompts 1-200) × 8 rollouts → train ... ``` That is, 5 chunks × 2 epochs = 10 GRPO training steps total, each preceded by 200 × 8 = 1600 rollouts. **Relationship with `gradient_accumulation_steps`** These two are orthogonal: * `chunk_size` controls how many prompts are rolled out **before each GRPO training step** — i.e., how on-policy the training is. * `gradient_accumulation_steps` controls how many forward/backward passes accumulate **within a single chunk's training step** before each optimizer update. `--chunk-size` is only exposed via the `firectl` / `eval-protocol` CLI. It is not configurable from the Web UI. ## Loss Method Parameters that control the policy optimization algorithm used during training. **What it does**: Controls the policy optimization algorithm used during training. Different methods trade off exploration aggressiveness, stability, and KL regularization. **Default**: `grpo` **Valid values**: `grpo`, `dapo`, `gspo-token` **GRPO** (default) — Group Relative Policy Optimization ([arXiv:2402.03300](https://arxiv.org/abs/2402.03300)). The conservative baseline used by most RFT jobs. * **Symmetric clipping:** Clips the policy ratio to `[0.8, 1.2]`, limiting how much the policy can change in a single step in either direction. * **KL penalty:** Includes a small KL divergence penalty (`kl_loss_coef=0.001`) that keeps the trained policy close to the reference model. This prevents mode collapse but limits how far the model can deviate from its starting behavior. * **Token-level loss aggregation:** Loss is summed over valid tokens and divided by total valid token count (`token-mean`). Best for: Most tasks. Start here unless you have a specific reason to use another method. **DAPO** — Decoupled Alignment Preference Optimization ([arXiv:2503.14476](https://arxiv.org/abs/2503.14476)). A more aggressive variant that removes KL regularization and uses asymmetric clipping. * **Asymmetric clipping:** Clips the policy ratio to `[0.8, 1.28]` — the upper bound is higher than the lower bound, allowing the policy to take larger steps in the "improve" direction while being more conservative about degradation. * **No KL penalty:** `kl_loss_coef` is set to 0. The trained policy is not penalized for diverging from the reference model. * **Token-level loss aggregation:** Same `token-mean` mode as GRPO. Best for: Tasks where the base model is far from optimal and you want to allow larger policy updates. Useful when GRPO converges too slowly or plateaus early. `--rl-kl-beta` is incompatible with DAPO. Setting a non-zero `--rl-kl-beta` when `--rl-loss-method dapo` is specified will cause job creation to fail with: `loss_config.kl_beta must be 0/unset when method is DAPO`. **What DAPO does NOT include from the original paper:** * **Overlong reward shaping** is not implemented. The separate `--length-norm` flag exists but is not DAPO-specific. * **Dynamic sampling (overgeneration)** is not implemented. Zero-variance groups are filtered out (see [Zero-Variance Group Filtering](#zero-variance-group-filtering) below), but filtered prompts are dropped from the batch, not replaced with new prompts. **GSPO-token** — Group Sequence Policy Optimization, token-level variant ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)). Uses sequence-level importance sampling with very tight clipping for conservative, stable updates. * **Sequence-level importance sampling:** Computes a sequence-level KL proxy and broadcasts it to token-level ratios, rather than computing ratios independently per token. This better captures how entire responses differ from the reference policy. * **Very tight clipping:** Clips the policy ratio to `[1 - 0.0003, 1 + 0.0004]` — much tighter than GRPO or DAPO, making each training step very conservative. * **No KL penalty:** `kl_loss_coef` is set to 0. * **Sequence-mean-token-mean aggregation:** Loss is first averaged per-sequence, then averaged across sequences. This prevents longer responses from dominating the loss. Best for: Stability-sensitive training or when working with long-form outputs where per-sequence normalization matters. The very small clip range means you may need more training steps to converge. `--rl-kl-beta` is incompatible with GSPO-token. Setting a non-zero `--rl-kl-beta` when `--rl-loss-method gspo-token` is specified will cause job creation to fail with: `loss_config.kl_beta must be 0/unset when method is GSPO_TOKEN`. **When to use each method:** | Goal | Recommended method | | ----------------------------------------------- | ------------------------------ | | Safe default for most tasks | `grpo` | | Faster convergence, more aggressive exploration | `dapo` | | Maximum stability, long-form outputs | `gspo-token` | | Keep policy close to reference model | `grpo` with `--rl-kl-beta > 0` | **What it does**: Overrides the KL divergence penalty coefficient for GRPO. Higher values keep the policy closer to the reference model; lower values allow more divergence. **Default**: `0` (uses the loss method's built-in default: `0.001` for GRPO) **Valid range**: `>= 0` `--rl-kl-beta` only applies to `--rl-loss-method grpo`. It is rejected for `dapo` and `gspo-token`, which are designed to operate without KL penalties. **When to adjust**: * **Increase** if the model diverges too far from the base model's capabilities (catastrophic forgetting) * **Decrease or set to 0** if you want the model to explore more freely * Leave at default for most tasks ## Rollout (Sampling) Parameters Parameters that control how the model generates responses during training rollouts. **What it does**: Controls the randomness of the model's token selection during generation. Higher temperature = more random/creative, lower = more deterministic/focused. **Default**: `0.7`\ **Valid range**: `0.1` to `2.0` (must be >0) **How it affects outcome**: * **0.0-0.1 (near-greedy)** → Deterministic outputs with no exploration. Leads to mode collapse and repetitive text. **Avoid in RFT.** * **0.5-1.0 (sweet spot)** → Good balance of exploration and coherence. Ideal for most RLHF applications. * **>1.2 (high randomness)** → Very creative but potentially incoherent outputs **When to adjust**: * **Lower (0.3-0.5)** for tasks requiring precision, factual accuracy, or safety (less toxic outputs) * **Raise (1.0-1.2)** for creative tasks like story generation or when you need more diverse rollout exploration * **Never use 0.0**—greedy sampling breaks RFT by eliminating exploration **What it does**: Dynamically limits token sampling to the smallest set of tokens whose cumulative probability exceeds threshold p. Only considers the most probable tokens that together make up the top p% of probability mass. **Default**: `1.0` (considers all tokens)\ **Valid range**: `0` to `1` `top_p` and `top_k` are both optional and not mutually exclusive. If both are set, `top_k` filters first, then `top_p` narrows by cumulative probability. **How it affects outcome**: * Lower values (0.2-0.5) filter out long-tail, low-probability tokens that often cause hallucinations * Higher values (0.9-1.0) allow more diversity in outputs * Prevents the model from selecting very unlikely tokens that may be nonsensical **When to adjust**: * **Lower to 0.2-0.5** when your reward function penalizes hallucinations or factual errors * **Keep at 0.9-1.0** for creative tasks that benefit from diverse vocabulary * Works well in combination with temperature for fine-grained control **What it does**: Limits sampling to only the K most probable tokens at each step. A fixed-size cutoff (unlike top-p which is dynamic). **Default**: `40`\ **Valid range**: `0` to `100` (0 = disabled) `top_p` and `top_k` are both optional and not mutually exclusive. If both are set, `top_k` filters first, then `top_p` narrows by cumulative probability. **How it affects outcome**: * Similar to top-p but uses a fixed number of candidates instead of a probability threshold * Lower k = more focused, less diverse outputs * Higher k = more exploration and creativity **When to adjust**: * **Combine with temperature** (e.g., temp 0.8 + top-k 40) for balanced creative exploration * **Keep ≤50** to maintain reasonable inference latency * Consider using top-p instead for most use cases—it adapts better to varying probability distributions **What it does**: How many different responses the model generates for each prompt during training. In GRPO terminology, this is the **group size** — the set of completions per prompt used to compute group-relative advantages. The policy optimization algorithm compares these candidates to compute advantages and learn which responses are better. Exposed as `--response-candidates-count` in both `firectl` and the `eval-protocol` CLI. **Default**: `8` (server-side default applied when the field is unset by any client) **Valid range**: Minimum `2`, no hard upper bound **How it affects outcome**: * **n=1** → **Not allowed.** Policy optimization requires multiple candidates to learn from comparisons * **n=2-4** → Minimal viable exploration. Faster and cheaper but less signal for learning * **n=8** → Recommended default. Good balance of learning signal and cost for most tasks * **n=16** → Higher quality signal at higher cost. Consider for complex tasks with nuanced evaluators * **n>16** → Diminishing returns in most cases. Linearly increases cost and rollout time **When to adjust**: * **Increase to 8-16** when you need higher quality learning signal and cost is acceptable * **Keep at 8** for most experiments—it's the recommended starting point * **Never set to 1**—this will cause job creation to fail * Consider the cost tradeoff: each chunk produces `chunk_size × response_candidates_count` rollouts before a training step (e.g., `chunk_size=200` with `n=8` → 1600 rollouts), so higher values linearly increase wall-clock time. See [Chunk Size](#chunk-size) for how chunks and epochs interact. Higher values of n increase per-prompt memory usage in both the rollout phase and the training step. While there is no enforced maximum, very high values (e.g., >32) may encounter memory pressure depending on model size and sequence length. Values of 8 and 16 are well-tested. **What it does**: The maximum number of tokens the model can generate in a single response during rollouts. **Default**: `2048`\ **Valid range**: `16` to `16384` **How it affects outcome**: * Directly affects task completion: too short and the model can't finish complex tasks * Longer responses improve reward on summarization, story generation, and reasoning tasks * Linearly increases training cost—every token generated costs compute **When to adjust**: * **Increase** when your tasks require longer reasoning chains, detailed summaries, or complex multi-step solutions * **Decrease** to reduce costs for tasks with naturally short outputs (classification, short-form Q\&A) * Monitor your reward curves: if the model is cutting off mid-response, increase max tokens **What it does**: Controls how many rollout completions run in parallel during the rollout phase of training. This is a **throughput parameter only** — it does not affect training dynamics, gradient computation, or model quality. **Default**: Inherited from the evaluator's `@evaluation_test` decorator if not set on the CLI. If the decorator also doesn't set it, the SDK default of `96` applies. **How it affects outcome**: * **Higher values** → Faster rollout phase (more completions generated simultaneously) * **Lower values** → Slower rollout phase but less API load on the inference endpoint * **No effect** on training loss, advantages, or gradient updates **When to adjust**: * **Increase** to speed up the rollout phase if your inference endpoint can handle higher concurrency * **Decrease** if you're hitting rate limits or timeouts on the inference endpoint * **Leave unset** to use the evaluator's default, which is tuned for typical workloads This parameter only controls parallelism during the rollout (sampling) phase. It has no effect on training dynamics — batch composition, advantage normalization, loss computation, and gradient updates are all unaffected. ## Zero-Variance Group Filtering During each training iteration, the model generates K response candidates per prompt (controlled by `--response-candidates-count` or `--n`). Your evaluator scores each candidate. If **all K candidates for a prompt receive the same score**, that group provides no learning signal — the model cannot distinguish better from worse responses. **Managed RFT automatically filters out these zero-variance groups.** This applies to all loss methods (GRPO, DAPO, and GSPO-token), not just DAPO. Important behaviors: * Filtered prompts are **dropped from the batch**, not replaced with new prompts. This means your effective batch size may be smaller than expected when many groups are homogeneous. * Filtering happens at both the full-group level (all K candidates same score) and at the chunk level within groups. * If your evaluator returns the same score for all rollouts across most prompts, training will make limited progress and may trigger early stopping. **To reduce zero-variance groups:** * Increase `--temperature` (e.g., 0.8–1.0) to produce more diverse responses * Increase `--response-candidates-count` to generate more candidates * Ensure your evaluator returns a range of scores, not just 0 and 1 ## Parameter Interactions Parameters don't work in isolation—they interact in important ways. These three work together to control sampling behavior. Using all three gives you fine-grained control: * **Temperature** sets the overall randomness * **Top-p** dynamically filters by probability mass * **Top-k** sets a hard limit on candidate tokens Example: `temperature=0.8, top_p=0.9, top_k=40` gives creative but controlled outputs. Larger batch sizes provide more stable gradients, which may allow for slightly higher learning rates. However, the default learning rate is tuned for the default batch size—only adjust if you have evidence from your training curves. Larger base models (70B+) may need higher LoRA ranks to capture complex behaviors, but they also require more resources. For smaller models (\<13B), rank 8-16 is usually sufficient. ## Tuning Strategies Best practices for adjusting parameters to achieve your training goals. The default parameters are carefully tuned to work well for most RFT tasks. Don't change them unless you have a clear hypothesis based on your training metrics. Run at least one baseline experiment with defaults before making any adjustments. This gives you: * A performance benchmark to compare against * Understanding of whether parameter tuning is actually needed * Evidence about which metrics need improvement Many successful RFT jobs use all default parameters. When you do adjust parameters, change only one at a time and measure the impact on your reward curves and evaluation metrics. **Good workflow:** 1. Run baseline with defaults 2. Identify specific issue (e.g., reward crashes, slow convergence) 3. Change ONE parameter that should address that issue 4. Compare results 5. Repeat **Avoid:** Changing multiple parameters simultaneously—you won't know which change caused the improvement or regression. Use Weights & Biases integration to: * Compare training curves across experiments * Track reward progression over time * Log all hyperparameters automatically This makes it easy to identify which parameter changes actually helped and which hurt performance. Quick reference for goal-directed parameter tuning: * **Faster convergence** → ↑ epochs (add 1-2), tune learning rate (stay \<2× default) * **Better quality** → ↑ temperature (1.0-1.2), ↑ rollouts (6-8), ↑ max tokens * **Safer/less toxic** → ↓ temperature (0.3-0.5), ↓ top-p (0.5), ↓ top-k * **More creative** → ↑ temperature (1.0-1.2), top-p = 0.9 * **Lower cost** → ↓ rollouts, ↓ max tokens, ↓ batch size * **Higher capacity** → ↑ LoRA rank (16-32), but monitor memory usage * **Prevent overfitting** → Keep epochs = 1, consider lower LoRA rank ## Next Steps Complete guide to CLI parameters and options Launch your RFT job Hands-on tutorial showing parameter tuning in practice Learn about the RFT training process and workflow # Single-Turn Training Quickstart Source: https://docs.fireworks.ai/fine-tuning/quickstart-math Train a model to be an expert at answering GSM8K math questions **Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details. **Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Single-Turn Training** path—the fastest way to get started with RFT. In this quickstart, you'll train a small language model—`Qwen3 0.6B`—to solve mathematical reasoning problems from the GSM8K dataset. ## What you'll learn * How to set up and test an evaluator locally, using the Eval Protocol SDK * How to take that evaluator and use it in an RFT job, from the command line * How to monitor training progress and evaluate accuracy improvements Prefer a notebook experience? You can also [run this tutorial in Google Colab](https://colab.research.google.com/drive/16xrb9rx6AoAEOtrDXumzo71HjhunaoPi#scrollTo=CP18QX4tgi-0). Note that Colab requires billing enabled on your Google account. ## Prerequisites * Python 3.10+ * A Fireworks API key (stored in your shell or .env) * Command-line access (terminal or shell) ## 1. Install dependencies and set up files Clone the quickstart-gsm8k repository and install dependencies: ```bash theme={null} git clone https://github.com/eval-protocol/quickstart-gsm8k.git cd quickstart-gsm8k pip install -r requirements.txt ``` Create the `gsm8k_artifacts/` folder structure and copy files: ```bash theme={null} mkdir -p gsm8k_artifacts/{tests/pytest/gsm8k,development} cp evaluation.py gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py cp gsm8k_sample.jsonl gsm8k_artifacts/development/gsm8k_sample.jsonl ``` The repository includes: * **Evaluator** (`evaluation.py`): Defines how to evaluate math answers * **Dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on Install the latest `eval-protocol` SDK, `pytest`, and `requests`: ```bash theme={null} python -m pip install --upgrade pip python -m pip install pytest requests git+https://github.com/eval-protocol/python-sdk.git ``` Download the evaluator and dataset files: Run this Python script to download two files from the Eval Protocol repository into a folder on your machine called `gsm8k_artifacts/`. * **Test script** (`test_pytest_math_example.py`): Defines how to evaluate math answers * **Sample dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on ```python tutorial/download_gsm8k_assets.py theme={null} from pathlib import Path import requests ARTIFACT_ROOT = Path("gsm8k_artifacts") TEST_PATH = ARTIFACT_ROOT / "tests" / "pytest" / "gsm8k" / "test_pytest_math_example.py" DATASET_PATH = ARTIFACT_ROOT / "development" / "gsm8k_sample.jsonl" files_to_download = { TEST_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/tests/pytest/gsm8k/test_pytest_math_example.py", DATASET_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/development/gsm8k_sample.jsonl", } for local_path, url in files_to_download.items(): local_path.parent.mkdir(parents=True, exist_ok=True) response = requests.get(url, timeout=30) response.raise_for_status() local_path.write_bytes(response.content) print(f"Saved {url} -> {local_path}") ``` Expected output: ``` Saved https://raw.githubusercontent.com/.../test_pytest_math_example.py -> gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py Saved https://raw.githubusercontent.com/.../gsm8k_sample.jsonl -> gsm8k_artifacts/development/gsm8k_sample.jsonl ``` ## 2. Test your evaluator locally In this step, we will test your evaluator by examining the output locally. Feel free to iterate on the evaluator you downloaded in the last step until it gives the output you want. Open a terminal and run: ```bash theme={null} ep logs ``` This will start a local server, navigate to `http://localhost:8000`. Keep this terminal running. In a **new terminal**, call the test script to run the evaluator on your dataset of sample math problems. ```bash theme={null} cd gsm8k_artifacts ep local-test ``` This command discovers and runs your `@evaluation_test` with pytest. As the test runs, you'll see evaluation scores appear in the browser, with detailed logs for each problem the model attempts. `pytest` will also register your evaluator and dataset with Fireworks automatically, so you can use them in the next step for RFT. GSM8K evaluation UI showing model scores and trajectories ## 3. Start training First, set your Fireworks API key so the Fireworks CLI can authenticate you: ```bash theme={null} export FIREWORKS_API_KEY="" ``` Next, we'll launch the RFT job using the evaluator and dataset you just registered. We're using a small base model (`qwen3-0p6b`) to keep training fast and inexpensive. Because your evaluator and dataset were already registered with Fireworks in the last step, we don't need to specify them again here. ```bash theme={null} cd .. eval-protocol create rft --base-model accounts/fireworks/models/qwen3-0p6b ``` The CLI will output dashboard links where you can monitor your training job in real-time. GSM8K evaluation score showing upward trajectory You can also store your API key in a `.env` file instead of exporting it each session. ## Monitor your training progress Your RFT job is now running. You can monitor progress in the dashboard links provided by the CLI output. Re-run the pytest evaluation command to measure your model's performance on new checkpoints: ```bash theme={null} cd gsm8k_artifacts pytest -q tests/pytest/gsm8k/test_pytest_math_example.py::test_math_dataset -s ``` This helps you see how your model's accuracy improves over time and decide when to stop training. You can adjust the evaluation logic to better fit your needs: * **Modify reward shaping**: Edit the scoring logic in `test_pytest_math_example.py` to match your answer format expectations * **Use your own data**: Replace the sample dataset by either editing the JSONL file locally or passing `--dataset-jsonl` when creating the RFT job ### What's happening behind the scenes Understanding the training workflow: 1. **Evaluation registration**: The pytest script evaluates a small GSM8K subset using numeric answer checking, then automatically registers both your evaluator and dataset with Fireworks 2. **RFT job creation**: The `create rft` command connects your registered evaluator and dataset to a Reinforcement Fine-Tuning job for your chosen base model 3. **Continuous improvement**: As training progresses, evaluation scores on the held-out set reflect improved accuracy, allowing you to iterate quickly before scaling to larger experiments ## Next steps Learn all CLI options to customize your training parameters Train agents that run in your production infrastructure Understand how reinforcement fine-tuning works # Remote Agent Quickstart Source: https://docs.fireworks.ai/fine-tuning/quickstart-svg-agent Train an SVG drawing agent running in a remote environment **Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details. **Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Remote Agent Training** path—for training agents that run in your production infrastructure. In this quickstart, you'll train an agent to generate SVG drawings. Your agent runs in a remote server (Vercel), which means rollouts happen remotely while Fireworks handles the training. This approach lets you train agents that already live in your production environment. Here's a quick walkthrough: