# Exporting Billing Metrics
Source: https://docs.fireworks.ai/accounts/exporting-billing-metrics
Export billing and usage metrics for all Fireworks services
## Overview
Fireworks provides a CLI tool to export comprehensive billing metrics for all usage types including serverless inference, on-demand deployments, and fine-tuning jobs. The exported data can be used for cost analysis, internal billing, and usage tracking.
## Exporting billing metrics
Use the Fireworks CLI to export a billing CSV that includes all usage:
```bash theme={null}
# Authenticate (once)
firectl login
# Export billing metrics to CSV
firectl billing export-metrics
```
## Examples
Export all billing metrics for an account:
```bash theme={null}
firectl billing export-metrics
```
Export metrics for a specific date range and filename:
```bash theme={null}
firectl billing export-metrics \
--start-time "2025-01-01" \
--end-time "2025-01-31" \
--filename january_metrics.csv
```
## Output format
The exported CSV includes the following columns:
* **email**: Account email
* **start\_time**: Request start timestamp
* **end\_time**: Request end timestamp
* **usage\_type**: Type of usage (e.g., TEXT\_COMPLETION\_INFERENCE\_USAGE)
* **accelerator\_type**: GPU/hardware type used
* **accelerator\_seconds**: Compute time in seconds
* **base\_model\_name**: The model used
* **model\_bucket**: Model category
* **parameter\_count**: Model size
* **prompt\_tokens**: Input tokens
* **completion\_tokens**: Output tokens
### Sample row
```csv theme={null}
email,start_time,end_time,usage_type,accelerator_type,accelerator_seconds,base_model_name,model_bucket,parameter_count,prompt_tokens,completion_tokens
user@example.com,2025-10-20 17:16:48 UTC,2025-10-20 17:16:48 UTC,TEXT_COMPLETION_INFERENCE_USAGE,,,accounts/fireworks/models/llama4-maverick-instruct-basic,Llama 4 Maverick Basic,401583781376,803,109
```
## Automation
Each `firectl billing export-metrics` call supports a maximum 31-day time range.
To export longer historical ranges, run the command in multiple 31-day chunks and
combine the CSV files in your downstream pipeline.
You can automate exports in cron jobs and load the CSV into your internal systems:
```bash theme={null}
# Example: Daily export with dated filename
firectl billing export-metrics \
--start-time "$(date -v-1d '+%Y-%m-%d')" \
--end-time "$(date '+%Y-%m-%d')" \
--filename "billing_$(date '+%Y%m%d').csv"
```
```bash theme={null}
# Example: Backfill 6 months in 31-day chunks
start_date="2025-01-01"
end_date="2025-07-01"
current_start="$start_date"
while [ "$(date -j -f "%Y-%m-%d" "$current_start" "+%s")" -lt "$(date -j -f "%Y-%m-%d" "$end_date" "+%s")" ]; do
current_end="$(date -j -v+31d -f "%Y-%m-%d" "$current_start" "+%Y-%m-%d")"
# Clamp the chunk end to the requested end_date
if [ "$(date -j -f "%Y-%m-%d" "$current_end" "+%s")" -gt "$(date -j -f "%Y-%m-%d" "$end_date" "+%s")" ]; then
current_end="$end_date"
fi
firectl billing export-metrics \
--start-time "$current_start" \
--end-time "$current_end" \
--filename "billing_${current_start}_to_${current_end}.csv"
current_start="$current_end"
done
```
Run `firectl billing export-metrics --help` to see all available flags and
options.
## Coverage
This export includes:
* **Serverless inference**: All serverless API usage
* **On-demand deployments**: Deployment usage (see also [Exporting deployment metrics](/deployments/exporting-metrics) for real-time Prometheus metrics)
* **Fine-tuning jobs**: Fine-tuning compute usage
* **Other services**: All billable Fireworks services
For real-time monitoring of on-demand deployment performance metrics (latency,
throughput, etc.), use the [Prometheus metrics
endpoint](/deployments/exporting-metrics) instead.
## See also
* [firectl CLI overview](/tools-sdks/firectl/firectl)
* [Exporting deployment metrics](/deployments/exporting-metrics) - Real-time Prometheus metrics for on-demand deployments
* [Account quotas](/guides/quotas_usage/account-quotas) - Spending tiers, budget controls, and account-wide request limits
* [Serverless rate limits](/serverless/rate-limits) - Adaptive serverless TPM bounds
# Usage & Cost Breakdown
Source: https://docs.fireworks.ai/accounts/exporting-usage-and-costs
Break down usage and rated costs by deployment, model, API key, or custom tags — via firectl or the billingUsage API
## Overview
Fireworks exposes the same usage-and-cost data through two equivalent surfaces:
* **CLI** — [`firectl billing get-usage`](/tools-sdks/firectl/commands/billing-get-usage), best for ad-hoc queries, shell scripting, and one-off cost reviews.
* **HTTP API** — [`GET /v1/accounts/{account_id}/billingUsage`](/api-reference/get-billing-usage), best for cron jobs, dashboards, and downstream cost-attribution pipelines.
Both return the same response shape and accept the same dimensions. Every example below shows the CLI form and the equivalent cURL side-by-side. Pick whichever fits your workflow.
The output has two parts:
* **Account costs** — rated dollar totals for the range (CLI: prints by default; API: companion `GetBillingSummary` endpoint).
* **Usage** — metered quantities (tokens, accelerator-seconds, audio input seconds) grouped by your chosen dimensions.
This page complements [Exporting Billing Metrics](/accounts/exporting-billing-metrics): use `export-metrics` for a raw per-event CSV dump, and the workflows on this page for grouped, rated views.
CLI examples require `firectl` 1.7.21 or later. Run `firectl version`, then `firectl upgrade` if needed.
## Authentication
For the API, send your Fireworks API key as a bearer token. Any key on the target account works.
```bash theme={null}
export ACCOUNT_ID=""
export FIREWORKS_API_KEY="fw_..."
```
For the CLI, run `firectl login` once and `firectl` reads credentials from `~/.fireworks/auth.ini`.
## Basic usage
Get a 30-day account-wide breakdown (defaults to all usage types, grouped by model for serverless and by deployment + accelerator for dedicated):
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 \
--end-time 2026-06-01
```
Add `-o json` for machine-readable output.
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z"
```
## Examples
### Serverless usage by model
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--usage-type serverless \
--group-by model_name
```
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=model_name"
```
### Serverless usage by API key
Breaks out serverless token consumption per API key. Pass both `api_key_id` (stable internal ID) and `api_key_name` (human-readable label from the console / `firectl api-key create --name`) so the response carries both.
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--usage-type serverless \
--group-by api_key_id \
--group-by api_key_name \
--group-by model_name
```
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=api_key_id" \
--data-urlencode "groupBy=api_key_name" \
--data-urlencode "groupBy=model_name"
```
Sample row from the API response:
```json theme={null}
{
"startTime": "2026-05-28T00:00:00Z",
"endTime": "2026-05-29T00:00:00Z",
"promptTokens": "1842301",
"completionTokens": "412980",
"audioInputSeconds": 0,
"usageType": "TEXT_COMPLETION_INFERENCE_USAGE",
"group": {
"api_key_id": "key_4nMFyHCSZP4CRKqa",
"api_key_name": "prod-eng",
"model_name": "accounts/fireworks/models/kimi-k2.6"
}
}
```
Token counts come back as JSON **strings** (int64 over JSON). Cast them with `tonumber` in `jq` or the equivalent in your client before doing arithmetic. The deprecated top-level `apiKeyId` field is only populated when `groupBy=api_key_id` is requested — always read API-key values from the `group` map.
### Filter to a specific API key
Repeat `--filter` (CLI) or `filter[][values]=` (API) to OR multiple values for the same dimension.
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--usage-type serverless \
--group-by model_name \
--filter api_key_name=prod-eng
```
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=model_name" \
--data-urlencode 'filter[api_key_name][values]=prod-eng'
```
### Dedicated deployment usage by deployment and GPU type
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--usage-type dedicated-deployment \
--group-by deployment_name \
--group-by accelerator_type
```
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=DEDICATED_DEPLOYMENT" \
--data-urlencode "groupBy=deployment_name" \
--data-urlencode "groupBy=accelerator_type"
```
### Filter to a single deployment
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--filter deployment_name=accounts/my-account/deployments/my-deployment
```
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode 'filter[deployment_name][values]=accounts/my-account/deployments/my-deployment'
```
### Account-level cost totals only
```bash theme={null}
firectl billing get-usage \
--start-time 2026-05-01 --end-time 2026-06-01 \
--account-costs-only
```
Rated dollar totals come from a companion endpoint, `GetBillingSummary`. Use the CLI for this view today; we'll surface the same data through the API in a future release.
## Reference
### CLI flags
| Flag | Description |
| ---------------------- | ---------------------------------------------------------------------------------- |
| `--start-time` | Start time (inclusive), as `YYYY-MM-DD` or `'YYYY-MM-DD hh:mm:ss'`. |
| `--end-time` | End time (exclusive), same formats. |
| `--usage-type` | `all`, `serverless`, or `dedicated-deployment`. Defaults to all. |
| `--group-by` | Dimension to group by. Repeatable. |
| `--filter` | `key=value` filter. Repeatable; repeated values for the same key are OR'ed. |
| `--timezone` | IANA timezone for daily aggregation (e.g. `America/Los_Angeles`). Defaults to UTC. |
| `--account-costs-only` | Print only account-level cumulative costs for the range. |
| `-o, --output` | `text` (default) or `json`. |
Run `firectl billing get-usage --help` for the full list.
### API parameters
The same dimensions are passed as `groupBy=` (repeat for multiple) and `filter[][values]=` (repeat for OR). `usageType` takes `SERVERLESS`, `DEDICATED_DEPLOYMENT`, or omitted for all. `timezone` and `startTime`/`endTime` mirror the CLI flags. See [the full API reference](/api-reference/get-billing-usage) for parameter schemas and response types.
### Grouping dimensions
Valid `--group-by` / `groupBy` and `--filter` / `filter` dimensions depend on the usage type:
* **Serverless**: `model_name`, `api_key_id`, `api_key_name`, `annotations.team`, `annotations.project`, `annotations.environment`
* **Dedicated deployment**: `deployment_name`, `accelerator_type`, `annotations.team`, `annotations.project`, `annotations.environment`
Dedicated-deployment rows also include the deployment's region (`placement`, e.g. `US`, `EUROPE`, `GLOBAL`) and metered `accelerator_seconds`.
## Custom tags (team / project / environment)
Group by `annotations.team`, `annotations.project`, or `annotations.environment` to split usage by your own labels. The tag source depends on usage type:
* **Dedicated deployments**: set an `annotations` map on the deployment, e.g. `{"team": "search", "project": "x", "environment": "prod"}`.
* **Serverless**: send a per-request header on inference calls:
```http theme={null}
POST /inference/v1/chat/completions HTTP/1.1
Host: api.fireworks.ai
Authorization: Bearer fw_...
Fireworks-Annotations: team=search,project=ranker,environment=prod
Content-Type: application/json
```
Annotation values are validated server-side; unrecognized keys are dropped silently.
## Cookbook: per-API-key reporting recipes
These recipes target the HTTP API, where downstream aggregation in `jq` (or any client) is easiest.
### Aggregate per key, across models
Sums prompt and completion tokens for each API key across every model it called, sorted by prompt volume.
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=api_key_id" \
--data-urlencode "groupBy=api_key_name" \
--data-urlencode "groupBy=model_name" \
| jq '.serverlessCosts
| group_by(.group.api_key_id)
| map({
api_key_id: .[0].group.api_key_id,
api_key_name: .[0].group.api_key_name,
models: (map(.group.model_name) | unique),
prompt_tokens: ([.[].promptTokens | tonumber] | add),
completion_tokens: ([.[].completionTokens | tonumber] | add)
})
| sort_by(-.prompt_tokens)'
```
### Group by model, then by key (cost-by-tool view)
If reporting starts from "how much did each model cost me, and which keys drove that", flip the nesting:
```bash theme={null}
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=2026-05-01T00:00:00Z" \
--data-urlencode "endTime=2026-06-01T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=api_key_id" \
--data-urlencode "groupBy=api_key_name" \
--data-urlencode "groupBy=model_name" \
| jq '.serverlessCosts
| group_by(.group.model_name)
| map({
model: .[0].group.model_name,
api_keys: (
group_by(.group.api_key_id)
| map({
api_key_id: .[0].group.api_key_id,
api_key_name: .[0].group.api_key_name,
prompt_tokens: ([.[].promptTokens | tonumber] | add),
completion_tokens: ([.[].completionTokens | tonumber] | add)
})
| sort_by(-.prompt_tokens)
)
})
| sort_by(.model)'
```
Multiply the token totals by the published [serverless prices](/serverless/pricing) to convert to dollars for chargeback.
### Backfill more than 31 days
The endpoint caps each request at a 31-day window. To pull a longer history, loop month-by-month:
```bash theme={null}
start_date="2026-01-01"
end_date="2026-06-01"
current="$start_date"
while [ "$(date -u -d "$current" '+%s')" -lt "$(date -u -d "$end_date" '+%s')" ]; do
next="$(date -u -d "$current +30 days" '+%Y-%m-%d')"
if [ "$(date -u -d "$next" '+%s')" -gt "$(date -u -d "$end_date" '+%s')" ]; then
next="$end_date"
fi
curl -sG "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/billingUsage" \
-H "Authorization: Bearer ${FIREWORKS_API_KEY}" \
--data-urlencode "startTime=${current}T00:00:00Z" \
--data-urlencode "endTime=${next}T00:00:00Z" \
--data-urlencode "usageType=SERVERLESS" \
--data-urlencode "groupBy=api_key_id" \
--data-urlencode "groupBy=api_key_name" \
> "usage_${current}_to_${next}.json"
current="$next"
done
```
## Granularity and freshness
* Usage is aggregated into **daily** buckets (`--timezone` / `timezone=` sets the day boundary). There are no sub-daily buckets.
* Responses are cached for several minutes — fine for cron jobs and dashboards, not for real-time.
## Coverage caveats
* **Tokens, not dollars.** The endpoint returns metered quantities (`promptTokens`, `completionTokens`, `accelerator_seconds`, `audioInputSeconds`). Multiply by the [serverless prices](/serverless/pricing) for cost, or use `--account-costs-only` for account-level dollar totals.
* **Inference types covered today**: text completion / chat completion and audio inference. Embeddings and image generation aren't yet reflected in `billingUsage` responses; coverage will expand in subsequent releases.
* **Dedicated deployments** are attributed at the deployment level, not by API key. Use `usageType=DEDICATED_DEPLOYMENT` with `groupBy=deployment_name` for that breakdown.
Run `firectl billing get-usage --help` to see all available CLI flags and options.
## See also
* [`firectl billing get-usage`](/tools-sdks/firectl/commands/billing-get-usage) - CLI command reference
* [`GET /v1/accounts/{account_id}/billingUsage`](/api-reference/get-billing-usage) - HTTP API reference
* [Exporting Billing Metrics](/accounts/exporting-billing-metrics) - Raw per-event billing CSV export
* [Account quotas](/guides/quotas_usage/account-quotas) - Spending tiers and budget controls
# Service Accounts
Source: https://docs.fireworks.ai/accounts/service-accounts
How to manage and use service accounts in Fireworks
Service accounts in Fireworks allow applications, scripts, and automated systems to authenticate and perform actions securely—without relying on human credentials. They are ideal for CI/CD pipelines, backend services, and automated workflows. Service Accounts let you avoid shared credentials and easily distinguish between what automated systems did vs humans in audit logs.
Service accounts can take actions using an API key, like creating deployments, running models or creating datasets (see [API reference](https://fireworks.ai/docs/api-reference/introduction)). Service accounts cannot login through the web interface or use OIDC tokens.
To manage service accounts via the Fireworks web UI visit [app.fireworks.ai/account/users](https://app.fireworks.ai/account/users).
## Creating a Service Account
Using our firectl you can create service accounts
```bash theme={null}
firectl user create --user-id "my-service-account" --service-account
```
## Creating an API Key for a Service Account
Using firectl you can create an API key on behalf of a service account:
```bash theme={null}
firectl api-key create --service-account "my-service-account"
```
## Roles
You can assign a role when creating a service account using the `--role` flag:
```bash theme={null}
firectl user create --user-id "my-service-account" --service-account --role=contributor
```
If not specified, the default service account role is `user`.
To change the role of an existing service account, use the update command:
```bash theme={null}
firectl user update my-service-account --role=inference-user
```
See [Managing users](/accounts/users) for available roles.
## Listing Service Accounts
To list all service accounts in your account:
```bash theme={null}
firectl user list --filter 'service_account=true'
```
## Billing
* Service accounts count toward the same account quotas and limits assigned to the account
* Usage is tracked by the account, not individual user vs service account
## Auditing
In audit logs users are referenced by their email id's. Service accounts are referenced by `my-service-account@my-account.sa.fireworks.ai`.
# Custom SSO
Source: https://docs.fireworks.ai/accounts/sso
Set up custom Single Sign-On (SSO) authentication for Fireworks AI
Fireworks uses single sign-on (SSO) as the primary mechanism to authenticate with the platform.
By default, Fireworks supports Google SSO.
If you have an enterprise account, Fireworks supports bringing your own identity provider using:
* OpenID Connect (OIDC) provider
* SAML 2.0 provider
Coordinate with your Fireworks AI representative to enable the integration.
## OpenID Connect (OIDC) provider
Create an OIDC client application in your identity provider, e.g. Okta.
Ensure the client is configured for "code authorization" of the "web" type (i.e. with a client\_secret).
Set the client's "allowed redirect URL" to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/oauth2/idpresponse
```
Note down the `issuer`, `client_id`, and `client_secret` for the newly created client. You will need to provide this to your Fireworks.ai representative to complete your account set up.
## SAML 2.0 provider
Create a SAML 2.0 application in your identity provider, e.g. [Okta](https://help.okta.com/en-us/Content/Topics/Apps/Apps_App_Integration_Wizard_SAML.htm).
Set the SSO URL to the URL provided by Fireworks. It looks like:
```
https://fireworks-.auth.us-west-2.amazoncognito.com/saml2/idpresponse
```
Configure the Audience URI (SP Entity ID) as provided by Fireworks. It looks like:
```
urn:amazon:cognito:sp:
```
Create an Attribute Statement with the name:
```
http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
```
and the value `user.email`
**Okta:** After saving the app, open **Sign On** → **Attribute Statements (SAML)** → expand **Show legacy configuration** → add the attribute statement there. Okta no longer configures this during app creation.
Leave the rest of the settings as defaults.
Note down the "metadata url" for your newly created application. You will need to provide this to your Fireworks AI representative to complete your account set up.
## Just-In-Time (JIT) user provisioning
JIT user provisioning automatically creates user accounts when they sign in through SSO for the first time. When enabled, users who authenticate through your identity provider are automatically added to your Fireworks account without requiring manual user creation.
To enable JIT user provisioning, use the [`--enable-jit-user-provisioning`](/tools-sdks/firectl/commands/identity-provider-create) flag when creating your identity provider with firectl.
## Enforce SSO
When SSO enforcement is enabled, account access is restricted to users with approved tenant domains only. Users with matching domains must authenticate via the identity provider, and users with other domains are blocked.
To enforce SSO, use the [`--enforce-sso`](/tools-sdks/firectl/commands/identity-provider-create) flag when creating your identity provider with firectl, or toggle "Enforce SSO for all users" in the Fireworks console.
## Troubleshooting
### Invalid samlResponse or relayState from identity provider
This error occurs if you are trying to use identity provider (IdP) initiated login. Fireworks currently only supports
service provider (SP) initiated login.
See [Understanding SAML](https://developer.okta.com/docs/concepts/saml/#understand-sp-initiated-sign-in-flow) for an
in-depth explanation.
### Required String parameter 'RelayState' is not present
See above.
# Managing users
Source: https://docs.fireworks.ai/accounts/users
Add, delete, and manage roles for users in your Fireworks account
See the concepts [page](/getting-started/concepts#account) for definitions of accounts and users.
## User roles
Each user in an account is assigned a role that determines their level of access:
| Role | Description |
| :----------------- | :---------------------------------------------------------------------------------------------------------------------- |
| **Admin** | Full administrative control over resources, users, and access. Can manage all account settings and add or remove users. |
| **User** (default) | Can manage all resources, including those owned by others, but cannot manage users or access settings. |
| **Contributor** | Can run inference on any resource and create and manage their own resources. Cannot modify resources owned by others. |
| **Inference User** | Can view all resources and run inference, but cannot create or modify resources. |
The `contributor` and `inference-user` roles are newer roles that provide more granular access control. Contact Fireworks support if you need these roles enabled for your account.
#### Resource management
| Permission | Inference User | Contributor | User | Admin |
| :--------------------------------------------------------------------- | :------------: | :---------: | :--: | :---: |
| Execute inference on any deployment | ✅ | ✅ | ✅ | ✅ |
| View all resources (deployments, models, fine tuning jobs, datasets) | ✅ | ✅ | ✅ | ✅ |
| Create new resources (deployments, models, fine tuning jobs, datasets) | ❌ | ✅ | ✅ | ✅ |
| Manage their own resources (edit/delete) | ❌ | ✅ | ✅ | ✅ |
| Manage resources owned by others (edit/delete) | ❌ | ❌ | ✅ | ✅ |
#### API key & account management
| Permission | Inference User | Contributor | User | Admin |
| :----------------------------------------------- | :------------: | :---------: | :--: | :---: |
| Manage self-owned API keys (create/delete) | ✅ | ✅ | ✅ | ✅ |
| View all users and service accounts | ✅ | ✅ | ✅ | ✅ |
| Create service account API keys | ❌ | ❌ | ❌ | ✅ |
| Delete other users and service accounts API keys | ❌ | ❌ | ❌ | ✅ |
| Add/modify/delete users and their access | ❌ | ❌ | ❌ | ✅ |
## Adding users
To add a new user to your Fireworks account, run the following command. If the email for the new user is already associated with a Fireworks account, they will have the option to freely switch between your account and their existing account(s). You can also add users in the Fireworks web UI at [https://app.fireworks.ai/account/users](https://app.fireworks.ai/account/users).
```bash theme={null}
firectl user create --email="alice@example.com"
```
To create another admin user, pass the `--role=admin` flag:
```bash theme={null}
firectl user create --email="alice@example.com" --role=admin
```
## Updating a user's role
To update a user's role, run
```bash theme={null}
firectl user update --role=
```
Where `` is one of: `admin`, `user`, `contributor`, or `inference-user`.
## Deleting users
You can remove a user from your account by running:
```bash theme={null}
firectl user delete
```
# Create a Message
Source: https://docs.fireworks.ai/api-reference/anthropic-messages
post /v1/messages
**Anthropic-compatible endpoint.**
Send a structured list of input messages with text and/or image content, and the model will generate the next message in the conversation.
The Messages API can be used for either single queries or stateless multi-turn conversations.
**Fireworks Quickstarts:**
- [Serverless Quickstart](/getting-started/quickstart)
- [Deployments Quickstart](/getting-started/ondemand-quickstart)
This endpoint provides an Anthropic-compatible Messages API surface on Fireworks.
For setup, supported features, and known differences, see [Anthropic compatibility](/tools-sdks/anthropic-compatibility).
# Cancel Reinforcement Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/cancel-reinforcement-fine-tuning-job
post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:cancel
# Create API Key
Source: https://docs.fireworks.ai/api-reference/create-api-key
post /v1/accounts/{account_id}/users/{user_id}/apiKeys
# Create Batch Inference Job
Source: https://docs.fireworks.ai/api-reference/create-batch-inference-job
post /v1/accounts/{account_id}/batchInferenceJobs
# Create Dataset
Source: https://docs.fireworks.ai/api-reference/create-dataset
post /v1/accounts/{account_id}/datasets
# Load LoRA
Source: https://docs.fireworks.ai/api-reference/create-deployed-model
post /v1/accounts/{account_id}/deployedModels
# Create Deployment
Source: https://docs.fireworks.ai/api-reference/create-deployment
post /v1/accounts/{account_id}/deployments
## Creating a deployment with a deployment shape
[Deployment shapes](/guides/ondemand-deployments#deployment-shapes) are pre-configured templates optimized for speed, cost, or efficiency. To create a deployment with a specific shape, pass the `deploymentShape` field in the request body along with `baseModel`.
Use the [List Deployment Shape Versions](/api-reference/list-deployment-shape-versions) endpoint to find available shapes for your model.
```bash theme={null}
curl -X POST "https://api.fireworks.ai/v1/accounts/YOUR_ACCOUNT_ID/deployments" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"baseModel": "accounts/fireworks/models/gpt-oss-120b",
"deploymentShape": "accounts/fireworks/deploymentShapes/gpt-oss-120b-minimal",
"minReplicaCount": 0,
"maxReplicaCount": 1
}'
```
When using a deployment shape, you do not need to specify `activeModelVersion` or `targetModelVersion` — the shape provides the necessary configuration.
# Create dpo job
Source: https://docs.fireworks.ai/api-reference/create-dpo-job
post /v1/accounts/{account_id}/dpoJobs
# Create Evaluation Job
Source: https://docs.fireworks.ai/api-reference/create-evaluation-job
post /v1/accounts/{account_id}/evaluationJobs
# Create Evaluator
Source: https://docs.fireworks.ai/api-reference/create-evaluator
post /v1/accounts/{account_id}/evaluatorsV2
Creates a custom evaluator for scoring model outputs. Evaluators use the
[Eval Protocol](https://evalprotocol.io) to define test cases, run model
inference, and score responses. They are used with evaluation jobs and
Reinforcement Fine-Tuning (RFT).
## Source Code Requirements
Your project should contain:
- `requirements.txt` - Python dependencies for your evaluator
- `test_*.py` - Pytest test file(s) with
[`@evaluation_test`](https://evalprotocol.io/reference/evaluation-test)
decorated functions
- Any additional code/modules your evaluator needs
## Workflow
**Recommended:** Use the [`ep upload`](https://evalprotocol.io/reference/cli#ep-upload)
CLI command to handle all these steps automatically.
If using the API directly:
1. Call this endpoint to create the evaluator resource
2. Package your source directory as a `.tar.gz` (respecting `.gitignore`)
3. Call [Get Evaluator Upload Endpoint](/api-reference/get-evaluator-upload-endpoint) to get a signed upload URL
4. `PUT` the tar.gz file to the signed URL
5. Call [Validate Evaluator Upload](/api-reference/validate-evaluator-upload) to trigger server-side validation
6. Poll [Get Evaluator](/api-reference/get-evaluator) until ready
Once active, reference the evaluator in [Create Evaluation Job](/api-reference/create-evaluation-job) or [Create Reinforcement Fine-tuning Job](/api-reference/create-reinforcement-fine-tuning-job).
# Create Model
Source: https://docs.fireworks.ai/api-reference/create-model
post /v1/accounts/{account_id}/models
# Create Reinforcement Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-job
post /v1/accounts/{account_id}/reinforcementFineTuningJobs
# Create Reinforcement Fine-tuning Step
Source: https://docs.fireworks.ai/api-reference/create-reinforcement-fine-tuning-step
post /v1/accounts/{account_id}/rlorTrainerJobs
# Create Router
Source: https://docs.fireworks.ai/api-reference/create-router
post /v1/accounts/{account_id}/routers
# Create secret
Source: https://docs.fireworks.ai/api-reference/create-secret
post /v1/accounts/{account_id}/secrets
# Create Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/create-supervised-fine-tuning-job
post /v1/accounts/{account_id}/supervisedFineTuningJobs
# Create User
Source: https://docs.fireworks.ai/api-reference/create-user
post /v1/accounts/{account_id}/users
# Create embeddings
Source: https://docs.fireworks.ai/api-reference/creates-an-embedding-vector-representing-the-input-text
post /embeddings
# Delete API Key
Source: https://docs.fireworks.ai/api-reference/delete-api-key
post /v1/accounts/{account_id}/users/{user_id}/apiKeys:delete
# Delete Batch Inference Job
Source: https://docs.fireworks.ai/api-reference/delete-batch-inference-job
delete /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id}
# Delete Dataset
Source: https://docs.fireworks.ai/api-reference/delete-dataset
delete /v1/accounts/{account_id}/datasets/{dataset_id}
# Unload LoRA
Source: https://docs.fireworks.ai/api-reference/delete-deployed-model
delete /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Delete Deployment
Source: https://docs.fireworks.ai/api-reference/delete-deployment
delete /v1/accounts/{account_id}/deployments/{deployment_id}
# Delete dpo job
Source: https://docs.fireworks.ai/api-reference/delete-dpo-job
delete /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}
# Delete Evaluation Job
Source: https://docs.fireworks.ai/api-reference/delete-evaluation-job
delete /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id}
# Delete Evaluator
Source: https://docs.fireworks.ai/api-reference/delete-evaluator
delete /v1/accounts/{account_id}/evaluators/{evaluator_id}
Deletes an evaluator and its associated versions and build artifacts.
# Delete Model
Source: https://docs.fireworks.ai/api-reference/delete-model
delete /v1/accounts/{account_id}/models/{model_id}
# Delete Reinforcement Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-job
delete /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}
# Delete Reinforcement Fine-tuning Step
Source: https://docs.fireworks.ai/api-reference/delete-reinforcement-fine-tuning-step
delete /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}
# Delete Response
Source: https://docs.fireworks.ai/api-reference/delete-response
delete /v1/responses/{response_id}
Deletes a model response by its ID. Once deleted, the response data will be gone immediately and permanently.
The response cannot be recovered and any conversations that reference this response ID will no longer be able to access it.
# Delete Router
Source: https://docs.fireworks.ai/api-reference/delete-router
delete /v1/accounts/{account_id}/routers/{router_id}
# Delete secret
Source: https://docs.fireworks.ai/api-reference/delete-secret
delete /v1/accounts/{account_id}/secrets/{secret_id}
# Delete Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/delete-supervised-fine-tuning-job
delete /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}
# Execute one training step for keep-alive Reinforcement Fine-tuning Step
Source: https://docs.fireworks.ai/api-reference/execute-reinforcement-fine-tuning-step
post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:executeTrainStep
# Generate an image with FLUX.1 [schnell] FP8
Source: https://docs.fireworks.ai/api-reference/generate-a-new-image-from-a-text-prompt
POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image
[FLUX.1
\[schnell\]](https://huggingface.co/fireworks-ai/FLUX.1-schnell-fp8-flumina) is a
12 billion parameter rectified flow transformer capable of generating images
from text descriptions. The FP8 version uses reduced precision numerics for 2x
faster inference.
See our
[Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-1-schnell-fp8)
to quickly try it out in your browser.
## Headers
Specifies which format to return the response in. With `image/png` and
`image/jpeg`, the server will populate the response body with a binary image
of the specified format.
The media type of the request body.
The Bearer with Fireworks API Key.
## Request Body
Prompt to use for the image generation process.
Aspect ratio of the generated image.
**Options:** `1:1`, `21:9`, `16:9`, `3:2`, `5:4`, `4:5`, `2:3`, `9:16`, `9:21`, `4:3`, `3:4`
Classifier-free guidance scale for the image diffusion process. Default value is 3.5.
Number of denoising steps for the image generation process. Default value is 4.
Random seed to use for the image generation process. If 0, we will use a totally random seed.
```python Python theme={null}
import requests
url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image"
headers = {
"Content-Type": "application/json",
"Accept": "image/jpeg",
"Authorization": "Bearer $API_KEY",
}
data = {
"prompt": "A beautiful sunset over the ocean"
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
with open("a.jpg", "wb") as f:
f.write(response.content)
print("Image saved as a.jpg")
else:
print("Error:", response.status_code, response.text)
```
```typescript TypeScript theme={null}
import fs from "fs";
import fetch from "node-fetch";
(async () => {
const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Accept": "image/jpeg",
"Authorization": "Bearer $API_KEY"
},
body: JSON.stringify({
prompt: "A beautiful sunset over the ocean"
}),
});
// To process the response and get the image:
const buffer = await response.arrayBuffer();
fs.writeFile('a.jpg', Buffer.from(buffer), () => console.log('Finished downloading!'));
})().catch(console.error);
```
```shell curl theme={null}
curl --request POST \
-S --fail-with-body \
--url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/flux-1-schnell-fp8/text_to_image \
-H 'Content-Type: application/json' \
-H 'Accept: image/jpeg' \
-H "Authorization: Bearer $API_KEY" \
--data '
{
"prompt": "A beautiful sunset over the ocean"
}' -o a.jpg
```
```json Accept: application/json theme={null}
{
"id": "1234567890",
"base64": ["data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...", "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."],
"finishReason": "SUCCESS",
"seed": 1234567890
}
```
```txt Accept: image/jpeg theme={null}
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wAARCAABAAEDASIAAhEBAxEB/8QAFQABAQAAAAAAAAAAAAAAAAAAAAv/xAAUEAEAAAAAAAAAAAAAAAAAAAAA/8QAFQEBAQAAAAAAAAAAAAAAAAAAAAX/xAAUEQEAAAAAAAAAAAAAAAAAAAAA/9oADAMBAAIRAxEAPwCdABmX/9k=
```
```txt Accept: image/png theme={null}
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNkYPhfDwAChwGA60e6kgAAAABJRU5ErkJggg==
```
## Response
The unique identifier for the image generation request.
Includes a base64-encoded string containing an image in PNG format.
To retrieve the image, base64-decode the string into binary data,
then load that binary data as a PNG file.
Can be `SUCCESS` or `CONTENT_FILTERED`.
Specifies the outcome of the image generation process. It could be
`SUCCESS` indicating that the image was successfully generated, or
`CONTENT_FILTERED` if the image was filtered due to the safety\_check=true
parameter being set.
The seed used for the image generation process.
When the Accept type is `image/jpeg`, the response body will contain a binary image. Additionally, the response will include headers such as:
**Content-Length:** Represents the length of the binary image content.
**Seed:** The random seed used to generate the image.
**Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`.
When the Accept type is `image/png`, the response body will contain a binary image. Additionally, the response will include headers such as:
**Content-Length:** Represents the length of the binary image content.
**Seed:** The random seed used to generate the image.
**Finish-Reason:** Indicates the outcome of the image generation, such as `CONTENT_FILTERED` or `SUCCESS`.
# Generate or edit an image with FLUX.1 Kontext
Source: https://docs.fireworks.ai/api-reference/generate-or-edit-image-using-flux-kontext
POST https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}
💡 Note that this API is async and will return the **request\_id** instead of the image. Call the [get\_result](/api-reference/get-generated-image-from-flux-kontex) API to obtain the generated image.
FLUX Kontext Pro is a specialized model for generating contextually-aware images from text descriptions. Designed for professional use cases requiring high-quality, consistent image generation.
Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-pro) to quickly try it out in your browser.
FLUX Kontext Max is the most advanced model in the Kontext series, offering maximum quality and context understanding. Ideal for enterprise applications requiring the highest level of image generation performance.
Use our [Playground](https://app.fireworks.ai/playground?model=accounts/fireworks/models/flux-kontext-max) to quickly try it out in your browser.
## Path
The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API.
## Headers
The media type of the request body.
Your Fireworks API key.
## Request Body
Prompt to use for the image generation process.
Base64 encoded image or URL to use with Kontext.
Optional seed for reproducibility.
Aspect ratio of the image between 21:9 and 9:21.
Output format for the generated image. Can be 'jpeg' or 'png'.
**Options:** `jpeg`, `png`
URL to receive webhook notifications.
**Length:** 1-2083 characters
Optional secret for webhook signature verification.
Whether to perform upsampling on the prompt. If active, automatically modifies the prompt for more creative generation.
Tolerance level for input and output moderation. Between 0 and 6, 0 being most strict, 6 being least strict. Limit of 2 for Image to Image.
**Range:** 0-6
```python Python theme={null}
import requests
url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer $API_KEY",
}
data = {
"prompt": "A beautiful sunset over the ocean",
"input_image": "",
"seed": 42,
"aspect_ratio": "",
"output_format": "jpeg",
"webhook_url": "",
"webhook_secret": "",
"prompt_upsampling": False,
"safety_tolerance": 2
}
response = requests.post(url, headers=headers, json=data)
```
```typescript TypeScript theme={null}
import fs from "fs";
import fetch from "node-fetch";
(async () => {
const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer $API_KEY"
},
body: JSON.stringify({
prompt: "A beautiful sunset over the ocean"
}),
});
})().catch(console.error);
```
```shell curl theme={null}
curl --request POST \
-S --fail-with-body \
--url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model} \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $API_KEY" \
--data '
{
"prompt": "A beautiful sunset over the ocean"
}'
```
## Response
Successful Response
request id
Unsuccessful Response
error message
# Get Account
Source: https://docs.fireworks.ai/api-reference/get-account
get /v1/accounts/{account_id}
# Get Batch Inference Job
Source: https://docs.fireworks.ai/api-reference/get-batch-inference-job
get /v1/accounts/{account_id}/batchInferenceJobs/{batch_inference_job_id}
# Get Account Usage
Source: https://docs.fireworks.ai/api-reference/get-billing-usage
get /v1/accounts/{account_id}/billingUsage
# Get Dataset
Source: https://docs.fireworks.ai/api-reference/get-dataset
get /v1/accounts/{account_id}/datasets/{dataset_id}
# Get Dataset Download Endpoint
Source: https://docs.fireworks.ai/api-reference/get-dataset-download-endpoint
get /v1/accounts/{account_id}/datasets/{dataset_id}:getDownloadEndpoint
# Get Dataset Upload Endpoint
Source: https://docs.fireworks.ai/api-reference/get-dataset-upload-endpoint
post /v1/accounts/{account_id}/datasets/{dataset_id}:getUploadEndpoint
# Get LoRA
Source: https://docs.fireworks.ai/api-reference/get-deployed-model
get /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Get Deployment
Source: https://docs.fireworks.ai/api-reference/get-deployment
get /v1/accounts/{account_id}/deployments/{deployment_id}
# Get Deployment Shape
Source: https://docs.fireworks.ai/api-reference/get-deployment-shape
get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}
# Get Deployment Shape Version
Source: https://docs.fireworks.ai/api-reference/get-deployment-shape-version
get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions/{version_id}
# Get dpo job
Source: https://docs.fireworks.ai/api-reference/get-dpo-job
get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}
# Get dpo job metrics file endpoint
Source: https://docs.fireworks.ai/api-reference/get-dpo-job-metrics-file-endpoint
get /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:getMetricsFileEndpoint
# Get Evaluation Job
Source: https://docs.fireworks.ai/api-reference/get-evaluation-job
get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id}
# Get Evaluation Job execution logs (stream log endpoint + tracing IDs).
Source: https://docs.fireworks.ai/api-reference/get-evaluation-job-log-endpoint
get /v1/accounts/{account_id}/evaluationJobs/{evaluation_job_id}:getExecutionLogEndpoint
# Get Evaluator
Source: https://docs.fireworks.ai/api-reference/get-evaluator
get /v1/accounts/{account_id}/evaluators/{evaluator_id}
Retrieves an evaluator by name. Use this to monitor build progress after
creation (**step 6** in the [Create Evaluator](/api-reference/create-evaluator) workflow).
Possible states:
- `BUILDING` - Environment is being prepared
- `ACTIVE` - Evaluator is ready to use
- `BUILD_FAILED` - Check build logs via [Get Evaluator Build Log Endpoint](/api-reference/get-evaluator-build-log-endpoint)
# Get Evaluator Build Log Endpoint
Source: https://docs.fireworks.ai/api-reference/get-evaluator-build-log-endpoint
get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getBuildLogEndpoint
Returns a signed URL to download the evaluator's build logs. Useful for
debugging `BUILD_FAILED` state.
# Get Evaluator Source Code Endpoint
Source: https://docs.fireworks.ai/api-reference/get-evaluator-source-code-endpoint
get /v1/accounts/{account_id}/evaluators/{evaluator_id}:getSourceCodeSignedUrl
Returns a signed URL to download the evaluator's source code archive.
Useful for debugging or reviewing the uploaded code.
# Get Evaluator Upload Endpoint
Source: https://docs.fireworks.ai/api-reference/get-evaluator-upload-endpoint
post /v1/accounts/{account_id}/evaluators/{evaluator_id}:getUploadEndpoint
Returns signed URLs for uploading evaluator source code (**step 3** in the
[Create Evaluator](/api-reference/create-evaluator) workflow). After receiving
the signed URL, upload your `.tar.gz` archive using HTTP `PUT` with
`Content-Type: application/octet-stream` header.
# Get generated image from FLUX.1 Kontext
Source: https://docs.fireworks.ai/api-reference/get-generated-image-from-flux-kontex
GET https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result
Replace **model** with **flux-kontext-pro** in the API to get the result.
Replace **model** with **flux-kontext-max** in the API to get the result.
## Path
The model to use for image generation. Use **flux-kontext-pro** or **flux-kontext-max** as the model name in the API.
## Headers
The media type of the request body.
Your Fireworks API key.
## Request Body
Request id generated from create/edit image request.
```python Python theme={null}
import requests
url = "https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer $API_KEY",
}
data = {
id: "request_id"
}
response = requests.post(url, headers=headers, json=data)
print(response.text)
```
```typescript TypeScript theme={null}
import fs from "fs";
import fetch from "node-fetch";
(async () => {
const response = await fetch("https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer $API_KEY"
},
body: JSON.stringify({
id: "request_id"
}),
});
})().catch(console.error);
```
```shell curl theme={null}
curl --request POST \
-S --fail-with-body \
--url https://api.fireworks.ai/inference/v1/workflows/accounts/fireworks/models/{model}/get_result \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $API_KEY" \
--data '
{
id: "request_id"
}'
```
## Response
Task id for retrieving result
Available options: Task not found, Pending, Request Moderated, Content Moderated, Ready, Error
# Get Model
Source: https://docs.fireworks.ai/api-reference/get-model
get /v1/accounts/{account_id}/models/{model_id}
# Get Model Download Endpoint
Source: https://docs.fireworks.ai/api-reference/get-model-download-endpoint
get /v1/accounts/{account_id}/models/{model_id}:getDownloadEndpoint
# Get Model Upload Endpoint
Source: https://docs.fireworks.ai/api-reference/get-model-upload-endpoint
post /v1/accounts/{account_id}/models/{model_id}:getUploadEndpoint
# Get Quota
Source: https://docs.fireworks.ai/api-reference/get-quota
get /v1/accounts/{account_id}/quotas/{quota_id}
Gets a single quota by resource name.
# Get Reinforcement Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-job
get /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}
# Get Reinforcement Fine-tuning Step
Source: https://docs.fireworks.ai/api-reference/get-reinforcement-fine-tuning-step
get /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}
# Get Response
Source: https://docs.fireworks.ai/api-reference/get-response
get /v1/responses/{response_id}
# Get Router
Source: https://docs.fireworks.ai/api-reference/get-router
get /v1/accounts/{account_id}/routers/{router_id}
# Get Secret
Source: https://docs.fireworks.ai/api-reference/get-secret
get /v1/accounts/{account_id}/secrets/{secret_id}
Retrieves a secret by name. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included.
# Get Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/get-supervised-fine-tuning-job
get /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}
# Get User
Source: https://docs.fireworks.ai/api-reference/get-user
get /v1/accounts/{account_id}/users/{user_id}
# Introduction
Source: https://docs.fireworks.ai/api-reference/introduction
Fireworks AI REST API enables you to interact with various language, image and embedding models using an API Key. It also lets you automate management of models, deployments, datasets, and more.
## Authentication
All requests made to the Fireworks AI REST API must include an `Authorization` header with a valid `Bearer` token using your API key, along with the `Content-Type: application/json` header.
### Getting your API key
You can obtain an API key by:
* Using the [`firectl api-key create`](/tools-sdks/firectl/commands/api-key-create) command
* Generating one through the [Fireworks AI dashboard](https://app.fireworks.ai/settings/users/api-keys)
### Request headers
Include the following headers in your REST API requests:
```json theme={null}
authorization: Bearer
content-type: application/json
```
## Account management APIs
In addition to inference and deployment APIs, Fireworks exposes account-scoped
quota endpoints.
* [List Quotas](/api-reference/list-quotas)
* [Get Quota](/api-reference/get-quota)
* [Update Quota](/api-reference/update-quota)
# List Accounts
Source: https://docs.fireworks.ai/api-reference/list-accounts
get /v1/accounts
# List API Keys
Source: https://docs.fireworks.ai/api-reference/list-api-keys
get /v1/accounts/{account_id}/users/{user_id}/apiKeys
# List Batch Inference Jobs
Source: https://docs.fireworks.ai/api-reference/list-batch-inference-jobs
get /v1/accounts/{account_id}/batchInferenceJobs
# List Datasets
Source: https://docs.fireworks.ai/api-reference/list-datasets
get /v1/accounts/{account_id}/datasets
# List LoRAs
Source: https://docs.fireworks.ai/api-reference/list-deployed-models
get /v1/accounts/{account_id}/deployedModels
# List Deployment Shapes Versions
Source: https://docs.fireworks.ai/api-reference/list-deployment-shape-versions
get /v1/accounts/{account_id}/deploymentShapes/{deployment_shape_id}/versions
Use this endpoint to query available deployment shape versions for a given model. Use `-` as a wildcard for both `account_id` and `deployment_shape_id` to search across all accounts and shapes.
## Example: List shapes for a model
To list validated deployment shapes for a specific model, use the `filter` parameter with `snapshot.base_model` and `latest_validated=true`:
```bash theme={null}
curl -s "https://api.fireworks.ai/v1/accounts/-/deploymentShapes/-/versions?filter=snapshot.base_model%3D%22accounts%2Ffireworks%2Fmodels%2Fgpt-oss-120b%22%20AND%20latest_validated%3Dtrue&order_by=create_time%20desc" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" | jq .
```
### Filter syntax
The `filter` parameter uses [AIP-160 filtering](https://google.aip.dev/160). Common patterns:
| Filter | Description |
| ------------------------------------------------------------ | ------------------------------------------------------ |
| `snapshot.base_model="accounts/fireworks/models/MODEL_NAME"` | Filter by base model |
| `latest_validated=true` | Only return the latest validated version of each shape |
Combine multiple conditions with `AND`:
```
snapshot.base_model="accounts/fireworks/models/MODEL_NAME" AND latest_validated=true
```
Remember to URL-encode the filter value when using curl directly. `=` becomes `%3D`, `"` becomes `%22`, and `/` becomes `%2F`.
# List Deployments
Source: https://docs.fireworks.ai/api-reference/list-deployments
get /v1/accounts/{account_id}/deployments
# List dpo jobs
Source: https://docs.fireworks.ai/api-reference/list-dpo-jobs
get /v1/accounts/{account_id}/dpoJobs
# List Evaluation Jobs
Source: https://docs.fireworks.ai/api-reference/list-evaluation-jobs
get /v1/accounts/{account_id}/evaluationJobs
# List Evaluators
Source: https://docs.fireworks.ai/api-reference/list-evaluators
get /v1/accounts/{account_id}/evaluators
Lists all evaluators for an account with pagination support.
# List Models
Source: https://docs.fireworks.ai/api-reference/list-models
get /v1/accounts/{account_id}/models
# List Quotas
Source: https://docs.fireworks.ai/api-reference/list-quotas
get /v1/accounts/{account_id}/quotas
Lists all quotas for an account.
# List Reinforcement Fine-tuning Jobs
Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-jobs
get /v1/accounts/{account_id}/reinforcementFineTuningJobs
# List Reinforcement Fine-tuning Steps
Source: https://docs.fireworks.ai/api-reference/list-reinforcement-fine-tuning-steps
get /v1/accounts/{account_id}/rlorTrainerJobs
# List Responses
Source: https://docs.fireworks.ai/api-reference/list-responses
get /v1/responses
Get a list of all responses for the authenticated account.
Args:
limit: Maximum number of responses to return (default: 20, max: 100)
after: Cursor for pagination - return responses after this ID
before: Cursor for pagination - return responses before this ID
# List Routers
Source: https://docs.fireworks.ai/api-reference/list-routers
get /v1/accounts/{account_id}/routers
# List Secrets
Source: https://docs.fireworks.ai/api-reference/list-secrets
get /v1/accounts/{account_id}/secrets
Lists all secrets for an account. Note that the `value` field is not returned in the response for security reasons. Only the `name` and `key_name` fields are included for each secret.
# List Supervised Fine-tuning Jobs
Source: https://docs.fireworks.ai/api-reference/list-supervised-fine-tuning-jobs
get /v1/accounts/{account_id}/supervisedFineTuningJobs
# List Users
Source: https://docs.fireworks.ai/api-reference/list-users
get /v1/accounts/{account_id}/users
# Create Chat Completion
Source: https://docs.fireworks.ai/api-reference/post-chatcompletions
post /v1/chat/completions
Create a completion for the provided prompt and parameters.
For RL / agent rollouts, Fireworks inference exposes additional
rollout-specific features:
[`x-session-affinity` and `x-multi-turn-session-id`](https://docs.fireworks.ai/guides/rollout-inference#session-affinity)
for multi-turn trajectories, and
[MoE Router Replay (R3)](https://docs.fireworks.ai/guides/rollout-inference#moe-router-replay)
for MoE expert tracing during rollouts.
# Create Completion
Source: https://docs.fireworks.ai/api-reference/post-completions
post /v1/completions
Create a completion for the provided prompt and parameters.
For RL / agent rollouts, Fireworks inference exposes additional
rollout-specific features:
[`x-session-affinity` and `x-multi-turn-session-id`](https://docs.fireworks.ai/guides/rollout-inference#session-affinity)
for multi-turn trajectories, and
[MoE Router Replay (R3)](https://docs.fireworks.ai/guides/rollout-inference#moe-router-replay)
for MoE expert tracing during rollouts.
# Create Response
Source: https://docs.fireworks.ai/api-reference/post-responses
post /v1/responses
Creates a model response, optionally interacting with custom tools via the Model Context Protocol (MCP). This endpoint supports conversational continuation and streaming.
Explore our cookbooks for detailed examples:
- [Basic MCP Usage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_examples.ipynb)
- [Streaming with MCP](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_mcp_with_streaming.ipynb)
- [Conversational History with `previous_response_id`](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_previous_response_cookbook.ipynb)
- [Basic Streaming](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/fireworks_streaming_example.ipynb)
- [Controlling Response Storage](https://github.com/fw-ai/cookbook/blob/main/learn/response-api/mcp_server_with_store_false_argument.ipynb)
# Prepare Model for different precisions
Source: https://docs.fireworks.ai/api-reference/prepare-model
post /v1/accounts/{account_id}/models/{model_id}:prepare
# Rerank documents
Source: https://docs.fireworks.ai/api-reference/rerank-documents
post /rerank
Rerank documents for a query using relevance scoring
# Resume Dpo Job
Source: https://docs.fireworks.ai/api-reference/resume-dpo-job
post /v1/accounts/{account_id}/dpoJobs/{dpo_job_id}:resume
# Resume Reinforcement Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-job
post /v1/accounts/{account_id}/reinforcementFineTuningJobs/{reinforcement_fine_tuning_job_id}:resume
# Resume Rlor Trainer Job
Source: https://docs.fireworks.ai/api-reference/resume-reinforcement-fine-tuning-step
post /v1/accounts/{account_id}/rlorTrainerJobs/{rlor_trainer_job_id}:resume
# Resume Supervised Fine-tuning Job
Source: https://docs.fireworks.ai/api-reference/resume-supervised-fine-tuning-job
post /v1/accounts/{account_id}/supervisedFineTuningJobs/{supervised_fine_tuning_job_id}:resume
# Scale Deployment to a specific number of replicas or to zero
Source: https://docs.fireworks.ai/api-reference/scale-deployment
patch /v1/accounts/{account_id}/deployments/{deployment_id}:scale
# Undelete Deployment
Source: https://docs.fireworks.ai/api-reference/undelete-deployment
post /v1/accounts/{account_id}/deployments/{deployment_id}:undelete
# Update Dataset
Source: https://docs.fireworks.ai/api-reference/update-dataset
patch /v1/accounts/{account_id}/datasets/{dataset_id}
# Update LoRA
Source: https://docs.fireworks.ai/api-reference/update-deployed-model
patch /v1/accounts/{account_id}/deployedModels/{deployed_model_id}
# Update Deployment
Source: https://docs.fireworks.ai/api-reference/update-deployment
patch /v1/accounts/{account_id}/deployments/{deployment_id}
# Update Evaluator
Source: https://docs.fireworks.ai/api-reference/update-evaluator
patch /v1/accounts/{account_id}/evaluators/{evaluator_id}
Updates evaluator metadata (display_name, description, default_dataset).
Changing `requirements` or `entry_point` triggers a rebuild. To upload new
source code, set `prepare_code_upload: true` then follow the upload flow.
# Update Model
Source: https://docs.fireworks.ai/api-reference/update-model
patch /v1/accounts/{account_id}/models/{model_id}
# Update Quota
Source: https://docs.fireworks.ai/api-reference/update-quota
patch /v1/accounts/{account_id}/quotas/{quota_id}
Updates a quota.
# Update Router
Source: https://docs.fireworks.ai/api-reference/update-router
patch /v1/accounts/{account_id}/routers/{router_id}
# Update secret
Source: https://docs.fireworks.ai/api-reference/update-secret
patch /v1/accounts/{account_id}/secrets/{secret_id}
# Update User
Source: https://docs.fireworks.ai/api-reference/update-user
patch /v1/accounts/{account_id}/users/{user_id}
# Upload Dataset Files
Source: https://docs.fireworks.ai/api-reference/upload-dataset-files
post /v1/accounts/{account_id}/datasets/{dataset_id}:upload
Provides a streamlined way to upload a dataset file in a single API request. This path can handle file sizes up to 150Mb. For larger file sizes use [Get Dataset Upload Endpoint](get-dataset-upload-endpoint).
# Validate Dataset Upload
Source: https://docs.fireworks.ai/api-reference/validate-dataset-upload
post /v1/accounts/{account_id}/datasets/{dataset_id}:validateUpload
# Validate Evaluator Upload
Source: https://docs.fireworks.ai/api-reference/validate-evaluator-upload
post /v1/accounts/{account_id}/evaluators/{evaluator_id}:validateUpload
Triggers server-side validation of the uploaded source code (**step 5** in
the [Create Evaluator](/api-reference/create-evaluator) workflow). The server
extracts and processes the archive, then builds the evaluator environment.
Poll [Get Evaluator](/api-reference/get-evaluator) to monitor progress.
# Validate Model Upload
Source: https://docs.fireworks.ai/api-reference/validate-model-upload
get /v1/accounts/{account_id}/models/{model_id}:validateUpload
# Autoscaling
Source: https://docs.fireworks.ai/deployments/autoscaling
Configure how your deployment scales based on traffic
Control how your deployment scales based on traffic and load.
## Configuration options
| Flag | Type | Default | Description |
| ------------------------ | --------- | ------------- | ------------------------------------------------------ |
| `--min-replica-count` | Integer | 0 | Minimum number of replicas. Set to 0 for scale-to-zero |
| `--max-replica-count` | Integer | 1 | Maximum number of replicas |
| `--scale-up-window` | Duration | 30s | Wait time before scaling up |
| `--scale-down-window` | Duration | 10m | Wait time before scaling down |
| `--scale-to-zero-window` | Duration | 1h | Idle time before scaling to zero (min: 5m) |
| `--load-targets` | Key-value | `default=0.8` | Scaling thresholds. See options below |
**Load target options** (use as `--load-targets =[,=...]`):
* `default=` - General load target from 0 to 1
* `tokens_generated_per_second=` - Desired tokens per second per replica
* `prompt_tokens_per_second=` - Desired prompt tokens per second per replica
* `requests_per_second=` - Desired requests per second per replica
* `concurrent_requests=` - Desired concurrent requests per replica
When multiple targets are specified, the maximum replica count across all is used.
## Common patterns
Scale to zero when idle to minimize costs:
```bash theme={null}
firectl deployment create \
--min-replica-count 0 \
--max-replica-count 3 \
--scale-to-zero-window 1h
```
Best for: Development, testing, or intermittent production workloads.
Keep replicas running for instant response:
```bash theme={null}
firectl deployment create \
--min-replica-count 2 \
--max-replica-count 10 \
--scale-up-window 15s \
--load-targets concurrent_requests=5
```
Best for: Low-latency requirements, avoiding cold starts, high-traffic applications.
Match known traffic patterns:
```bash theme={null}
firectl deployment create \
--min-replica-count 3 \
--max-replica-count 5 \
--scale-down-window 30m \
--load-targets tokens_generated_per_second=150
```
Best for: Steady workloads where you know typical load ranges.
## Scaling from zero behavior
When a deployment is scaled to zero and receives a request, the system immediately returns a `503` error with the `DEPLOYMENT_SCALING_UP` error code while initiating the scale-up process:
```json theme={null}
{
"error": {
"message": "Deployment is currently scaled to zero and is scaling up. Please retry your request in a few minutes.",
"code": "DEPLOYMENT_SCALING_UP",
"type": "error"
}
}
```
Requests to a scaled-to-zero deployment are **not queued**. Your application must implement retry logic to handle `503` responses while the deployment scales up.
### Handling scale-from-zero responses
Implement retry logic with exponential backoff to gracefully handle scale-up delays:
```python theme={null}
import time
import requests
def query_deployment_with_retry(url, payload, max_retries=30, initial_delay=5):
"""Query a deployment with retry logic for scale-from-zero scenarios."""
delay = initial_delay
for attempt in range(max_retries):
response = requests.post(url, json=payload, headers=headers)
# Only retry if deployment is scaling up
if response.status_code == 503:
error_code = response.json().get("error", {}).get("code")
if error_code == "DEPLOYMENT_SCALING_UP":
print(f"Deployment scaling up, retrying in {delay}s...")
time.sleep(delay)
delay = min(delay * 1.5, 60) # Cap at 60 seconds
continue
response.raise_for_status()
return response.json()
raise Exception("Deployment did not scale up in time")
```
```javascript theme={null}
async function queryDeploymentWithRetry(url, payload, maxRetries = 30, initialDelay = 5000) {
let delay = initialDelay;
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json', ...headers },
body: JSON.stringify(payload)
});
// Only retry if deployment is scaling up
if (response.status === 503) {
const body = await response.json();
if (body.error?.code === 'DEPLOYMENT_SCALING_UP') {
console.log(`Deployment scaling up, retrying in ${delay/1000}s...`);
await new Promise(resolve => setTimeout(resolve, delay));
delay = Math.min(delay * 1.5, 60000); // Cap at 60 seconds
continue;
}
}
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
}
throw new Error('Deployment did not scale up in time');
}
```
```bash theme={null}
# Simple retry loop for scale-from-zero
MAX_RETRIES=30
RETRY_DELAY=5
for i in $(seq 1 $MAX_RETRIES); do
response=$(curl -s -w "\n%{http_code}" \
https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{"model": "accounts//deployments/", ...}')
http_code=$(echo "$response" | tail -n1)
body=$(echo "$response" | head -n -1)
# Only retry if deployment is scaling up
if [ "$http_code" -eq 503 ]; then
error_code=$(echo "$body" | jq -r '.error.code // empty')
if [ "$error_code" = "DEPLOYMENT_SCALING_UP" ]; then
echo "Deployment scaling up, retrying in ${RETRY_DELAY}s..."
sleep $RETRY_DELAY
RETRY_DELAY=$((RETRY_DELAY * 2))
continue
fi
echo "$body"
exit 1
fi
# Check for success (2xx status codes)
if [ "$http_code" -ge 200 ] && [ "$http_code" -lt 300 ]; then
echo "$body"
exit 0
fi
echo "$body"
exit 1
done
echo "Deployment did not scale up in time"
exit 1
```
Cold start times vary depending on model size—larger models may take longer to download and initialize. If you need instant responses without cold starts, set `--min-replica-count 1` or higher to keep replicas always running.
Deployments with min replicas = 0 are auto-deleted after 7 days of no traffic. [Reserved capacity](/deployments/reservations) guarantees availability during scale-up.
# Performance benchmarking
Source: https://docs.fireworks.ai/deployments/benchmarking
Measure and optimize your deployment's performance with load testing
Understanding your deployment's performance under various load conditions is essential for production readiness. Fireworks provides tools and best practices for benchmarking throughput, latency, and identifying bottlenecks.
## Fireworks Benchmark Tool
Use our open-source benchmarking tool to measure and optimize your deployment's performance:
**[Fireworks Benchmark Tool](https://github.com/fw-ai/benchmark)**
This tool allows you to:
* Test throughput and latency under various load conditions
* Simulate production traffic patterns
* Identify performance bottlenecks
* Compare different deployment configurations
### Installation
```bash theme={null}
git clone https://github.com/fw-ai/benchmark.git
cd benchmark
pip install -r requirements.txt
```
### Basic usage
Run a basic benchmark test:
```bash theme={null}
python benchmark.py \
--model "accounts/fireworks/models/llama-v3p1-8b-instruct" \
--deployment "your-deployment-id" \
--num-requests 1000 \
--concurrency 10
```
### Key metrics to monitor
When benchmarking your deployment, focus on these key metrics:
* **Throughput**: Requests per second (RPS) your deployment can handle
* **Latency**: Time to first token (TTFT) and end-to-end response time
* **Token generation rate**: Tokens per second during generation
* **Error rate**: Failed requests under load
## Custom benchmarking
You can also develop custom performance testing scripts or integrate with monitoring tools to track metrics over time. Consider:
* Using production-like request patterns and payloads
* Testing with various concurrency levels
* Monitoring resource utilization (GPU, memory, network)
* Testing autoscaling behavior under load
## Best practices
1. **Warm up your deployment**: Run a few requests before benchmarking to ensure models are loaded
2. **Test realistic scenarios**: Use request patterns and payloads similar to your production workload
3. **Gradually increase load**: Start with low concurrency and gradually increase to find your deployment's limits
4. **Monitor for errors**: Track error rates and response codes to identify issues under load
5. **Compare configurations**: Test different deployment shapes, quantization levels, and hardware to optimize cost and performance
## Next steps
Configure autoscaling to handle variable load
Optimize your client code for maximum throughput
# Client-side performance optimization
Source: https://docs.fireworks.ai/deployments/client-side-performance-optimization
Optimize your client code for maximum performance with dedicated deployments
When using a dedicated deployment, it is important to optimize the client-side HTTP connection pooling for maximum performance. We recommend using our [Python SDK](/tools-sdks/python-sdk) as it has good defaults for connection pooling and utilizes [httpx](https://www.python-httpx.org/) for optimal performance with Python's `asyncio` library. It also includes retry logic for handling `429` errors that Fireworks returns when the server is overloaded.
## General optimization recommendations
Based on our benchmarks, we recommend the following:
1. Use a client library optimized for high concurrency, such as [httpx](https://www.python-httpx.org/) in Python or [http.Agent](https://nodejs.org/api/http.html#class-httpagent) in Node.js.
2. Use the `AsyncFireworks` client for high-concurrency workloads.
3. Increase concurrency until performance stops improving or you observe too many `429` errors.
## Code example: Optimal concurrent requests (Python)
Install the [Fireworks Python SDK](/tools-sdks/python-sdk):
The SDK is currently in alpha. Use the `--pre` flag when installing to get the latest version.
```bash pip theme={null}
pip install --pre fireworks-ai
```
```bash poetry theme={null}
poetry add --pre fireworks-ai
```
```bash uv theme={null}
uv add --pre fireworks-ai
```
Here's how to implement optimal concurrent requests using `asyncio` and the `AsyncFireworks` client:
```python main.py theme={null}
import asyncio
import time
import statistics
from fireworks import AsyncFireworks
async def make_concurrent_requests(
messages: list[str],
model: str,
max_workers: int = 1000,
):
"""Make concurrent requests with optimized connection pooling"""
client = AsyncFireworks(
max_retries=5,
)
# Semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(max_workers)
latencies = []
async def single_request(message: str):
"""Make a single request with semaphore control"""
async with semaphore:
start_time = time.perf_counter()
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}],
max_tokens=100,
)
latency = time.perf_counter() - start_time
latencies.append(latency)
return response.choices[0].message.content
# Create all request tasks
tasks = [single_request(message) for message in messages]
# Execute all requests concurrently
results = await asyncio.gather(*tasks)
return results, latencies
# Usage example
async def main():
messages = ["Hello!"] * 1000 # 1000 requests
model = "accounts/fireworks/models/qwen3-0p6b"
start_time = time.perf_counter()
results, latencies = await make_concurrent_requests(
messages=messages,
model=model,
)
total_time = time.perf_counter() - start_time
# Calculate performance metrics
num_requests = len(results)
requests_per_second = num_requests / total_time
# Latency statistics (in milliseconds)
latencies_ms = [lat * 1000 for lat in latencies]
avg_latency = statistics.mean(latencies_ms)
min_latency = min(latencies_ms)
max_latency = max(latencies_ms)
p50_latency = statistics.median(latencies_ms)
p95_latency = statistics.quantiles(latencies_ms, n=20)[18] # 95th percentile
p99_latency = statistics.quantiles(latencies_ms, n=100)[98] # 99th percentile
print("\n" + "=" * 50)
print("Performance Results")
print("=" * 50)
print(f"Total requests: {num_requests}")
print(f"Total time: {total_time:.2f} seconds")
print(f"Throughput: {requests_per_second:.2f} requests/second")
print("\nLatency Statistics (ms):")
print(f" Min: {min_latency:.2f}")
print(f" Max: {max_latency:.2f}")
print(f" Avg: {avg_latency:.2f}")
print(f" P50 (median): {p50_latency:.2f}")
print(f" P95: {p95_latency:.2f}")
print(f" P99: {p99_latency:.2f}")
print("=" * 50)
if __name__ == "__main__":
asyncio.run(main())
```
This implementation:
* Uses `AsyncFireworks` for non-blocking async requests with optimized connection pooling
* Uses `asyncio.Semaphore` to control concurrency to avoid overwhelming the server
# Exporting Metrics
Source: https://docs.fireworks.ai/deployments/exporting-metrics
Export metrics from your dedicated deployments to your observability stack
## Overview
Fireworks provides a metrics endpoint in Prometheus format, enabling integration with popular observability tools like Prometheus, OpenTelemetry (OTel) Collector, Datadog Agent, and Vector.
This page covers real-time performance metrics (latency, throughput, etc.) for on-demand deployments. For billing and usage data across all Fireworks services, see [Exporting Billing Metrics](/accounts/exporting-billing-metrics).
## Setting Up Metrics Collection
### Endpoint
The metrics endpoint is as follows. This URL and authorization header can be directly used by services like Grafana Cloud to ingest Fireworks metrics.
```
https://api.fireworks.ai/v1/accounts//metrics
```
### Authentication
Use the Authorization header with your Fireworks API key:
```json theme={null}
{
"Authorization": "Bearer YOUR_API_KEY"
}
```
### Scrape Interval
We recommend using a 1-minute scrape interval as metrics are updated every 30s.
### Rate Limits
To ensure service stability and fair usage:
* Maximum of 6 requests per minute per account
* Exceeding this limit results in HTTP 429 (Too Many Requests) responses
* Use a 1-minute scrape interval to stay within limits
## Integration Options
Fireworks metrics can be integrated with various observability platforms through multiple approaches:
### OpenTelemetry Collector Integration
The Fireworks metrics endpoint can be integrated with OpenTelemetry Collector by configuring a Prometheus receiver that scrapes the endpoint. This allows Fireworks metrics to be pushed to a variety of popular exporters—see the [OpenTelemetry registry](https://opentelemetry.io/ecosystem/registry/) for a full list.
### Direct Prometheus Integration
To integrate directly with Prometheus, specify the Fireworks metrics endpoint in your scrape config:
```yaml theme={null}
global:
scrape_interval: 60s
scrape_configs:
- job_name: 'fireworks'
metrics_path: 'v1/accounts//metrics'
authorization:
type: "Bearer"
credentials: "YOUR_API_KEY"
static_configs:
- targets: ['api.fireworks.ai']
scheme: https
```
For more details on Prometheus configuration, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/).
### Supported Platforms
Fireworks metrics can be exported to various observability platforms including:
* Prometheus
* Datadog
* Grafana
* New Relic
## Available Metrics
### Common Labels
All metrics include the following common labels:
* `base_model`: The base model identifier (e.g., "accounts/fireworks/models/deepseek-v3")
* `deployment`: Full deployment path (e.g., "accounts/account-name/deployments/deployment-id")
* `deployment_account`: The account name
* `deployment_id`: The deployment identifier
### Rate Metrics (per second)
These metrics show activity rates calculated using 1-minute windows:
#### Request Rate
* `request_counter_total:sum_by_deployment`: Request rate per deployment
#### Error Rate
* `requests_error_total:sum_by_deployment`: Error rate per deployment, broken down by HTTP status code (includes additional `http_code` label)
#### Token Processing Rates
* `tokens_cached_prompt_total:sum_by_deployment`: Rate of cached prompt tokens per deployment
* `tokens_prompt_total:sum_by_deployment`: Rate of total prompt tokens processed per deployment
### Latency Histogram Metrics
These metrics provide latency distribution data with histogram buckets, calculated using 1-minute windows:
#### Generation Latency
* `latency_generation_per_token_ms_bucket:sum_by_deployment`: Per-token generation time distribution
* `latency_generation_queue_ms_bucket:sum_by_deployment`: Time spent waiting in generation queue
#### Request Latency
* `latency_overall_ms_bucket:sum_by_deployment`: End-to-end request latency distribution
* `latency_to_first_token_ms_bucket:sum_by_deployment`: Time to first token distribution
#### Prefill Latency
* `latency_prefill_ms_bucket:sum_by_deployment`: Prefill processing time distribution
* `latency_prefill_queue_ms_bucket:sum_by_deployment`: Time spent waiting in prefill queue
### Token Distribution Metrics
These histogram metrics show token count distributions per request, calculated using 1-minute windows:
* `tokens_generated_per_request_bucket:sum_by_deployment`: Distribution of generated tokens per request
* `tokens_prompt_per_request_bucket:sum_by_deployment`: Distribution of prompt tokens per request
### Resource Utilization Metrics
These gauge metrics show average resource usage:
* `generator_kv_blocks_fraction:avg_by_deployment`: Average fraction of KV cache blocks in use
* `generator_kv_slots_fraction:avg_by_deployment`: Average fraction of KV cache slots in use
* `generator_model_forward_time:avg_by_deployment`: Average time spent in model forward pass
* `requests_coordinator_concurrent_count:avg_by_deployment`: Average number of concurrent requests
* `prefiller_prompt_cache_ttl:avg_by_deployment`: Average prompt cache time-to-live
# Regions
Source: https://docs.fireworks.ai/deployments/regions
Fireworks runs a global fleet of hardware on which you can deploy your models.
Fireworks runs a global fleet so you can deploy models close to users, meet data-residency needs, and scale across clouds. This page covers **multi-region** (default behavior and quota groupings), **single-region** availability and hardware, how to **use and change** regions, and **quotas**.
## Multi-region (recommended)
By default, deployments are multi-region: Fireworks can move and spread them across regions as needed. Multi-regions (**GLOBAL**, **US**, **EUROPE**, **APAC**) are high-level groupings of single regions. Your deployment may run in any single region(s) within that multi-region.
Utilizing multiple clouds and locations maximizes the odds that there's capacity to scale.
Multi-region deployments enable resilience to localized outages, maintaining application availability as workloads scale across regions.
### Supported multi-regions
Supported multi-regions: `GLOBAL`, `US`, `EUROPE`, `APAC`.
## Single region availability
Single regions are concrete locations (e.g. `US_IOWA_1`, `EU_FRANKFURT_1`) where your deployment can run. We have the single regions listed below available; we recommend multi-region for most users because of its advantages (elastic scaling, higher reliability). If you have a specific need for a single region, contact [Fireworks](mailto:inquiries@fireworks.ai) to request it. The table below shows which single regions are available and what hardware is offered in each.
| **Region** | **Accelerator Type(s)** | |
| ----------------- | ---------------------------------------- | - |
| `US_ARIZONA_1` | `NVIDIA_H100_80GB` | |
| `US_CALIFORNIA_1` | `NVIDIA_H200_141GB` | |
| `US_GEORGIA_2` | `NVIDIA_B200_180GB` | |
| `US_GEORGIA_3` | `NVIDIA_H200_141GB` | |
| `US_ILLINOIS_1` | `NVIDIA_H100_80GB` | |
| `US_ILLINOIS_2` | `NVIDIA_A100_80GB` | |
| `US_IOWA_1` | `NVIDIA_H100_80GB` | |
| `US_OHIO_1` | `NVIDIA_B200_180GB` | |
| `US_TEXAS_2` | `NVIDIA_H100_80GB` | |
| `US_UTAH_1` | `NVIDIA_B200_180GB` | |
| `US_VIRGINIA_1` | `NVIDIA_H100_80GB`, `NVIDIA_H200_141GB` | |
| `US_WASHINGTON_2` | `NVIDIA_H100_80GB` | |
| `US_WASHINGTON_3` | `NVIDIA_B200_180GB` | |
| `US_WASHINGTON_4` | `NVIDIA_B200_180GB` | |
| `EU_FRANKFURT_1` | `NVIDIA_H100_80GB` | |
| `EU_ICELAND_1` | `NVIDIA_H200_141GB` | |
| `EU_ICELAND_2` | `NVIDIA_B200_180GB`, `NVIDIA_H200_141GB` | |
| `AP_TOKYO_1` | `NVIDIA_H100_80GB` | |
| `AP_TOKYO_2` | `NVIDIA_H200_141GB` | |
## Using a region
When creating a deployment, you can pass the `--region` flag to pin it to a single region:
```
firectl deployment create accounts/fireworks/models/llama-v3p1-8b-instruct \
--region GLOBAL
```
## Changing regions
Updating the single region for a deployment in-place is not supported. To move a deployment to a different single region, create a new deployment in the desired region, then delete the old deployment.
## Quotas
Quota is granted at the **multi-region** level for new users. By default, all users receive quota for **GLOBAL** multi-region. For specific single region quota, please contact Fireworks. To view your current quotas, run:
```
firectl quota list
```
To use single regions that are not generally available (see the table above), or to request additional multi-region quota, contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
# Reserved capacity
Source: https://docs.fireworks.ai/deployments/reservations
Enterprise accounts can purchase reserved capacity, typically with 1 year commitments. Reserved capacity has the following advantages over ordinary [on-demand deployments](/guides/ondemand-deployments):
* Guaranteed capacity
* Higher quotas
* Lower GPU-hour prices
* Pre-GA access to newer regions
* Pre-GA access to newest hardware
## Usage and billing
Consuming a reservation is done by creating a deployment that meets the reservation parameters. For example, suppose you have a reservation for 12 H100 GPUs and create two deployments, each using 8 H100 GPUs. While both deployments are running, 12 of the H100s will count towards using your reservation, while the excess 4 H100s will be metered and billed at the on-demand rate. Follow [deploying models on-demand](/guides/ondemand-deployments) to create a deployment.
When a reservation approaches its end time, ensure that you either renew your reservation or turn down a corresponding number of deployments, otherwise you may be billed at for your usage at on-demand rates.
Reservations are invoiced separately from your on-demand usage, at a frequency determined by your reservation contract
(e.g. monthly, quarterly, or yearly).
Reserved capacity will always be billed until the reservation ends, regardless of whether the reservation is
actively used.
## Purchasing or renewing a reservation
To purchase a reservation or increase the size or duration of an existing reservation, contact your Fireworks account
manager. If you are a new, prospective customer, please reach out to our [sales team](https://fireworks.ai/company/contact-us).
## Viewing your reservations
To view your existing reservations, run:
```
firectl reservation list
```
# Routers
Source: https://docs.fireworks.ai/deployments/routers
Distribute traffic across multiple deployments for A/B testing, traffic migration, and load distribution.
A **Router** is a resource that controls how inference traffic is routed to one or more deployments. Instead of sending all requests to a single deployment, a router lets you split traffic across multiple deployments — useful for A/B testing model variants, gradually migrating traffic to a new deployment, or distributing load.
Traffic is split proportionally based on the number of replicas in each deployment. For example, if a router covers two deployments — one with 3 replicas and another with 2 — the first receives 60% of traffic and the second receives 40%.
Routers only work with multi-region deployments.
## When to use a router
### Stable alias for deployment replacement
If you plan to replace a deployment later (e.g., changing to a new model later), give your application the **router name** instead of the deployment name. You can then swap the underlying deployment without your application changing anything.
```
Your app calls: accounts//routers/my-router
└── Initially routes to: accounts//deployments/v1
└── Later updated to: accounts//deployments/v2
```
### A/B testing between deployments
Place multiple deployments under a single router. Traffic is automatically split by replica count, so you can control the ratio by adjusting replicas on each deployment.
```bash theme={null}
firectl router create \
--router-id=ab-test \
--deployments=model-a,model-b
```
### Gradual traffic migration
Shift traffic from an old deployment to a new one with zero downtime by scaling replicas up on the new deployment and down on the old. See the [worked example](#example-traffic-migration) below.
## How traffic routing works
Traffic is distributed based on **replica count**. Each replica across all deployments in the router receives an equal share of traffic.
| Deployment | Replicas | Traffic share |
| -------------- | -------- | ------------- |
| `deployment-a` | 3 | 60% |
| `deployment-b` | 2 | 40% |
| **Total** | **5** | **100%** |
To shift traffic, scale the replica counts on the underlying deployments. The router automatically adjusts the distribution.
### Sending traffic to a router
Use the router's name in the `model` field of your API request, just like you would use a deployment name:
```bash theme={null}
curl -s -X POST https://api.fireworks.ai/inference/v1/chat/completions \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "accounts//routers/",
"messages": [{"role": "user", "content": "Hello"}]
}'
```
### Routing strategy
Traffic is routed using **weighted replica** selection: each request is randomly assigned to a deployment, weighted by its replica count. A deployment with more replicas receives proportionally more traffic.
## Managing routers
### Creating a router
A router requires at least one deployment.
```bash theme={null}
firectl router create \
--deployments=,
```
Optional flags:
| Flag | Description |
| ---------------- | -------------------------------------------------------------- |
| `--router-id` | Set a specific router ID. If omitted, a random ID is generated |
| `--display-name` | Human-readable name for the router |
| `--model` | The model to route traffic to |
| `--strategy` | Routing strategy. Default: `weighted-random` |
| `--public` | Make the router accessible to other accounts |
### Listing routers
```bash theme={null}
firectl router list
```
### Getting router details
```bash theme={null}
firectl router get
```
You can also use the full resource name:
```bash theme={null}
firectl router get accounts//routers/
```
### Updating a router
Update the deployments, strategy, or other properties of an existing router:
```bash theme={null}
firectl router update \
--deployments=,,
```
### Deleting a router
```bash theme={null}
firectl router delete
```
Deleting a router takes effect immediately. Any traffic sent to the router's alias will fail. Make sure all clients have switched to a different route before deleting.
## Example: traffic migration
This example walks through migrating traffic from an existing deployment to a new one with zero downtime.
**Step 1** — Create a router for your existing deployment and point your application at the router alias:
```bash theme={null}
firectl router create \
--router-id=my-router \
--deployments=current-deployment
```
Your application sends traffic to `accounts//routers/my-router`. All traffic goes to `current-deployment`.
**Step 2** — Create the new deployment and add it to the router:
```bash theme={null}
firectl deployment create accounts//models/ \
--deployment-id=new-deployment
```
```bash theme={null}
firectl router update my-router \
--deployments=current-deployment,new-deployment
```
A new deployment starts with 1 replica by default, so if `current-deployment` has 4 replicas, the split is immediately 80%/20%.
**Step 3** — Shift more traffic by increasing replicas on the new deployment and decreasing the old:
```bash theme={null}
firectl deployment update new-deployment \
--min-replica-count=4 \
--max-replica-count=4
firectl deployment update current-deployment \
--min-replica-count=1 \
--max-replica-count=1
```
Traffic split is now 20% old / 80% new.
**Step 4** — Complete the migration by scaling the old deployment to zero:
```bash theme={null}
firectl deployment update current-deployment \
--min-replica-count=0 \
--max-replica-count=0
```
All traffic now flows to `new-deployment`. Clean up by removing the old deployment from the router:
```bash theme={null}
firectl router update my-router --deployments=new-deployment
```
Monitor your new deployment's latency and error rates at each step before shifting more traffic. This lets you catch issues early and roll back by increasing replicas on the old deployment.
# Speculative Decoding
Source: https://docs.fireworks.ai/deployments/speculative-decoding
Speed up generation with draft models and n-gram speculation
Speed up text generation by using a smaller "draft" model to assist the main model, or using n-gram based speculation.
Speculative decoding may slow down output generation if the draft model is not a good speculator, or if token count/speculation length is too high or too low. It may also reduce max throughput. Test different models and speculation lengths for your use case.
## Configuration options
| Flag | Type | Description |
| ---------------------------- | ------ | ------------------------------------------------------------------------------------------- |
| `--draft-model` | string | Draft model name. Can be a Fireworks model or custom model. See recommendations below. |
| `--draft-token-count` | int32 | Tokens to generate per step. Required when using draft model or n-gram. Typically set to 4. |
| `--ngram-speculation-length` | int32 | Alternative to draft model: uses N-gram based speculation from previous input. |
`--draft-model` and `--ngram-speculation-length` cannot be used together.
## Recommended draft models
| Draft model | Use with |
| -------------------------------------------------- | --------------------- |
| `accounts/fireworks/models/llama-v3p2-1b-instruct` | All Llama models > 3B |
| `accounts/fireworks/models/qwen2p5-0p5b-instruct` | All Qwen models > 3B |
## Examples
Use a smaller model to speed up generation:
```bash theme={null}
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--draft-model="accounts/fireworks/models/llama-v3p2-1b-instruct" \
--draft-token-count=4
```
Use input history for speculation (no draft model needed):
```bash theme={null}
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--ngram-speculation-length=3 \
--draft-token-count=4
```
Fireworks also supports [Predicted Outputs](/guides/predicted-outputs) which works in addition to model-based speculative decoding.
# Cloud Integrations
Source: https://docs.fireworks.ai/ecosystem/integrations
Cloud Integrations
## Agentic Coding Harnesses
Use Fireworks models in Claude Code via the FireConnect CLI
Use Fireworks models in OpenCode via the FireConnect CLI
## Cloud Deployments
Access frontier open models through Azure, billed to your Azure account
Deploy Fireworks models on AWS SageMaker
Run Fireworks on Amazon Elastic Kubernetes Service
Deploy using Amazon Elastic Container Service
Build and deploy AI agents with AgentCore
## Need Help?
For assistance with cloud deployments or custom integrations, [contact our team](https://fireworks.ai/contact).
# Agent Frameworks
Source: https://docs.fireworks.ai/ecosystem/integrations/agent-frameworks
Build production-ready AI agents with Fireworks and leading open-source frameworks
Fireworks AI seamlessly integrates with the best open-source agent frameworks, enabling you to build magical, production-ready applications powered by state-of-the-art language models.
## Supported Frameworks
Build LLM applications with powerful orchestration and tool integration
Efficient data retrieval and document indexing for LLM-based agents
Orchestrate collaborative multi-agent systems for complex tasks
Type-safe AI agent development with Pydantic validation
Modern agent orchestration with seamless OpenAI-compatible integration
Use Claude Code with Fireworks models for AI-powered coding
Add Fireworks models to Copilot Chat via a custom endpoint
## Need Help?
For assistance with agent framework integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai).
# Microsoft Foundry
Source: https://docs.fireworks.ai/ecosystem/integrations/azure-foundry
Deploy frontier open models inside your Azure subscription, billed through Azure.
Fireworks AI is a first-party inference provider inside Microsoft Foundry. You can access frontier open models through your existing Azure account, with usage billed through Azure and counting toward your Microsoft Azure Consumption Commitment (MACC).
This page covers the Fireworks side of the integration. For Azure portal setup steps, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models).
**New to Fireworks?** Foundry users get the same OpenAI-compatible API and model catalog as direct Fireworks customers. Start with the [PayGo quickstart](#paygo-quickstart) below — you can be making requests in about 10 minutes.
## Prerequisites
* An active Azure subscription
* The Fireworks integration enabled at the subscription level (see below)
* A Microsoft Foundry project with the **Azure AI Developer** role assigned
### Opt-in
Fireworks on Foundry requires a one-time opt-in per Azure subscription before you can create deployments. Follow the steps in the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#enable-fireworks-on-foundry).
## Deployment modes
Fireworks on Foundry supports three deployment modes.
| Mode | Also called | Pricing | Regions | Right for |
| ----------------- | ------------------------------ | --------------------------------- | -------------------- | -------------------------------------------- |
| **PayGo** | Serverless, Data Zone Standard | Per token, MACC-eligible | US Data Zone only | Prototyping, low-volume workloads |
| **PTU** | Provisioned Throughput | Per PTU-hour, ACD + MACC eligible | Global | Production workloads with consistent traffic |
| **Custom Models** | Bring Your Own Model | PTU pricing | Global (PTU regions) | Fine-tuned model deployment |
PTU deployments can be created directly in the Azure portal. For help with PTU sizing on Fireworks models, contact [sales@fireworks.ai](mailto:sales@fireworks.ai).
## Available models
All models use the OpenAI-compatible chat completions API and are added to the catalog on a rolling basis. For the current list of available models, see the [Microsoft Learn catalog](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#available-catalog-models).
Chat completions only. Embeddings, image generation, and audio modalities are not available through Foundry.
## PayGo quickstart
PayGo (Data Zone Standard) is available in: East US, East US 2, Central US, North Central US, West US, West US 3.
The throughput limit for PayGo deployments is **250,000 tokens per minute (TPM)**.
### Make your first request
Foundry deployments use an OpenAI-compatible endpoint. Use your Foundry project endpoint and Azure API key.
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="https://.services.ai.azure.com/models",
api_key="",
)
response = client.chat.completions.create(
model="fireworks-ai/FW-GLM-5.1",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
```
Find your project endpoint in the Microsoft Foundry portal under **Project settings**.
## PTU (Provisioned Throughput)
PTU deployments provide dedicated GPU capacity reserved for your workload, with consistent throughput and global region availability.
* Dedicated capacity, not shared with other tenants
* Available globally, not limited to US Data Zone
* ACD-eligible and MACC-eligible
You can create a PTU deployment directly in the Azure portal. For more on provisioned throughput, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/provisioned-throughput).
For help with PTU sizing on Fireworks models, contact [sales@fireworks.ai](mailto:sales@fireworks.ai).
## Custom Models
Fine-tune on Fireworks and deploy on Foundry, or bring your own weights from wherever you post-train to deploy on Foundry. Your model is served on Fireworks infrastructure within Azure, billed through your Azure account.
### Supported base architectures
For the list of supported custom model architectures, see the [Microsoft Learn guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#supported-model-architectures).
### Deployment
To import and deploy a custom model, follow the [Import custom models into Foundry guide](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/import-custom-models?tabs=rest-api).
## Billing
All Fireworks on Foundry usage is billed through Azure. You do not need a separate Fireworks billing account or contract.
* PayGo and PTU usage is MACC-eligible
* PTU deployments are ACD-eligible and qualify for quota retirement
* Direct Fireworks usage at [fireworks.ai](https://fireworks.ai) is billed separately and does not count toward MACC
## Troubleshooting
| Issue | Resolution |
| ------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Quota exceeded error | Request a limit increase at [aka.ms/fireworks-quota](https://aka.ms/fireworks-quota) |
| Access denied on deployment | Verify you have the **Azure AI Developer** role on the project |
| Opt-in not propagating | Allow up to 30 minutes after registering `Fireworks.EnableDeploy` |
| Custom Model deployment failing | Confirm weights are full-weight (not LoRA adapters) and the architecture is in the [supported list](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models#supported-model-architectures) |
| PTU provisioning questions | Contact [sales@fireworks.ai](mailto:sales@fireworks.ai) |
## Additional resources
* [Enable Fireworks on Foundry (Microsoft Learn)](https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models)
* [Microsoft Foundry portal](https://ai.azure.com/)
* [Fireworks fine-tuning docs](/fine-tuning/finetuning-intro)
* [Fireworks Trust Center](https://fireworks.ai/trust)
* [sales@fireworks.ai](mailto:sales@fireworks.ai) for PTU provisioning and Custom Model support
# Claude Code
Source: https://docs.fireworks.ai/ecosystem/integrations/claude-code
Use Fireworks AI models in Claude Code with the FireConnect CLI
[FireConnect](https://github.com/fw-ai/fireconnect) routes [Claude Code](https://claude.ai/code) through Fireworks AI models. Install it once, then use `fireconnect on` and `fireconnect off` to switch providers without editing config files by hand.
## Prerequisites
* [Claude Code](https://claude.ai/code) installed
* A [Fireworks API key](https://app.fireworks.ai/settings/users/api-keys) (`fw_...`) or a [Fire Pass](/firepass) key (`fpk_...`)
* Node.js (the installer can install it via Homebrew or apt if it is missing)
## Install
```bash theme={null}
curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | bash
```
For non-interactive setup (CI or scripted installs):
```bash theme={null}
curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fw_..." bash
```
The installer:
* Uses Node.js to update Claude Code settings (it does not install or update npm packages)
* Prompts for your Fireworks API key, or reads it from `FIREWORKS_API_KEY`
* Runs `fireconnect on` to apply the default model mapping and write `~/.claude/settings.json`
* Clones the FireConnect CLI to `~/.fireconnect/cli` and installs a `fireconnect` launcher to `~/.local/bin`
* Adds `~/.local/bin` to your shell `PATH`
Restart Claude Code after installation, then test with:
```text theme={null}
Is this thing on?
```
## Using Fire Pass
If you have a [Fire Pass](/firepass) subscription, use your `fpk_...` key instead:
```bash theme={null}
curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fpk_..." bash
```
FireConnect automatically detects Fire Pass keys and routes all model aliases to `kimi-k2p6-turbo` — the only model covered by the Fire Pass subscription.
## Default model mapping
| Alias | Standard key (`fw_...`) | Fire Pass key (`fpk_...`) |
| -------- | ----------------------- | ------------------------- |
| main | `kimi-k2p6-turbo` | `kimi-k2p6-turbo` |
| opus | `kimi-k2p6-turbo` | `kimi-k2p6-turbo` |
| sonnet | `glm-5p1` | `kimi-k2p6-turbo` |
| haiku | `minimax-m2p5` | `kimi-k2p6-turbo` |
| subagent | `minimax-m2p5` | `kimi-k2p6-turbo` |
Short model IDs like `kimi-k2p6-turbo` are expanded to full Fireworks paths (for example, `accounts/fireworks/routers/kimi-k2p6-turbo`).
## What gets written
FireConnect writes these settings to `~/.claude/settings.json`:
```json theme={null}
{
"env": {
"ANTHROPIC_BASE_URL": "https://api.fireworks.ai/inference",
"ANTHROPIC_API_KEY": "fw_YOUR_FIREWORKS_API_KEY",
"ANTHROPIC_AUTH_TOKEN": "fw_YOUR_FIREWORKS_API_KEY",
"ANTHROPIC_MODEL": "accounts/fireworks/routers/kimi-k2p6-turbo",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "accounts/fireworks/routers/kimi-k2p6-turbo",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "accounts/fireworks/models/glm-5p1",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "accounts/fireworks/models/minimax-m2p5",
"CLAUDE_CODE_SUBAGENT_MODEL": "accounts/fireworks/models/minimax-m2p5"
}
}
```
FireConnect writes both `ANTHROPIC_API_KEY` (preferred) and `ANTHROPIC_AUTH_TOKEN` (compatibility alias) with the same Fireworks key. It also saves a backup of your previous provider settings to `~/.fireconnect/claude/` so `fireconnect off` can restore them.
## CLI reference
```bash theme={null}
fireconnect on # Route Claude Code through Fireworks
fireconnect off # Restore your previous provider
fireconnect status # Show the current provider and model
fireconnect list # Show the default and current model mapping
fireconnect set # Change model aliases without touching credentials
fireconnect reset # Reset model aliases to defaults
fireconnect uninstall # Remove FireConnect from this machine
```
Run `fireconnect help ` for all options.
### Manual setup
If you already have a Fireworks API key, you can skip the installer and enable routing directly:
```bash theme={null}
fireconnect on --api-key fw_...
```
Restart Claude Code after this completes.
### Switch models
Short model IDs work everywhere:
```bash theme={null}
fireconnect set --main kimi-k2p6-turbo --sonnet glm-5p1 --haiku minimax-m2p5 --subagent minimax-m2p5
```
### Turn off Fireworks routing
```bash theme={null}
fireconnect off
```
This restores your previous `~/.claude/settings.json` from the backup saved in `~/.fireconnect/claude/`.
### Enable with a specific API key
```bash theme={null}
fireconnect on --api-key fw_...
```
## Uninstall
```bash theme={null}
fireconnect uninstall
```
This disables Fireworks routing for Claude Code, removes `~/.fireconnect/claude/`, and deletes the `fireconnect` CLI launcher from `~/.local/bin`.
## Source
FireConnect is open source: [github.com/fw-ai/fireconnect](https://github.com/fw-ai/fireconnect)
# Development Setup with Fireworks Docs MCP
Source: https://docs.fireworks.ai/ecosystem/integrations/development-setup
Configure the Fireworks AI Docs MCP server for Claude Code and Cursor
## Claude Code
Add the MCP server via the CLI:
```bash theme={null}
claude mcp add --transport http fireworks-docs https://docs.fireworks.ai/mcp
```
Or add it to your project's `mcp.json`:
```json theme={null}
{
"mcpServers": {
"fireworks-docs": {
"url": "https://docs.fireworks.ai/mcp"
}
}
}
```
## Cursor
One-click install:
[Install Fireworks Docs MCP](https://cursor.com/en/install-mcp?name=fireworks-docs\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuZmlyZXdvcmtzLmFpL21jcCJ9)
Or manually add to your workspace's `mcp.json`:
```json theme={null}
{
"mcpServers": {
"fireworks-docs": {
"url": "https://docs.fireworks.ai/mcp"
}
}
}
```
## Using the MCP Server
Once configured, your AI coding agent can search the full Fireworks AI documentation. Example queries:
* "How do I configure autoscaling for deployments?"
* "What parameters does the chat completions endpoint accept?"
* "Show me examples of function calling with Fireworks models"
* "Find the API reference for batch inference"
# GitHub Copilot
Source: https://docs.fireworks.ai/ecosystem/integrations/github-copilot
Use Fireworks AI models in GitHub Copilot Chat via a custom endpoint
Use [Fireworks AI](https://fireworks.ai) models in **GitHub Copilot Chat** by adding a **Custom Endpoint** in VS Code (or other hosts that support Copilot custom models).
Fireworks offers **200+ models**—copy the model `id` and token limits from the [Model Library](https://app.fireworks.ai/models). Use endpoint URL `https://api.fireworks.ai/inference/v1`.
## Prerequisites
* A Fireworks [API key](https://app.fireworks.ai/settings/users/api-keys)
* GitHub Copilot with access to **Other Models** and **Custom Endpoint** (availability depends on your Copilot plan)
## Setup
In Copilot Chat, click the active model name at the bottom (often **Auto**). In the menu, click the gear icon next to **Other Models**.
In **Language Models**, click **+ Add Models...** in the top right, then choose **Custom Endpoint**.
Enter **Fireworks AI** as the group name and press Enter.
Paste your Fireworks API key (hidden by default) and press Enter to confirm.
When asked for the default request/response format, select **Responses API**.
A configuration file opens. **Do not change** the auto-generated header at the top—only fill in the model template below it.
Fill in your model fields, then save (**Ctrl+S** on Windows/Linux, **Cmd+S** on macOS) and close the settings modal.
Example for [DeepSeek V4 Pro](https://app.fireworks.ai/models/fireworks/deepseek-v4-pro):
| Field | Value |
| ------------------- | ------------------------------------------- |
| **id** | `accounts/fireworks/models/deepseek-v4-pro` |
| **name** | `DeepSeek V4 Pro` |
| **url** | `https://api.fireworks.ai/inference/v1` |
| **toolCalling** | `true` |
| **vision** | `false` |
| **maxInputTokens** | `1000000` |
| **maxOutputTokens** | `384000` |
Use the exact model `id` and token limits from the model page in the [Model Library](https://app.fireworks.ai/models). Values differ per model.
Return to Copilot Chat, open the model picker (**Auto**), expand **Other Models**, and choose your model under **Fireworks AI**.
## Related
* [Claude Code](/ecosystem/integrations/claude-code) — use Fireworks models with Claude Code
* [Development Setup with Fireworks Docs MCP](/ecosystem/integrations/development-setup) — add Fireworks docs to your coding agent
# MLOps & Observability
Source: https://docs.fireworks.ai/ecosystem/integrations/mlops-observability
Track and monitor your Fireworks AI deployments with leading MLOps and observability platforms
Fireworks AI integrates with industry-leading MLOps and observability platforms to help you monitor, track, and optimize your AI applications in production.
## Supported Platforms
Track fine-tuning experiments and visualize training metrics with W\&B
Mlflow Tracing to track prompts, outputs, latency etc as your build AI applications with FireworksAI
## Need Help?
For assistance with MLOps and observability integrations, [contact our team](https://fireworks.ai/contact) or join our [Discord community](https://discord.gg/fireworks-ai).
# OpenCode
Source: https://docs.fireworks.ai/ecosystem/integrations/opencode
Use Fireworks AI models in OpenCode with the FireConnect CLI
[FireConnect](https://github.com/fw-ai/fireconnect) routes [OpenCode](https://opencode.ai) through Fireworks AI models. Install the CLI once, then use `fireconnect on --harness opencode` to switch providers without editing config files by hand.
## Prerequisites
* [OpenCode](https://opencode.ai) installed
* A [Fireworks API key](https://app.fireworks.ai/settings/users/api-keys) (`fw_...`) or a [Fire Pass](/firepass) key (`fpk_...`)
* The FireConnect CLI (see [Install the CLI](#install-the-cli) below)
## Install the CLI
If you do not already have `fireconnect` on your `PATH`, install it with:
```bash theme={null}
curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | bash
```
The installer also configures Claude Code by default. If you only use OpenCode, run `fireconnect off` after install to restore your Claude Code settings, then follow the steps below.
For non-interactive setup:
```bash theme={null}
curl -fsSL https://raw.githubusercontent.com/fw-ai/fireconnect/main/install.sh | FIREWORKS_API_KEY="fw_..." bash
```
The installer uses Node.js to update settings (it does not install or update npm packages), clones the CLI to `~/.fireconnect/cli`, and adds a `fireconnect` launcher to `~/.local/bin`.
## Enable Fireworks routing
```bash theme={null}
export FIREWORKS_API_KEY=fw_...
fireconnect on --harness opencode
```
Restart OpenCode after enabling, then confirm routing:
```bash theme={null}
fireconnect status --harness opencode
```
## Using Fire Pass
Use your `fpk_...` key instead of a standard `fw_...` key:
```bash theme={null}
export FIREWORKS_API_KEY=fpk_...
fireconnect on --harness opencode --api-key fpk_...
```
FireConnect detects Fire Pass keys and defaults OpenCode to `kimi-k2p6-turbo` — the only model covered by the Fire Pass subscription.
## Default model
OpenCode routes a single default model (no opus/sonnet/haiku alias slots). The default is `kimi-k2p6-turbo`, written to config as `fireworks/accounts/fireworks/routers/kimi-k2p6-turbo`.
Short model IDs like `glm-5p1` are expanded to full Fireworks paths (for example, `accounts/fireworks/models/glm-5p1`).
## What gets written
FireConnect merges a `provider.fireworks` block into `~/.config/opencode/opencode.json`:
* An OpenAI-compatible adapter pointed at `https://api.fireworks.ai/inference/v1`
* A default `model` set to `fireworks/`
* Your other providers are left untouched
FireConnect snapshots your original `opencode.json` before the first change. The snapshot lives in `~/.fireconnect/opencode/`. Running `fireconnect off --harness opencode` restores the file byte-for-byte.
### API key handling
* If the key comes from `FIREWORKS_API_KEY`, it is written as `{env:FIREWORKS_API_KEY}` so the secret stays out of the config file.
* Passing `--api-key` writes the literal key instead.
* OpenCode's `auth.json` is never touched.
## CLI reference
All commands use `--harness opencode`:
```bash theme={null}
fireconnect on --harness opencode # Enable Fireworks routing
fireconnect off --harness opencode # Restore original config
fireconnect status --harness opencode # Check current provider
fireconnect list --harness opencode # Show the current model
fireconnect set --harness opencode --main glm-5p1 # Switch model
fireconnect reset --harness opencode # Reset model to default
```
Run `fireconnect help ` for all options.
### Switch models
```bash theme={null}
fireconnect set --harness opencode --main glm-5p1
```
### Turn off Fireworks routing
```bash theme={null}
fireconnect off --harness opencode
```
This restores your previous `opencode.json` from the backup in `~/.fireconnect/opencode/`.
### Use a non-default config file
```bash theme={null}
fireconnect on --harness opencode --config-path /path/to/opencode.json
```
## Built-in provider connection
OpenCode also supports connecting to Fireworks directly without FireConnect:
1. Type `/connect` in OpenCode and search for **fireworks.ai**
2. Paste your Fireworks API key and press Enter
3. Type `/models` and select a model (for Fire Pass, choose **Kimi K2.6 Turbo**)
## Source
FireConnect is open source: [github.com/fw-ai/fireconnect](https://github.com/fw-ai/fireconnect)
# Cookbooks
Source: https://docs.fireworks.ai/examples/cookbooks
Interactive Jupyter notebooks demonstrating advanced use cases and best practices with Fireworks AI
Explore our collection of notebooks that showcase real-world applications, best practices, and advanced techniques for building with Fireworks AI.
## Fine-Tuning & Training
Transfer large model capabilities to efficient models using a two-stage SFT + RFT approach.
**Techniques:** Supervised Fine-Tuning (SFT) + Reinforcement Fine-Tuning (RFT)
**Results:** 52% → 70% accuracy on GSM8K mathematical reasoning
Beat frontier closed-source models for product catalog cleansing with vision-language model fine-tuning.
**Techniques:** Supervised Fine-Tuning (SFT)
**Results:** 48% increase in quality from base model
## Multimodal AI
Extract structured data from invoices, forms, and financial documents using state-of-the-art OCR and document understanding.
**Use Cases:** Forms, invoices, financial documents, product catalogs
**Results:** 90.8% accuracy on invoice extraction (100% on invoice numbers and dates)
Real-time audio transcription with streaming support and low latency.
**Features:** Streaming support, low-latency transcription, production-ready
Analyze video and audio content with Qwen3 Omni, a multimodal model supporting video, audio, and text inputs.
**Features:** Video captioning, scene analysis, content understanding, multimodal Q\&A
## API Features
Leverage Model Context Protocol (MCP) for GitHub repository analysis, code search, and documentation Q\&A.
**Features:** Repository analysis, code search, documentation Q\&A, GitMCP integration
**Models:** Qwen 3 235B with external tool support
# Courses
Source: https://docs.fireworks.ai/examples/introduction
Standalone end-to-end examples showing how to use Fireworks to solve real-world use cases
Learn how to use Fireworks to fine-tune a model to convert natural language to SQL queries.
Learn how to build reinforcement learning systems that avoid reward hacking.
Learn to distill the knowledge of large AI models into efficient, deployable alternatives.
# How do I close my Fireworks.ai account?
Source: https://docs.fireworks.ai/faq-new/account-access/how-do-i-close-my-fireworksai-account
To close your account:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai)
2. Include in your request:
* Your account ID
* A clear request for account deletion
Before closing your account, please ensure:
* All outstanding invoices are paid, and any payment issues on prepaid accounts are resolved
* Any active deployments are terminated
* Important data is backed up if needed
# I have multiple Fireworks accounts. When I try to login with Google on Fireworks' web UI, I'm getting signed into the wrong account. How do I fix this?
Source: https://docs.fireworks.ai/faq-new/account-access/i-have-multiple-fireworks-accounts-when-i-try-to-login-with-google-on-fireworks
If you log in with Google, account management is controlled by Google. You can log in through an incognito mode or create separate Chrome/browser profiles to log in with different Google accounts. You could also follow the steps in this [guide](https://support.google.com/accounts/answer/13533235?hl=en#zippy=%2Csign-in-with-google) to disassociate Fireworks.ai with a particular Google account sign-in. If you have more complex issues please contact us on Discord.
# What email does GitHub authentication use?
Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-github-authentication-use
When you authenticate with Fireworks using GitHub, we use the **primary email address** associated with your GitHub account for identification and account management.
## How it works
Fireworks automatically retrieves your primary email address from your GitHub profile during the authentication process. This email address becomes your Fireworks account identifier.
## Managing your primary email
To change your primary email address on GitHub:
1. Go to your [GitHub email settings](https://github.com/settings/emails)
2. Select the email address you want to set as primary in the "Primary email address" section
You can also follow the [GitHub documentation](https://docs.github.com/en/enterprise-cloud@latest/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/changing-your-primary-email-address) for detailed instructions on managing email preferences.
## Switching between accounts
You can easily switch which Fireworks account your GitHub authentication logs into by changing your primary email address on GitHub before logging in. This allows you to:
* Log into different Fireworks accounts using the same GitHub account
* Switch between personal and work accounts by updating your GitHub primary email
* Maintain separate billing and usage tracking for different email addresses
The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your GitHub primary email before authenticating.
# What email does LinkedIn authentication use?
Source: https://docs.fireworks.ai/faq-new/account-access/what-email-does-linkedin-authentication-use
When you authenticate with Fireworks using LinkedIn, we use the **primary email address** associated with your LinkedIn account for identification and account management.
## How it works
Fireworks automatically retrieves your primary email address from your LinkedIn profile during the authentication process. This email address becomes your Fireworks account identifier.
## Managing your primary email
To change your primary email address on LinkedIn:
1. Go to your [LinkedIn email settings](https://www.linkedin.com/mypreferences/d/manage-email-addresses)
2. From there, you can add new email addresses or change your primary email
3. Click **Add email address** to add a new email or select an existing one to make primary
You can also follow the [LinkedIn documentation](https://www.linkedin.com/help/linkedin/answer/a519904) for detailed instructions on managing email preferences.
## Switching between accounts
You can easily switch which Fireworks account your LinkedIn authentication logs into by changing your primary email address on LinkedIn before logging in. This allows you to:
* Log into different Fireworks accounts using the same LinkedIn account
* Switch between personal and work accounts by updating your LinkedIn primary email
* Maintain separate billing and usage tracking for different email addresses
The authentication will use whatever email is set as primary at the time of login, so you can switch accounts by simply updating your LinkedIn primary email before authenticating.
# What should I do if I can't access my company account after being invited when I already have a personal account?
Source: https://docs.fireworks.ai/faq-new/account-access/what-should-i-do-if-i-cant-access-my-company-account-after-being-invited-when-i
This issue can occur when you have multiple accounts associated with the same email address (e.g., a personal account created with Google login and a company account you've been invited to).
To resolve this:
1. Email [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) from the email address associated with both accounts
2. Include in your email:
* The account ID you created personally (e.g., username-44ace8)
* The company account ID you need access to (e.g., company-a57b2a)
* Mention that you're having trouble accessing your company account
Note: This is a known scenario that support can resolve once they verify your email ownership.
# Are there discounts for bulk usage?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-discounts-for-bulk-usage
We offer discounts for bulk or pre-paid purchases. Contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss volume pricing.
# Are there extra fees for serving fine-tuned models?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/are-there-extra-fees-for-serving-fine-tuned-models
Fine-tuned (LoRA) models require a dedicated deployment to serve. Here's what you need to know:
**What you pay for**:
* **Deployment costs** on a per-GPU-second basis for hosting the model
* **The fine-tuning process** itself, if applicable
**Deployment options**:
* **Live-merge deployment**: Deploy your LoRA model with weights merged into the base model for optimal performance
* **Multi-LoRA deployment**: Deploy up to 100 LoRA models as addons on a single base model deployment
For more details on deploying fine-tuned models, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras).
# How does billing and credit usage work?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-does-billing-and-credit-usage-work
Fireworks uses a **pre-paid credits** model for new self-serve accounts:
* Add a valid payment method and billing address, then purchase credits to use the platform.
* Usage across serverless, on-demand deployments, and fine-tuning deducts from your credit balance.
* If your balance reaches zero and auto top-up is not enabled, usage pauses until you add credits.
* You can configure auto top-up and a monthly budget cap in the billing dashboard.
Legacy and enterprise exceptions:
* Accounts created before **June 1** remain on their existing postpaid terms unless migrated.
* Enterprise accounts can be configured for postpaid billing on request.
For grandfathered and postpaid enterprise accounts, usage and billing operate through a **tiered system**:
* Each **tier** has a monthly usage limit, regardless of available credits.
* Once you reach your tier's limit, **service will be suspended** even if you have remaining credits.
* **Usage limits** reset at the beginning of each month.
* Pre-purchased credits do not prevent additional charges once the limit is exceeded.
Enterprise accounts do not have the same self-serve limits. See [Enterprise quotas](/faq/enterprise/service/quotas) for more information.
For details on spend limits, budget caps, and quota controls, see our [Account quotas guide](/guides/quotas_usage/account-quotas#view-and-adjust-your-spend-limit).
# How many tokens per image?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-many-tokens-per-image
Learn how to calculate token usage for images in vision models and understand pricing implications
Image token consumption varies by model and resolution, typically ranging from 1,000 to 2,500 tokens per image for most common resolutions.
## Common resolution token counts
The following table shows the token counts for a single image for Qwen2.5 VL at different image resolutions:
| Resolution | Token Count |
| ---------- | ----------- |
| 336×336 | 144 |
| 672×672 | 576 |
| 1024×1024 | 1,369 |
| 1280×720 | 1,196 |
| 1920×1080 | 2,769 |
| 2560×1440 | 4,641 |
| 3840×2160 | 10,549 |
## Calculating exact token count for your images
You can determine exact token usage by processing your images through the model's tokenizer.
For instance, for Qwen2.5 VL, you can use the following code:
```bash theme={null}
pip install torch torchvision transformers pillow
```
```python Tokenizing your image theme={null}
import requests
from PIL import Image
from transformers import AutoProcessor
import os
# Your image source - can be URL or local path
IMAGE_URL_OR_PATH = "https://images.unsplash.com/photo-1519125323398-675f0ddb6308"
def load_image(source):
"""Load image from URL or local file path"""
if source.startswith(('http://', 'https://')):
print(f"Downloading image from URL: {source}")
response = requests.get(source)
response.raise_for_status()
return Image.open(requests.get(source, stream=True).raw)
else:
print(f"Loading image from path: {source}")
if not os.path.exists(source):
raise FileNotFoundError(f"Image file not found: {source}")
return Image.open(source)
def count_image_tokens(image):
"""Count how many tokens an image takes using Qwen 2.5 VL processor"""
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What's in this image?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt")
input_ids = inputs["input_ids"][0]
# Count the image pad tokens (151655 is Qwen2.5 VL's image token ID)
image_tokens = (input_ids == 151655).sum().item()
return image_tokens, input_ids
def main():
import sys
image_source = sys.argv[1] if len(sys.argv) > 1 else IMAGE_URL_OR_PATH
print(f"Processing image: {image_source}")
image = load_image(image_source)
print(f"Image size: {image.size}")
print(f"Image mode: {image.mode}")
print("\nCalculating tokens...")
image_tokens, input_ids = count_image_tokens(image)
print(f"Total tokens: {len(input_ids)}")
print(f"Image tokens: {image_tokens}")
print(f"Text tokens: {len(input_ids) - image_tokens}")
if __name__ == "__main__":
main()
```
```bash Usage theme={null}
# Calculate tokens for an image URL
python token_calculator.py "https://example.com/image.jpg"
# Calculate tokens for a local image
python token_calculator.py "path/to/your/image.png"
```
# How much does Fireworks cost?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/how-much-does-fireworks-cost
Fireworks AI uses a **usage-based pre-paid** model for new self-serve accounts. You purchase credits, then usage is deducted based on:
* **Per token** for serverless inference
* **Per GPU usage time** for on-demand deployments
* **Per token of training data** for fine-tuning
Billing model by account type:
* Accounts created before **June 1** keep their existing postpaid terms (grandfathered).
* Enterprise accounts can be configured for postpaid billing on request.
For customers needing **enterprise-grade security and reliability**, please reach out to us at [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai) to discuss options.
Find out more about our current pricing on our [Pricing page](https://fireworks.ai/pricing).
# Is prompt caching billed differently for serverless models?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/is-prompt-caching-billed-differently
Yes, **cached prompt tokens are discounted compared to uncached tokens for serverless models**. The default discount is 50%, but the exact discount varies by model. Check the [Model Library](https://fireworks.ai/models) for model-specific cached and uncached input token pricing.
# How do credits work?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/what-happens-when-i-finish-my-1-dollar-credit
## How credits are applied
Fireworks uses a **pre-paid credits** model for new self-serve accounts:
* Credits are used first for all usage.
* If credits are exhausted and auto top-up is disabled, usage pauses until you add credits.
* If auto top-up is enabled, credits are purchased automatically when your balance reaches your configured minimum.
* You can set a monthly budget cap to limit total spend.
Accounts created before **June 1** remain on existing postpaid terms (grandfathered). Enterprise accounts can also be configured for postpaid billing on request.
## Missing credits after purchase?
If you don't see your credits reflected immediately:
1. Visit your **billing dashboard**
2. Review the **"Credits"** section
3. Check your **credit balance** and **auto top-up settings**
**Important**: In the pre-paid model, usage consumes available credits. If your balance is low, enable auto top-up to avoid interruptions.
## Why did I receive an invoice after depositing credits?
Most self-serve accounts on pre-paid billing should not see month-end overage invoices. If you received an invoice, your account is likely on a postpaid contract (for example, grandfathered or enterprise postpaid terms).
If this seems unexpected, contact [community\_billing@fireworks.ai](mailto:community_billing@fireworks.ai) so we can confirm your billing configuration.
## What happens when I finish my \$1 credit?
When you finish your \$1 credit, the following occurs:
## Account Status
* **Without payment method**: Your account will be **suspended** until you add a payment method. For request-rate behavior, see [Account quotas](/guides/quotas_usage/account-quotas#account-wide-request-limits); for serverless TPM upper bounds, see [Serverless rate limits](/serverless/rate-limits).
* **With payment method**: Add credits to continue usage. [Account-wide request limits](/guides/quotas_usage/account-quotas#account-wide-request-limits) increase, and [serverless TPM upper bounds](/serverless/rate-limits) grow as your account spend tier rises.
**Payment Method Requirements:**
* Adding a payment method is required to continue service after credit depletion
* Add credits (or enable auto top-up) to continue service after credit depletion
* New self-serve accounts use pre-paid billing by default
* Grandfathered and enterprise postpaid accounts can still receive invoices based on their contract terms
* As you spend more with Fireworks, your adaptive usage limits and serverless TPM upper bounds can increase
## Where's my receipt for purchased credits?
Receipts for purchased credits are sent via Stripe upon purchase. Check your email for receipts from Stripe (not Fireworks). If you can't find your receipt, contact [community\_billing@fireworks.ai](mailto:community_billing@fireworks.ai).
For spend limits, tiers, and account-wide request limits, see [Account quotas](/guides/quotas_usage/account-quotas). For adaptive serverless TPM upper bounds, see [Serverless rate limits](/serverless/rate-limits).
# Why might my account be suspended even with remaining credits?
Source: https://docs.fireworks.ai/faq-new/billing-pricing/why-might-my-account-be-suspended-even-with-remaining-credits
Your account may be suspended due to several factors:
1. **Budget cap reached**:
* Your monthly budget cap can pause usage even if you still have credit balance.
* Increase your budget cap in Billing to resume usage.
2. **Payment or risk checks**:
* Accounts may be temporarily paused if payment verification fails.
* In some cases, manual review can temporarily limit usage.
3. **Billing model mismatch**:
* Accounts created before **June 1** may still be on grandfathered postpaid terms.
* Enterprise accounts may use custom postpaid billing if requested.
If you're experiencing account suspension issues or need assistance with your budget and billing limits, please contact [inquiries@fireworks.ai](mailto:inquiries@fireworks.ai).
# Are there any quotas for serverless?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/are-there-any-quotas-for-serverless
Yes. Standard serverless, Priority tier, and Fast all have serverless rate limits and quotas.
For the detailed serverless policy, see our [Serverless rate limits guide](/serverless/rate-limits).
# Do you provide notice before removing model availability?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-provide-notice-before-removing-model-availability
Yes, we provide advance notice before removing models from the serverless infrastructure:
* **Minimum 2 weeks’ notice** before model removal
* Longer notice periods may be provided for **popular models**, depending on usage
* Higher-usage models may have extended deprecation timelines
**Best Practices**:
1. Monitor announcements regularly.
2. Prepare a migration plan in advance.
3. Test alternative models to ensure continuity.
4. Keep your contact information updated for timely notifications.
# Do you support Auto Scaling?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/do-you-support-auto-scaling
Yes, our system supports **auto scaling** with the following features:
* **Scaling down to zero** capability for resource efficiency
* Controllable **scale-up and scale-down velocity**
* **Custom scaling rules and thresholds** to match your specific needs
# How does autoscaling affect my costs?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-autoscaling-affect-my-costs
* **Scaling from 0**: No minimum cost when scaled to zero
* **Scaling up**: Each new replica adds to your total cost proportionally. For example:
* Scaling from 1 to 2 replicas doubles your GPU costs
* If each replica uses multiple GPUs, costs scale accordingly (e.g., scaling from 1 to 2 replicas with 2 GPUs each means paying for 4 GPUs total)
For current pricing details, please visit our [pricing page](https://fireworks.ai/pricing).
# How does billing and scaling work for on-demand GPU deployments?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-and-scaling-work-for-on-demand-gpu-deployments
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
**Billing**:
* Charges start when the server begins accepting requests
* **Billed by GPU-second** for each active instance
* Costs accumulate even if there are no active API calls
**Scaling options**:
* Supports **autoscaling** from 0 to multiple GPUs
* Each additional GPU **adds to the billing rate**
* Can handle unlimited requests within the GPU’s capacity
**Management requirements**:
* Not fully serverless; requires some manual management
* **Manually delete deployments** when no longer needed
* Or configure autoscaling to **scale down to 0** during inactive periods
**Cost control tips**:
* Regularly **monitor active deployments**
* **Delete unused deployments** to avoid unnecessary costs
* Consider **serverless options** for intermittent usage
* Use **autoscaling to 0** to optimize costs during low-demand times
# How does billing work for on-demand deployments?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-billing-work-for-on-demand-deployments
On-demand deployments come with automatic cost optimization features:
* **Default autoscaling**: Automatically scales to 0 replicas when not in use
* **Pay for what you use**: Charged only for GPU time when replicas are active
* **Flexible configuration**: Customize autoscaling behavior to match your needs
**Best practices for cost management**:
1. **Leverage default autoscaling**: The system automatically scales down deployments when not in use
2. **Customize carefully**: While you can modify autoscaling behavior using our [configuration options](https://docs.fireworks.ai/guides/ondemand-deployments#customizing-autoscaling-behavior), note that preventing scale-to-zero will result in continuous GPU charges
3. **Consider your use case**: For intermittent or low-frequency usage, serverless deployments might be more cost-effective
For detailed configuration options, see our [deployment guide](https://docs.fireworks.ai/guides/ondemand-deployments#replica-count-horizontal-scaling).
# How does the system scale?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/how-does-the-system-scale
Our system is **horizontally scalable**, meaning it:
* Scales linearly with additional **replicas** of the deployment
* **Automatically allocates resources** based on demand
* Manages **distributed load handling** efficiently
# Are there SLAs for serverless?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/is-latency-guaranteed-for-serverless-models
Our multi-tenant serverless offering does not currently come with Service Level Agreements (SLAs) for latency or availability.
If you have specific performance or availability requirements, we recommend:
* **On-demand deployments**: Provides dedicated resources with predictable performance
* **Contact sales**: [Reach out to discuss](https://fireworks.ai/company/contact-us) custom solutions and enterprise options
# What are the rate limits for on-demand deployments?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-are-the-rate-limits-for-on-demand-deployments
On-demand deployments have GPU quotas that determine your maximum allocation.
For detailed information about on-demand deployment quotas and GPU limits, see our [Account quotas guide](/guides/quotas_usage/account-quotas#on-demand-deployment-quotas).
Need higher GPU allocations? [Contact us](https://fireworks.ai/company/contact-us) to discuss custom solutions for your use case.
# What factors affect the number of simultaneous requests that can be handled?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled
The request handling capacity is influenced by multiple factors:
* **Model size and type**
* **Number of GPUs** allocated to the deployment
* **GPU type** (e.g., A100 vs. H100)
* **Prompt size** and **generation token length**
* **Deployment type** (serverless vs. on-demand)
# What’s the supported throughput?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/whats-the-supported-throughput
Throughput capacity typically depends on several factors:
* **Deployment type** (serverless or on-demand)
* **Traffic patterns** and **request patterns**
* **Hardware configuration**
* **Model size and complexity**
# Why am I experiencing request timeout errors and slow response times with serverless LLM models?
Source: https://docs.fireworks.ai/faq-new/deployment-infrastructure/why-am-i-experiencing-request-timeout-errors-and-slow-response-times-with-server
Timeout errors and increased response times can occur due to **server load during high-traffic periods**.
With serverless, users are essentially **sharing a pool of GPUs** with models pre-provisioned.
The goal of serverless is to allow users and teams to **seamlessly power their generative applications** with the **latest generative models** in **less than 5 lines of code**.
Deployment barriers should be **minimal** and **pricing is based on usage**.
However there are trade-offs with this approach, namely that in order to ensure users have **consistent access** to the most in-demand models, users are also subject to **minor latency and performance variability** during **high-volume periods**.
With **on-demand deployments**, users are reserving GPUs (which are **billed by rented time** instead of usage volume) and don't have to worry about traffic spikes.
Which is why our two recommended ways to address timeout and response time issues is:
### Current solution (recommended for production)
* **Use on-demand deployments** for more stable performance
* **Guaranteed response times**
* **Dedicated resources** to ensure availability
We are always investing in ways to improve speed and performance.
### Upcoming improvements
* Enhanced SLAs for uptime
* More consistent generation speeds during peak load times
If you experience persistent issues, please include the following details in your support request:
1. Exact **model name**
2. **Timestamp** of errors (in UTC)
3. **Frequency** of timeouts
4. **Average wait times**
### Performance optimization tips
* Consider **batch processing** for handling bulk requests
* Implement **retry logic with exponential backoff**
* Monitor **usage patterns** to identify peak traffic times
* Set **appropriate timeout settings** based on model complexity
# Does Fireworks support custom base models?
Source: https://docs.fireworks.ai/faq-new/models-inference/does-fireworks-support-custom-base-models
Yes, custom base models can be deployed via **firectl**. You can learn more about custom model deployment in our [guide on uploading custom models](https://docs.fireworks.ai/models/uploading-custom-models).
# Does the API support batching and load balancing?
Source: https://docs.fireworks.ai/faq-new/models-inference/does-the-api-support-batching-and-load-balancing
Current capabilities include:
* **Load balancing**: Yes, supported out of the box
* **Continuous batching**: Yes, supported
* **Batch inference**: Yes, supported via the [Batch API](/guides/batch-inference)
* **Streaming**: Yes, supported
For asynchronous batch processing of large volumes of requests, see our [Batch API documentation](/guides/batch-inference).
# FLUX image generation
Source: https://docs.fireworks.ai/faq-new/models-inference/flux-image-generation
## Can I generate multiple images in a single API call?
No, FLUX serverless supports only one image per API call. For multiple images, send separate parallel requests—these will be automatically load-balanced across our replicas for optimal performance.
## Does FLUX support image-to-image generation?
No, image-to-image generation is not currently supported. We are evaluating this feature for future implementation. If you have specific use cases, please share them with our support team to help inform development.
## Can I create custom LoRA models with FLUX?
Inference on FLUX-LoRA adapters is currently supported. However, managed training on Fireworks with FLUX is not, although this feature is under development. Updates about our managed LoRA training service will be announced when available.
# How do I control output image sizes when using SDXL ControlNet?
Source: https://docs.fireworks.ai/faq-new/models-inference/how-do-i-control-output-image-sizes-when-using-sdxl-controlnet
When using **SDXL ControlNet** (e.g., canny control), the output image size is determined by the explicit **width** and **height** parameters in your API request:
The input control signal image will be automatically:
* **Resized** to fit your specified dimensions
* **Cropped** to preserve aspect ratio
**Example**: To generate a 768x1344 image, explicitly include these parameters in your request:
```json theme={null}
{
"width": 768,
"height": 1344
}
```
*Note*: While these parameters may not appear in the web interface examples, they are supported API parameters that can be included in your requests.
# How to check if a model is available on serverless?
Source: https://docs.fireworks.ai/faq-new/models-inference/how-to-check-if-a-model-is-available-on-serverless
## Web UI
Go to [https://app.fireworks.ai/models?filter=LLM\&serverless=true](https://app.fireworks.ai/models?filter=LLM\&serverless=true)
## API
You can programmatically retrieve all serverless models using the [List Models API](/api-reference/list-models) with the `supports_serverless=true` filter.
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
# List all serverless models
models = client.models.list(filter="supports_serverless=true")
for model in models:
print(model.name)
```
You can also combine filters and customize the response:
```python theme={null}
# List serverless models with pagination
models = client.models.list(
filter="supports_serverless=true",
page_size=50,
)
for model in models:
print(f"{model.name}: {model.display_name}")
```
```bash theme={null}
curl "https://api.fireworks.ai/v1/accounts/fireworks/models?filter=supports_serverless%3Dtrue" \
-H "Authorization: Bearer $FIREWORKS_API_KEY"
```
With pagination:
```bash theme={null}
curl "https://api.fireworks.ai/v1/accounts/fireworks/models?filter=supports_serverless%3Dtrue&pageSize=50" \
-H "Authorization: Bearer $FIREWORKS_API_KEY"
```
The filter parameter uses the [AIP-160 filter syntax](https://google.aip.dev/160). The `supports_serverless` field indicates whether a model is available on serverless infrastructure.
See the [List Models API reference](/api-reference/list-models) for all available parameters including `order_by`, `page_size`, and `read_mask`.
# There’s a model I would like to use that isn’t available on Fireworks. Can I request it?
Source: https://docs.fireworks.ai/faq-new/models-inference/theres-a-model-i-would-like-to-use-that-isnt-available-on-fireworks-can-i-reques
Fireworks supports a wide array of custom models and actively takes feature requests for new, popular models to add to the platform.
**To request new models**:
1. **Join our [Discord server](https://discord.gg/fireworks-ai)**
2. Let us know which models you’d like to see
3. Provide **use case details**, if possible, to help us prioritize
We regularly evaluate and add new models based on:
* **Community requests**
* **Popular demand**
* **Technical feasibility**
* **Licensing requirements**
# What factors affect the number of simultaneous requests that can be handled?
Source: https://docs.fireworks.ai/faq-new/models-inference/what-factors-affect-the-number-of-simultaneous-requests-that-can-be-handled
Request handling capacity depends on several factors:
* **Model size and type**
* **Number of GPUs allocated** to the deployment
* **GPU type** (e.g., A100, H100)
* **Prompt size**
* **Generation token length**
* **Deployment type** (serverless vs. on-demand)
# Fireworks Agent: Classification
Source: https://docs.fireworks.ai/fine-tuning/agent/classification
Benchmark base models, fine-tune on labeled data, and pick the best classifier — automatically.
Fireworks Agent's classification workflow is purpose-built for label-prediction tasks. It evaluates one or more base models against your labeled dataset, fine-tunes the strongest candidates, and reports a head-to-head comparison of base vs fine-tuned accuracy on a held-out test split.
Use this workflow for classification, content extraction, intent detection, routing, and other tasks with a discrete label set.
For the underlying SFT mechanics (job parameters, supported base models, dataset format), see [Managed Fine-Tuning → Supervised Fine-Tuning](/fine-tuning/fine-tuning-models). The classification workflow is built on top of SFT with classification-specific dataset handling and reporting.
## What you give Agent
| Input | Required? | Notes |
| -------------------------------- | --------- | -------------------------------------------------------------------------------------- |
| Dataset ID(s) | **Yes** | Single dataset (split 80/20 train/test) or two datasets (separate train + eval) |
| Models to evaluate and fine-tune | **Yes** | Agent does **not** default to "all models"; pick from the supported list when prompted |
| Candidate labels | No | Agent infers labels from your data if you don't list them explicitly |
| Imbalance-ratio threshold | No | Defaults to `50.0` (ratio of most-frequent to least-frequent label) |
### Dataset requirements
* Each sample must contain `messages` in OpenAI chat-completion format.
* `ground_truth` is optional. If absent, Agent extracts the label from the final assistant message using task-specific logic in the plan.
* `ground_truth` may be a single string or a list of strings.
## Example session instructions
Single dataset with automatic split, two candidate models:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Benchmark and fine-tune classification on accounts/myacct/datasets/intent-labels. Compare Qwen3 8B and Qwen3 32B. Labels are: billing, technical, account, sales."
```
Separate train and eval datasets, model already chosen:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run classification fine-tuning on accounts/myacct/datasets/train, eval on accounts/myacct/datasets/test, using Qwen3 8B."
```
**Where classification lives in the 7-phase pipeline:** Phase 1 is dataset inspection (with per-label sample counts and imbalance detection), phase 2 is plan + cost approval, **phase 3 is the base-model benchmark plus fine-tuning sweep**, phase 4 is the full-data run for each candidate, **phase 5 is the fine-tuned evaluation** with per-label and overall accuracy, phase 6 is deployment of the winner, phase 7 is the base-vs-fine-tuned comparison report. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job).
## Workflow stages
Agent stages your dataset locally once, computes per-label sample counts, detects the imbalance ratio, and inspects message structure. If `ground_truth` is missing, Agent decides how to extract the label from the final assistant turn.
If you provided candidate labels in your instruction, Agent uses them. Otherwise, Agent infers them from the data and surfaces the inferred set in the plan for you to confirm.
Agent computes the ratio of most-frequent to least-frequent label. If it exceeds the threshold (default `50.0`), Agent flags the imbalance in the plan and proposes mitigations (rebalancing, weighted training, or a smaller candidate set).
Agent writes a plan and presents it with a cost breakdown (base-model evaluation inference + fine-tuning + fine-tuned evaluation inference + total). Approve once to proceed.
Agent runs inference on the held-out eval split against each selected base model and reports per-label and overall accuracy. This becomes the baseline.
Agent fine-tunes the strongest candidates with the default tuning grid (or your explicit grid), batched into waves of up to 6 active jobs.
Agent runs the same eval inference against each fine-tuned model and computes per-label accuracy, overall accuracy, and confusion-matrix style summaries.
Agent picks the winner, deploys it (phase 6), and writes the final comparison report (phase 7) showing base vs fine-tuned accuracy per candidate plus a `fireworks-ai` SDK snippet for inference.
## Output
When the session reports `succeeded`, Agent's response includes:
* Per-label and overall accuracy for every base model evaluated
* Per-label and overall accuracy for every fine-tuned candidate
* The winning model ID, deployment ID, and inference endpoint
* A `fireworks-ai` SDK snippet for label prediction
* `final_report.md` in the session workspace with the full base vs fine-tuned comparison and estimated-vs-actual cost
## Customizing the run
* **Explicit labels:** *"Labels are: positive, negative, neutral."*
* **Imbalance threshold override:** *"Use an imbalance threshold of 20."*
* **Inference-only mode:** *"Just benchmark — don't fine-tune."*
* **Single candidate:** *"Only fine-tune Qwen3 8B, skip the base-vs-base comparison."*
* **Custom split:** *"Use a 70/30 train/test split."*
**Agent crib notes**
* Required inputs: dataset ID and at least one base model to evaluate. Agent will ask explicitly if either is missing.
* Agent needs to know your labels eventually — either supply them in the instruction or confirm Agent's inferred set when prompted.
* The default imbalance threshold is `50.0`; if your dataset is highly imbalanced, expect Agent to flag it in the plan.
* For multi-label classification (a sample with multiple ground-truth labels), pass `ground_truth` as a list in your dataset.
* Agent creates new resources only; your dataset, models, and deployments are never deleted or modified in place.
# Fireworks Agent: Preference Learning (DPO/ORPO)
Source: https://docs.fireworks.ai/fine-tuning/agent/dpo
Run preference fine-tuning end-to-end with optional base-model sweep, automatic pair generation, and pairwise evaluation.
Fireworks Agent's preference-learning workflow runs DPO or ORPO fine-tuning against pre-paired preference data, or generates pairs for you from a prompts-only dataset using delta learning. It can sweep multiple base models when you don't know which to pick, evaluates winners pairwise (or with your evaluator), and produces a final comparison report.
For the underlying DPO mechanics and dataset format details, see [Managed Fine-Tuning → DPO Fine-Tuning](/fine-tuning/dpo-fine-tuning). This page documents the Fireworks Agent workflow built on top of it.
## What you give Agent
| Input | Required? | Notes |
| ------------------ | --------- | ----------------------------------------------------------------------------------------------------------------------- |
| Dataset ID(s) | **Yes** | A single dataset (split 80/20 train/test automatically) or two datasets (separate train + test) |
| Base model | No | If omitted, Agent runs a **model sweep** across supported base models to pick the best automatically |
| Evaluator | No | Evaluator ID, custom rubric text, or none (Agent builds a data-grounded pairwise judge rubric if you don't provide one) |
| Performance target | No | Optional goal score, for example *"win rate above 70%"* |
## Example session instructions
Pre-paired preference data with a specific base model:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run DPO on accounts/myacct/datasets/customer-prefs using Qwen3 32B."
```
Prompts-only dataset with automatic pair generation and a base-model sweep:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run preference learning on accounts/myacct/datasets/prompts-only. Generate preference pairs automatically and sweep base models to find the best one."
```
**Where DPO lives in the 7-phase pipeline:** Phase 1 is dataset inspection, phase 2 is plan + cost approval, **phase 3 is the preference sweep** (replacing the SFT HP sweep — includes pair generation up-front for Format B), phase 5 is the pairwise evaluation, phase 6 is deployment of the winner, phase 7 is the final report. DPO does **not** run a separate phase 4 full-data retrain — the sweep itself is the training run on the chosen base model + config. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job).
## Dataset formats
Agent accepts two formats:
### Format A — DPO format (pre-paired preferences)
Each sample has `input`, `preferred_output`, and `non_preferred_output` fields. `input.messages` holds the conversation; `preferred_output` and `non_preferred_output` hold candidate assistant responses.
When this format is detected, Agent skips pair generation and goes straight to training.
### Format B — prompts-only
Each sample has a `messages` field with user messages only (no assistant completions). Agent generates preference pairs automatically using **delta learning**: it samples completions from a strong and a weak model, then constructs preferred/non-preferred pairs for training.
## Workflow stages
Agent stages the dataset locally exactly once per session, computes token statistics, and decides between Format A (skip pair generation) and Format B (generate pairs).
Agent resolves your evaluator choice (evaluator ID / custom rubric / auto rubric) and asks for anything missing — usually the dataset and, if you omitted both base model and grid, confirmation that a base-model sweep is OK.
Agent presents a plan plus a cost breakdown (training + any pair-generation inference + evaluator inference + total) and asks for a single approval covering both.
Agent generates preference pairs via delta learning and uploads the resulting dataset to Fireworks under a new, timestamped name. Your original dataset is left untouched.
If no base model was specified, Agent runs DPO/ORPO across a curated set of supported base models. If a base model was specified, Agent runs an HP sweep against that single base model. Training jobs are batched (default cap of 6 active at once).
For each trained model, Agent generates completions on the held-out test split and scores them. With your own evaluator, scores are reported independently. Without one, Agent uses a pairwise judge rubric grounded in actual training samples.
Agent deploys the winning fine-tuned model and writes a final report comparing base and fine-tuned models, with the deployment endpoint and (if you supplied a performance target) whether the target was met.
## Evaluator handling
Agent supports three evaluator paths, in priority order:
1. **Evaluator ID** — for example `accounts/myacct/evaluators/my-eval`. Agent fetches the evaluator code, installs dependencies, and runs it to score each model's completions independently. Agent reports average scores for the base model and every fine-tuned candidate.
2. **Custom rubric text** — provide a pairwise LLM judge rubric in your instruction. Agent uses it to compare two completions head-to-head.
3. **Neither** — Agent inspects training samples and writes a data-grounded pairwise judge rubric automatically.
## Output
When the session reports `succeeded`, Agent returns:
* The winning fine-tuned model ID and its deployment endpoint
* Base vs fine-tuned comparison: scores or win rate from the chosen evaluator
* A copy-paste `fireworks-ai` SDK snippet for the deployed model
* `final_report.md` in the session workspace with per-model scores, pair-generation provenance (if Format B), and estimated-vs-actual cost
## Supported base models
The model sweep selects from the supported preference-learning base models. For the canonical list, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models).
## Customizing the run
* **Pin a base model:** *"Use Qwen3 32B."* — skips the model sweep.
* **Explicit grid:** *"Sweep Qwen3 32B and Qwen3-30B-A3B with beta 0.1 and 0.3."*
* **Bring your own evaluator:** *"Use evaluator accounts/myacct/evaluators/my-rubric."*
* **Auto-generate pairs:** *"Generate preference pairs automatically."*
* **Set a target:** *"Stop early once we reach 75% win rate against the base."*
**Agent crib notes**
* Required input: dataset ID. Everything else is optional.
* Agent will pause for one approval (plan + cost) and again at the comparison report. The promotion gate appears only when a clear winner needs confirmation.
* If the dataset is prompts-only, Agent will generate pairs by sampling strong and weak models — expect inference cost on top of training cost.
* Agent always creates new datasets with timestamped names; your original dataset is never overwritten.
* For deeper customization of the loss (custom beta schedules, hybrid objectives), use the [Training API](/fine-tuning/training-api/introduction) instead.
# Fireworks Agent: Evaluator Authoring
Source: https://docs.fireworks.ai/fine-tuning/agent/evaluators
Have Fireworks Agent generate a reusable evaluator from your dataset — for scoring candidates in an SFT sweep, or for use with Managed RFT.
Fireworks Agent can write a task-specific evaluator from your dataset alone. Two flavors:
* **SFT evaluators** — a Python evaluator (`evaluator.py`) plus a spec (`eval_spec.md`) that Agent uses to score candidates during a subsequent SFT sweep in the same session.
* **RFT evaluators** — an Eval Protocol `@evaluation_test` evaluator ready to drive a Reinforcement Fine-Tuning job.
Use evaluator authoring when you have a dataset and a clear notion of what "correct" looks like, but no evaluator script yet.
## SFT evaluators
### What you get
Agent generates two artifacts in the session workspace:
* `outputs/eval_spec.md` — a human-readable spec describing what the evaluator checks (the contract: what counts as correct, how partial credit works, edge cases).
* `outputs/evaluator.py` — a Python evaluator that takes a model's outputs and the dataset's ground truth and returns scores.
After the artifacts are written, Agent surfaces the full `eval_spec.md` and `evaluator.py` contents in chat so you can review them before they're used downstream.
### Example session instructions
Author an evaluator only:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Generate an evaluator for accounts/myacct/datasets/customer-support. Outputs are short text answers; check whether the final assistant reply matches ground truth on key facts."
```
Author an evaluator and continue straight into SFT in the same session — Agent reuses the freshly-written evaluator without re-authoring:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Generate an evaluator for accounts/myacct/datasets/customer-support, then run SFT on Qwen3 8B and use that evaluator to pick the winning candidate."
```
**Where evaluator authoring lives in the 7-phase pipeline:** When evaluator authoring runs as a standalone session, phases 3–7 of the standard pipeline don't apply; the session writes `outputs/evaluator.py` + `outputs/eval_spec.md` and stops. When you chain authoring into SFT in the same session, those artifacts feed **phase 5 (Evaluation)** of the follow-on training pipeline — used to score candidates during phase 3 and again for direct evaluation of the final model. (RFT evaluators are saved to your Fireworks account and then used by [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models), not by Agent.) See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job).
When you ask for both in one instruction, Agent writes the evaluator first, then automatically continues into SFT with **same-session evaluator reuse**: the SFT workflow picks up `outputs/evaluator.py` and `outputs/eval_spec.md` without re-authoring them, and reuses the staged dataset paths so the dataset is downloaded only once.
### Multi-turn handoff
If you want fine-grained control of the handoff, structure your two instructions like this:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Generate an evaluator for accounts/myacct/datasets/mydata."
# Wait for evaluator artifacts to be written and presented in chat.
```
Then continue in the **same session**:
```bash theme={null}
source .env && firectl session update \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Now run SFT on Qwen3 32B using the evaluator we just authored. Reuse outputs/evaluator.py and outputs/eval_spec.md — do not regenerate them."
```
Agent will inherit the staged dataset and the evaluator artifacts without re-downloading or rewriting them.
## RFT evaluators
**Agent authors RFT evaluators but does not run RFT training.** This workflow produces and validates the Eval Protocol evaluator file, then registers it with your Fireworks account. The actual RFT training job runs through [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models) — not from an Agent session.
### What you get
An Eval Protocol `@evaluation_test` evaluator file, validated end-to-end, ready to drop into a Reinforcement Fine-Tuning job. The plan includes the concrete evaluator code, validation commands, and the command to save the evaluator to Fireworks.
This is purpose-built for tasks where you can score model outputs against reference data — math problems, code generation, structured-output extraction, agentic workflows with verifiable side effects.
### Example session instruction
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Build an RFT evaluator for accounts/myacct/datasets/math-problems. Score whether the final numeric answer matches ground truth."
```
Agent inspects samples, writes the evaluator, validates it on a few records, and presents the plan with the save command. You approve once and Agent executes the plan, registering the evaluator with your Fireworks account.
### Handing off to RFT training
Once the evaluator is saved, run the RFT job through Managed Fine-Tuning — see the [Reinforcement Fine-Tuning Overview](/fine-tuning/reinforcement-fine-tuning-models) and [Evaluators concepts](/fine-tuning/evaluators). For example:
```bash theme={null}
firectl rftj create \
--base-model accounts/fireworks/models/qwen3-8b \
--evaluator accounts/myacct/evaluators/ \
--dataset accounts/myacct/datasets/math-problems
```
Or use the [Web UI](/fine-tuning/web-ui-guide) to launch the RFT job interactively.
## Workflow summary
Agent stages the dataset locally, samples records, and infers the evaluator contract from data plus your scoring intent. Agent will not finalize an evaluator without successfully staging readable data.
For SFT, Agent writes both `eval_spec.md` (the contract) and `evaluator.py` (the implementation) and self-checks that both are non-empty before finishing. For RFT, Agent writes a single Eval Protocol `@evaluation_test` file and self-checks that it's non-empty and that validation succeeds.
Agent surfaces the artifacts inline in chat. For RFT, Agent also presents a plan with validation and save commands and asks for one approval.
If your instruction asks for downstream SFT, Agent continues into the SFT workflow in the same session and reuses the just-authored evaluator — no re-downloading, no re-authoring. RFT training itself runs through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models), not from an Agent session.
## When to use which
| Use case | Workflow |
| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| You want an evaluator Agent can use to score candidates during an SFT sweep, with optional auto-continue into SFT | **SFT evaluator authoring** (run end-to-end by Agent) |
| You want an Eval Protocol evaluator to drive an RFT job | **RFT evaluator authoring** (Agent writes and saves the evaluator; RFT training runs through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models)) |
| You don't have a clear notion of "correct" yet | Start with **validation-loss-only SFT** on [Agent SFT](/fine-tuning/agent/sft) and add an evaluator later |
**Agent crib notes**
* Required input: dataset ID. Agent also wants your scoring intent in plain English — "check whether the answer matches ground truth", "verify the JSON has the right schema", etc.
* For SFT evaluators, ask for both authoring and SFT in the same instruction to get same-session evaluator reuse for free.
* For RFT evaluators, expect a plan + cost approval before the evaluator is saved to your Fireworks account. **The Agent session ends after the evaluator is saved.** Hand off to [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models) to run the actual RFT training job.
* Agent surfaces the generated `eval_spec.md` and `evaluator.py` inline in chat after authoring — relay them to the user.
* All evaluator artifacts live under `outputs/` in the session workspace and can be inspected via `firectl session get ` if needed.
# Fireworks Agent Overview
Source: https://docs.fireworks.ai/fine-tuning/agent/introduction
Describe what you want, approve the plan and cost, get a deployed fine-tuned model.
Fireworks Agent is a hosted Fireworks assistant that owns the full fine-tuning loop. You describe what you want — *"fine-tune a model that classifies our support tickets"*, *"improve Llama 3.1 70B on our function-calling data"*, *"train a smaller model that matches GPT-4 on our routing task"* — and Agent picks the base model, prepares the dataset, runs a hyperparameter sweep, submits training, evaluates the result, and deploys the fine-tuned model. You stay in the loop for approvals and final calls; everything else is handled.
Agent is the easiest of the three Fireworks fine-tuning paths, sitting alongside [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) and the [Training API](/fine-tuning/training-api/introduction). It's the right starting point when you want a working fine-tuned model without writing config files or Python training loops.
**Naming.** This documentation refers to the product as **Fireworks Agent** (or just **Agent**). You may also see it called `pilot` in internal source code, in CLI permission presets (`--permission-preset=pilot`), in the embedded manifest file (`pilot.yaml`), and in some legacy support contexts — those are all the same product. Use *"Fireworks Agent"* or *"Agent"* in your own prompts and communication.
## What Agent does for you
Agent recommends a base model and tuning method (SFT, DPO, or classification) from your task description and a peek at your data.
Inspects your dataset, proposes hyperparameters, estimates cost, and presents a single plan for approval before any spend.
Submits the job, streams progress, evaluates checkpoints, and ships a deployed model at the end.
Concretely, Agent can:
* Run **SFT, DPO, and classification** jobs from a natural-language prompt
* Inspect your dataset and call out format issues before training starts
* Recommend a base model from a curated panel based on your task shape
* Run a short **hyperparameter sweep** before committing to full training
* Stream a live progress feed with eval loss, cost-so-far, and ETA
* Evaluate the trained model against a held-out set and surface the best checkpoint
* Deploy the fine-tuned model so you can call it from `chat/completions` immediately
* Author task-specific [evaluators](/fine-tuning/agent/evaluators) for use in SFT sweeps, or Eval Protocol evaluators you can then run through [Managed Fine-Tuning's RFT path](/fine-tuning/reinforcement-fine-tuning-models)
* Answer questions about your account, deployments, jobs, and Fireworks models along the way
Agent does **not** run RFT training itself — for that, author the evaluator with Agent and then submit the RFT job through [Managed Fine-Tuning](/fine-tuning/reinforcement-fine-tuning-models). Agent also cannot run an arbitrary Python training loop, use a custom loss function, or sample mid-training from your own evaluator — for those, use the [Training API](/fine-tuning/training-api/introduction) directly.
## Architecture
```mermaid theme={null}
flowchart LR
Client["Client
(user via web app,
user via firectl / REST API,
or coding agent)"] -->|"create session"| AgentAPI["Fireworks Agent API"]
AgentAPI -->|dispatch| Runner["Session Runner"]
Runner -->|"plan + cost estimate"| AgentAPI
AgentAPI -->|"events stream"| Client
Client -->|"approve / answer"| AgentAPI
AgentAPI -->|"session update"| Runner
Runner -->|"firectl + Fireworks API"| Platform["Fireworks Platform"]
Platform -->|results| Runner
Runner -->|"final report + deployed model"| Client
```
The runner is an ephemeral, sandboxed environment with its own filesystem. It executes Agent's plan against your Fireworks account using your API key. Sessions can pause for hours or days waiting on user input without consuming compute.
## Two ways to use Agent
The default — and recommended — surface for most users. Open **Agent** in the left nav of [app.fireworks.ai](https://app.fireworks.ai) for a chat interface that streams Agent's plan, progress, and final report. Best for:
* Most fine-tuning workflows, end to end
* Teams that want a visual plan, cost, and approval UX
* Watching a long training run with a live progress feed
* Skipping `firectl` installation and service-account setup
### Dashboard quickstart
Click **Agent** in the left navigation at [app.fireworks.ai](https://app.fireworks.ai).
A good first prompt is specific about *what* you're training for, *what data* to use, and *what success looks like*:
```text theme={null}
Fine-tune a model on accounts/your-account/datasets/support-tickets.
Classify each ticket into one of 12 categories.
Target: better than GPT-4 mini on accuracy. Budget: under $5.
```
Agent will inspect the dataset, propose a plan, and stop for your approval.
Agent presents one structured plan with a cost estimate. Approve, request a change (*"use Qwen3 32B instead"*, *"skip HP tuning"*), or cancel. No spend happens before this gate.
Agent streams phase-anchored updates every few minutes through the final report, which includes the deployed model ID and inference endpoint.
The advanced path, for power users and anyone already living in a coding-agent harness. Use it two ways:
* **Drive Agent directly from `firectl session`** — script it, run it from CI, or call the REST API.
* **Let Claude Code, Cursor, Codex, Aider, Goose, or another coding agent drive it for you** by installing the [Fireworks Agent skill file](/fine-tuning/agent/use-with-coding-agents). The coding agent shells out to `firectl session` using a scoped service-account key.
Best for:
* Fine-tuning as a step in a larger coding workflow
* Reproducing a training run with code-checked-in instructions
* Power users who already orchestrate everything from their coding agent or terminal
* Scripting and automation against the `firectl session` / REST API
### CLI quickstart
Create a service account scoped to Agent's capabilities (the `pilot` permission preset — see the [security section below](#security-service-accounts-and-the-agent-manifest) for the rationale) and mint an API key:
```bash theme={null}
firectl -a user create \
--service-account \
--user-id=fireworks-agent \
--permission-preset=pilot
firectl -a api-key create --service-account=fireworks-agent
```
Save the returned key in a `.env` file in your project root:
```bash .env theme={null}
FIREWORKS_AGENT_API_KEY=fw-...
```
The Fireworks Agent skill sources `.env` automatically. See [Service Accounts](/accounts/service-accounts) for the full setup.
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run SFT on Qwen3 32B using accounts/myacct/datasets/mydata"
```
The command returns a session ID, for example `abc123`.
```bash theme={null}
source .env && firectl session events abc123 --api-key $FIREWORKS_AGENT_API_KEY --wait
```
The `--wait` flag keeps streaming until the session reaches `waiting`, `succeeded`, `failed`, or `cancelled`. Without it, the command dumps existing events and exits.
When the stream stops at `waiting`, read Agent's question, then send your answer back to the same session:
```bash theme={null}
source .env && firectl session update abc123 \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Approved, proceed."
```
Re-run `firectl session events abc123 --wait` to resume. Repeat until the session reports `succeeded`.
## How Agent runs a training job
Every Agent session moves through the same seven phases. Coding agents should expect this sequence; humans can use it as a mental model for what to expect next.
| # | Phase | What happens |
| - | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | **Data inspection** | Agent reads your dataset, reports format, sample count, token count, and any issues. |
| 2 | **Planning & approval** | Agent proposes base model, tuning method, hyperparameters, eval path, and a cost estimate. You approve, edit, or cancel. |
| 3 | **HP tuning** | A short parallel sweep (typically 3 configs) over LoRA rank and learning rate, capped at 6 active jobs by default. |
| 4 | **Full training** | The best config from phase 3 runs to completion on the full dataset, with per-epoch eval loss. |
| 5 | **Evaluation** | The trained model is evaluated against a held-out set using one of three strategies you pick in phase 2: validation loss only (default), an evaluator you provide, or an evaluator Agent generates for you. |
| 6 | **Deployment** | The model is deployed and a `fireworks-ai` SDK snippet is ready for inference. |
| 7 | **Final report** | Deployed model ID, key metrics, total cost, and per-phase summary in one message. |
DPO uses the same shape with phase 3 replaced by a preference sweep (or pair generation followed by a preference sweep when the dataset is prompts-only). Classification uses the same shape with phase 3 expanded into a base-model benchmark plus a fine-tuning sweep, and phase 5 reports per-label and overall accuracy. The promotion gate between phase 3 and phase 4 is one of the two user-facing pauses (the other is plan approval in phase 2).
### The approval and cost contract
Agent never spends without an explicit approval. This is structural, not a setting.
At the end of **Phase 2 (Planning)** — and again before any new spend-incurring step — Agent surfaces a structured cost preview and waits for approval. In the dashboard this is a yes/no prompt. From a coding agent, the skill holds the session in a `waiting` state, surfaces Agent's exact question, and only proceeds after you respond via `firectl session update`. Reject and the session ends with no charges.
The preview always includes:
* Total estimated cost (in USD, with a confidence range)
* Estimated wall time
* Per-phase cost breakdown (HP tuning / full training / evaluation / deployment)
* Cost-so-far in the session (for re-approvals on long runs)
### Out-of-coverage behavior
If you ask Agent to use a model or method outside its supported set, it refuses rather than silently approximating. For example, asking for full-parameter tuning on a model with no Agent recipe returns a clear *"not supported in Agent — use Managed Fine-Tuning or the Training API"* message with a pointer to the right surface. See [When not to use Agent](#when-not-to-use-agent).
## What Agent can do today
End-to-end SFT with dataset inspection, hyperparameter sweep, evaluator-guided model selection, and a deployed winner.
Run DPO or ORPO on pre-paired preferences or generate pairs automatically with delta learning, with an optional base-model sweep.
Benchmark base models, fine-tune on labeled data, and compare base vs fine-tuned classification accuracy on a held-out split.
Generate a reusable Python evaluator Agent uses to score candidates during an SFT sweep, or an Eval Protocol evaluator you can take to a Managed RFT job — directly from your dataset.
Copy-paste skill files for Claude Code, Cursor, Codex, Aider, and Goose so they can drive Agent for you.
## Agent vs Managed Fine-Tuning vs Training API
All three sit on the same training infrastructure, GPU shapes, and tuning methods. The difference is how much you drive.
| | **Fireworks Agent** | **Managed Fine-Tuning** | **Training API** |
| ------------------------------- | ------------------------------------------------------------------------- | --------------------------------- | ---------------------------------- |
| **Interface** | Natural language (dashboard chat, `firectl session`, or via coding agent) | UI, `firectl`, REST | Python script |
| **Who picks the model** | Agent recommends | You | You |
| **Who tunes hyperparameters** | Agent runs a sweep | You set them | You set them |
| **Cost approval** | Built-in gate | None — you submit jobs directly | None |
| **Custom loss / training loop** | Not supported | Not supported | Supported |
| **Inference-in-the-loop eval** | Not supported | Not supported | Supported (hotload) |
| **Best for** | Getting a working fine-tuned model fast, without ML expertise | Production runs with known config | Research, custom RL, hybrid losses |
### When not to use Agent
Reach for a more direct surface when:
* You need a **custom loss function** or hybrid objective → [Training API](/fine-tuning/training-api/introduction)
* You need to **hotload checkpoints** for mid-training inference evaluation → [Training API](/fine-tuning/training-api/introduction)
* You already know your config and just want to **submit a job** → [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro)
* You need **full-parameter tuning** on a model Agent doesn't cover → [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro)
* You're training in a **fully automated CI pipeline** with no human approval → Agent's approval gate is interactive by design; [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) is the better fit today
## Security: service accounts and the Agent manifest
When a coding agent drives Fireworks Agent on your behalf, it should authenticate as a **service account** with the `pilot` permission preset, not your personal user key. This enforces a layered permissions model:
> Effective permissions = User role ∩ Agent capability manifest
### The manifest is a real artifact
The Agent capability manifest is a versioned YAML file (`pilot.yaml`, kept under its original internal name) embedded into the Fireworks control-plane binary at build time. It enumerates the exact set of RPC methods the `pilot` preset is allowed to call — roughly 80 methods grouped by capability surface:
* **Account & billing** — `GetAccountUsage`, `GetQuota`, `ListQuotas`, `ListCosts`
* **Models** — `GetModel`, `ListModels`, `CreateModelVersion`, `PrepareModel`, `ValidateModelUpload`
* **Deployments** — `GetDeployment`, `CreateDeployment`, `DeployModelVersion`, `GetDeploymentMetrics`
* **Datasets** — `CreateDataset`, `GetDataset`, `ListDatasets`, `PreviewDataset`, `SplitDataset`
* **Evaluators and evaluations** — `CreateEvaluator`, `GetEvaluator`, `CreateEvaluation`, `TestEvaluation`
* **Fine-tuning jobs** — `CreateSupervisedFineTuningJob`, `CreateDpoJob`, `CreateReinforcementFineTuningJob`, `CreateRlorTrainerJob` *(the RFT and RLOR-trainer RPCs are granted by the manifest but Agent's current workflows don't use them — see [What Agent does for you](#what-agent-does-for-you))*
* **Training shapes** — `GetTrainingShape`, `ListTrainingShapes`
* **Batch inference and inference logs** — `CreateBatchInferenceJob`, `ListInferenceLogs`
The control plane enforces the manifest as a **hard ceiling** before checking the underlying user's role: even if the user has broader permissions, the preset cannot exceed what the manifest allows. Any RPC outside the manifest returns `PERMISSION_DENIED` at the API gateway, regardless of how the request was constructed.
### Non-destructive guarantee, structurally enforced
Agent's promise to never delete, cancel, or destroy your existing resources is enforced by the manifest itself, not by skill-level politeness. The manifest **does not include any `Delete*`, `Cancel*`, or destructive RPC methods**. Even a malicious or hallucinated tool call targeting `DeleteModel`, `CancelReinforcementFineTuningJob`, or `DeleteDeployment` is rejected at the control plane before it reaches the resource layer.
### Cross-account reads, never cross-account writes
The `pilot` preset is granted **read-only** access across accounts. This is what lets Agent reach Fireworks-owned public resources — base models at `accounts/fireworks/models/...`, public deployment shapes, public datasets — using only your account's API key. Agent cannot write into any other account; mutating operations are scoped to your account.
### Auto-update on control-plane releases
Because the manifest is compiled into the control-plane binary, expanded Agent capabilities ship automatically with every control-plane deploy. Your service account stores only the preset *name* (`pilot`), not the list of allowed methods — so new capabilities are picked up without rotating keys or re-provisioning the service account. See [Service Accounts](/accounts/service-accounts) for setup details.
## Session lifecycle reference
| Command | What it does | Confirmation required |
| ------------------------------------------------------ | --------------------------------------------- | ------------------------------- |
| `firectl session create --instruction ""` | Start a new session | No |
| `firectl session events --wait` | Stream events until terminal or waiting state | No |
| `firectl session get ` | Get current status and details | No |
| `firectl session list` | List sessions for your account | No |
| `firectl session update --instruction ""` | Send a response to a waiting session | **Yes** — confirm with the user |
| `firectl session cancel ` | Stop a running session (keeps the record) | **Yes** — confirm with the user |
| `firectl session delete ` | Remove the session record (irreversible) | **Yes** — confirm with the user |
All commands accept `--api-key $FIREWORKS_AGENT_API_KEY` for non-interactive auth and `--scope optimize` (the default scope).
## Troubleshooting
Agent shares the on-demand pool with the Training API. If GPU capacity is tight, jobs queue. If you need guaranteed capacity, [request a reservation](https://fireworks.ai/contact).
Agent only runs methods it has curated recipes for. For anything outside that set, use [Managed Fine-Tuning](/fine-tuning/managed-finetuning-intro) or the [Training API](/fine-tuning/training-api/introduction).
You're missing the `--wait` flag. Without it, `firectl session events` prints existing events and returns. The Fireworks Agent skill always passes `--wait`, which keeps the stream open until the session reaches `waiting`, `succeeded`, `failed`, or `cancelled`. If you're driving `firectl` directly, add `-w / --wait`.
Agent's preview includes HP tuning, full training, evaluation, and the first hour of deployment. Reject the plan and ask Agent to skip HP tuning or use a smaller base model — the next preview will reflect the lower scope.
## Next steps
Open Agent in the left nav at app.fireworks.ai.
Install the skill file in Claude Code, Cursor, Codex, Aider, or Goose.
Drive the same training infra directly when you know your config.
Write your own Python training loop on Fireworks GPUs.
**Agent crib notes**
* Auth: set `FIREWORKS_AGENT_API_KEY` in a project-local `.env` (the key is from a service account with the `pilot` permission preset). Source it via `source .env && ...` and pass on every command as `--api-key $FIREWORKS_AGENT_API_KEY`.
* Use the **same session ID** for follow-ups. Never create a new session to continue an existing conversation.
* Always pass `--wait` to `session events`, or the command exits immediately after dumping history.
* `create`, `get`, `events`, and `list` are safe to run without user confirmation. **Always confirm with the user before `update`, `cancel`, or `delete`.**
* On `waiting`, surface Agent's exact question to the user verbatim; do not paraphrase.
* See [Use with coding agents](/fine-tuning/agent/use-with-coding-agents) for a complete copy-paste skill for Claude Code, Cursor, Codex, Aider, and Goose.
# Fireworks Agent: Supervised Fine-Tuning
Source: https://docs.fireworks.ai/fine-tuning/agent/sft
Run end-to-end SFT with Fireworks Agent — dataset inspection, hyperparameter sweep, evaluator-guided selection, and a deployed winner.
Fireworks Agent's SFT workflow takes a dataset and (optionally) a base model, runs a hyperparameter sweep with held-out evaluation, picks the winner, retrains on the full data, and deploys the result. You approve a single plan with a cost estimate up front; Agent handles everything from there and pauses only at meaningful decision points.
For the underlying SFT mechanics (job parameters, supported base models, dataset format), see [Managed Fine-Tuning → Supervised Fine-Tuning](/fine-tuning/fine-tuning-models). This page documents the Fireworks Agent workflow built on top of it.
## What you give Agent
Agent needs enough to build an executable plan. The required inputs:
* **Dataset ID** — an existing Fireworks dataset in `READY` state, in OpenAI-compatible chat format. Optionally a separate evaluation dataset.
* **Base model(s)** — one or more base models. If you omit this, Agent will ask you to choose from the supported list.
* **Evaluation approach** — one of three strategies (see below). Default is validation loss only.
Everything else (epochs, LoRA rank, learning rate, batching) is resolved by Agent from defaults or your explicit overrides.
## Example session instruction
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run supervised fine-tuning on accounts/myacct/datasets/customer-support-conv. Use Qwen3 32B as the base model. Use validation loss for evaluation."
```
For explicit candidates instead of the default tuning grid:
```bash theme={null}
source .env && firectl session create \
--api-key $FIREWORKS_AGENT_API_KEY \
--instruction "Run SFT on accounts/myacct/datasets/mydata across qwen3-8b and qwen3-32b with learning rates 1e-4 and 5e-5, LoRA ranks 16 and 32, and 3 epochs."
```
**Where SFT lives in the 7-phase pipeline:** Phase 1 is dataset inspection, phase 2 is plan + cost approval, **phase 3 is the candidate sweep** described below, phase 4 is the full-data final run, **phase 5 is held-out evaluation** (using the strategy you picked in phase 2), phase 6 is deployment, phase 7 is the final report. See [How Agent runs a training job](/fine-tuning/agent/introduction#how-agent-runs-a-training-job).
## Workflow stages
Agent stages your dataset locally exactly once per session (`firectl dataset download ...`), inspects format and sample structure, estimates token counts for cost, and decides whether any conversion is needed (for example, mapping `ground_truth` fields onto an assistant message or rewriting `tool` roles).
Agent picks an evaluation strategy (see [Evaluation paths](#evaluation-paths) below) and resolves your candidate grid. The default tuning grid is three HP configurations with the LoRA rank and learning rate shown below; epochs default to `min(5, ceil(2500 / total_samples))` unless you override them.
| HP config | LoRA rank | Learning rate |
| --------- | --------- | ------------- |
| 1 | 8 | 1.5e-4 |
| 2 | 16 | 1.0e-4 |
| 3 | 32 | 5.0e-5 |
For HP tuning on datasets larger than 1,000 samples, Agent subsamples to 1,000 (seed `42`) to keep candidate-search costs bounded.
Agent writes a plan to the session workspace and presents it to you with a cost breakdown (Training + Inference + Total). A single approval covers both the plan and the estimate. Reply with `Approved, proceed.` or ask for revisions and Agent will re-cost and re-present.
Agent launches the candidate training runs, capped at **6 active jobs at a time** by default. Each candidate trains on the (sub-sampled) train split and is evaluated against the held-out test split using the evaluation strategy you chose.
Before the full-data final run, Agent pauses at a promotion gate. It surfaces the candidate scoreboard (validation loss and any evaluator metrics) and asks you to confirm the winner. Reply with `Proceed with the winning config.`
Agent trains the winning configuration on the full dataset (epochs default to `min(5, ceil(2500 / total_samples))` for the final run). Agent then evaluates the final model directly and writes `final_report.md`.
Agent deploys the final model and reports the deployed model ID, deployment ID, inference endpoint, and a copy-paste `fireworks-ai` SDK snippet you can use immediately.
## Evaluation paths
Agent supports three evaluation strategies. You can specify one in your instruction, or Agent will ask which to use in plain English (it does **not** say "Path A" / "Path B" / "Path C" to you — the labels below are docs shorthand for the three options).
### Path A — validation loss only
The default. Agent creates a held-out test split, trains each candidate, and picks the winner purely on validation loss. No task-level evaluator is run. Choose this when:
* You don't have an evaluator script for the task
* The dataset is small or evaluator design is not yet settled
* You want the fastest, lowest-cost sweep
Trigger phrase: *"Use validation loss for evaluation."* or simply *"validation loss is fine"* if Agent asks.
### Path B — bring your own evaluator
You provide a Python evaluator (uploaded to Fireworks, or generated in the same session via [evaluator authoring](/fine-tuning/agent/evaluators)). Agent runs the evaluator on each candidate's outputs and on the final model.
Trigger phrase: *"Use evaluator accounts/myacct/evaluators/my-eval."* or *"Use my own evaluator"* if Agent asks.
### Path C — Agent-generated evaluator
Agent inspects your data and writes a Python evaluator for structured or objectively checkable outputs (for example: numeric answers, JSON schemas, exact-match labels). It then uses that evaluator to score candidates and the final model.
Trigger phrase: *"Generate an evaluator for me."* or *"agent-generated evaluator"* if Agent asks.
## Output
When the session reports `succeeded`, Agent's final message includes:
* The deployed **model ID** and **deployment ID**
* The inference endpoint and a ready-to-run `fireworks-ai` SDK snippet
* Final training loss and evaluation loss (or evaluator score) for the winning model
* Provenance for any rollout/evaluation evidence carried forward from candidate search
* A link to `final_report.md` in the session workspace with the full plan, costs (estimated vs actual), and per-candidate metrics
## Supported base models
Agent's SFT workflow supports the same base models as Managed Fine-Tuning. For the canonical list and maximum context lengths, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models).
You can ask Agent for the current list inside any session: *"Which base models do you support for SFT?"*
## Customizing the run
Things you can put in your instruction:
* **Candidate grid:** *"Use LoRA ranks 8, 16, 32 with learning rates 1e-4 and 5e-5."*
* **Fixed epochs:** *"Train each candidate for 3 epochs."*
* **Subsampling override:** *"Use 500 samples for HP tuning."*
* **Batch limit:** *"Run up to 10 training jobs in parallel."*
* **Skip final retrain:** *"Skip the full-data final run."* (Agent will deploy the winning candidate directly.)
* **Eval set:** *"Use accounts/myacct/datasets/holdout as the eval dataset."* (Agent sets `evaluationDataset` and disables eval carveout.)
If anything in your instruction conflicts with Agent's defaults, your instruction wins.
**Agent crib notes**
* Required inputs for an SFT session: dataset ID. Optional: base model, evaluation strategy, candidate grid, epochs.
* Default tuning grid is 3 LoRA configs × selected base models. Default epochs = `min(5, ceil(2500 / total_samples))`.
* Agent will pause twice for user input: plan + cost approval, and the promotion gate. Common responses: `Approved, proceed.` and `Proceed with the winning config.`
* For validation-loss-only evaluation, answer `validation loss is fine` when Agent asks how to evaluate.
* All training jobs and the deployed model are created in **your** account using the `FIREWORKS_AGENT_API_KEY` you authenticated with. Agent never destroys or modifies your existing resources.
# Use Fireworks Agent with Claude Code, Cursor, Codex, and other coding agents
Source: https://docs.fireworks.ai/fine-tuning/agent/use-with-coding-agents
Install the Fireworks Agent skill file once and drive end-to-end fine-tuning from your coding agent.
Fireworks Agent is a great fit for coding-assistant workflows: long-running training jobs that benefit from a conversational driver, multi-turn approvals that benefit from a human-in-the-loop, and natural-language instructions that benefit from a model that already knows your project context.
This page gives you everything you need to plug Agent into Claude Code, Cursor, Codex, Aider, Goose, or any other coding agent — a single canonical skill file you can `curl`, the right install path for each runtime, and the agent-side patterns that handle Agent's plan-and-cost approval and waiting-state Q\&A loops correctly.
## TL;DR for agents
```text theme={null}
Fireworks Agent (firectl session) is a hosted Fireworks fine-tuning agent.
Use it whenever the user asks to fine-tune, train, or improve a model on
Fireworks (SFT, DPO/ORPO, or classification workflows).
Authenticate with FIREWORKS_AGENT_API_KEY in a project-local .env, sourced via
`source .env`. The key is a Fireworks API key from a service account with
the `pilot` permission preset (internal name kept for historical reasons —
it's the manifest that scopes Fireworks Agent's capabilities).
Lifecycle:
- firectl session create --instruction ""
- firectl session events --wait # stream until terminal or waiting
- firectl session update --instruction "" # respond to a waiting state
- firectl session get # status
- firectl session list # browse sessions
Safety:
- Always confirm with the user before `update`, `cancel`, or `delete`.
- `create`, `get`, `events`, `list` are safe to run autonomously.
- Never create a new session for a follow-up — always reuse the same session id.
- Always pass `--wait` to `events`; without it the command exits immediately.
```
## Prerequisites
See the [firectl reference](/tools-sdks/firectl/firectl) for installation. On Linux:
```bash theme={null}
curl -sL -o /tmp/firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip -f /tmp/firectl.gz
sudo install -m 0755 /tmp/firectl /usr/local/bin/firectl
```
Create a service account scoped to Agent's capabilities. The CLI preset is named `pilot` for historical reasons — it's the [Agent capability manifest](/fine-tuning/agent/introduction#security-service-accounts-and-the-agent-manifest):
```bash theme={null}
firectl -a user create \
--service-account \
--user-id=fireworks-agent \
--permission-preset=pilot
firectl -a api-key create --service-account=fireworks-agent
```
Drop the returned key into your project root:
```bash .env theme={null}
FIREWORKS_AGENT_API_KEY=fw-...
```
The skill sources `.env` automatically. See [Service Accounts](/accounts/service-accounts) for the full setup.
## Install the skill file
The Fireworks Agent skill is a single Markdown document that teaches your coding agent how to drive Agent: when to invoke it, how to authenticate, how to handle waiting states and approval gates, which `firectl session` flags are confirmed working, and the common questions Agent asks mid-session. It auto-attaches based on the `description` frontmatter — no slash commands required.
Canonical source in the public [`fw-ai/cookbook`](https://github.com/fw-ai/cookbook) repo. `curl` the raw URL into your coding agent (see below); re-fetch at the start of a fine-tuning session to pick up the latest confirmed flags.
```bash Claude Code theme={null}
# Project-scoped (recommended)
mkdir -p .claude/skills/fireworks-agent
curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \
-o .claude/skills/fireworks-agent/SKILL.md
# Or user-scoped (available in every project)
mkdir -p ~/.claude/skills/fireworks-agent
curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \
-o ~/.claude/skills/fireworks-agent/SKILL.md
```
```bash Cursor theme={null}
mkdir -p .cursor/skills/fireworks-agent
curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md \
-o .cursor/skills/fireworks-agent/SKILL.md
```
```bash Codex / Aider / Goose theme={null}
# These runtimes read AGENTS.md as ambient context at session start.
curl -L https://raw.githubusercontent.com/fw-ai/cookbook/main/skills/fireworks-agent/SKILL.md >> AGENTS.md
```
Once the skill is installed, prompts like *"Fine-tune Qwen3 32B on my customer-support dataset"* will trigger Fireworks Agent automatically. The coding agent will create a session, stream events, surface Agent's questions to you, and wait for your approval before sending responses back.
## How the agent should drive Fireworks Agent
Every Agent workflow pauses at least once (plan + cost approval) and may pause again at the promotion gate or whenever the planner needs missing information. Your coding agent must handle the loop correctly.
```mermaid theme={null}
flowchart TD
Create["firectl session create"] --> Stream["firectl session events --wait"]
Stream -->|"status: waiting"| Capture["Capture LAST_TS"]
Capture --> Extract["Extract status_info question"]
Extract --> Ask["Surface question to user verbatim"]
Ask --> Confirm["Get user response
+ confirmation"]
Confirm --> Update["firectl session update --instruction '...'"]
Update --> Resume["Resume events, filter older than LAST_TS"]
Resume --> Stream
Stream -->|"status: succeeded"| Report["Surface deployed model + final report"]
Stream -->|"status: failed"| Triage["Surface error, ask user how to proceed"]
```
Key invariants:
1. **`--wait` is required.** `session events` without `--wait` exits immediately after dumping history.
2. **Use the same session ID for follow-ups.** Never create a new session to continue a conversation. The runner reads state from the previous session's workspace.
3. **Pause for confirmation on `update`, `cancel`, `delete`.** Read-only commands (`create`, `get`, `events`, `list`) are safe to run autonomously.
4. **Surface Agent's questions verbatim.** Agent's exact question contains the information the user needs to answer correctly. Don't paraphrase.
5. **Filter history on resume.** After a `session update`, the next `events --wait` re-dumps history. Filter on the timestamp captured before the update so the user only sees new traces.
### Fallback when the stream drops
If the events stream drops unexpectedly (network error, client timeout), fall back to polling `session get` until the status is terminal or waiting:
```bash theme={null}
source .env && until firectl session get --api-key $FIREWORKS_AGENT_API_KEY 2>/dev/null \
| grep -E "waiting|succeeded|failed|cancelled"; do sleep 10; done \
&& firectl session get --api-key $FIREWORKS_AGENT_API_KEY
```
Then resume `events --wait` from the captured timestamp once you know the session is alive.
### Common waiting-state prompts and good responses
| Agent asks about | Reasonable default response |
| -------------------------------- | -------------------------------------------------------------------------------------------- |
| Which evaluation strategy to use | `"validation loss is fine"` (no task-level evaluator) |
| Plan and cost approval | `"Approved, proceed."` |
| Promotion gate / winning config | `"Proceed with the winning config."` |
| Missing base model | `"Use Qwen3 32B."` (or whatever model the user picked) |
| Missing dataset | (Agent won't reach `waiting` without a dataset — surface this back to the user and ask them) |
| Plan revisions | Forward the user's revision request verbatim |
## Pitfalls
* **Forgetting `--wait`.** The most common failure mode. Always pass it on `events`.
* **Creating a new session for a follow-up.** Agent loses all prior context. Use `session update` on the existing ID.
* **Running `update` / `cancel` / `delete` autonomously.** These are user-confirmation gates. Always ask first.
* **Treating Agent's safety refusals as failures.** Agent won't delete, cancel, or destroy your existing resources. If your instruction contains a destructive intent, rephrase it as a non-destructive action (list, inspect, create, monitor).
* **Streaming through a TTY-truncating wrapper.** Piping `firectl session events` through `tail` or `head` can hide the `[done] session status:` footer and break the loop. Stream directly.
## Reference: session commands
| Command | Description | Confirmation required |
| ----------------------------------------------------------------------------------------- | -------------------------- | --------------------- |
| `firectl session create --api-key $FIREWORKS_AGENT_API_KEY --instruction ""` | Start a session | No |
| `firectl session events --api-key $FIREWORKS_AGENT_API_KEY --wait` | Stream events | No |
| `firectl session get --api-key $FIREWORKS_AGENT_API_KEY` | Get status | No |
| `firectl session list --api-key $FIREWORKS_AGENT_API_KEY` | List sessions | No |
| `firectl session update --api-key $FIREWORKS_AGENT_API_KEY --instruction ""` | Reply to a waiting session | **Yes** |
| `firectl session cancel --api-key $FIREWORKS_AGENT_API_KEY` | Cancel a running session | **Yes** |
| `firectl session delete --api-key $FIREWORKS_AGENT_API_KEY` | Delete the session record | **Yes** |
`session create` and `session update` accept the long-form `--instruction` flag (short form: `-n`). All session commands accept `--scope optimize` (the default scope).
**Agent crib notes**
* `curl` the canonical [`fireworks-agent SKILL.md`](https://github.com/fw-ai/cookbook/blob/main/skills/fireworks-agent/SKILL.md) from the public `fw-ai/cookbook` repo into `.claude/skills/fireworks-agent/SKILL.md`, `.cursor/skills/fireworks-agent/SKILL.md`, or append it to `AGENTS.md` for Codex/Aider/Goose.
* Authenticate via `FIREWORKS_AGENT_API_KEY` in a project-local `.env`, sourced with `source .env`. The key is a Fireworks service-account API key with the `pilot` permission preset (the underlying capability manifest is kept under that internal name).
* Always reuse the same session ID for follow-ups. Always pass `--wait` to `events`. Always confirm before `update / cancel / delete`.
# Training Overview
Source: https://docs.fireworks.ai/fine-tuning/cli-reference
Launch RFT jobs using the eval-protocol CLI
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
The Eval Protocol CLI provides the fastest, most reproducible way to launch RFT jobs. This page covers everything you need to know about using `eval-protocol create rft`.
Before launching, review [Training Prerequisites & Validation](/fine-tuning/training-prerequisites) for requirements, validation checks, and common errors.
Already familiar with [firectl](/fine-tuning/cli-reference#using-firectl-cli-alternative)? Use it as an alternative to eval-protocol.
## Installation and setup
The following guide will help you:
* Upload your evaluator to Fireworks. If you don't have one yet, see [Concepts > Evaluators](/fine-tuning/evaluators)
* Upload your dataset to Fireworks
* Create and launch the RFT job
```bash theme={null}
pip install eval-protocol
```
Verify installation:
```bash theme={null}
eval-protocol --version
```
Configure your Fireworks API key:
```bash theme={null}
export FIREWORKS_API_KEY="fw_your_api_key_here"
```
Or create a `.env` file:
```bash theme={null}
FIREWORKS_API_KEY=fw_your_api_key_here
```
Before training, verify your evaluator works. This command discovers and runs your `@evaluation_test` with pytest. If a Dockerfile is present, it builds an image and runs the test in Docker; otherwise it runs on your host.
```bash theme={null}
cd evaluator_directory
ep local-test
```
If using a Dockerfile, it must use a Debian-based image (no Alpine or CentOS), be single-stage (no multi-stage builds), and only use supported instructions: `FROM`, `RUN`, `COPY`, `ADD`, `WORKDIR`, `USER`, `ENV`, `CMD`, `ENTRYPOINT`, `ARG`. Instructions like `EXPOSE` and `VOLUME` are ignored. See the [RFT quickstart guide](/fine-tuning/quickstart-svg-agent) for details.
From the directory where your evaluator and dataset (dataset.jsonl) are located,
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--output-model my-model-name
```
The CLI will:
* Upload evaluator code (if changed)
* Upload dataset (if changed)
* Create the RFT job
* Display dashboard links for monitoring
Expected output:
```
Created Reinforcement Fine-tuning Job
name: accounts/your-account/reinforcementFineTuningJobs/abc123
Dashboard Links:
Evaluator: https://app.fireworks.ai/dashboard/evaluators/your-evaluator
Dataset: https://app.fireworks.ai/dashboard/datasets/your-dataset
RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123
```
Click the RFT Job link to watch training progress in real-time. See [Monitor Training](/fine-tuning/monitor-training) for details.
## Common CLI options
Customize your RFT job with these flags:
**Model and output**:
```bash theme={null}
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct # Base model to fine-tune
--output-model my-custom-name # Name for fine-tuned model
```
**Training parameters**:
```bash theme={null}
--epochs 2 # Number of training epochs (default: 1)
--learning-rate 5e-5 # Learning rate (default: 1e-4)
--lora-rank 16 # LoRA rank (default: 8)
--batch-size 65536 # Batch size in tokens (default: 32768)
--chunk-size 200 # Prompts rolled out per GRPO training step (default: 200). -1 disables chunking.
```
**Loss method**:
```bash theme={null}
--rl-loss-method dapo # RL loss method: grpo (default), dapo, gspo-token
--rl-kl-beta 0.001 # KL beta override (only for grpo; rejected for dapo/gspo-token)
```
**Rollout (sampling) parameters**:
```bash theme={null}
--temperature 0.8 # Sampling temperature (default: 0.7)
--n 8 # Number of rollouts per prompt (default: 4)
--response-candidates-count 8 # Alias for --n in firectl (default: 8, minimum: 2)
--max-tokens 4096 # Max tokens per response (default: 32768)
--top-p 0.95 # Top-p sampling (default: 1.0)
--top-k 50 # Top-k sampling (default: 40)
--max-concurrent-rollouts 64 # Max in-flight rollouts per job (default: 96, or the value set in @evaluation_test). Throughput only; no training effect.
```
**Remote environments**:
```bash theme={null}
--remote-server-url https://your-evaluator.example.com # For remote rollout processing
```
**Force re-upload**:
```bash theme={null}
--force # Re-upload evaluator even if unchanged
```
See all options:
```bash theme={null}
eval-protocol create rft --help
```
## Advanced options
Track training metrics in W\&B for deeper analysis:
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--wandb-project my-rft-experiments \
--wandb-entity my-org
```
Set `WANDB_API_KEY` in your environment first.
Save intermediate checkpoints during training:
```bash theme={null}
firectl rftj create \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--checkpoint-frequency 500 # Save every 500 steps
...
```
Available in `firectl` only.
For evaluators that need more time:
```bash theme={null}
firectl rftj create \
--rollout-timeout 300 # 5 minutes per rollout
...
```
Default is 60 seconds. Increase for complex evaluations.
For other tuning parameters — rollout concurrency, chunk size, loss method, and more — see [Parameter Tuning](/fine-tuning/parameter-tuning).
## Examples
**Fast experimentation** (small model, 1 epoch):
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/qwen3-0p6b \
--output-model quick-test
```
**High-quality training** (more rollouts, higher temperature):
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--output-model high-quality-model \
--n 8 \
--temperature 1.0
```
**Remote environment** (for multi-turn agents):
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--remote-server-url https://your-agent.example.com \
--output-model remote-agent
```
**Multiple epochs with custom learning rate**:
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--epochs 3 \
--learning-rate 5e-5 \
--output-model multi-epoch-model
```
## Using `firectl` CLI (Alternative)
For users already familiar with Fireworks `firectl`, you can create RFT jobs directly:
```bash theme={null}
firectl rftj create \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--dataset accounts/your-account/datasets/my-dataset \
--evaluator accounts/your-account/evaluators/my-evaluator \
--output-model my-finetuned-model
```
**Differences from `eval-protocol`**:
* Requires fully qualified resource names (accounts/...)
* Must manually upload evaluators and datasets first
* More verbose but offers finer control
* Same underlying API as `eval-protocol`
See [firectl documentation](/tools-sdks/firectl/commands/reinforcement-fine-tuning-job-create) for all options.
## Next steps
Review requirements, validation, and common errors
Track job progress, inspect rollouts, and debug issues
Learn how to adjust parameters for better results
# Remote Environment Setup
Source: https://docs.fireworks.ai/fine-tuning/connect-environments
Implement the /init endpoint to run evaluations in your infrastructure
If you already have an agent running in your product, or need to run rollouts on your own infrastructure, you can integrate it with RFT using the `RemoteRolloutProcessor`. This delegates rollout execution to an HTTP service you control.
Remote agent are ideal for:
* Multi-turn agentic workflows with tool use
* Access to private databases, APIs, or internal services
* Integration with existing agent codebases
* Complex simulations that require your infrastructure
New to RFT? Start with [local agent](/fine-tuning/quickstart-math) instead. They're simpler and cover most use cases. Only use remote agent environments when you need access to private infrastructure or have an existing agent to integrate.
## How remote rollouts work
During training, Fireworks calls your service's `POST /init` endpoint with the dataset row and correlation metadata.
Your agent executes the task (e.g., multi-turn conversation, tool calls, simulation steps), logging progress via Fireworks tracing.
Your service sends structured logs tagged with rollout metadata to Fireworks so the system can track completion.
Once Fireworks detects completion, it pulls the full trace and evaluates it using your scoring logic.
Everything except implementing your remote server is handled automatically by Eval Protocol. You only need to implement the `/init` endpoint and add Fireworks tracing.
## Implementing the /init endpoint
Your remote service must implement a single `/init` endpoint that accepts rollout requests.
### Request schema
Model configuration including model name and inference parameters like temperature, max\_tokens, etc.
Array of conversation messages to send to the model
Array of available tools for the model (for function calling)
Base URL for making LLM calls through Fireworks tracing (includes correlation metadata)
Rollout execution metadata for correlation (rollout\_id, run\_id, row\_id, etc.)
Fireworks API key to use for model calls
### Example request
```json theme={null}
{
"completion_params": {
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"temperature": 0.7,
"max_tokens": 2048
},
"messages": [
{ "role": "user", "content": "What is the weather in San Francisco?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" }
}
}
}
}
],
"model_base_url": "https://tracing.fireworks.ai/rollout_id/brave-night-42/invocation_id/wise-ocean-15/experiment_id/calm-forest-28/run_id/quick-river-07/row_id/bright-star-91",
"metadata": {
"invocation_id": "wise-ocean-15",
"experiment_id": "calm-forest-28",
"rollout_id": "brave-night-42",
"run_id": "quick-river-07",
"row_id": "bright-star-91"
},
"api_key": "fw_your_api_key"
}
```
## Metadata correlation
The `metadata` object contains correlation IDs that you must include when logging to Fireworks tracing. This allows Eval Protocol to match logs and traces back to specific evaluation rows.
Required metadata fields:
* `invocation_id` - Identifies the evaluation invocation
* `experiment_id` - Groups related experiments
* `rollout_id` - Unique ID for this specific rollout (most important)
* `run_id` - Identifies the evaluation run
* `row_id` - Links to the dataset row
`RemoteRolloutProcessor` automatically generates these IDs and sends them to your server. You don't need to create them yourself—just pass them through to your logging.
## Fireworks tracing integration
Your remote server must use Fireworks tracing to report rollout status. Eval Protocol polls these logs to detect when rollouts complete.
### Basic setup
```python theme={null}
import logging
from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler, RolloutIdFilter
# Configure Fireworks tracing handler globally
fireworks_handler = FireworksTracingHttpHandler()
logging.getLogger().addHandler(fireworks_handler)
@app.post("/init")
def init(request: InitRequest):
# Create rollout-specific logger with filter
rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}")
rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id))
try:
# Execute your agent logic here
result = execute_agent(request)
# Log successful completion with structured status
rollout_logger.info(
f"Rollout {request.metadata.rollout_id} completed",
extra={"status": Status.rollout_finished()}
)
return {"status": "success"}
except Exception as e:
# Log errors with structured status
rollout_logger.error(
f"Rollout {request.metadata.rollout_id} failed: {e}",
extra={"status": Status.rollout_error(str(e))}
)
raise
```
### Key components
1. **FireworksTracingHttpHandler**: Sends logs to Fireworks tracing service
2. **RolloutIdFilter**: Tags logs with the rollout ID for correlation
3. **Status objects**: Structured status reporting that Eval Protocol can parse
* `Status.rollout_finished()` - Signals successful completion
* `Status.rollout_error(message)` - Signals failure with error details
### Alternative: Environment variable approach
For simpler setups, you can use the `EP_ROLLOUT_ID` environment variable instead of manual filters.
If your server processes one rollout at a time (e.g., serverless functions, container per request):
```python theme={null}
import os
import logging
from eval_protocol import Status, InitRequest, FireworksTracingHttpHandler
# Set rollout ID in environment
os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id
# Configure handler (automatically picks up EP_ROLLOUT_ID)
fireworks_handler = FireworksTracingHttpHandler()
logging.getLogger().addHandler(fireworks_handler)
logger = logging.getLogger(__name__)
@app.post("/init")
def init(request: InitRequest):
# Logs are automatically tagged with rollout_id
logger.info("Processing rollout...")
# ... execute agent logic ...
```
If your `/init` handler spawns separate Python processes for each rollout:
```python theme={null}
import os
import logging
import multiprocessing
from eval_protocol import FireworksTracingHttpHandler, InitRequest
def execute_rollout_step_sync(request):
# Set EP_ROLLOUT_ID in the child process
os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id
logging.getLogger().addHandler(FireworksTracingHttpHandler())
# Execute your rollout logic here
# Logs are automatically tagged
@app.post("/init")
async def init(request: InitRequest):
# Do NOT set EP_ROLLOUT_ID in parent process
p = multiprocessing.Process(
target=execute_rollout_step_sync,
args=(request,)
)
p.start()
return {"status": "started"}
```
### How Eval Protocol uses tracing
1. **Your server logs completion**: Uses `Status.rollout_finished()` or `Status.rollout_error()`
2. **Eval Protocol polls**: Searches Fireworks logs by `rollout_id` tag until completion signal found
3. **Status extraction**: Reads structured status fields (`code`, `message`, `details`) to determine outcome
4. **Trace retrieval**: Fetches full trace of model calls and tool use for evaluation
## Complete example
Here's a minimal but complete remote server implementation:
```python theme={null}
from fastapi import FastAPI
from fastapi.responses import JSONResponse
from eval_protocol import InitRequest, FireworksTracingHttpHandler, RolloutIdFilter, Status
import logging
app = FastAPI()
# Setup Fireworks tracing
fireworks_handler = FireworksTracingHttpHandler()
logging.getLogger().addHandler(fireworks_handler)
@app.post("/init")
async def init(request: InitRequest):
# Create rollout-specific logger
rollout_logger = logging.getLogger(f"eval_server.{request.metadata.rollout_id}")
rollout_logger.addFilter(RolloutIdFilter(request.metadata.rollout_id))
rollout_logger.info(f"Starting rollout {request.metadata.rollout_id}")
try:
# Your agent logic here
# 1. Make model calls using request.model_base_url
# 2. Call tools, interact with environment
# 3. Collect results
result = run_your_agent(
messages=request.messages,
tools=request.tools,
model_config=request.completion_params,
api_key=request.api_key
)
# Signal completion
rollout_logger.info(
f"Rollout {request.metadata.rollout_id} completed successfully",
extra={"status": Status.rollout_finished()}
)
return {"status": "success", "result": result}
except Exception as e:
# Signal error
rollout_logger.error(
f"Rollout {request.metadata.rollout_id} failed: {str(e)}",
extra={"status": Status.rollout_error(str(e))}
)
return JSONResponse(
status_code=500,
content={"status": "error", "message": str(e)}
)
def run_your_agent(messages, tools, model_config, api_key):
# Implement your agent logic here
# Make model calls, use tools, etc.
pass
```
## Testing locally
Before deploying, test your remote server locally:
```bash theme={null}
uvicorn main:app --reload --port 8080
```
In your evaluator test, point to your local server:
```python theme={null}
from eval_protocol.pytest import RemoteRolloutProcessor
rollout_processor = RemoteRolloutProcessor(
remote_base_url="http://localhost:8080"
)
```
```bash theme={null}
pytest my-evaluator-name.py -vs
```
This sends test rollouts to your local server and verifies the integration works.
## Deploying your service
Once tested locally, deploy to production:
* ✅ Service is publicly accessible (or accessible via VPN/private network)
* ✅ HTTPS endpoint with valid SSL certificate (recommended)
* ✅ Authentication/authorization configured
* ✅ Monitoring and logging set up
* ✅ Auto-scaling configured for concurrent rollouts
* ✅ Error handling and retry logic implemented
* ✅ Service availability SLA meets training requirements
**Vercel/Serverless**:
* One rollout per function invocation
* Use environment variable approach
* Configure timeout for long-running evaluations
**AWS ECS/Kubernetes**:
* Handle concurrent requests with proper worker configuration
* Use RolloutIdFilter approach
* Set up load balancing
**On-premise**:
* Ensure network connectivity from Fireworks
* Configure firewall rules
* Set up VPN if needed for security
## Connecting to RFT
Once your remote server is deployed, create an RFT job that uses it:
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--remote-server-url https://your-evaluator.example.com \
--dataset my-dataset
```
The RFT job will send all rollouts to your remote server for evaluation during training.
## Troubleshooting
**Symptoms**: Rollouts show as timed out or never complete
**Solutions**:
* Check that your service is logging `Status.rollout_finished()` correctly
* Verify Fireworks tracing handler is configured
* Ensure rollout\_id is included in log tags
* Check for exceptions being swallowed without logging
**Symptoms**: Eval Protocol can't match logs to rollouts
**Solutions**:
* Verify you're using the exact `rollout_id` from request metadata
* Check that RolloutIdFilter or EP\_ROLLOUT\_ID is set correctly
* Ensure logs are being sent to Fireworks (check tracing dashboard)
**Symptoms**: Training is slow, high rollout latency
**Solutions**:
* Scale your service to handle concurrent requests
* Optimize your agent logic (caching, async operations)
* Add more workers or instances
* Profile your code to find bottlenecks
**Symptoms**: Model calls fail, API errors
**Solutions**:
* Verify API key is passed correctly from request
* Check that your service has network access to Fireworks
* Ensure model\_base\_url is used for traced calls
## Example implementations
Learn by example:
Complete walkthrough using a Vercel TypeScript server for SVG generation
Minimal Python implementation showing the basics
## Next steps
Launch your RFT job using the CLI
Track rollout progress and debug issues
Full Remote Rollout Processor tutorial
Design effective reward functions
# Debug SFT tokenization
Source: https://docs.fireworks.ai/fine-tuning/debug-sft-tokenization
Download rendered token IDs and loss masks for supervised fine-tuning jobs.
When supervised fine-tuning quality looks wrong, first check what the trainer actually saw. Fireworks can attach a **Render Samples** download to supervised fine-tuning job details. The file is a JSONL sample of records after Fireworks applies the model's chat template, tokenizer, and training mask.
Use render samples to answer questions such as:
* Did `system`, `user`, `assistant`, and tool messages render with the expected special tokens?
* Are only the intended assistant tokens included in the loss?
* Did a message-level `weight: 0` or sample-level `weight` remove the tokens you expected?
* Does Fireworks' tokenizer output match the tokenizer behavior you tested locally?
The render samples file is a diagnostic sample, not a full dataset export. New supervised fine-tuning jobs capture up to 20 rendered records by default. Older jobs, jobs that fail before rendering, or jobs without captured samples may not show the download.
## Download render samples
Go to the [Fireworks dashboard](https://app.fireworks.ai/dashboard/fine-tuning), then open the supervised fine-tuning job you want to inspect.
In the job details sidebar, look for **Render Samples**.
Click **Download**. Each line in the downloaded file is one rendered training record.
Render samples can contain text from your training dataset in `decoded_tokens`. Treat the downloaded file like training data and do not share it publicly.
## Understand the JSONL fields
A render sample record looks like this:
```json theme={null}
{
"source_jsonl_row_index": 4,
"source_jsonl_line_number": 5,
"split_index": 0,
"worker_id": 2,
"renderer": "qwen3",
"train_on_what": "all_assistant_messages",
"token_ids": [10, 11, 12],
"decoded_tokens": ["<|im_start|>", "assistant", "Hello"],
"token_weights": [0.0, 0.0, 1.0],
"training_target_token_ids": [11, 12],
"training_loss_weights": [0.0, 1.0]
}
```
| Field | Meaning |
| --------------------------- | ----------------------------------------------------------------------------------------------------------- |
| `source_jsonl_row_index` | Zero-based index of the source dataset row. |
| `source_jsonl_line_number` | One-based source line number, useful for opening the row in an editor. |
| `split_index` | Index of the rendered record produced from that source row. Most rows produce one record. |
| `renderer` | Chat template renderer selected for the base model. |
| `train_on_what` | Which message content is configured to contribute to training loss. |
| `token_ids` | Full rendered token sequence before the next-token shift. |
| `decoded_tokens` | One-token decode for each token ID. Tokenizers may show whitespace markers or byte fallback pieces. |
| `token_weights` | Per-token training weight in rendered order. `0` means context only; a positive value contributes to loss. |
| `training_target_token_ids` | Shifted next-token targets passed to the trainer. This array is usually one shorter than `token_ids`. |
| `training_loss_weights` | Loss weights aligned with `training_target_token_ids`. A positive value means that target token is trained. |
For quick inspection, `token_ids`, `decoded_tokens`, and `token_weights` are the easiest fields to scan. For exact trainer behavior, use `training_target_token_ids` and `training_loss_weights`; those are shifted for next-token prediction.
## Inspect a downloaded file
Use this local script to print each rendered token with its training status:
```python theme={null}
import json
from pathlib import Path
for line in Path("render_samples.jsonl").read_text().splitlines():
record = json.loads(line)
print(
f"\nsource line {record['source_jsonl_line_number']} "
f"split {record['split_index']} "
f"renderer={record['renderer']} "
f"train_on={record['train_on_what']}"
)
for index, (token_id, text, weight) in enumerate(
zip(record["token_ids"], record["decoded_tokens"], record["token_weights"])
):
status = "TRAIN" if float(weight) > 0 else "ctx"
print(f"{index:04d} {int(token_id):8d} {float(weight):g} {status:5s} {text!r}")
```
Then compare the reported `source_jsonl_line_number` with the original dataset row:
```bash theme={null}
sed -n '5p' train.jsonl
```
Replace `5` with the line number from the render sample.
## Common findings
| What you see | Likely cause | What to do |
| --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| Assistant answer tokens have `token_weights` of `0` | The assistant message has `weight: 0`, the sample has zero weight, or the job is configured to train on different content. | Check the original JSONL row and remove unintended weights. |
| User or system tokens have positive `token_weights` | The row schema or training configuration is not representing roles as intended. | Verify every message has the correct `role`, and avoid putting assistant text in a `user` message. |
| Expected text is missing from `decoded_tokens` | The source row may have been split, truncated, or rendered differently by the model chat template. | Check `split_index`, source line number, and the job's max context length. |
| Extra special tokens appear around messages | The selected model renderer is adding chat template markers. | This is often expected. If the markers are wrong for your use case, check that the base model and dataset format match. |
| Token boundaries look surprising | Many tokenizers encode whitespace, Unicode, and byte fallback pieces in non-obvious ways. | Compare with the same Hugging Face tokenizer using `skip_special_tokens=False`. |
| The Render Samples row is missing | The job may predate this feature, may have failed before rendering, or may not have captured samples. | Create a new supervised fine-tuning job, or contact support with the job ID if the job should have rendered samples. |
## Compare with a local tokenizer
If you have access to the matching Hugging Face tokenizer, compare Fireworks' rendered tokens with local tokenizer output:
```python theme={null}
import json
from pathlib import Path
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("", trust_remote_code=True)
record = json.loads(Path("render_samples.jsonl").read_text().splitlines()[0])
print(tokenizer.decode(record["token_ids"], skip_special_tokens=False))
```
The local decode should help explain token boundaries and special tokens. If local tokenization differs, confirm that you are using the same tokenizer family and revision as the base model selected for fine-tuning.
# Deploying Fine Tuned Models
Source: https://docs.fireworks.ai/fine-tuning/deploying-loras
Deploy one or multiple LoRA models fine tuned on Fireworks using live merge or multi-LoRA
After fine-tuning your model on Fireworks, deploy it to make it available for inference. Fireworks supports two deployment methods for LoRA fine-tuned models: **live merge** and **multi-LoRA**. Each method has different tradeoffs around performance, cost, and flexibility.
Fine-tuned LoRA models, whether created on the Fireworks platform or imported, can **only** be deployed to **on-demand (dedicated) deployments**. Serverless deployment is not supported for LoRA models.
You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for details.
## Choosing a deployment method
Fireworks offers two ways to deploy LoRA fine-tuned models. The right choice depends on how many fine-tuned variants you need to serve and your performance requirements.
| | **Live merge** | **Multi-LoRA** |
| ------------------------- | ---------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| **How it works** | LoRA weights are merged into the base model at deployment time, creating a single merged model | Base model is deployed with addon support; LoRA adapters are loaded dynamically at request time |
| **Number of LoRAs** | One per deployment | Multiple per deployment |
| **Inference performance** | Matches the base model (no overhead) | Some overhead per request due to dynamic adapter application |
| **Throughput** | Same as base model | Lower maximum throughput under high concurrency |
| **Cost efficiency** | One deployment per fine-tune | Share a single deployment across many fine-tunes |
| **Best for** | Production workloads requiring maximum performance | Experimentation, A/B testing, or serving many variants of the same base model |
If you only need to serve a single fine-tuned model, **live merge is the recommended approach**. It delivers the best performance with the simplest setup.
## Live merge deployment
Live merge is the simplest way to deploy a fine-tuned model. Fireworks automatically merges the LoRA weights into the base model at deployment time, producing a model that performs identically to a natively fine-tuned model with no inference overhead.
### How it works
When you deploy a LoRA model directly, Fireworks:
1. Takes your LoRA adapter weights and the base model
2. Merges them into a single set of weights at deployment time
3. Serves the merged model as a standalone deployment
The result is a deployment that is indistinguishable from a fully fine-tuned model in terms of latency, throughput, and memory usage.
### Deploy with live merge
Deploy your LoRA fine-tuned model with a single command:
```bash theme={null}
firectl deployment create "accounts//models/"
```
Your deployment will be ready to use once it completes, with performance that matches the base model.
### Sending requests
Send inference requests to your live-merge deployment by referencing the deployment directly:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts//models/",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts//models/",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
```
### When to use live merge
* You need maximum inference performance (latency and throughput matching the base model)
* You are serving a single fine-tuned model in production
* You want the simplest possible deployment workflow
## Multi-LoRA deployment
Multi-LoRA lets you load multiple LoRA adapters onto a single base model deployment. This is useful when you have several fine-tuned variants of the same base model and want to share GPU resources across them rather than creating a separate deployment for each.
### How it works
With multi-LoRA:
1. You deploy the base model with addon support enabled
2. You load one or more LoRA adapters onto the running deployment
3. At inference time, the correct adapter is selected and applied dynamically based on the model specified in the request
Because adapters are applied dynamically rather than merged, there is some performance overhead compared to live merge. This overhead increases with higher request concurrency.
### LoRA addon shape compatibility
Not all deployment shapes support LoRA addons. **FP8 and FP4 quantized shapes do not support `--enable-addons`.**
| Precision | `--enable-addons` supported? |
| --------- | ---------------------------- |
| BF16 | ✅ Yes |
| FP8 | ❌ No |
| FP4 | ❌ No |
Many base models default to FP8 or FP4 shapes. If you need LoRA addon inference on one of these models, you have two options:
**Option 1 — Use a BF16 deployment shape**
```bash theme={null}
# List available shapes for your model
firectl deployment-shape-version list --base-model accounts/fireworks/models/
# Create deployment with a BF16 shape and addons enabled
firectl deployment create "accounts/fireworks/models/" \
--deployment-shape \
--enable-addons
```
**Option 2 — Merge the adapter into a standalone model**
If no BF16 addon-compatible shape is available, use [live merge](#live-merge-deployment) (recommended for a single adapter) or merge the LoRA into a standalone Fireworks model, then deploy that merged model without `--enable-addons`. See [Uploading custom models](/models/uploading-custom-models#importing-fine-tuned-models) and [`firectl model create`](/tools-sdks/firectl/commands/model-create).
`"addons cannot be enabled with quantized precisions (FP8/FP4)"` — your model's default shape is quantized; use Option 1 or 2 above.
`"the deployment shape version does not exist or you do not have access to it"` — the shape you requested is not available on your account; contact support.
### Deploy with multi-LoRA
Deploy the base model with addons enabled:
```bash theme={null}
firectl deployment create "accounts/fireworks/models/" --enable-addons
```
Once the deployment is ready, load your LoRA models onto the deployment:
```bash theme={null}
firectl load-lora --deployment
```
Repeat this command for each LoRA adapter you want to load.
### Sending requests
To route inference requests to a specific LoRA adapter on a multi-LoRA deployment, set the `model` field to `#`. The `#` separator tells Fireworks to route the request to the specified adapter on the given deployment.
**Deprecation notice:** The `deployedModel` request key for routing to LoRA addons is deprecated and will not be supported for any new deployments. Use the `model` field with the `#` format shown below.
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts//models/#accounts//deployments/",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="accounts//models/#accounts//deployments/",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts//models/#accounts//deployments/",
messages: [
{
role: "user",
content: "Hello!",
},
],
});
console.log(response.choices[0].message.content);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts//models/#accounts//deployments/",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
```
### When to use multi-LoRA
* You need to serve multiple fine-tuned models based on the same base model
* You want to maximize GPU utilization by sharing a single deployment
* You are running experiments or A/B tests across multiple fine-tuned variants
* You can accept some performance overhead compared to live merge
## Performance considerations
Live merge eliminates all LoRA-related inference overhead because the adapter weights are baked into the model at deployment time. The resulting deployment behaves exactly like a natively fine-tuned base model.
Multi-LoRA deployments incur overhead because adapters are applied dynamically:
* **Time to first token (TTFT):** Increases by roughly 10–30% due to adapter loading and prompt processing overhead
* **Generation speed:** Overhead grows with higher request concurrency
* **Maximum throughput:** Lower than a live-merge deployment under sustained load
For a deeper dive into LoRA performance characteristics and optimization strategies, see [Understanding LoRA Performance](/guides/understanding_lora_performance).
## Next steps
Learn about deployment configuration and optimization
Upload LoRA models fine-tuned outside of Fireworks
Understand performance tradeoffs and optimization strategies
# Direct Preference Optimization
Source: https://docs.fireworks.ai/fine-tuning/dpo-fine-tuning
Direct Preference Optimization (DPO) fine-tunes models by training them on pairs of preferred and non-preferred responses to the same prompt. This teaches the model to generate more desirable outputs while reducing unwanted behaviors.
**Use DPO when:**
* Aligning model outputs with brand voice, tone, or style guidelines
* Reducing hallucinations or incorrect reasoning patterns
* Improving response quality where there's no single "correct" answer
* Teaching models to follow specific formatting or structural preferences
## Fine-tuning with DPO
Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted training example.
**Minimum Requirements:**
* **Minimum examples needed:** 3
* **Maximum examples:** Up to 3 million examples per dataset
* **File format:** JSONL (each line is a valid JSON object)
* **Dataset Schema:** Each training sample must include the following fields:
* An `input` field containing a `messages` array, where each message is an object with two fields:
* `role`: one of `system`, `user`, or `assistant`
* `content`: a string representing the message content
* A `preferred_output` field containing an assistant message with an ideal response
* A `non_preferred_output` field containing an assistant message with a suboptimal response
Here’s an example conversation dataset (one training example):
```json einstein_dpo.jsonl theme={null}
{
"input": {
"messages": [
{
"role": "user",
"content": "What is Einstein famous for?"
}
],
"tools": []
},
"preferred_output": [
{
"role": "assistant",
"content": "Einstein is renowned for his theory of relativity, especially the equation E=mc²."
}
],
"non_preferred_output": [
{
"role": "assistant",
"content": "He was a famous scientist."
}
]
}
```
We currently only support one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message.
Save this dataset as jsonl file locally, for example `einstein_dpo.jsonl`.
There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`.
* You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard.
* Upload dataset using `firectl`
```bash theme={null}
firectl dataset create /path/to/file.jsonl
```
You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client.
```jsx theme={null}
// Create Dataset Entry
const createDatasetPayload = {
datasetId: "trader-poe-sample-data",
dataset: { userUploaded: {} }
// Additional params such as exampleCount
};
const urlCreateDataset = `${BASE_URL}/datasets`;
const response = await fetch(urlCreateDataset, {
method: "POST",
headers: HEADERS_WITH_CONTENT_TYPE,
body: JSON.stringify(createDatasetPayload)
});
```
```jsx theme={null}
// Upload JSONL file
const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`;
const files = new FormData();
files.append("file", localFileInput.files[0]);
const uploadResponse = await fetch(urlUpload, {
method: "POST",
headers: HEADERS,
body: files
});
```
While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets.
Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids).
```bash theme={null}
firectl dpoj create \
--base-model accounts/account-id/models/base-model-id \
--dataset accounts/my-account-id/datasets/my-dataset-id \
--output-model new-model-id
```
For our example, we might run the following command:
```bash theme={null}
firectl dpoj create \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--dataset accounts/pyroworks/datasets/einstein-dpo \
--output-model einstein-dpo-model
```
to fine-tune a [Llama 3.1 8b Instruct](https://fireworks.ai/models/fireworks/llama-v3p1-8b-instruct) model with our Einstein dataset.
```bash theme={null}
firectl dpoj get dpo-job-id
```
Once the job is complete, the `STATE` will be set to `JOB_STATE_COMPLETED`, and the fine-tuned model can be deployed.
Once training completes, you can create a deployment to interact with the fine-tuned model. Refer to [deploying a fine-tuned model](/fine-tuning/fine-tuning-models#deploying-a-fine-tuned-model) for more details.
## Next Steps
Explore other fine-tuning methods to improve model output for different use cases.
Train models on input-output examples to improve task-specific performance.
Optimize models using AI feedback for complex reasoning and decision-making.
Fine-tune vision-language models to understand both images and text.
# Agent Tracing
Source: https://docs.fireworks.ai/fine-tuning/environments
Understand where your agent runs and how tracing enables reinforcement fine-tuning
## Why agent tracing is critical to doing RL
Reinforcement learning for agents depends on the entire chain of actions, tool calls, state transitions, and intermediate decisions—not just the final answer. Tracing captures this full trajectory so you can compute reliable rewards, reproduce behavior, and iterate quickly.
**Why it matters**
* **Credit assignment**: You need a complete record of each step to attribute reward to the decisions that caused success or failure.
* **Reproducibility**: Deterministic replays require the exact prompts, model parameters, tool I/O, and environment state.
* **Debuggability**: You can pinpoint where an episode fails (model output, tool error, data mismatch, timeout).
Use Fireworks Tracing to drive the RL loop: emit structured logs with `FireworksTracingHttpHandler`, tag them with rollout correlation metadata, and signal completion using `Status.rollout_finished()` or `Status.rollout_error()`. When you make model calls, use the `model_base_url` issued by the trainer (it points to `https://tracing.fireworks.ai`) so chat completions are recorded as traces via an OpenAI-compatible endpoint.
## How Fireworks tracing works for RFT
* **Traced completions**: The trainer provides a `model_base_url` on `https://tracing.fireworks.ai` that encodes correlation metadata. Your agent uses this OpenAI-compatible URL for LLM calls; tracing.fireworks.ai records the calls as traces automatically.
* **Structured logging sink**: Your agent logs to Fireworks via `FireworksTracingHttpHandler`, including a structured `Status` when a rollout finishes or errors.
* **Join traces and logs**: The trainer polls the logging sink by `rollout_id` to detect completion, then loads the full trace. Logs and traces are deterministically joined using the same correlation tags.
### Correlation metadata
* **Correlate every log and trace** with these metadata fields provided in `/init`: `invocation_id`, `experiment_id`, `rollout_id`, `run_id`, `row_id`.
* **Emit structured completion** from your server logs:
* Add `FireworksTracingHttpHandler` and `RolloutIdFilter` to attach the `rollout_id`
* Log `Status.rollout_finished()` on success, or `Status.rollout_error(message)` on failure
* **Alternative**: If you run one rollout per process, set `EP_ROLLOUT_ID` in the child process instead of adding a filter.
* **Record model calls as traces** by using the `model_base_url` from the trainer. It encodes the correlation IDs so your completions are automatically captured.
### tracing.fireworks.ai base URL
* **Purpose-built for RL**: tracing.fireworks.ai is the Fireworks gateway used during RFT to capture traces and correlate them with rollout status.
* **OpenAI-compatible**: It exposes Chat Completions-compatible endpoints, so you set it as your client's `base_url`.
* **Correlation-aware**: The trainer embeds `rollout_id`, `run_id`, and related IDs into the `model_base_url` path so your completions are automatically tagged and joinable with logs.
* **Drop-in usage**: Always use the `model_base_url` provided in `/init`—do not override it—so traces and logs are correctly linked.
## End-to-end tracing setup with tracing.fireworks.ai
Your server implements `/init` and receives `metadata` and `model_base_url`. Attach `RolloutIdFilter` or set `EP_ROLLOUT_ID` for the current rollout.
Call the model using `model_base_url` so chat completions are persisted as traces with correlation tags.
Attach `FireworksTracingHttpHandler` to your logger and log `Status.rollout_finished()` or `Status.rollout_error()` when the rollout concludes.
The trainer polls Fireworks logs by `rollout_id`, then loads the full traces; logs and traces share the same tags and are joined to finalize results and compute rewards.
### Remote server minimal example
```python remote_server.py theme={null}
import logging
import os
from eval_protocol import InitRequest, Status, FireworksTracingHttpHandler, RolloutIdFilter
# Configure Fireworks logging sink once at startup
logging.getLogger().addHandler(FireworksTracingHttpHandler())
@app.post("/init")
def init(request: InitRequest):
# Option A: add filter that injects rollout_id on every log record
logger = logging.getLogger(f"eval.{request.metadata.rollout_id}")
logger.addFilter(RolloutIdFilter(request.metadata.rollout_id))
# Option B: per-process correlation (use when spawning one rollout per process)
# os.environ["EP_ROLLOUT_ID"] = request.metadata.rollout_id
# Make model calls via the correlated base URL so completions are traced
# client = YourLLMClient(base_url=request.model_base_url, api_key=request.api_key)
try:
# ... execute rollout steps, tool calls, etc. ...
logger.info("rollout finished", extra={"status": Status.rollout_finished()})
except Exception as e:
logger.error("rollout error", extra={"status": Status.rollout_error(str(e))})
```
Under the hood, the trainer polls the logging sink for `Status` and then loads the full trace for scoring. Because both logs and traces share the same correlation tags, Fireworks can deterministically join them to finalize results and compute rewards.
### What to capture in a trace
* **Inputs and context**: Task ID, dataset split, initial state, seeds, and any retrieval results provided to the agent.
* **Model calls**: System/user messages, tool messages, model/version, parameters (e.g., temperature, top\_p, seed), token counts, and optional logprobs.
* **Tool and API calls**: Request/response summaries, status codes, durations, retries, and sanitized payload snippets.
* **Environment state transitions**: Key state before/after each action that affects reward or next-step choices.
* **Rewards**: Per-step shaping rewards, terminal reward, and component breakdowns with weights and units.
* **Errors and timeouts**: Exceptions, stack traces, and where they occurred in the trajectory.
* **Artifacts**: Files, code, unit test results, or other outputs needed to verify correctness.
Never record secrets or raw sensitive data in traces. Redact tokens, credentials, and PII. Store references (IDs, hashes) instead of full payloads whenever possible.
### How tracing powers the training loop
1. **Rollout begins**: Trainer creates a rollout and sends it to your environment (local or remote) with a unique identifier.
2. **Agent executes**: Your agent emits spans for model calls, tool calls, and state changes; your evaluator computes step and terminal rewards.
3. **Rewards aggregate**: The trainer consumes your rewards and updates the policy; traces are stored for replay and analysis.
4. **Analyze and iterate**: You filter traces by reward, failure type, latency, or cost to refine prompts, tools, or reward shaping.
### How RemoteRolloutProcessor uses Fireworks Tracing
1. **Remote server logs completion** with structured status: `Status.rollout_finished()` or `Status.rollout_error()`.
2. **Trainer polls Fireworks Tracing** by `rollout_id` until completion status is found.
3. **Status extracted** from structured fields (`code`, `message`, `details`) to finalize the rollout result.
### Best practices
* **Make it deterministic**: Record seeds, versions, and any non-deterministic knobs; prefer idempotent tool calls or cached fixtures in test runs.
* **Keep signals bounded**: Normalize rewards to a consistent range (e.g., \[0, 1]) and document your components and weights.
* **Summarize, don’t dump**: Log compact summaries and references for large payloads to keep traces fast and cheap.
* **Emit heartbeats**: Send periodic status updates so long-running rollouts are observable; always finalize with success or failure.
* **Use consistent schemas**: Keep field names and structures stable to enable dashboards, filters, and automated diagnostics.
## Next steps
Implement `/init`, tracing, and structured status for remote agents
Build and deploy a local evaluator in under 10 minutes
Launch your RFT job
Design effective reward functions for your task
# Evaluators
Source: https://docs.fireworks.ai/fine-tuning/evaluators
Understand the fundamentals of evaluators and reward functions in reinforcement fine-tuning
An evaluator (also called a reward function) is code that scores model outputs from 0.0 (worst) to 1.0 (best). During reinforcement fine-tuning, your evaluator guides the model toward better responses by providing feedback on its generated outputs.
## Why evaluators matter
Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what "good" means. This is powerful because:
* **No perfect data required** - Just prompts and a way to score outputs
* **Encourages exploration** - Models learn strategies, not just patterns
* **Noise tolerant** - Even noisy signals can improve model performance
* **Encodes domain expertise** - Complex rules and logic that are hard to demonstrate with examples
## Anatomy of an evaluator
Every evaluator has three core components:
### 1. Input data
The prompt and any ground truth data needed for evaluation:
```python theme={null}
{
"messages": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 * 23?"}
],
"ground_truth": "345" # Optional additional data
}
```
### 2. Model output
The assistant's response to evaluate:
```python theme={null}
{
"role": "assistant",
"content": "Let me calculate that step by step:\n15 * 23 = 345"
}
```
### 3. Scoring logic
Code that compares the output to your criteria:
```python theme={null}
def evaluate(model_output: str, ground_truth: str) -> float:
# Extract answer from model's response
predicted = extract_number(model_output)
# Score it
if predicted == int(ground_truth):
return 1.0 # Perfect
else:
return 0.0 # Wrong
```
## Types of evaluators
### Rule-based evaluators
Check if outputs match specific patterns or rules:
* **Exact match** - Output exactly equals expected value
* **Contains** - Output includes required text
* **Regex** - Output matches a pattern
* **Format validation** - Output follows required structure (e.g., valid JSON)
Start with rule-based evaluators. They're simple, fast, and surprisingly effective.
### Execution-based evaluators
Run code or commands to verify correctness:
* **Code execution** - Run generated code and check results
* **Test suites** - Pass generated code through unit tests
* **API calls** - Execute commands and verify outcomes
* **Simulations** - Run agents in environments and measure success
### LLM-as-judge evaluators
Use another model to evaluate quality:
* **Rubric scoring** - Judge outputs against criteria
* **Comparative ranking** - Compare multiple outputs
* **Natural language assessment** - Evaluate subjective qualities like helpfulness
## Scoring guidelines
Your evaluator should return a score between 0.0 and 1.0:
| Score range | Meaning | Example |
| ----------- | ------- | --------------------------- |
| 1.0 | Perfect | Exact correct answer |
| 0.7-0.9 | Good | Right approach, minor error |
| 0.4-0.6 | Partial | Some correct elements |
| 0.1-0.3 | Poor | Wrong but attempted |
| 0.0 | Failure | Completely wrong |
Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes.
## Best practices
Begin with basic evaluation logic and refine over time:
```python theme={null}
# Start here
score = 1.0 if predicted == expected else 0.0
# Then refine if needed
score = calculate_similarity(predicted, expected)
```
Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results.
Training generates many outputs to evaluate, so performance matters:
* **Cache expensive computations**: Store results of repeated calculations
* **Use timeouts for code execution**: Prevent hanging on infinite loops
* **Batch API calls when possible**: Reduce network overhead
* **Profile slow evaluators and optimize**: Identify and fix bottlenecks
Aim for evaluations that complete in seconds, not minutes. Slow evaluators directly increase training time and cost.
Models will generate unexpected outputs, so build robust error handling:
```python theme={null}
try:
result = execute_code(model_output)
score = check_result(result)
except TimeoutError:
score = 0.0 # Code ran too long
except SyntaxError:
score = 0.0 # Invalid code
except Exception as e:
score = 0.0 # Any other error
```
Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain.
Models will exploit evaluation weaknesses, so design defensively:
**Example: Length exploitation**
If you score outputs by length, the model might generate verbose nonsense. Add constraints:
```python theme={null}
# Bad: Model learns to write long outputs
score = min(len(output) / 1000, 1.0)
# Better: Require correctness AND reasonable length
if is_correct(output):
score = 1.0 if len(output) < 500 else 0.8
else:
score = 0.0
```
**Example: Format over substance**
If you only check JSON validity, the model might return valid but wrong JSON. Check content too:
```python theme={null}
# Bad: Only checks format
score = 1.0 if is_valid_json(output) else 0.0
# Better: Check format AND content
if is_valid_json(output):
data = json.loads(output)
score = evaluate_content(data)
else:
score = 0.0
```
Always combine format checks with content validation to prevent models from gaming the system.
## Debugging evaluators
Test your evaluator before training. Look for:
* **Correct scoring** - Good outputs score high, bad outputs score low
* **Reasonable runtime** - Each evaluation completes in reasonable time
* **Clear feedback** - Evaluation reasons explain scores
Run your evaluator on manually created good and bad examples first. If it doesn't score them correctly, fix the evaluator before training.
## Next steps
Connect to your environment for single and multi-turn agents
Follow a complete example building and using an evaluator
# Supervised Fine Tuning - Text
Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-models
This guide will focus on using supervised fine-tuning to fine-tune a model and deploy it to an on-demand (dedicated) deployment, which is the only supported method for serving fine-tuned models.
For the full list of base models supported by managed fine-tuning (SFT, DPO, and RFT) and their max context lengths, see [Managed Fine-Tuning Overview → Supported base models](/fine-tuning/managed-finetuning-intro#supported-base-models).
## Fine-tuning a model using SFT
You can confirm that a base model is available to fine-tune by looking for the `Tunnable` tag in the model library or by using:
```bash theme={null}
firectl model get -a fireworks
```
And looking for `Tunable: true`.
Some base models cannot be tuned on Fireworks (`Tunable: false`) but still list support for LoRA (`Supports Lora: true`). This means that users can tune a LoRA for this base model on a separate platform and upload it to Fireworks for inference. Consult [importing fine-tuned models](/models/uploading-custom-models#importing-fine-tuned-models) for more information.
Fireworks uses the **OpenAI-compatible chat completion format** for SFT training data. If you already have datasets formatted for OpenAI fine-tuning, they work on Fireworks with no changes needed.
Datasets must be in JSONL format, where each line represents a complete JSON-formatted training example. Make sure your data conforms to the following restrictions:
* **Minimum examples:** 3
* **Maximum examples:** 3 million per dataset
* **File format:** `.jsonl`
* **Message schema:** Each training sample must include a messages array, where each message is an object with two fields:
* `role`: one of `system`, `user`, or `assistant`. A message with the `system` role is optional, but if specified, it must be the first message of the conversation
* `content`: the message content. This can be either a plain string **or** a list of content parts in the OpenAI chat completions style, e.g. `[{"type": "text", "text": "..."}]`. Both forms are accepted, and you can mix them freely across messages and even within the same dataset
* `weight`: optional key with value to be configured in either 0 or 1. message will be skipped if value is set to 0
* **Sample weight:** Optional key `weight` at the root of the JSON object. It can be any floating point number (positive, negative, or 0) and is used as a loss multiplier for tokens in that sample. If used, this field must be present in all samples in the dataset.
Here is an example conversation dataset:
```json theme={null}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris."}
]
}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2", "weight": 0},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4"}
]
}
```
#### OpenAI-style structured content
In addition to plain strings, `content` may also be a list of content parts following the OpenAI chat completions format. For text fine-tuning, use `{"type": "text", "text": "..."}` parts. This is convenient if you already produce data in the OpenAI chat completions shape, or if you generate datasets with the OpenAI SDK. The string form and the list form are equivalent for text models, and you can mix them within the same file (and even within the same conversation):
```json theme={null}
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Paris."}]}]}
{"messages": [{"role": "user", "content": [{"type": "text", "text": "What is 1+1?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "2"}], "weight": 0}, {"role": "user", "content": "Now what is 2+2?"}, {"role": "assistant", "content": "4"}]}
{"messages": [{"role": "user", "content": [{"type": "text", "text": "Say hello "}, {"type": "text", "text": "in French."}]}, {"role": "assistant", "content": "Bonjour."}]}
```
All keys you can use with the string form — including the per-message `weight` and `reasoning_content` — work the same way with the list form. When a single message contains multiple text parts (as in the third example above), the parts are concatenated when the chat template is applied. For text-only fine-tuning, only `{"type": "text", ...}` parts are used; image parts are reserved for [vision fine-tuning](/fine-tuning/fine-tuning-vlm).
Here is an example conversation dataset with sample weights:
```json theme={null}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris."}
],
"weight": 0.5
}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2", "weight": 0},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4"}
],
"weight": 1.0
}
```
We also support function calling dataset with a list of tools. An example would look like:
```json theme={null}
{
"tools": [
{
"type": "function",
"function": {
"name": "get_car_specs",
"description": "Fetches detailed specifications for a car based on the given trim ID.",
"parameters": {
"trimid": {
"description": "The trim ID of the car for which to retrieve specifications.",
"type": "int",
"default": ""
}
}
}
},
],
"messages": [
{
"role": "user",
"content": "What is the specs of the car with trim 121?"
},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_car_specs",
"arguments": "{\"trimid\": 121}"
}
}
]
}
]
}
```
For the subset of models that supports thinking (e.g. DeepSeek R1, GPT OSS models and Qwen3 thinking models), we also support fine tuning with thinking traces. If you wish to fine tune with thinking traces, the dataset could also include thinking traces for assistant turns. Though optional, ideally each assistant turn includes a thinking trace. For example:
```json theme={null}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "Paris.", "reasoning_content": "The user is asking about the capital city of France, it should be Paris."}
]
}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2", "weight": 0, "reasoning_content": "The user is asking about the result of 1+1, the answer is 2."},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4", "reasoning_content": "The user is asking about the result of 2+2, the answer should be 4."}
]
}
```
Note that when fine tuning with intermediate thinking traces, the number of total tuned tokens could exceed the number of total tokens in the dataset. This is because we unroll multi-turn conversations into multiple training examples to ensure train-inference consistency.
During inference, a model's thinking traces from previous turns are **not** visible in the conversation history — only the final `content` is retained. To match this behavior during training, we expand each multi-turn conversation into several single-turn training examples, where each example only tunes on one assistant turn and presents the conversation history exactly as it would appear at inference time (i.e., without previous thinking traces).
For example, consider this two-turn dataset entry:
```json theme={null}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2", "reasoning_content": "Simple arithmetic: 1+1=2."},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4", "reasoning_content": "Following up: 2+2=4."}
]
}
```
This gets expanded into two training examples:
**Example 1** — tunes on the first assistant turn:
```json theme={null}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2", "reasoning_content": "Simple arithmetic: 1+1=2."}
]
}
```
**Example 2** — tunes on the second assistant turn, with the first turn's thinking trace stripped to match inference behavior:
```json theme={null}
{
"messages": [
{"role": "user", "content": "What is 1+1?"},
{"role": "assistant", "content": "2"},
{"role": "user", "content": "Now what is 2+2?"},
{"role": "assistant", "content": "4", "reasoning_content": "Following up: 2+2=4."}
]
}
```
Because the conversation context is duplicated across these expanded examples, the total tuned token count will be larger than the raw dataset token count. The expansion grows with the number of assistant turns in each conversation: a conversation with *N* assistant turns produces *N* separate training examples.
There are a couple ways to upload the dataset to Fireworks platform for fine tuning: `firectl`, `Restful API` , `builder SDK` or `UI`.
* You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard.
```bash theme={null}
firectl dataset create /path/to/jsonl/file
```
You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset). Note that the `exampleCount` parameter needs to be provided by the client.
```jsx theme={null}
// Create Dataset Entry
const createDatasetPayload = {
datasetId: "trader-poe-sample-data",
dataset: { userUploaded: {} }
// Additional params such as exampleCount
};
const urlCreateDataset = `${BASE_URL}/datasets`;
const response = await fetch(urlCreateDataset, {
method: "POST",
headers: HEADERS_WITH_CONTENT_TYPE,
body: JSON.stringify(createDatasetPayload)
});
```
```jsx theme={null}
// Upload JSONL file
const urlUpload = `${BASE_URL}/datasets/${DATASET_ID}:upload`;
const files = new FormData();
files.append("file", localFileInput.files[0]);
const uploadResponse = await fetch(urlUpload, {
method: "POST",
headers: HEADERS,
body: files
});
```
While all of the above approaches should work, `UI` is more suitable for smaller datasets `< 500MB` while `firectl` might work better for bigger datasets.
Ensure the dataset ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids).
There are also a couple ways to launch the fine-tuning jobs. We highly recommend creating supervised fine tuning jobs via `UI` .
Simply navigate to the `Fine-Tuning` tab, click `Fine-Tune a Model` and follow the wizard from there. You can even pick a LoRA model to start the fine-tuning for continued training.
Ensure the fine tuned model ID conforms to the [resource id restrictions](/getting-started/concepts#resource-names-and-ids). This will return a fine-tuning job ID. For a full explanation of the settings available to control the fine-tuning process, including learning rate and epochs, consult [additional SFT job settings](#additional-sft-job-settings).
```bash theme={null}
firectl sftj create --base-model --dataset --output-model
```
Similar to UI, instead of tuning a base model, you can also start tuning from a previous LoRA model using
```bash theme={null}
firectl sftj create --warm-start-from --dataset --output-model
```
Notice that we use `--warm-start-from` instead of `--base-model` when creating this job.
With `UI`, once the job is created, it will show in the list of jobs. Clicking to view the job details to monitor the job progress.
If the fine-tuned model appears to learn the wrong text or ignore the expected assistant response, use **Render Samples** on the job details page to inspect the rendered token IDs and loss masks. See [Debug SFT tokenization](/fine-tuning/debug-sft-tokenization).
With `firectl`, you can monitor the progress of the tuning job by running
```bash theme={null}
firectl sftj get
```
Once the job successfully completes, you will see the new LoRA model in your model list
```bash theme={null}
firectl model list
```
For a complete Python SDK example that demonstrates the full workflow (creating datasets, uploading files, and launching a supervised fine-tuning job), see the [Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py).
## Deploying a fine-tuned model
After fine-tuning completes, deploy your model to make it available for inference:
```bash theme={null}
firectl deployment create
```
This creates a dedicated deployment with performance matching the base model.
For more details on deploying fine-tuned models, including multi-LoRA deployments, see the [Deploying Fine Tuned Models guide](/fine-tuning/deploying-loras).
## Additional SFT job settings
Additional tuning settings are available when starting a fine-tuning job. All of the below settings are optional and will have reasonable defaults if not specified. For settings that affect tuning quality like `epochs` and `learning rate`, we recommend using default settings and only changing hyperparameters if results are not as desired.
By default, the fine-tuning job will run evaluation by running the fine-tuned model against an evaluation set that's created by automatically carving out a portion of your training set. You have the option to explicitly specify a separate evaluation dataset to use instead of carving out training data.
`evaluation_dataset`: The ID of a separate dataset to use for evaluation. Must be pre-uploaded via firectl
```shell theme={null}
firectl sftj create \
--evaluation-dataset my-eval-set \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Depending on the size of the model, the default context size will be different. For most models, the default context size is >= 32768. Training examples will be cut-off at 32768 tokens. Usually you do not need to set the max context length unless out of memory error is encountered with higher lora rank and large max context length.
```shell theme={null}
firectl sftj create \
--max-context-length 65536 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Batch size is the number of tokens packed into one forward step during training. One batch could consist of multiple training samples. We do sequence packing on the training samples, and batch size controls how many total tokens will be packed into each batch.
```shell theme={null}
firectl sftj create \
--batch-size 65536 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Epochs are the number of passes over the training data. Our default value is 1. If the model does not follow the training data as much as expected, increase the number of epochs by 1 or 2. Non-integer values are supported.
**Note: we set a max value of 3 million dataset examples × epochs**
```shell theme={null}
firectl sftj create \
--epochs 2.0 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Learning rate controls how fast the model updates from data. We generally do not recommend changing learning rate. The default value is automatically based on your selected model.
```shell theme={null}
firectl sftj create \
--learning-rate 0.0001 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Learning rate warmup steps controls the number of training steps during which the learning rate will be linearly ramped up to the set learning rate.
```shell theme={null}
firectl sftj create \
--learning-rate 0.0001 \
--learning-rate-warmup-steps 200 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
Gradient accumulation steps controls the number of forward steps and backward steps to take (gradients are accumulated) before optimizer.step() is taken. Gradient accumulation steps > 1 increases effective batch size.
```shell theme={null}
firectl sftj create \
--gradient-accumulation-steps 4 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
LoRA rank refers to the number of parameters that will be tuned in your LoRA add-on. Higher LoRA rank increases the amount of information that can be captured while tuning. LoRA rank must be a power of 2 up to 32. Our default value is 8.
```shell theme={null}
firectl sftj create \
--lora-rank 16 \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
The fine-tuning service integrates with Weights & Biases to provide observability into the tuning process. To use this feature, you must have a Weights & Biases account and have provisioned an API key.
```shell theme={null}
firectl sftj create \
--wandb-entity my-org \
--wandb-api-key xxx \
--wandb-project "My Project" \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
By default, the fine-tuning job will generate a random unique ID for the model. This ID is used to refer to the model at inference time. You can optionally specify a custom ID, within [ID constraints](/getting-started/concepts#resource-names-and-ids).
```shell theme={null}
firectl sftj create \
--output-model my-model \
--base-model MY_BASE_MODEL \
--dataset cancerset
```
By default, the fine-tuning job will generate a random unique ID for the fine-tuning job. You can optionally choose a custom ID.
```shell theme={null}
firectl sftj create \
--job-id my-fine-tuning-job \
--base-model MY_BASE_MODEL \
--dataset cancerset \
--output-model my-tuned-model
```
## Appendix
* `Python SDK` [references](/tools-sdks/python-sdk)
* `Restful API` [references](/api-reference/introduction)
* `firectl` [references](/tools-sdks/firectl/firectl)
* [Complete Python SDK workflow example](https://github.com/fw-ai-external/python-sdk/blob/main/examples/sftj_workflow.py) for a code-only implementation
# Supervised Fine Tuning - Vision
Source: https://docs.fireworks.ai/fine-tuning/fine-tuning-vlm
Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases.
This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding.
To see all vision models that support fine-tuning, visit the [Model Library for vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true).
## Fine-tuning a VLM using LoRA
vision datasets must be in JSONL format in OpenAI-compatible chat format.
Each line represents a complete training example.
**Dataset Requirements:**
* **Format**: `.jsonl` file
* **Minimum examples**: 3
* **Maximum examples**: 3 million per dataset
* **Images**: Must be base64 encoded with proper MIME type prefixes
* **Supported image formats**: PNG, JPG, JPEG
**Message Schema:**
Each training example must include a `messages` array where each message has:
* `role`: one of `system`, `user`, or `assistant`
* `content`: an array containing text and image objects or just text
### Basic VLM Dataset Example
```json theme={null}
{
"messages": [
{
"role": "system",
"content": "You are a helpful visual assistant that can analyze images and answer questions about them."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects do you see in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
]
},
{
"role": "assistant",
"content": "I can see a red car, a tree, and a blue house in this image."
}
]
}
```
### If your dataset contains image urls
Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64.
```json theme={null}
{
"type": "image_url",
"image_url": {
// ❌ Raw HTTP/HTTPS URLs are NOT supported
"url": "https://example.com/image.jpg"
}
}
```
```json theme={null}
{
"type": "image_url",
"image_url": {
// ✅ Use data URI with base64 encoding
// Format: data:image/{format};base64,{base64_encoded_data}
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
```
You can use the following script to automatically convert your dataset to the correct format:
**Usage:**
```bash theme={null}
# Install required dependency
pip install requests
# Download the script
wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py
# Run the script - will output a new dataset _base64.jsonl
python download_images_and_encode_to_b64.py --input_file
```
### Advanced Dataset Examples
```json expandable theme={null}
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and tell me the differences"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
}
}
]
},
{
"role": "assistant",
"content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
}
]
}
```
```json expandable theme={null}
{
"messages": [
{
"role": "system",
"content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this kitchen layout for me?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
}
}
]
},
{
"role": "assistant",
"content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
},
{
"role": "user",
"content": "Now look at this living room. Do you think the styles would work well together?"
},
{
"role": "assistant",
"content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Sorry about that! Here's the living room:"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..."
}
}
]
},
{
"role": "assistant",
"content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
},
]
}
```
### Try with an Example Dataset
To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `` tags for reasoning.
```bash theme={null}
# Download the example dataset
curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
```
```bash theme={null}
# Download the example dataset
wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
```
Upload your prepared JSONL dataset to Fireworks for training:
```bash theme={null}
firectl dataset create my-vlm-dataset /path/to/vlm_training_data.jsonl
```
Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard.
```javascript theme={null}
// Create dataset entry
const createDatasetPayload = {
datasetId: "my-vlm-dataset",
dataset: { userUploaded: {} }
};
const response = await fetch(`${BASE_URL}/datasets`, {
method: "POST",
headers: {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify(createDatasetPayload)
});
// Upload JSONL file
const formData = new FormData();
formData.append("file", fileInput.files[0]);
const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, {
method: "POST",
headers: { "Authorization": `Bearer ${API_KEY}` },
body: formData
});
```
For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details.
Create a supervised fine-tuning job for your VLM:
```bash theme={null}
firectl sftj create \
--base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
--dataset my-vlm-dataset \
--output-model my-custom-vlm \
--epochs 3
```
For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings).
1. Navigate to the Fine-tuning tab in the Fireworks console
2. Click "Create Fine-tuning Job"
3. Select your VLM base model (Qwen 2.5 VL)
4. Choose your uploaded dataset
5. Configure training parameters
6. Launch the job
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning).
Monitor key metrics:
* **Training loss**: Should generally decrease over time
* **Evaluation loss**: Monitor for overfitting if using evaluation dataset
* **Training progress**: Epochs completed and estimated time remaining
Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment.
Once training is complete, deploy your custom VLM:
```bash theme={null}
# Create a deployment for your fine-tuned VLM
firectl deployment create my-custom-vlm
# Check deployment status
firectl deployment get accounts/your-account/deployment/deployment-id
```
Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page.
## Advanced Configuration
For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide.
Need custom training loops for VLMs? The **Training API** also supports vision-language model fine-tuning with full control over loss functions, training objectives, and evaluation. See [Training API — Vision Inputs](/fine-tuning/training-api/vision-inputs) for details.
## Interactive Tutorials: Fine-tuning VLMs
For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation.
**Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI**
**Finetuning a VLM to beat SOTA closed source model**
The cookbooks above cover the following:
* Setting up your environment with Fireworks CLI
* Preparing vision datasets in the correct format
* Launching and monitoring VLM fine-tuning jobs
* Testing your fine-tuned model
* Best practices for VLM fine-tuning
* Running inference on serverless VLMs
* Running evals to show performance gains
## Testing Your Fine-tuned VLM
After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:
```python Python (OpenAI SDK) theme={null}
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="",
)
response = client.chat.completions.create(
model="accounts/your-account/models/my-custom-vlm",
messages=[{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg"
},
},{
"type": "text",
"text": "What's in this image?",
}],
}]
)
print(response.choices[0].message.content)
```
If you fine-tuned using the example dataset, your model should include `` tags in its response.
# Training Overview
Source: https://docs.fireworks.ai/fine-tuning/finetuning-intro
Fireworks helps you fine-tune models to improve quality and performance for your product use cases, without the burden of building & maintaining your own training infrastructure.
**Coming from OpenAI?** Fireworks uses the same [OpenAI-compatible chat completion format](/fine-tuning/fine-tuning-models#prepare-a-dataset) for training data — the same `messages` array with `role`, `content`, `tool_calls`, and `weight` fields. You can use your existing SFT datasets with no conversion required. See our [OpenAI compatibility guide](/tools-sdks/openai-compatibility) for more details.
## Three ways to fine-tune
Fireworks offers three approaches to fine-tuning, from fully autonomous to fully custom. Pick the one that fits how much control you want:
**Describe what you want in plain English.** Agent picks the base model, prepares the data, sweeps hyperparameters, evaluates, trains, and deploys. You approve a single plan and cost up front.
Best for the fastest path from dataset to deployed fine-tuned model — from the Fireworks dashboard or from inside Claude Code, Cursor, Codex, Aider, or Goose.
**Give Fireworks your data and configuration.** The platform handles scheduling, training, checkpointing, and model output. No custom code required.
Best for teams that want managed SFT, DPO, or RFT with LoRA or full-parameter tuning.
**Write custom Python training loops.** You control the loss function, optimizer step, checkpointing, and weight sync. Fireworks handles the distributed GPU infrastructure.
Best for research teams needing custom loops, custom rollout orchestration, or inference-in-the-loop evaluation.
| | **Fireworks Agent** | **Managed Fine-Tuning** | **Training API** |
| ------------------------------- | ------------------------------------------------------------------------- | -------------------------------------------- | ---------------------------------- |
| **Interface** | Natural language (dashboard chat, `firectl session`, or via coding agent) | UI, `firectl`, REST API | Python script |
| **Who picks the model** | Agent recommends | You | You |
| **Who tunes hyperparameters** | Agent runs a sweep | You set them | You set them |
| **Cost approval** | Built-in gate before any spend | None — you submit jobs directly | None |
| **Tuning method** | Full-parameter or LoRA | Full-parameter or LoRA | Full-parameter or LoRA |
| **Custom loss / training loop** | Not supported | Not supported | Supported |
| **Inference-in-the-loop eval** | Not supported | Not supported | Supported (hotload) |
| **Best for** | Getting a working fine-tuned model fast, without ML expertise | Production fine-tuning with standard methods | Research, custom RL, hybrid losses |
## When to use SFT vs. RFT
In supervised fine-tuning, you provide a dataset with labeled examples of "good" outputs. In reinforcement fine-tuning, you provide a grader function that can be used to score the model's outputs. The model is iteratively trained to produce outputs that maximize this score.
Supervised fine-tuning (SFT) works well for many common scenarios, especially when:
* You have a sizable dataset (\~1000+ examples) with high-quality, ground-truth labels.
* The dataset covers most possible input scenarios.
* Tasks are relatively straightforward, such as:
* Classification
* Content extraction
However, SFT may struggle in situations where:
* Your dataset is small.
* You lack ground-truth outputs (a.k.a. "golden generations").
* The task requires multi-step reasoning.
Here is a simple decision tree:
```mermaid theme={null}
flowchart TD
B{"Do you have labeled ground truth data?"}
B --"Yes"--> C{"How much?"}
C --"more than 1000 examples"--> D["SFT"]
C --"100-1000 examples"-->F{"Does reasoning help?"}
C --"~100s examples"--> E["RFT"]
F --"No"-->D
F -- "Yes" -->E
B --"No"--> G{"Is this a verifiable task (see below)?"}
G -- "Yes" -->E
G -- "No"-->H["RLHF / LLM as judge"]
```
`Verifiable` refers to whether it is relatively easy to make a judgement on the quality of the model generation.
## When to use the Training API instead
Move from managed fine-tuning to the [Training API](/fine-tuning/training-api/introduction) when you need:
* **Custom training logic** — hybrid objectives, custom reward shaping, or a non-standard algorithm beyond managed settings
* **Inference-in-the-loop evaluation** — hotload checkpoints onto a serving deployment and sample mid-training
* **Per-step control** — custom gradient accumulation, dynamic learning rate schedules, or algorithm research
### Detailed capability comparison
| Capability | Managed RFT | Training API |
| ----------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Launch training | CLI or UI | Python script |
| Loss functions | `grpo`, `dapo`, `gspo-token` (built-in) | Any custom loss via `forward_backward_custom` |
| Tuning modes | Full-parameter or LoRA | Full-parameter or LoRA |
| Context length | Full context length supported by the selected training shape | Full context length supported by the selected training shape |
| Training loop | Fully managed | You write the loop |
| Per-step diagnostics | Dashboard (reward, loss, rollouts) | Full Python access to all metrics |
| Zero-variance filtering | Automatic | You implement |
| Checkpoint management | Automatic | You control via `save_weights_for_sampler_ext` |
### Migrating from managed flow to Training API
If you've been using managed RFT and want more control — custom loss functions, richer diagnostics, or algorithm experimentation — the Training API lets you implement your own training loop while keeping the same GPU infrastructure. Managed jobs and cookbook recipes now use the same core tuning capabilities, including LoRA or full-parameter tuning and the full context length supported by the selected training shape.
### MoE models and Routing Replay
For Mixture-of-Experts (MoE) models like Kimi K2 (384 experts), training stability benefits from **Routing Replay** — caching the expert routing assignments from the reference policy's forward pass and replaying them during the training forward pass. This ensures that the same experts process the same tokens in both the reference and policy models, reducing gradient noise from routing changes.
Routing Replay is available in the Training API via the `loss_fn_inputs` mechanism — you can pass routing matrices from the reference forward pass into the training datum. Use the Training API when you need to inspect or customize those forward-pass inputs directly.
# Basics
Source: https://docs.fireworks.ai/fine-tuning/how-rft-works
Understand the reinforcement learning fundamentals behind RFT
## What is reinforcement fine-tuning?
In traditional supervised fine-tuning, you provide a dataset with labeled examples showing exactly what the model should output. In reinforcement fine-tuning, you instead provide:
1. **A dataset**: Prompts, with input examples for the model to respond to
2. **An evaluator**: Code that scores the model's outputs from 0.0 (bad) to 1.0 (good), also known as a reward function
3. **An agent**: An LLM application, with access to tools, APIs, and data needed for your task
During training, the model generates responses to each prompt, receives scores from your reward function, and produces outputs that maximize the reward.
## Use cases
Reinforcement fine-tuning helps you train models to excel at:
* **Code generation and analysis** - Writing and debugging functions with verifiable execution results or test outcomes
* **Structured output generation** - JSON formatting, data extraction, classification, and schema compliance with programmatic validation
* **Domain-specific reasoning** - Legal analysis, financial modeling, or medical triage with verifiable criteria and compliance checks
* **Tool-using agents** - Multi-step workflows where agents call external APIs with measurable success criteria
## How it works
Define how you'll score model outputs from 0 to 1. For example, scoring outputs higher by checking if your agent called the right tools, or if your LLM-as-judge rates the output highly.
Create a JSONL file with prompts (system and user messages). These will be used to generate rollouts during training.
Train locally, or connect your agent as a remote server to Fireworks with our /init and /status endpoints.
Create an RFT job via the UI or CLI. Fireworks orchestrates rollouts, evaluates them, and trains the model to maximize reward.
Once training completes, deploy your fine-tuned LoRA model to production with an on-demand deployment.
### RFT works best when:
1. You can determine whether a model's output is "good" or "bad," even if only approximately
2. You have prompts but lack perfect "golden" completions to learn from
3. The task requires multi-step reasoning where evaluating intermediate steps is hard
4. You want the model to explore creative solutions beyond your training examples
## Next steps
Learn how to design effective reward functions
Learn how to launch and configure RFT jobs
# Managed Fine-Tuning Overview
Source: https://docs.fireworks.ai/fine-tuning/managed-finetuning-intro
Fine-tune models with Fireworks-managed infrastructure — no custom code required.
Give Fireworks your data and configuration. The platform handles scheduling, training, checkpointing, and model output. Training data uses the **OpenAI-compatible chat completion format**, so existing OpenAI SFT datasets work with no conversion required.
## Methods
Train text models with labeled examples of desired outputs
Train vision-language models with image and text pairs
Align models with human preferences using pairwise comparisons
Train models using custom reward functions for complex reasoning tasks
## Free Reinforcement Fine-Tuning
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
When creating a **Reinforcement Fine-Tuning** job in the UI, look for the "Free tuning" filter in the model selection area:
For SFT and DPO pricing, see the [pricing page](https://fireworks.ai/pricing).
## Supported base models
Fireworks supports fine-tuning for most major open source models, including DeepSeek, Qwen, Kimi, Gemma, GLM, and Llama families. The same set of base models is available for SFT, DPO, and RFT — once a base model is supported, every managed fine-tuning method works against it.
The table below is generated from the live training shape registry. The "Max supported context length" is the largest `max_supported_context_length` across all training shapes registered for that base model — use it as the upper bound when you set a per-job context length on `firectl sftj create`, `firectl dpoj create`, or RFT job creation.
| Base model | Max supported context length |
| ------------------------------- | ---------------------------- |
| `gemma-4-26b-a4b-it` | 256K (262,144 tokens) |
| `gemma-4-31b-it` | 256K (262,144 tokens) |
| `glm-5p1` | 200K (200,000 tokens) |
| `kimi-k2p5` | 256K (262,144 tokens) |
| `kimi-k2p6` | 256K (262,144 tokens) |
| `llama-v3p3-70b-instruct` | 128K (131,072 tokens) |
| `minimax-m2p5` | 192K (196,608 tokens) |
| `nemotron-nano-3-30b-a3b` | 256K (262,144 tokens) |
| `qwen3-235b-a22b-instruct-2507` | 128K (128,000 tokens) |
| `qwen3-30b-a3b` | 128K (131,072 tokens) |
| `qwen3-30b-a3b-instruct-2507` | 128K (128,000 tokens) |
| `qwen3-32b` | 128K (131,072 tokens) |
| `qwen3-4b` | 64K (65,536 tokens) |
| `qwen3-8b` | 256K (256,000 tokens) |
| `qwen3-vl-8b-instruct` | 256K (262,144 tokens) |
| `qwen3p5-27b` | 256K (262,144 tokens) |
| `qwen3p5-35b-a3b` | 256K (262,144 tokens) |
| `qwen3p5-397b-a17b` | 256K (262,144 tokens) |
| `qwen3p5-9b` | 256K (262,144 tokens) |
| `qwen3p6-27b` | 256K (262,144 tokens) |
To browse the broader catalog (including non-tunable inference models), visit the [Model Library for text models](https://app.fireworks.ai/models?filter=LLM\&tunable=true) or [vision models](https://app.fireworks.ai/models?filter=vision\&tunable=true).
## Tuning modes and context length
Managed fine-tuning supports both **[Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685)** and full-parameter tuning, depending on the model, method, and selected training shape. It also supports the full context lengths exposed by the available training shapes, matching the same long-context capabilities used by cookbook recipes.
Choose LoRA when you want efficient adapter training and flexible deployment, including [multiple LoRAs](/fine-tuning/deploying-loras#multi-lora-deployment) on a single base model deployment. Choose full-parameter tuning when you need to update all model weights for difficult reasoning, alignment, or domain adaptation tasks.
**Deprecation notice:** The `deployedModel` request key for routing to LoRA addons is deprecated and will not be supported for any new deployments. Please migrate to the `model` field with the `#` format described in [Routing requests to LoRA addons](/fine-tuning/deploying-loras#routing-requests-to-lora-addons).
# Monitor Training
Source: https://docs.fireworks.ai/fine-tuning/monitor-training
Track RFT job progress and diagnose issues in real-time
Once your RFT job is running, the Fireworks dashboard provides comprehensive monitoring tools to track progress, inspect individual rollouts, and debug issues as they arise.
## Accessing the monitoring dashboard
After creating your RFT job, you'll receive a dashboard link in the CLI output:
```
Dashboard Links:
RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/abc123
```
Click this link or navigate manually:
1. Go to [Fireworks Dashboard](https://app.fireworks.ai)
2. Click **Fine-Tuning** in the sidebar
3. Select your job from the list
## Understanding the overview
The main dashboard shows your job's current state and key metrics.
### Job status
Your job is queued waiting for GPU resources. Queue time depends on current demand and your account priority.
**Action**: None needed. Job will start automatically when resources become available.
Fireworks is validating your dataset to ensure it meets format requirements and quality standards.
**Duration**: Typically 1-2 minutes
**Action**: None needed. If validation fails, you'll receive specific error messages about issues in your dataset.
Training is actively in progress. Rollouts are being generated, evaluated, and the model is learning.
**Action**: Monitor metrics and rollout quality. This is when you'll watch reward curves improve.
Training finished successfully. Your fine-tuned model is ready for deployment.
**Action**: Review final metrics, then [deploy your model](/fine-tuning/deploying-loras).
Training encountered an unrecoverable error and stopped.
**Action**: Check error logs and troubleshooting section below. Common causes include evaluator errors, resource limits, or dataset issues.
You or another user manually stopped the job.
**Action**: Review partial results if needed. Create a new job to continue training.
Training stopped automatically because the full epoch showed no improvement. All rollouts received the same scores, indicating no training progress.
**Action**: This typically indicates an issue with your evaluator or training setup. Check that:
* Your evaluator is returning varied scores (not all 0s or all 1s)
* The reward function can distinguish between good and bad outputs
* The model is actually generating different responses
Review the troubleshooting section below for common causes.
### Key metrics at a glance
The overview panel displays:
* **Elapsed time**: How long the job has been running
* **Progress**: Current epoch and step counts
* **Reward**: Latest mean reward from rollouts
* **Model**: Base model and output model names
## Training metrics
### Reward curves
The most important metric in RFT is the reward curve, which shows how well your model is performing over time.
**What to look for**:
* **Upward trend** - Model is learning and improving
* **Plateauing** - Model may have converged; consider stopping or adjusting parameters
* **Decline** - Potential issue with evaluator or training instability
* **Spikes** - Could indicate noisy rewards or outliers in evaluation
Healthy training shows steady reward improvement. Don't worry about minor fluctuations—focus on the overall trend.
### Training loss
Loss measures how well the model is fitting the training data:
* **Decreasing loss** - Normal learning behavior
* **Increasing loss** - Learning rate may be too high
* **Flat loss** - Model may not be learning; check evaluator rewards
### Evaluation metrics
If you provided an evaluation dataset, you'll see validation metrics:
* **Eval reward**: Model performance on held-out data
* **Generalization gap**: Difference between training and eval rewards
Large gaps between training and eval rewards suggest overfitting. Consider reducing epochs or adding more diverse training data.
## Inspecting rollouts
Understanding individual rollouts helps you verify your evaluator is working correctly and identify quality issues.
### Rollout overview table
Click any **Epoch** in the training timeline, then click the **table icon** to view all rollouts for that step.
The table shows:
* **Row ID**: Unique identifier for each dataset row used in this rollout
* **Prompt**: The input prompt sent to the model
* **Messages**: The model's generated response messages
* **Valid**: Whether the rollout completed successfully without errors
* **Reason**: Explanation if the rollout failed or was marked invalid
* **Score**: Reward score assigned by your evaluator (0.0 to 1.0)
**What to check**:
* Most rollouts succeeding (status: complete)
* Reward distribution makes sense (high for good outputs, low for bad)
* Many failures indicate evaluator issues
* All rewards identical may indicate evaluator is broken
### Individual rollout details
Click any row in the rollout table to see full details:
You'll see:
1. **Full prompt**: Exact messages sent to the model
2. **Model response**: Complete generated output
3. **Evaluation result**: Reward score and reasoning (if provided)
4. **Metadata**: Token counts, timing, temperature settings
5. **Tool calls**: For agentic rollouts with function calling
Copy and paste model outputs to test them manually. For example, if you're training a code generator, try running the generated code yourself to verify your evaluator is scoring correctly.
### Quality spot checks
Regularly inspect rollouts at different stages of training:
**Early training (first epoch)**:
* Verify evaluator is working correctly
* Check that high-reward rollouts are actually good
* Ensure low-reward rollouts are actually bad
**Mid-training**:
* Confirm model quality is improving
* Look for new strategies or behaviors emerging
* Check that evaluator isn't being gamed
**Late training**:
* Verify final model quality meets your standards
* Check for signs of overfitting (memorizing training data)
* Ensure diversity in responses (not all identical)
## Live logs
Real-time logs show what's happening inside your training job.
### Accessing logs
Click the **Logs icon** next to the table icon to view real-time logs for your training job.
### Using logs for debugging
When things go wrong, logs are your first stop:
1. **Filter by error level**: Focus on `[ERROR]` and `[WARNING]` messages
2. **Search for rollout IDs**: Track specific rollouts through their lifecycle
3. **Look for patterns**: Repeated errors indicate systematic issues
4. **Check timestamps**: Correlate errors with metric changes
## Training diagnostics
### Available in the managed flow
The managed RFT dashboard provides:
* **Reward curves:** Mean reward over training steps
* **Training loss:** Policy loss over time
* **Rollout inspection:** Individual rollouts with scores, messages, and metadata
### Traces page
The **Traces** page in the Fireworks dashboard provides per-rollout execution traces, including timing, token counts, and evaluation results. Trace data can be downloaded for offline analysis using the download button on the Traces page.
### Metrics not directly surfaced
The following diagnostics are not directly surfaced in the managed RFT dashboard today:
* **Filtering rates:** How many zero-variance groups were dropped per iteration
* **Effective batch size:** Actual number of training groups after filtering
* **Advantage magnitude and distribution:** Per-step advantage statistics
* **KL divergence:** Distance between the current policy and the reference model
* **Per-token importance sampling ratios:** Clipping frequency and magnitude
These metrics can be partially inferred from trace data and rollout inspection. For richer per-step diagnostics, consider using the [Training API](/fine-tuning/training-api/introduction), which gives you full Python control over the training loop and allows you to log any metric you need.
## Common issues and solutions
**Symptoms**: Reward curve flat or very low throughout training
**Possible causes**:
* Evaluator always returning 0 or very low scores
* Model outputs not matching expected format
* Task too difficult for base model
**Solutions**:
1. Inspect rollouts to verify evaluator is working:
* Check that some rollouts get high rewards
* Verify reward logic makes sense
2. Test evaluator locally on known good/bad outputs
3. Simplify the task or provide more examples
4. Try a stronger base model
**Symptoms**: Reward increases then crashes and stays low
**Possible causes**:
* Learning rate too high causing training instability
* Model found an exploit in the evaluator (reward hacking)
* Catastrophic forgetting
**Solutions**:
1. Stop training and use the last good checkpoint
2. Restart with lower learning rate (e.g., `--learning-rate 5e-5`)
3. Review recent rollouts for reward hacking behavior
4. Improve evaluator to be more robust
**Symptoms**: Rollout table shows lots of errors or timeouts
**Possible causes**:
* Evaluator code errors
* Timeout too short for evaluation
* External API failures (for remote evaluators)
* Resource exhaustion
**Solutions**:
1. Check error logs for specific error messages
2. Test evaluator locally to reproduce errors
3. Increase `--rollout-timeout` if evaluations need more time
4. Add better error handling in evaluator code
5. For remote evaluators: check server health and logs
**Symptoms**: Loss goes up instead of down
**Possible causes**:
* Learning rate too high
* Conflicting reward signals
* Numerical instability
**Solutions**:
1. Reduce learning rate by 2-5x
2. Check that rewards are consistent (same prompt gets similar rewards)
3. Verify rewards are in valid range \[0, 1]
4. Consider reducing batch size
**Symptoms**: Model generates the same response for every prompt
**Possible causes**:
* Temperature too low (near 0)
* Model found one high-reward response and overfit to it
* Evaluator only rewards one specific output
**Solutions**:
1. Increase `--temperature` to 0.8-1.0
2. Make evaluator more flexible to accept diverse good answers
3. Use more diverse prompts in training data
4. Reduce epochs to prevent overfitting
**Symptoms**: Many rollouts timing out with remote environment
**Possible causes**:
* Remote server slow or overloaded
* Network latency issues
* Evaluator not logging completion correctly
**Solutions**:
1. Check remote server logs for errors
2. Verify server is logging `Status.rollout_finished()`
3. Increase `--rollout-timeout` to allow more time
4. Scale remote server to handle concurrent requests
5. Optimize evaluator code for performance
## Performance optimization
### Speeding up training
If training is slower than expected:
**Slow evaluators directly increase training time**:
* Profile your evaluator code to find bottlenecks
* Cache expensive computations
* Use batch processing for API calls
* Add timeouts to prevent hanging
**For remote evaluators**:
* Add more worker instances to handle concurrent rollouts
* Use faster machines (more CPU, memory)
* Optimize network connectivity to Fireworks
Target: Evaluations should complete in 1-5 seconds per rollout.
**Reduce compute while maintaining quality**:
* Decrease `--n` (e.g., from 8 to 4 rollouts per prompt)
* Reduce `--max-tokens` if responses don't need to be long
* Lower temperature slightly to speed up sampling
Caution: Too few rollouts (n \< 4) may hurt training quality.
### Cost optimization
Reduce costs without sacrificing too much quality:
1. **Start small**: Experiment with `qwen3-0p6b` before scaling to larger models
2. **Reduce rollouts**: Use `--n 4` instead of 8
3. **Shorter responses**: Lower `--max-tokens` to minimum needed
4. **Fewer epochs**: Start with 1 epoch, only add more if needed
5. **Efficient evaluators**: Minimize API calls and computation
## Stopping and resuming jobs
### Stopping a running job
If you need to stop training:
1. Click **Cancel Job** in the dashboard
2. Or via CLI:
```bash theme={null}
firectl rftj delete
```
The model state at the last checkpoint is saved and can be deployed.
Cancelled jobs cannot be resumed. If you want to continue training, create a new job starting from the last checkpoint.
### Using checkpoints
Checkpoints are automatically saved during training. To continue from a checkpoint:
```bash theme={null}
eval-protocol create rft \
--warm-start-from accounts/your-account/models/previous-checkpoint \
--output-model continued-training
```
This is useful for:
* Extending training after early stopping
* Trying different hyperparameters on a trained model
* Building on previous successful training runs
## Comparing multiple jobs
Running multiple experiments? Compare them side-by-side:
1. Navigate to **Fine-Tuning** dashboard
2. Select multiple jobs using checkboxes
3. Click **Compare**
This shows:
* Reward curves overlaid on same graph
* Parameter differences highlighted
* Final metrics comparison
* Training time and cost comparison
Use consistent naming for experiments (e.g., `math-lr-1e4`, `math-lr-5e5`) to make comparisons easier.
## Exporting metrics
For deeper analysis or paper writing:
### Via dashboard
1. Click **Export** button in job view
2. Choose format: CSV, JSON
3. Select metrics to export (rewards, loss, rollout data)
### Via API
```python theme={null}
import requests
response = requests.get(
f"https://api.fireworks.ai/v1/accounts/{account}/reinforcementFineTuningJobs/{job_id}/metrics",
headers={"Authorization": f"Bearer {api_key}"}
)
metrics = response.json()
```
### Weights & Biases integration
If you enabled W\&B when creating the job:
```bash theme={null}
eval-protocol create rft \
--wandb-project my-experiments \
--wandb-entity my-org \
...
```
All metrics automatically sync to W\&B for advanced analysis, comparison, and sharing.
## Best practices
Check your job within the first 15-30 minutes of training:
* Verify evaluator is working correctly
* Confirm rewards are in expected range
* Catch configuration errors early
Don't wait until training completes to discover issues.
Every few epochs, inspect 5-10 random rollouts:
* Manually verify high-reward outputs are actually good
* Check low-reward outputs are actually bad
* Look for unexpected model behaviors
This catches evaluator bugs and reward hacking.
When you find good hyperparameters, save the command:
```bash theme={null}
# Save to file for reproducibility
echo "eval-protocol create rft --base-model ... --learning-rate 5e-5 ..." > best_config.sh
```
Makes it easy to reproduce results or share with team.
Name jobs descriptively:
* Good: `math-solver-llama8b-temp08-n8`
* Bad: `test1`, `try2`, `final-final`
Future you will thank you when comparing experiments.
Keep notes on what worked and what didn't:
* Hypothesis for each experiment
* Parameters changed
* Results and insights
* Next steps
Build institutional knowledge for your team.
## Next steps
Once training completes, deploy your fine-tuned model for inference
Learn how to adjust parameters for better results
Improve your reward functions based on training insights
Start a new experiment using the CLI
# Price comparison vs Tinker
Source: https://docs.fireworks.ai/fine-tuning/multi-turn-cost-comparison
Estimate the cost of multi-turn agentic RL rollouts on Fireworks compared to Tinker's per-token pricing
If you're running RL or agentic post-training on a long-context model and your
provider bills you per token with **no cross-turn prefix cache**, the prefill
cost grows quadratically with the number of turns — every turn re-prefills the
full conversation history. On Fireworks Dedicated, session-affinity routing
keeps an episode pinned to one replica so the KV cache is reused across turns,
and cached prompt tokens contribute essentially zero extra compute.
The calculator below makes that difference concrete. Set your episode shape
(turns, context growth, generation length) and compare:
* **Tinker** — flat per-token billing, no cross-turn cache (re-prefill every turn)
* **Fireworks Dedicated** — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate
## Performance and benchmarking notes
### Dedicated trainer vs pooled/serverless resourcing
Tinker runs training jobs on a **pooled/serverless** GPU fleet, which lets a
single job burst onto many more GPUs than you would dedicate to a replica on
Fireworks. That burst is what makes individual Tinker steps feel fast — but it
also **caps the maximum training speed you can buy**: you cannot pay to scale
beyond the pool's per-job allocation, and you cannot reserve isolated capacity.
Fireworks dedicated trainers take the opposite trade-off: predictable,
isolated execution with no shared-pool queueing or noisy-neighbor variance,
and the ability to scale **wall-clock time and cost independently** by
adjusting replica count. If you want faster steps on dedicated, increase
replica count and parallelize work.
For **large model training or longer rollouts**, we have consistently found
the dedicated setup like ours is **cheaper overall and can also be faster**
depending on the customer's resourcing needs.
### Context-length benchmarking caveat
Benchmark comparisons are only apples-to-apples when truncation policy and
effective context length are matched. If one system truncates `>32k` samples
and another does not, the non-truncating run is doing more work and will
appear slower.
### Replica count is a speed/cost knob
Users can trade cost and wall-clock time by scaling replicas. A quick
back-of-envelope estimate:
$$
\text{\$ / 1M tokens} \approx \frac{\text{GPU count} \cdot \text{\$ / GPU-hour}}{\text{tokens/sec(cluster)} \cdot 3600} \cdot 10^6
$$
## How the numbers come together
### Tinker (the cost customers describe)
Each turn re-prefills the full accumulated context:
$$
\text{Prefill tokens (Tinker)} = \sum_{t=1}^{T} P_t = T \cdot P_1 + \Delta \cdot \frac{T(T-1)}{2}
$$
…where $P_1$ is the initial prompt (system + tools + task), $\Delta$ is the
context added per turn (model response + tool result), and $T$ is the turn
count. This is **quadratic in $T$**.
$$
\text{Cost (Tinker)} = \frac{\text{Prefill tokens}}{10^6} \cdot r_{\text{prefill}} + \frac{\text{Decode tokens}}{10^6} \cdot r_{\text{sample}}
$$
### Fireworks Dedicated — GPU-hour billing
Dedicated deployments are billed per GPU-second, so the prefix cache shows up
as **higher effective throughput** rather than a discount on per-token rates.
Across one episode, each unique token is prefilled at most once — the rest of
the prompt is served from the prefix cache and contributes essentially no GPU
work. The uncached portion that actually hits prefill is:
$$
\text{Uncached prompt} = P_T = P_1 + (T - 1) \Delta
$$
On a saturated cluster:
$$
\text{Cluster-hours} = \frac{\text{Uncached prompt} / \text{prefill TPS}}{3600}
$$
$$
\text{Cost} = \text{Cluster-hours} \cdot N_{\text{GPU}} \cdot r_{\text{GPU/hr}}
$$
Because cached tokens contribute essentially nothing to wall-clock work, the
cluster's effective \$/M token rate falls as utilization rises. For continuous
RL training, where rollouts run at sustained pace, dedicated is typically the
cheapest path at scale.
The calculator's dedicated path uses *saturated* throughput estimates as
defaults. A small, lightly-loaded test deployment will look more expensive
per token than these numbers because the cluster is paid for whether it's
busy or idle. Tune the throughput inputs in the **Advanced** panel to match
your actual rollout pace.
## What's covered
The calculator currently includes the four models for which Tinker publishes
per-token rates:
| Model | Tinker prefill / sample (per 1M) |
| ------------------------ | -------------------------------- |
| Kimi K2.6 (128K) | $5.15 / $12.81 |
| Kimi K2.5 (128K) | $5.15 / $12.81 |
| Qwen3.5-397B-A17B (256K) | $4.00 / $10.00 |
| GPT-OSS-120B (128K) | $0.63 / $1.54 |
All Fireworks-side rates are taken from the public pages linked below and the
constants live in `snippets/multi-turn-cost-calculator.jsx` — update there if
either side's pricing changes.
## FAQ
### What is the fastest way to reduce wall-clock time?
Increase replicas and overlap sampling/training where your workflow allows it.
Those are usually the most direct levers for shortening end-to-end cycle time.
### How should I compare costs between providers?
Use matched assumptions for context length, truncation policy, and effective
resource allocation. The calculator at the top of this page handles the math
once you plug in your episode shape — be sure to also align truncation policy
and effective context window between providers before drawing conclusions.
## Sources
* Tinker pricing: [thinkingmachines.ai/tinker](https://thinkingmachines.ai/tinker)
* Fireworks GPU-hour pricing: [fireworks.ai/pricing](https://fireworks.ai/pricing)
* Related: [RFT Cost Estimator](/fine-tuning/rft-cost-estimator) — same idea, but
for the training-side bill (Fireworks GPU-hour, no comparison column).
This is an estimator, not a quote (updated). Real costs depend on your exact workload,
cache hit rate, hardware utilization, and rate-card terms at run time.
# Parameter Tuning
Source: https://docs.fireworks.ai/fine-tuning/parameter-tuning
Learn how training parameters affect model behavior and outcomes
## Overview
Reinforcement fine-tuning uses two categories of parameters to control model training: **training parameters** that govern how the model learns, and **rollout (sampling) parameters** that control how the model generates responses during training.
Most experiments converge well with the default values. Adjust parameters only when you have a clear hypothesis based on your training metrics and reward curves.
## Training Parameters
Core parameters that control how your model learns during the training process.
**What it does**: Controls how aggressively the model updates its weights during each training step. Think of it as the "step size" when descending the loss landscape.
**Default**: `1e-4` (0.0001)\
**Valid range**: `1e-5` to `5e-4`
**How it affects outcome**:
* **Too high** → Unstable training where reward spikes briefly then collapses as the model overshoots optimal weights.
* **Too low** → Painfully slow convergence. The reward curve plateaus too early before reaching optimal performance.
* **Just right** → Steady, consistent reward improvement throughout training.
**When to adjust**:
* **Decrease** when you see reward spikes followed by crashes in your training metrics
* **Increase** when the reward curve plateaus too early and stops improving
* Keep changes within 2× of the default value
**What it does**: The number of complete passes through your training dataset. Each epoch processes every example once.
**Default**: `1`\
**Valid range**: `1` to `10` (whole numbers only)
**How it affects outcome**:
* **Too few** → The model hasn't had enough exposure to learn patterns from your data
* **Too many** → Overfitting risk where the model memorizes the training set instead of generalizing
* **Just right** → Reward curve shows steady improvement and plateaus near the end of training
**When to adjust**:
* **Add 1-2 more epochs** if the reward is still climbing steadily at the end of training
* **Keep at 1** for most tasks—the default works well
* Watch your reward curves to detect when adding more epochs stops helping
**What it does**: Controls the number of trainable parameters in your LoRA adapter. LoRA (Low-Rank Adaptation) adds small adapter layers to the base model rather than training all weights. Higher rank means more capacity to learn new behaviors.
**Default**: `8`\
**Valid range**: `4` to `32` (must be powers of 2: 4, 8, 16, 32)
**How it affects outcome**:
* **Lower rank (4-8)** → Faster training, but may lack capacity for complex tasks
* **Just right (8-16)** → Balances capacity and efficiency for most tasks
* **Higher rank (32)** → More learning capacity, but requires significantly more GPUs and risks overfitting
**When to adjust**:
* **Increase** for complex reasoning tasks or when the model struggles to learn desired behaviors
* Consider task complexity: simple style changes need lower rank, complex reasoning needs higher
**What it does**: The amount of data (measured in tokens) processed in each training step before updating model weights.
Unlike traditional batch sizes that count sequences (e.g., 32 or 64 sequences), Fireworks RFT uses **token-based batch sizing**. For example, with an 8k max sequence length, a 64k batch size allows up to 8 sequences per batch (64k tokens ÷ 8k tokens/sequence = 8 sequences).
**Default**: `32k tokens`
**How it affects outcome**:
* **Smaller batches** → Noisier gradient updates that may help exploration, but slower training throughput
* **Larger batches** → Smoother, more stable updates and faster training throughput
**When to adjust**:
* Most users should stick with the default. Modify if you want a smaller/larger amount of tokens per train step
**What it does**: Sets the minimum number of prompts rolled out before each GRPO training step. Controls how on-policy the training is by determining how often the model is updated relative to rollout generation — a chunk is a slice of the dataset that the trainer fully rolls out *before* taking a training step, after which the next chunk's rollouts are generated from the updated policy.
**Default**: `200` (auto-applied only when the dataset has at least `2 × chunk_size` examples; datasets with fewer examples run without chunking)
**Valid values**: `-1` to disable chunking, any positive integer to set an explicit size. Setting `0` (or leaving unset) uses the default behavior above.
**On-policy spectrum**:
* **Small chunk size** → more frequent training steps, rollouts stay close to the policy being trained (more on-policy), but more forward/backward passes per epoch and slower wall-clock time.
* **Large chunk size** (or `chunk_size = dataset_size`) → fewer training steps, rollouts become stale relative to the updated policy (more off-policy), faster wall-clock but potentially lower sample efficiency.
* **Fully online RL**: `chunk_size=1` (generate one prompt's rollouts → train → repeat). Not typically recommended in practice.
* **Fully offline RL**: `chunk_size = dataset_size` (generate all rollouts first, then train — equivalent to 1 epoch with no mid-epoch updates).
**Epoch/chunk interaction**
An epoch is still a full pass through the entire dataset. `chunk_size` controls how frequently the model gets a GRPO training step *within* each epoch. For example, with `chunk_size=200`, `dataset_size=1000`, `epochs=2`, and `response_candidates_count=8`:
```
epoch 0 chunk 0 (prompts 1-200) × 8 rollouts → train
epoch 0 chunk 1 (prompts 201-400) × 8 rollouts → train
epoch 0 chunk 2 (prompts 401-600) × 8 rollouts → train
epoch 0 chunk 3 (prompts 601-800) × 8 rollouts → train
epoch 0 chunk 4 (prompts 801-1000) × 8 rollouts → train
epoch 1 chunk 0 (prompts 1-200) × 8 rollouts → train
...
```
That is, 5 chunks × 2 epochs = 10 GRPO training steps total, each preceded by 200 × 8 = 1600 rollouts.
**Relationship with `gradient_accumulation_steps`**
These two are orthogonal:
* `chunk_size` controls how many prompts are rolled out **before each GRPO training step** — i.e., how on-policy the training is.
* `gradient_accumulation_steps` controls how many forward/backward passes accumulate **within a single chunk's training step** before each optimizer update.
`--chunk-size` is only exposed via the `firectl` / `eval-protocol` CLI. It is not configurable from the Web UI.
## Loss Method
Parameters that control the policy optimization algorithm used during training.
**What it does**: Controls the policy optimization algorithm used during training. Different methods trade off exploration aggressiveness, stability, and KL regularization.
**Default**: `grpo`
**Valid values**: `grpo`, `dapo`, `gspo-token`
**GRPO** (default) — Group Relative Policy Optimization ([arXiv:2402.03300](https://arxiv.org/abs/2402.03300)). The conservative baseline used by most RFT jobs.
* **Symmetric clipping:** Clips the policy ratio to `[0.8, 1.2]`, limiting how much the policy can change in a single step in either direction.
* **KL penalty:** Includes a small KL divergence penalty (`kl_loss_coef=0.001`) that keeps the trained policy close to the reference model. This prevents mode collapse but limits how far the model can deviate from its starting behavior.
* **Token-level loss aggregation:** Loss is summed over valid tokens and divided by total valid token count (`token-mean`).
Best for: Most tasks. Start here unless you have a specific reason to use another method.
**DAPO** — Decoupled Alignment Preference Optimization ([arXiv:2503.14476](https://arxiv.org/abs/2503.14476)). A more aggressive variant that removes KL regularization and uses asymmetric clipping.
* **Asymmetric clipping:** Clips the policy ratio to `[0.8, 1.28]` — the upper bound is higher than the lower bound, allowing the policy to take larger steps in the "improve" direction while being more conservative about degradation.
* **No KL penalty:** `kl_loss_coef` is set to 0. The trained policy is not penalized for diverging from the reference model.
* **Token-level loss aggregation:** Same `token-mean` mode as GRPO.
Best for: Tasks where the base model is far from optimal and you want to allow larger policy updates. Useful when GRPO converges too slowly or plateaus early.
`--rl-kl-beta` is incompatible with DAPO. Setting a non-zero `--rl-kl-beta` when `--rl-loss-method dapo` is specified will cause job creation to fail with: `loss_config.kl_beta must be 0/unset when method is DAPO`.
**What DAPO does NOT include from the original paper:**
* **Overlong reward shaping** is not implemented. The separate `--length-norm` flag exists but is not DAPO-specific.
* **Dynamic sampling (overgeneration)** is not implemented. Zero-variance groups are filtered out (see [Zero-Variance Group Filtering](#zero-variance-group-filtering) below), but filtered prompts are dropped from the batch, not replaced with new prompts.
**GSPO-token** — Group Sequence Policy Optimization, token-level variant ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)). Uses sequence-level importance sampling with very tight clipping for conservative, stable updates.
* **Sequence-level importance sampling:** Computes a sequence-level KL proxy and broadcasts it to token-level ratios, rather than computing ratios independently per token. This better captures how entire responses differ from the reference policy.
* **Very tight clipping:** Clips the policy ratio to `[1 - 0.0003, 1 + 0.0004]` — much tighter than GRPO or DAPO, making each training step very conservative.
* **No KL penalty:** `kl_loss_coef` is set to 0.
* **Sequence-mean-token-mean aggregation:** Loss is first averaged per-sequence, then averaged across sequences. This prevents longer responses from dominating the loss.
Best for: Stability-sensitive training or when working with long-form outputs where per-sequence normalization matters. The very small clip range means you may need more training steps to converge.
`--rl-kl-beta` is incompatible with GSPO-token. Setting a non-zero `--rl-kl-beta` when `--rl-loss-method gspo-token` is specified will cause job creation to fail with: `loss_config.kl_beta must be 0/unset when method is GSPO_TOKEN`.
**When to use each method:**
| Goal | Recommended method |
| ----------------------------------------------- | ------------------------------ |
| Safe default for most tasks | `grpo` |
| Faster convergence, more aggressive exploration | `dapo` |
| Maximum stability, long-form outputs | `gspo-token` |
| Keep policy close to reference model | `grpo` with `--rl-kl-beta > 0` |
**What it does**: Overrides the KL divergence penalty coefficient for GRPO. Higher values keep the policy closer to the reference model; lower values allow more divergence.
**Default**: `0` (uses the loss method's built-in default: `0.001` for GRPO)
**Valid range**: `>= 0`
`--rl-kl-beta` only applies to `--rl-loss-method grpo`. It is rejected for `dapo` and `gspo-token`, which are designed to operate without KL penalties.
**When to adjust**:
* **Increase** if the model diverges too far from the base model's capabilities (catastrophic forgetting)
* **Decrease or set to 0** if you want the model to explore more freely
* Leave at default for most tasks
## Rollout (Sampling) Parameters
Parameters that control how the model generates responses during training rollouts.
**What it does**: Controls the randomness of the model's token selection during generation. Higher temperature = more random/creative, lower = more deterministic/focused.
**Default**: `0.7`\
**Valid range**: `0.1` to `2.0` (must be >0)
**How it affects outcome**:
* **0.0-0.1 (near-greedy)** → Deterministic outputs with no exploration. Leads to mode collapse and repetitive text. **Avoid in RFT.**
* **0.5-1.0 (sweet spot)** → Good balance of exploration and coherence. Ideal for most RLHF applications.
* **>1.2 (high randomness)** → Very creative but potentially incoherent outputs
**When to adjust**:
* **Lower (0.3-0.5)** for tasks requiring precision, factual accuracy, or safety (less toxic outputs)
* **Raise (1.0-1.2)** for creative tasks like story generation or when you need more diverse rollout exploration
* **Never use 0.0**—greedy sampling breaks RFT by eliminating exploration
**What it does**: Dynamically limits token sampling to the smallest set of tokens whose cumulative probability exceeds threshold p. Only considers the most probable tokens that together make up the top p% of probability mass.
**Default**: `1.0` (considers all tokens)\
**Valid range**: `0` to `1`
`top_p` and `top_k` are both optional and not mutually exclusive. If both are set, `top_k` filters first, then `top_p` narrows by cumulative probability.
**How it affects outcome**:
* Lower values (0.2-0.5) filter out long-tail, low-probability tokens that often cause hallucinations
* Higher values (0.9-1.0) allow more diversity in outputs
* Prevents the model from selecting very unlikely tokens that may be nonsensical
**When to adjust**:
* **Lower to 0.2-0.5** when your reward function penalizes hallucinations or factual errors
* **Keep at 0.9-1.0** for creative tasks that benefit from diverse vocabulary
* Works well in combination with temperature for fine-grained control
**What it does**: Limits sampling to only the K most probable tokens at each step. A fixed-size cutoff (unlike top-p which is dynamic).
**Default**: `40`\
**Valid range**: `0` to `100` (0 = disabled)
`top_p` and `top_k` are both optional and not mutually exclusive. If both are set, `top_k` filters first, then `top_p` narrows by cumulative probability.
**How it affects outcome**:
* Similar to top-p but uses a fixed number of candidates instead of a probability threshold
* Lower k = more focused, less diverse outputs
* Higher k = more exploration and creativity
**When to adjust**:
* **Combine with temperature** (e.g., temp 0.8 + top-k 40) for balanced creative exploration
* **Keep ≤50** to maintain reasonable inference latency
* Consider using top-p instead for most use cases—it adapts better to varying probability distributions
**What it does**: How many different responses the model generates for each prompt during training. In GRPO terminology, this is the **group size** — the set of completions per prompt used to compute group-relative advantages. The policy optimization algorithm compares these candidates to compute advantages and learn which responses are better. Exposed as `--response-candidates-count` in both `firectl` and the `eval-protocol` CLI.
**Default**: `8` (server-side default applied when the field is unset by any client)
**Valid range**: Minimum `2`, no hard upper bound
**How it affects outcome**:
* **n=1** → **Not allowed.** Policy optimization requires multiple candidates to learn from comparisons
* **n=2-4** → Minimal viable exploration. Faster and cheaper but less signal for learning
* **n=8** → Recommended default. Good balance of learning signal and cost for most tasks
* **n=16** → Higher quality signal at higher cost. Consider for complex tasks with nuanced evaluators
* **n>16** → Diminishing returns in most cases. Linearly increases cost and rollout time
**When to adjust**:
* **Increase to 8-16** when you need higher quality learning signal and cost is acceptable
* **Keep at 8** for most experiments—it's the recommended starting point
* **Never set to 1**—this will cause job creation to fail
* Consider the cost tradeoff: each chunk produces `chunk_size × response_candidates_count` rollouts before a training step (e.g., `chunk_size=200` with `n=8` → 1600 rollouts), so higher values linearly increase wall-clock time. See [Chunk Size](#chunk-size) for how chunks and epochs interact.
Higher values of n increase per-prompt memory usage in both the rollout phase and the training step. While there is no enforced maximum, very high values (e.g., >32) may encounter memory pressure depending on model size and sequence length. Values of 8 and 16 are well-tested.
**What it does**: The maximum number of tokens the model can generate in a single response during rollouts.
**Default**: `2048`\
**Valid range**: `16` to `16384`
**How it affects outcome**:
* Directly affects task completion: too short and the model can't finish complex tasks
* Longer responses improve reward on summarization, story generation, and reasoning tasks
* Linearly increases training cost—every token generated costs compute
**When to adjust**:
* **Increase** when your tasks require longer reasoning chains, detailed summaries, or complex multi-step solutions
* **Decrease** to reduce costs for tasks with naturally short outputs (classification, short-form Q\&A)
* Monitor your reward curves: if the model is cutting off mid-response, increase max tokens
**What it does**: Controls how many rollout completions run in parallel during the rollout phase of training. This is a **throughput parameter only** — it does not affect training dynamics, gradient computation, or model quality.
**Default**: Inherited from the evaluator's `@evaluation_test` decorator if not set on the CLI. If the decorator also doesn't set it, the SDK default of `96` applies.
**How it affects outcome**:
* **Higher values** → Faster rollout phase (more completions generated simultaneously)
* **Lower values** → Slower rollout phase but less API load on the inference endpoint
* **No effect** on training loss, advantages, or gradient updates
**When to adjust**:
* **Increase** to speed up the rollout phase if your inference endpoint can handle higher concurrency
* **Decrease** if you're hitting rate limits or timeouts on the inference endpoint
* **Leave unset** to use the evaluator's default, which is tuned for typical workloads
This parameter only controls parallelism during the rollout (sampling) phase. It has no effect on training dynamics — batch composition, advantage normalization, loss computation, and gradient updates are all unaffected.
## Zero-Variance Group Filtering
During each training iteration, the model generates K response candidates per prompt (controlled by `--response-candidates-count` or `--n`). Your evaluator scores each candidate. If **all K candidates for a prompt receive the same score**, that group provides no learning signal — the model cannot distinguish better from worse responses.
**Managed RFT automatically filters out these zero-variance groups.** This applies to all loss methods (GRPO, DAPO, and GSPO-token), not just DAPO.
Important behaviors:
* Filtered prompts are **dropped from the batch**, not replaced with new prompts. This means your effective batch size may be smaller than expected when many groups are homogeneous.
* Filtering happens at both the full-group level (all K candidates same score) and at the chunk level within groups.
* If your evaluator returns the same score for all rollouts across most prompts, training will make limited progress and may trigger early stopping.
**To reduce zero-variance groups:**
* Increase `--temperature` (e.g., 0.8–1.0) to produce more diverse responses
* Increase `--response-candidates-count` to generate more candidates
* Ensure your evaluator returns a range of scores, not just 0 and 1
## Parameter Interactions
Parameters don't work in isolation—they interact in important ways.
These three work together to control sampling behavior. Using all three gives you fine-grained control:
* **Temperature** sets the overall randomness
* **Top-p** dynamically filters by probability mass
* **Top-k** sets a hard limit on candidate tokens
Example: `temperature=0.8, top_p=0.9, top_k=40` gives creative but controlled outputs.
Larger batch sizes provide more stable gradients, which may allow for slightly higher learning rates. However, the default learning rate is tuned for the default batch size—only adjust if you have evidence from your training curves.
Larger base models (70B+) may need higher LoRA ranks to capture complex behaviors, but they also require more resources. For smaller models (\<13B), rank 8-16 is usually sufficient.
## Tuning Strategies
Best practices for adjusting parameters to achieve your training goals.
The default parameters are carefully tuned to work well for most RFT tasks. Don't change them unless you have a clear hypothesis based on your training metrics.
Run at least one baseline experiment with defaults before making any adjustments. This gives you:
* A performance benchmark to compare against
* Understanding of whether parameter tuning is actually needed
* Evidence about which metrics need improvement
Many successful RFT jobs use all default parameters.
When you do adjust parameters, change only one at a time and measure the impact on your reward curves and evaluation metrics.
**Good workflow:**
1. Run baseline with defaults
2. Identify specific issue (e.g., reward crashes, slow convergence)
3. Change ONE parameter that should address that issue
4. Compare results
5. Repeat
**Avoid:** Changing multiple parameters simultaneously—you won't know which change caused the improvement or regression.
Use Weights & Biases integration to:
* Compare training curves across experiments
* Track reward progression over time
* Log all hyperparameters automatically
This makes it easy to identify which parameter changes actually helped and which hurt performance.
Quick reference for goal-directed parameter tuning:
* **Faster convergence** → ↑ epochs (add 1-2), tune learning rate (stay \<2× default)
* **Better quality** → ↑ temperature (1.0-1.2), ↑ rollouts (6-8), ↑ max tokens
* **Safer/less toxic** → ↓ temperature (0.3-0.5), ↓ top-p (0.5), ↓ top-k
* **More creative** → ↑ temperature (1.0-1.2), top-p = 0.9
* **Lower cost** → ↓ rollouts, ↓ max tokens, ↓ batch size
* **Higher capacity** → ↑ LoRA rank (16-32), but monitor memory usage
* **Prevent overfitting** → Keep epochs = 1, consider lower LoRA rank
## Next Steps
Complete guide to CLI parameters and options
Launch your RFT job
Hands-on tutorial showing parameter tuning in practice
Learn about the RFT training process and workflow
# Single-Turn Training Quickstart
Source: https://docs.fireworks.ai/fine-tuning/quickstart-math
Train a model to be an expert at answering GSM8K math questions
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
**Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Single-Turn Training** path—the fastest way to get started with RFT.
In this quickstart, you'll train a small language model—`Qwen3 0.6B`—to solve mathematical reasoning problems from the GSM8K dataset.
## What you'll learn
* How to set up and test an evaluator locally, using the Eval Protocol SDK
* How to take that evaluator and use it in an RFT job, from the command line
* How to monitor training progress and evaluate accuracy improvements
Prefer a notebook experience? You can also [run this tutorial in Google Colab](https://colab.research.google.com/drive/16xrb9rx6AoAEOtrDXumzo71HjhunaoPi#scrollTo=CP18QX4tgi-0). Note that Colab requires billing enabled on your Google account.
## Prerequisites
* Python 3.10+
* A Fireworks API key (stored in your shell or .env)
* Command-line access (terminal or shell)
## 1. Install dependencies and set up files
Clone the quickstart-gsm8k repository and install dependencies:
```bash theme={null}
git clone https://github.com/eval-protocol/quickstart-gsm8k.git
cd quickstart-gsm8k
pip install -r requirements.txt
```
Create the `gsm8k_artifacts/` folder structure and copy files:
```bash theme={null}
mkdir -p gsm8k_artifacts/{tests/pytest/gsm8k,development}
cp evaluation.py gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py
cp gsm8k_sample.jsonl gsm8k_artifacts/development/gsm8k_sample.jsonl
```
The repository includes:
* **Evaluator** (`evaluation.py`): Defines how to evaluate math answers
* **Dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on
Install the latest `eval-protocol` SDK, `pytest`, and `requests`:
```bash theme={null}
python -m pip install --upgrade pip
python -m pip install pytest requests git+https://github.com/eval-protocol/python-sdk.git
```
Download the evaluator and dataset files:
Run this Python script to download two files from the Eval Protocol repository into a folder on your machine called `gsm8k_artifacts/`.
* **Test script** (`test_pytest_math_example.py`): Defines how to evaluate math answers
* **Sample dataset** (`gsm8k_sample.jsonl`): Contains example math problems to test on
```python tutorial/download_gsm8k_assets.py theme={null}
from pathlib import Path
import requests
ARTIFACT_ROOT = Path("gsm8k_artifacts")
TEST_PATH = ARTIFACT_ROOT / "tests" / "pytest" / "gsm8k" / "test_pytest_math_example.py"
DATASET_PATH = ARTIFACT_ROOT / "development" / "gsm8k_sample.jsonl"
files_to_download = {
TEST_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/tests/pytest/gsm8k/test_pytest_math_example.py",
DATASET_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/development/gsm8k_sample.jsonl",
}
for local_path, url in files_to_download.items():
local_path.parent.mkdir(parents=True, exist_ok=True)
response = requests.get(url, timeout=30)
response.raise_for_status()
local_path.write_bytes(response.content)
print(f"Saved {url} -> {local_path}")
```
Expected output:
```
Saved https://raw.githubusercontent.com/.../test_pytest_math_example.py -> gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py
Saved https://raw.githubusercontent.com/.../gsm8k_sample.jsonl -> gsm8k_artifacts/development/gsm8k_sample.jsonl
```
## 2. Test your evaluator locally
In this step, we will test your evaluator by examining the output locally. Feel free to iterate on the evaluator you downloaded in the last step until it gives the output you want.
Open a terminal and run:
```bash theme={null}
ep logs
```
This will start a local server, navigate to `http://localhost:8000`. Keep this terminal running.
In a **new terminal**, call the test script to run the evaluator on your dataset of sample math problems.
```bash theme={null}
cd gsm8k_artifacts
ep local-test
```
This command discovers and runs your `@evaluation_test` with pytest.
As the test runs, you'll see evaluation scores appear in the browser, with detailed logs for each problem the model attempts. `pytest` will also register your evaluator and dataset with Fireworks automatically, so you can use them in the next step for RFT.
## 3. Start training
First, set your Fireworks API key so the Fireworks CLI can authenticate you:
```bash theme={null}
export FIREWORKS_API_KEY=""
```
Next, we'll launch the RFT job using the evaluator and dataset you just registered. We're using a small base model (`qwen3-0p6b`) to keep training fast and inexpensive. Because your evaluator and dataset were already registered with Fireworks in the last step, we don't need to specify them again here.
```bash theme={null}
cd ..
eval-protocol create rft
--base-model accounts/fireworks/models/qwen3-0p6b
```
The CLI will output dashboard links where you can monitor your training job in real-time.
You can also store your API key in a `.env` file instead of exporting it each session.
## Monitor your training progress
Your RFT job is now running. You can monitor progress in the dashboard links provided by the CLI output.
Re-run the pytest evaluation command to measure your model's performance on new checkpoints:
```bash theme={null}
cd gsm8k_artifacts
pytest -q tests/pytest/gsm8k/test_pytest_math_example.py::test_math_dataset -s
```
This helps you see how your model's accuracy improves over time and decide when to stop training.
You can adjust the evaluation logic to better fit your needs:
* **Modify reward shaping**: Edit the scoring logic in `test_pytest_math_example.py` to match your answer format expectations
* **Use your own data**: Replace the sample dataset by either editing the JSONL file locally or passing `--dataset-jsonl` when creating the RFT job
### What's happening behind the scenes
Understanding the training workflow:
1. **Evaluation registration**: The pytest script evaluates a small GSM8K subset using numeric answer checking, then automatically registers both your evaluator and dataset with Fireworks
2. **RFT job creation**: The `create rft` command connects your registered evaluator and dataset to a Reinforcement Fine-Tuning job for your chosen base model
3. **Continuous improvement**: As training progresses, evaluation scores on the held-out set reflect improved accuracy, allowing you to iterate quickly before scaling to larger experiments
## Next steps
Learn all CLI options to customize your training parameters
Train agents that run in your production infrastructure
Understand how reinforcement fine-tuning works
# Remote Agent Quickstart
Source: https://docs.fireworks.ai/fine-tuning/quickstart-svg-agent
Train an SVG drawing agent running in a remote environment
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
**Following the [RFT Overview](/fine-tuning/reinforcement-fine-tuning-models)?** This is the **Remote Agent Training** path—for training agents that run in your production infrastructure.
In this quickstart, you'll train an agent to generate SVG drawings. Your agent runs in a remote server (Vercel), which means rollouts happen remotely while Fireworks handles the training. This approach lets you train agents that already live in your production environment.
Here's a quick walkthrough:
## What You'll Learn
* **Apply RFT to production agents** — Train models that work with remote servers and existing infrastructure
* **Remote rollout processing** — Connect your production environment to Fireworks RFT using Eval Protocol
* **Monitor and debug training** — Track progress, inspect rollouts, and debug issues with live logs
## 1. Installation
1. **Clone the quickstart repo**: [https://github.com/eval-protocol/quickstart](https://github.com/eval-protocol/quickstart)
```bash theme={null}
git clone git@github.com:eval-protocol/quickstart.git
cd quickstart
```
2. **Install Eval Protocol**:
```bash theme={null}
pip install "eval-protocol[svgbench]"
```
3. **Environment Setup**:
The `env.example` file is located in the `evaluator/` directory. Make a copy of it in the same directory, name it `.env`, and fill in your API keys:
```bash theme={null}
cp evaluator/env.example evaluator/.env
```
Then edit `evaluator/.env` with your API keys:
```
FIREWORKS_API_KEY=your-fireworks-key-here
OPENAI_API_KEY=your-openai-key-here
```
The create process below automatically reads and uploads these secrets to Fireworks.
For more details on Fireworks Secret Management usage, please refer to [using secret in evaluator](/fine-tuning/using-secret-in-evaluator).
## 2. Test your evaluator locally
Test your evaluator locally before launching training, to verify everything works with your rollout processor.
**Terminal 1** - Start the local UI server to view results:
```bash theme={null}
ep logs
```
**Terminal 2** - Kick off the test:
```bash theme={null}
cd evaluator
ep local-test
```
This command discovers and runs your `@evaluation_test` with pytest. In this case, it builds an image and runs the test in Docker, because a `Dockerfile` is present.
The test automatically uses our Vercel remote server:
```
rollout_processor=RemoteRolloutProcessor(
remote_base_url="https://vercel-svg-server-ts.vercel.app",
)
```
If you want to use a local development Vercel server instead, see [Local Development Server](#local-development-server).
**Note:**
* If your evaluation setup has custom system dependencies (e.g., Chromium), add a `Dockerfile`. When you run `ep local-test`, it will build an image and run `pytest` inside Docker.
* If you don't need Docker, `ep local-test` will run `pytest` on your host machine by default.
* You can ignore the `Dockerfile` and force host execution with: `ep local-test --ignore-docker`.
RFT evaluators run in sandboxed environments. Your Dockerfile must follow these constraints:
**Base image:**
* Only Debian-based images are supported (e.g., Debian, Ubuntu, or `python:3.x-slim`)
* Alpine, CentOS, and other non-Debian distros are not supported
* If no Dockerfile is provided, the system uses a default Python environment with common packages pre-installed
**Supported instructions:**
* `FROM`: Base image (required, only one allowed)
* `RUN`: Execute commands
* `COPY` / `ADD`: Copy files into the image
* `WORKDIR`: Set working directory
* `USER`: Set the user
* `ENV`: Set environment variables
* `CMD` / `ENTRYPOINT`: Set the start command
* `ARG`: Build-time variables
**Unsupported features:**
| Feature | Status |
| ---------------------- | ----------------------------------------- |
| Non-Debian base images | ❌ Not supported (no Alpine, CentOS, etc.) |
| Multi-stage builds | ❌ Not supported (only one `FROM` allowed) |
| `EXPOSE` | ⚠️ Ignored |
| `VOLUME` | ⚠️ Ignored |
**Example Dockerfile:**
```dockerfile theme={null}
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
chromium \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy evaluator code
COPY . .
CMD ["pytest", "-vs"]
```
Multi-stage Dockerfiles will fail during the evaluator build. Use a single `FROM` instruction and install all dependencies in one stage.
### Expected Test Output
Navigate to [http://localhost:8000](http://localhost:8000) to see the Eval Protocol UI.
```
INFO:eval_protocol.pytest.remote_rollout_processor:Found status log for rollout democratic-way-12: Rollout democratic-way-12 completed
INFO:eval_protocol.pytest.remote_rollout_processor:Found Fireworks log for rollout democratic-way-12 with status code 100.0
INFO:eval_protocol.adapters.fireworks_tracing:Successfully converted 1 traces to evaluation rows | 3/8 [00:19<00:22, 4.52s/rollout]
...
Runs (Parallel): 100%|████████████████████████████████████████████| 1/1 [00:31<00:00, 31.07s/run]
PASSED
```
If you're interested in understanding how Remote Rollout Processing works and how it communicates with the remote server, see [How Remote Rollout Processing Works](#how-remote-rollout-processing-works).
## 3. Start training with a single command
To kickoff training, simply do:
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/qwen3-0p6b \
--chunk-size 10
```
This command:
1. Uploads secrets — reads your `.env` and uploads API keys as Fireworks secrets
2. Uploads evaluator — packages and uploads your evaluation code
3. Waits for build — polls evaluator status until ACTIVE (timeout: 10 minutes)
4. Creates dataset — uploads your `svgbench_dataset.jsonl`
5. Launches RFT job — starts reinforcement fine-tuning with your evaluator
### Configuration & Troubleshooting
**Training Parameters**: We use Eval Protocol's default values for training parameters (batch size, epochs, learning rate, LoRA rank, accelerator count, etc.). For a complete list of available RFT flags you can customize, see [Fireworks RFT Command Documentation](/tools-sdks/firectl/commands/reinforcement-fine-tuning-job-create).
**Changing Evaluators**: If you've made changes to your evaluator code and want to upload a new version:
```bash theme={null}
eval-protocol create rft \
--base-model accounts/fireworks/models/qwen3-0p6b \
--chunk-size 10 \
--force
```
**Evaluator Upload Timing Out**: If your evaluator takes longer than 10 minutes to build, you'll see:
```
⏰ Timeout after 10.0m - evaluator is not yet ACTIVE
❌ Evaluator is not ready within the timeout period.
📊 Please check the evaluator status at: https://app.fireworks.ai/dashboard/evaluators/test-svgagent-test-svg-generation-evaluation
Wait for it to become ACTIVE, then run 'eval-protocol create rft' again.
```
In this case, monitor the evaluator upload at the link, and run the command again when ACTIVE.
## 4. Monitor Training Progress
After successful job creation, you'll see:
```
✅ Created Reinforcement Fine-tuning Job
name: accounts/pyroworks/reinforcementFineTuningJobs/sdnld4yn
📊 Dashboard Links:
Evaluator: https://app.fireworks.ai/dashboard/evaluators/test-svgagent-test-svg-generation-evaluation
Dataset: https://app.fireworks.ai/dashboard/datasets/svgbench-dataset
RFT Job: https://app.fireworks.ai/dashboard/fine-tuning/reinforcement/sdnld4yn
```
Click on the **RFT Job** link to view real-time training progress, epoch counts, and rollout data.
### Training Results
After successful training, you should see performance improvements reflected in the training metrics:
### SVG Quality Improvement
You can inspect individual rollouts to see the dramatic improvement in SVG generation quality. Below is a comparison between the first epoch and the final 8th epoch:
**Before (1st Epoch):**
**After (8th Epoch):**
The reinforcement fine tuning process significantly improves the model's ability to generate accurate, detailed SVG graphics that better match the input descriptions.
## Debugging Tips
When your training is running, you have several powerful tools to debug and monitor your rollouts:
### Rollout Overview
Clicking on any **Epoch** or **Step** in the training dashboard, then clicking the **table icon** to the right, will show you a comprehensive table of all rollouts. It's a good high-level overview to see if any rollouts failed and for what reason.
### Individual Rollout Details
If you click on a specific row in the rollout table, you can see exactly what the prompt was and how the model responded. You can even copy and paste out the SVG code generated and render it yourself to see what the model did. This is how we got the results above in the before and after comparison.
### Live Log Streaming
Clicking on **View Logs** takes you to a page of logs being streamed in. Here, you can see precisely what errors are happening to the rollouts. This is useful to debug and fix any issues with your rollouts.
## Next steps
Learn all CLI options to customize your training parameters
Train models with Python evaluators for simpler tasks
Understand how reinforcement fine-tuning works
## Additional resources
* [Discord Server](https://discord.gg/mMqQxvFD9A) - Come talk to us in the #eval-protocol channel!
* [Eval Protocol Documentation](https://evalprotocol.io/introduction)
* [Remote Rollout Processor Tutorial](https://evalprotocol.io/tutorial/remote-rollout-processor)
* [SVGBench Dataset](https://github.com/johnbean393/SVGBench) - The original benchmark this project is based on
## Appendix
### How Remote Rollout Processing Works
Eval Protocol enables **reinforcement learning that meets you where you are**. Instead of forcing you to rewrite your agent in a specific framework, you can implement a lightweight remote server wherever your codebase and infrastructure already live.
Your remote server is only responsible for:
* **Executing rollouts** - Run your agent logic (in this case, SVG generation from text prompts)
* **Logging to tracing** - Send structured logs to `tracing.fireworks.ai` for evaluation (see the below linked docs for more information)
In this example, we showcase a **Vercel TypeScript server** that executes single-turn SVG code generation.
> **📖 Learn More**: For a complete deep-dive into Remote Rollout Processing, see the [Remote Rollout Processor Tutorial](https://evalprotocol.io/tutorial/remote-rollout-processor).
### Local Development Server
```bash theme={null}
cd vercel_svg_server_ts
vercel dev
```
Then swap out the `remote_base_url` to point to the local server you just started:
```
rollout_processor=RemoteRolloutProcessor(
remote_base_url="http://localhost:3000",
)
```
And in a third terminal, run the evaluation:
```bash theme={null}
ep local-test
```
> See [Vercel CLI documentation](https://vercel.com/docs/cli/dev) for more information on local development.
# Overview
Source: https://docs.fireworks.ai/fine-tuning/reinforcement-fine-tuning-models
Train models using reinforcement learning in minutes
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
Fireworks RFT helps you train frontier models like DeepSeek V3 and Kimi K2 to **outperform closed models for your product use case, using reinforcement learning.** Fireworks RFT is powerful and easy to use for developers and enterprises:
* **No infrastructure:** Train frontier models without managing GPUs or RL infra
* **Production-ready:** Built-in tracing, monitoring, security & one-click deploy
* **Fast iteration:** From evaluator setup to deployed model in hours, not weeks
See how [Genspark](https://fireworks.ai/blog/genspark) and [Vercel](https://fireworks.ai/blog/vercel) used Fireworks RFT to train open models for agentic use cases, outperforming leading closed models.
## Quickstart: Pick Your Training Approach
**⏱️ 15 minutes**
**Best for:** Testing locally, simple task training
**How it works:** Iterate on your evaluator and use it to train a small model on Fireworks.
**⏱️ 1-2 hours**
**Best for:** Agents, multi-turn workflows, existing services
**How it works:** Rollouts happen in your environment. Connect via HTTP with tracing.
**⏱️ 2-4 hours**
**Best for:** Sensitive data, compliance, enterprise
**How it works:** Training data never leaves your GCS/S3 bucket. Full data isolation.
## Launch Training
Requirements, validation checks, and common errors before launching
Fast, scriptable, reproducible. Perfect for automation and iteration
Visual, guided, beginner-friendly. Great for exploring options
Already familiar with [firectl](/fine-tuning/cli-reference#using-firectl-cli-alternative)? You can create RFT jobs directly.
## RFT Concepts
The RL training loop explained
How reward functions guide training
Local vs remote evaluation environments
Optimize your training configuration
Estimate and optimize your training costs
# Cost Estimator
Source: https://docs.fireworks.ai/fine-tuning/rft-cost-estimator
Estimate and optimize the cost of your RFT training jobs
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
## Interactive cost calculator
Select your model and training configuration to get an instant cost estimate. The calculator uses the following formulas:
1. **Total tokens**: Prompts × Epochs × Response candidates × (Max tokens × 0.6)
2. **GPU hours**: (Total tokens ÷ 1M) × (GPU hours per million tokens range, varies by model size)
3. **Cost**: GPU hours × GPU rate per hour
You can derive wall-clock training time from the estimate as: **Training time = GPU hours ÷ Number of GPUs**.
The GPU hours per million tokens range varies by model size and accounts for variability in model efficiency, system overhead, and actual response lengths. Ranges are based on actual RFT job data.
**Order-of-magnitude estimates only.** This calculator provides estimates and is not intended for real forecasting or budgeting. Actual costs may vary significantly.
## How RFT pricing works
Reinforcement fine-tuning jobs are billed based on **GPU-seconds** consumed during training. The total cost depends on three main factors:
1. **Model size** — Determines how many GPUs are needed and the per-GPU-hour rate
2. **Training dataset** — How much data is processed (dataset size × epochs × rollouts)
3. **Rollout generation** — Token generation during training (max tokens × rollouts per prompt)
## Cost formula
The approximate cost of an RFT job can be estimated as:
$$
\text{Cost} = \text{GPU-hours} \times \text{Price per GPU-hour}
$$
Where GPU-hours depend on:
$$
\text{GPU-hours} \approx \text{Num GPUs} \times \left(\frac{\text{Prompts} \times \text{Epochs} \times \text{Rollouts (n)} \times \text{Avg tokens per rollout}}{\text{Throughput (tokens/sec)}}\right) \div 3600
$$
The key variables are:
| Variable | Description | How to control |
| --------------------------- | ------------------------------ | ----------------------------------- |
| **Num GPUs** | GPUs required for the model | Determined by model size |
| **Prompts** | Number of rows in your dataset | Your dataset size |
| **Epochs** | Passes through the dataset | `--epochs` flag (default: 1) |
| **Response candidates (n)** | Responses generated per prompt | `--n` flag (default: 4) |
| **Avg tokens per rollout** | Average response length | `--max-tokens` flag (default: 2048) |
| **Throughput** | Tokens generated per second | Determined by model + hardware |
Training time directly translates to cost: **Cost = Training time × Num GPUs × GPU-hour rate**. Check the [pricing page](https://fireworks.ai/pricing) for current GPU-hour rates.
### How parameters affect cost
See how each parameter change impacts your total cost relative to a baseline configuration (500 prompts, 1 epoch, n=4, 2048 max tokens):
| Change | Cost impact | Explanation |
| ---------------------------------- | -------------- | --------------------------------------- |
| Double dataset size (1000 prompts) | **\~2×** | Linear scaling with dataset size |
| Double rollouts (n=8) | **\~2×** | Linear scaling with rollout count |
| Double max tokens (4096) | **\~1.5–2×** | More tokens per rollout |
| Add epoch (epochs=2) | **\~2×** | Full additional pass through data |
| Double LoRA rank (16 → 32) | **\~1.2–1.5×** | More trainable parameters |
| Halve max tokens (1024) | **\~0.5–0.7×** | Fewer tokens generated |
| Halve rollouts (n=2) | **\~0.5×** | Fewer rollouts but less learning signal |
## Cost optimization tips
Use models under 16B parameters for initial experimentation. Iterate on your evaluator and dataset with `qwen3-0p6b` or `llama-v3p1-8b-instruct` before moving to larger models.
This lets you:
* Validate your evaluator logic at zero cost
* Test dataset quality and format
* Tune rollout parameters
* Establish baseline reward curves
Set `--max-tokens` to the minimum needed for your task:
* **Short outputs** (classification, short answers): 256–512 tokens
* **Medium outputs** (code generation, summaries): 1024–2048 tokens
* **Long outputs** (detailed analysis, multi-step reasoning): 4096+ tokens
Every token generated during rollouts costs compute. Don't use 16384 max tokens if your task only needs 512.
Start with 1 epoch (default). Most RFT jobs converge well within a single pass through the data. Add more epochs only if the reward curve is still climbing at the end of training.
Slow evaluators increase wall-clock training time and therefore cost:
* Keep evaluations under 5 seconds per rollout
* Cache expensive computations
* For remote evaluators, ensure your server can handle concurrent requests
* Avoid unnecessary API calls in your evaluation logic
**Evaluator complexity impact**: Simple evaluators (self-contained) have minimal overhead. Evaluators with calls to external services, such as LLM-as-judge use cases or company-specific endpoints, may have variable training time due to rate limits by model providers or other services.
A smaller, high-quality dataset often outperforms a larger, noisy one:
* Remove duplicate or near-duplicate prompts
* Ensure prompts are diverse and representative
* Start with 200–500 well-chosen prompts
* Quality over quantity reduces cost while maintaining performance
## Example cost scenarios
**Goal**: Test an evaluator on a small model
| Parameter | Value |
| ------------------ | --------------- |
| Model | Qwen3 0.6B |
| Dataset | 100 prompts |
| Epochs | 1 |
| Rollouts (n) | 4 |
| Max tokens | 2048 |
| **Estimated cost** | **Free** |
| **Estimated time** | \~15–30 minutes |
Best for: Initial evaluator development and testing.
**Goal**: Train a capable model for production use
| Parameter | Value |
| ------------------ | --------------------- |
| Model | Llama 3.1 8B Instruct |
| Dataset | 500 prompts |
| Epochs | 1 |
| Rollouts (n) | 4 |
| Max tokens | 2048 |
| **Estimated cost** | **Free** |
| **Estimated time** | \~1–2 hours |
Best for: Production workloads that can use an 8B model.
**Goal**: Train a large model for maximum quality
| Parameter | Value |
| ------------------ | ------------------------------ |
| Model | Llama 3.3 70B Instruct |
| Dataset | 500 prompts |
| Epochs | 1 |
| Rollouts (n) | 4 |
| Max tokens | 2048 |
| **Estimated cost** | Training hours × 8 GPUs × rate |
| **Estimated time** | \~1–2 hours |
Check the [Fireworks Pricing page](https://fireworks.ai/pricing) for the current GPU-hour rate. For a 2-hour job on 8 GPUs, multiply: 2 × 8 × (rate per GPU-hour).
**Goal**: Maximum quality with large model and more rollouts
| Parameter | Value |
| ------------------ | ------------------------------ |
| Model | DeepSeek V3 |
| Dataset | 1000 prompts |
| Epochs | 2 |
| Rollouts (n) | 8 |
| Max tokens | 4096 |
| **Estimated cost** | Training hours × 8 GPUs × rate |
| **Estimated time** | \~8–16 hours |
This is a larger job. The cost scales with training time: more prompts, epochs, rollouts, and tokens all increase total GPU-hours.
## Monitoring costs during training
Cost information is only available after your job completes:
1. **Dashboard**: The [Fireworks Dashboard](https://app.fireworks.ai) displays the final cost on the RFT job page once training finishes
2. **Training progress**: While the job is running, you can monitor elapsed time and estimated completion in the job overview
3. **Early stopping**: You can cancel a job early if needed—the model checkpoint from the last completed step is still usable. The final cost will be calculated based on GPU-seconds consumed up to the cancellation point.
If a job is running longer than expected, check your evaluator performance. Slow evaluators are the most common cause of unexpectedly long (and expensive) training runs.
## Next steps
View current GPU-hour rates and pricing tiers
Learn how each parameter affects training quality and cost
Create your first RFT job
# RFT parameters reference
Source: https://docs.fireworks.ai/fine-tuning/rft-parameters-reference
Checkpoint, resume, and GRPO metrics fields for reinforcement fine-tuning recipes.
Use this page for training **checkpoint and resume** knobs and **GRPO metric interpretation** that are easy to miss when running reinforcement fine-tuning (RFT) and cookbook-driven training. For sampling and optimization hyperparameters (learning rate, epochs, temperature, KL targets, etc.), see [Parameter tuning](/fine-tuning/parameter-tuning).
The canonical cookbook reference for save, resume, and promote is [Checkpoints and Resume](/fine-tuning/training-api/cookbook/checkpoints). Low-level SDK APIs are documented in [Saving and loading](/fine-tuning/training-api/saving-and-loading).
## `dcp_save_interval`
Controls how often full training state (weights **and** optimizer) is checkpointed using DCP (Distributed Checkpoint) format.
| Property | Value |
| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Type** | `integer` |
| **Default** | `0` (disabled) |
| **Typical config — SFT / RL / DPO cookbooks** | `WeightSyncConfig(dcp_save_interval=N)` on the recipe `Config` (see [Cookbook: RL](/fine-tuning/training-api/cookbook/rl) and [Checkpoints](/fine-tuning/training-api/cookbook/checkpoints)) |
When set to `0` (the default), no periodic DCP checkpoints are written for resume. Only sampler and HuggingFace-format weight snapshots may be produced — these preserve model weights but **not optimizer state**.
When set to a positive integer `N`, a full DCP checkpoint is written every `N` steps.
**Why this matters:** If a training job is interrupted, optimizer state is lost unless `dcp_save_interval` is set. The model resumes from the last checkpoint, but the optimizer re-initializes from scratch — which can affect training stability and effective learning rate.
### Example (cookbook `Config`)
```python theme={null}
from training.recipes.rl_loop import Config, main
from training.utils import WeightSyncConfig
cfg = Config(
log_path="./grpo_logs",
base_model="accounts/fireworks/models/qwen3-8b",
# ... other fields ...
weight_sync=WeightSyncConfig(dcp_save_interval=50), # full checkpoint every 50 steps
# ...
)
main(cfg)
```
Some internal or forked recipes may expose the same interval on a nested config type (for example a weight sync block). The field name is always `dcp_save_interval`; see your recipe’s `Config` dataclass for the exact attribute path.
### Job recovery and preemption
For transient control-plane or worker interruptions, the trainer job manager exposes [`reconnect_and_wait`](/fine-tuning/training-api/reference/trainer-job-manager) so your driver can wait for a resumable state and resume cleanly.
`load_state_with_optimizer()` only restores optimizer state from DCP-format checkpoints. If you point it at an HF or sampler snapshot, optimizer state silently won't be restored. Always load from the path returned by `save_state()` when you need full optimizer restore. See [Saving and loading](/fine-tuning/training-api/saving-and-loading#sampler-checkpoints).
***
## Metrics reference
### `ppo_kl` vs `ref_kld`
GRPO training logs two KL divergence metrics that measure different things:
| Metric | What it measures | Expected behavior |
| --------- | ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `ppo_kl` | KL between the **current policy** and the **previous policy** (importance-sampling ratio inside the PPO clip objective) | Stays near `0` with one minibatch per rollout — this is correct, not a bug |
| `ref_kld` | KL between the **current policy** and the **reference (base) model** | Starts near `0`, increases gradually as the policy diverges from base during training |
**Which one to monitor:** `ref_kld` is the metric to watch for policy drift. A sudden large jump in `ref_kld` may indicate reward hacking or that the KL penalty coefficient needs tuning.
The cookbook does not always surface `ref_kld` by default. To add it, you can use the `k3` unbiased estimator:
```python theme={null}
ref_kld = (ref_logp - policy_logp).exp() - (ref_logp - policy_logp) - 1
```
# Ledger & Debugging for RL Rollouts
Source: https://docs.fireworks.ai/fine-tuning/rl-rollout-debugging
Inspect snapshot history, reset the ledger, and understand how in-flight requests behave during a weight swap.
**Early Access Feature.** This page is part of the same private-preview
external-bucket hot-load workflow for RL rollouts. Contact Fireworks to enable
this path on your account before using non-`FW_HOSTED` storage.
If you are using Fireworks-managed RLOR trainers with `FW_HOSTED`, the ledger
and checkpoint-swap behavior here still matter, but you can usually ignore the
external-bucket setup and manual upload/signaling details from the BYOT
integration guide.
A hot-load deployment maintains a **ledger** of every snapshot it has loaded, along with which replica finished which snapshot at what time. The ledger is the fastest way to answer "what weights is my deployment serving right now?" and to recover from a stuck state.
## Inspect snapshot history
Dump the ledger, sorted by most recent snapshot first:
```bash theme={null}
firectl get ledger
```
Each row shows the `identity` you signaled, whether it was a full or delta snapshot, the per-replica `readiness` transition timestamps, and any load error.
## Inspect deployment status and failures
If the deployment itself is unhealthy (crashlooping after a bad snapshot, out-of-memory on merge, etc.), the reason is on the deployment resource itself:
```bash theme={null}
firectl deployment get
```
Look at the `status`, `latestStatus.reason`, and the most recent ledger entry together to reason about whether the problem is load-side, weights-side, or infra-side.
### Snapshot config validation errors
Weight sync validates each snapshot's `config.json` against the deployment's base-model config before serving the snapshot. A validation failure means the snapshot stayed unloaded; continue serving the previous ready snapshot or fall back to a new full snapshot after fixing the files.
Common messages include:
* `Extra base model config options` or `Extra snapshot model config options`: one config has a top-level field that the other does not.
* `Config value mismatch for `: both configs contain the field, but the values differ.
* `Types mismatch`: the snapshot config resolves to a different HuggingFace config class than the base model.
If the only difference is a known-safe additive metadata field, retry the weight sync request with `validation.extra_fields_ignore`, for example:
```json theme={null}
{
"identity": "version_002",
"validation": {
"extra_fields_ignore": ["snapshot_only_option"]
}
}
```
Important: Ignoring model-affecting fields can cause load or serving failures; only bypass known-safe metadata fields.
## Reset the ledger
If the delta chain is wedged or you want to force the deployment back to the base model, you can clear server-side ledger history. This preserves the deployment itself; it just forgets every hot-loaded snapshot.
```bash theme={null}
curl -X DELETE \
https://api.fireworks.ai/v1/accounts//deployments//ledger \
-H "Authorization: Bearer "
```
After reset, your next signal must be a **full** snapshot (delta metadata will be rejected because there's nothing to diff against).
## Checkpoint-swap behavior
When you signal a new snapshot, Fireworks has to eventually swap weights on every replica. What happens to **in-flight** and **new** requests during the swap depends on which transition mode the deployment is configured with.
Both modes behave the same way for checkpoint download — it always starts immediately after the signal, in parallel with ongoing inference. The modes differ in how they handle the actual weight-swap moment.
Set the mode at deployment create time with `--hot-load-transition-type ASYNC` or `SYNC` (default `ASYNC`). See [Create a hot-load deployment](/fine-tuning/rl-rollout-integration#1-create-a-hot-load-deployment).
### Async transition (recommended, default for RL)
This mode is similar in spirit to [PipelineRL](https://arxiv.org/pdf/2509.19128):
* **In-flight requests**: paused for the duration of the swap, then resumed on the same HTTP connection. The active turn keeps its current KV state, so the request continues streaming instead of restarting.
* **New requests**: queued until the swap finishes. Clients observe this as elevated time-to-first-token (TTFT).
* **No 4xx or 5xx** is returned for the swap itself. Users may specify `x-fireworks-hot-load-drain-timeout` timeout request header in seconds (default `90`) to receive HTTP 425 Too Early once the timeout expires.
### Synchronous transition
* **In-flight requests**: the server waits for them to complete on the *old* weights before swapping.
* **New requests** arriving during the swap are rejected with HTTP `425 Too Early`. Your rollout client should back off and retry, ideally using the same session-affinity key so it lands on a replica that has already finished the swap.
### Prompt cache reset behavior
`reset_prompt_cache` only affects what can be reused **after** the swap. It does not interrupt the **active turn** (the in-flight HTTP stream), but it affects the next turn in the same session and new sessions.
Configure per snapshot in `POST /hot_load/v1/models/hot_load`, for example `{ "identity": "version_002", "reset_prompt_cache": "new_session" }`.
For the full RL rollout mental model across active streams, session IDs, and reset options, see [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts).
## Need help?
If the ledger stops advancing, a snapshot never becomes ready, or the deployment stays unhealthy after you fall back to a full snapshot, contact Fireworks. Include the account ID, deployment ID, snapshot identity you tried to load, and the latest ledger output.
## Related pages
Prerequisites, deployment setup, and the hot-load API.
ARC2 deltas, hints, and incremental signal bodies.
Session affinity, policy version in streams, and MoE Router Replay.
# Incremental Snapshots (ARC2)
Source: https://docs.fireworks.ai/fine-tuning/rl-rollout-delta-checkpoints
Build ARC2 incremental checkpoints, use per-file hints, and signal delta hot-loads for BYOT RL rollout integrations.
**Early Access Feature.** This page is part of the same private-preview
external-bucket hot-load workflow for RL rollouts. Contact Fireworks to enable
this path on your account before using non-`FW_HOSTED` storage.
Start with the linear workflow in [RL Rollouts with Your Own
Trainer](/fine-tuning/rl-rollout-integration) if you have not completed a first
full snapshot and rollout yet.
Use **incremental snapshots** between full snapshots to reduce upload size and weight-update time. Each incremental snapshot is a compressed delta against a **previous snapshot identity** already loaded on the deployment.
Fireworks supports the public **ARC2** format (`compression_format: "arc_v2"`) with **Adler32** checksums (`checksum_format: "alder32"`).
## Snapshot cadence
| When | Snapshot type | Notes |
| -------------------- | --------------- | --------------------------------------------------------------------- |
| First training step | **Full** | HuggingFace layout under a new `identity` |
| Every 20th–30th step | **Full** | Resets the chain; faster recovery if a delta is corrupt |
| All other steps | **Incremental** | `previous_snapshot_identity` must match the snapshot currently served |
If an incremental hot-load fails or the chain is wedged, publish a new **full** snapshot and see [Ledger & debugging](/fine-tuning/rl-rollout-debugging).
## Why incremental?
* **Smaller uploads** — Typical compression ratios exceed 20× versus re-uploading full weights.
* **Faster loads** — Less data over the network; merge applies on replicas that already hold the previous snapshot.
* **Chain dependency** — Each incremental snapshot must reference the correct `previous_snapshot_identity` (the last successfully loaded snapshot).
## Create ARC2 deltas
You need a **pair of consecutive full checkpoints** on disk (or tensors in memory) and produce **diff safetensors** for the new step.
### Compression library
Use the Fireworks delta compression utilities. A reference implementation is available in this [GitHub gist](https://gist.github.com/ericwuatfirworks/b17ed8086cfe1b42caac556c3d364958) (`delta_compress_files_to_file`, `arc_v2`, `alder32`).
**Per-file example** (previous full snapshot `version_001`, new full snapshot `version_002_full`, upload diff as `version_002`):
```python theme={null}
from delta import delta_compress_files_to_file # from the gist / your vendored copy
delta_compress_files_to_file(
src="version_001/model-00000.safetensors",
dst="version_002_full/model-00000.safetensors",
diff_file="version_002/model-00000.safetensors",
compression_format="arc_v2",
)
```
Repeat for each safetensors shard (same filenames as the base layout). Copy non-weight files (for example `config.json`, tokenizer) from the new full tree into `version_002/` as needed.
If the previous checkpoint is already in trainer CPU memory, the gist also exposes
tensor-level helpers (`delta_compress_dicts`, etc.) so you can avoid writing full
intermediates to disk.
Upload only the **incremental directory** for the new `identity` (for example `s3://.../version_002/`). Do not re-upload the entire full checkpoint every step.
## Upload workflow
1. Build diffs with `arc_v2` for each `.safetensors` file.
2. Upload all files under the new `identity` prefix (same bucket parent as [snapshot layout](/fine-tuning/rl-rollout-integration#snapshot-layout)).
3. Optionally call [per-file hints](#per-file-hints-optional) as each file completes.
4. [Signal incremental ready](#signal-incremental-snapshot-ready) via `POST /hot_load`.
5. Poll `GET /hot_load` until all replicas are ready (same criteria as the [integration guide](/fine-tuning/rl-rollout-integration#poll-load-status)).
## Per-file hints (optional)
Hints let Fireworks start fetching and staging files before you signal the full snapshot. They are optional but recommended for large models.
**Endpoint:** `POST https://api.fireworks.ai/hot_load/v1/models/hot_load/hint`
**Headers:** Same as [hot-load API](/fine-tuning/rl-rollout-integration#hot-load-api) (`Authorization`, `fireworks-model`, `fireworks-deployment`).
**Full snapshot hint:**
```json theme={null}
{
"snapshot": { "identity": "version_001" },
"filename": "model-00000.safetensors"
}
```
**Incremental snapshot hint:**
```json theme={null}
{
"snapshot": {
"identity": "version_002",
"incremental_snapshot_metadata": {
"previous_snapshot_identity": "version_001",
"compression_format": "arc_v2",
"checksum_format": "alder32"
}
},
"filename": "model-00000.safetensors"
}
```
```bash theme={null}
curl -X POST https://api.fireworks.ai/hot_load/v1/models/hot_load/hint \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/" \
-H "Content-Type: application/json" \
-d '{
"snapshot": {
"identity": "version_002",
"incremental_snapshot_metadata": {
"previous_snapshot_identity": "version_001",
"compression_format": "arc_v2",
"checksum_format": "alder32"
}
},
"filename": "model-00000.safetensors"
}'
```
## Signal incremental snapshot ready
After all files are uploaded, signal the deployment to load the incremental snapshot:
```bash theme={null}
curl -X POST https://api.fireworks.ai/hot_load/v1/models/hot_load \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/" \
-H "Content-Type: application/json" \
-d '{
"identity": "version_002",
"incremental_snapshot_metadata": {
"previous_snapshot_identity": "version_001",
"compression_format": "arc_v2",
"checksum_format": "alder32"
},
"reset_prompt_cache": "all"
}'
```
The `identity` of the snapshot already loaded on the deployment (must exist in the ledger).
Use `"arc_v2"` for BYOT integrations.
Use `"alder32"`.
`all` (default), `none`, or `new_session`. See [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts) for active stream, session ID, and reset-option semantics.
Poll until every replica has `readiness: true` and `current_snapshot_identity == "version_002"`.
## Reference
* Every snapshot needs a new `identity` (single directory name, no `/`).
* Point `previous_snapshot_identity` at the snapshot the deployment is serving before this load.
* Upload incremental **diff** safetensors under the new `identity`; keep periodic **full** snapshots for recovery.
## Related pages
Prerequisites, deployment setup, first full snapshot, and rollouts.
Inspect snapshot history and recover from a broken chain.
Session affinity, policy version, and MoE Router Replay.
# RL Rollouts with Your Own Trainer
Source: https://docs.fireworks.ai/fine-tuning/rl-rollout-integration
Integrate an external RL trainer with Fireworks inference: hot-load new checkpoints from your bucket and run rollouts via the OpenAI-compatible API.
**Early Access Feature.** External-bucket hot-load for RL rollouts is a
private preview. [Contact Fireworks](https://fireworks.ai/contact) to enable
this path on your account before you use `S3`, `MINIO`, `NEBIUS`, or similar
non-`FW_HOSTED` storage.
**Using a code agent?** Follow sections in order: [Prerequisites](#prerequisites)
→ [Quickstart checklist](#quickstart-checklist) → [Hot-load API](#hot-load-api).
Required env: `FIREWORKS_API_KEY`. After your first full snapshot is serving,
read [Incremental snapshots](/fine-tuning/rl-rollout-delta-checkpoints) before
production training loops. For active stream, session ID, and `reset_prompt_cache`
semantics, see [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts).
For ledger and hot-load status debugging, see [Ledger & debugging](/fine-tuning/rl-rollout-debugging).
This guide is for teams that already run their own RL trainer (PyTorch FSDP, Megatron, a custom Ray cluster, etc.) and want Fireworks for large-scale inference during rollouts.
## Is this the right guide?
| Path | You own | Fireworks owns |
| ------------------------------------------------------------ | -------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **This guide (BYOT rollouts)** | Trainer, rewards, environment, checkpoint upload cadence | Hot-load deployment, distributed weight swap, inference, KV cache across rollouts |
| [Training API](/fine-tuning/training-api/introduction) | Training logic (recipes or SDK) | GPUs, trainer lifecycle, often `FW_HOSTED` bucket |
| [Managed RFT](/fine-tuning/reinforcement-fine-tuning-models) | Dataset and evaluator | End-to-end hosted RL |
**Why BYOT rollout inference?**
* **Disaggregated:** Your trainer and rollout cluster can run in different regions or clouds; deployments can span multiple regions to pool capacity.
* **Full-parameter scale:** Full (non-LoRA) tuning for large models supported on Fireworks inference shapes.
* **Fast checkpoint transfer:** Lossless compressed incremental snapshots (`arc_v2`, typically 20×+ compression) over standard object storage—no special RDMA networking between trainer and inference.
* **Async / off-policy friendly:** Background download during rollouts; configurable swap semantics similar in spirit to [PipelineRL](https://arxiv.org/pdf/2509.19128)—see [checkpoint-swap behavior](/fine-tuning/rl-rollout-debugging#checkpoint-swap-behavior).
For **Online RL** (live user traffic as rollouts with rolling per-replica updates), the same hot-load infrastructure applies; contact Fireworks for production Online RL setup.
## Placeholders
Reuse these values in every command below:
| Placeholder | Example |
| -------------------------------------- | ----------------------------------------------------------------- |
| `` | `my-team` |
| `` | `qwen3-30b-a3b` |
| `` | `rl-rollout-prod` |
| `` | From [API keys](https://app.fireworks.ai/settings/users/api-keys) |
| `` / `` | Parent prefix configured on the deployment (no trailing slash) |
| `` | Snapshot directory name, e.g. `version_001` (no slashes) |
## Prerequisites
Complete this checklist before creating a deployment:
1. **Fireworks account** and **API key** — [create a key](https://app.fireworks.ai/settings/users/api-keys) and set `export FIREWORKS_API_KEY=""`.
2. **Account ID** — In the [dashboard](https://app.fireworks.ai/), open your account settings or any resource URL; the account slug is the segment after `/accounts/` (for example `accounts//...`).
3. **Feature enablement** — Request **external-bucket hot-load for RL rollouts** on account ``, including your bucket provider (`S3`, `GCS`/`gs://`, or `NEBIUS`).
4. **Object storage read access for Fireworks** — Fireworks needs read-only access to the bucket prefix you will pass as `--hot-load-bucket-url`. At enablement, Fireworks shares the IAM principal to grant access. Typical setup:
* **Amazon S3:** Grant the Fireworks principal `s3:GetObject` (and `s3:ListBucket` on the prefix) on `s3:////*`.
* **Google Cloud Storage:** Grant `roles/storage.objectViewer` on the bucket or prefix to the Fireworks service account provided at onboarding.
* **Nebius / MinIO:** Equivalent read-only credentials or access key scoped to the upload prefix.
5. **`firectl` installed** — See [firectl](/tools-sdks/firectl/firectl).
6. **Base model and deployment shape** — An RL-capable shape for your model (GPU count, precision). If you omit `--deployment-shape`, `firectl` prompts you to pick one interactively.
## Architecture
```mermaid theme={null}
flowchart LR
trainer["Your RL Trainer"] -->|"1. Upload checkpoint"| bucket[("External bucket")]
trainer -->|"2. Signal snapshot ready"| api["Fireworks Hot-Load API"]
api -->|"3. Load weights"| deployment["Inference Deployment"]
trainer -->|"4. Rollout via /v1/completions"| deployment
deployment -->|"Tokens + optional routing_matrix"| trainer
```
**You own:** trainer, reward shaping, checkpoint cadence, rollout orchestration.
**Fireworks owns:** hot-load logistics, distributed weight swap, inference serving, KV cache across rollouts.
## End-to-end loop
1. Create a hot-load deployment.
2. Upload and hot-load an initial **full** snapshot.
3. Run rollouts against that snapshot.
4. For each training step: upload and hot-load the next **incremental** snapshot (see [Incremental snapshots](/fine-tuning/rl-rollout-delta-checkpoints)).
5. Run rollouts again.
6. Every 20th or 30th step, publish a **full** snapshot instead of an incremental one. If the incremental chain fails, fall back to a full snapshot.
## Quickstart checklist
Use this table for your **first** rollout end-to-end:
| Step | Action | Done when |
| ---- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| 1 | [Create hot-load deployment](#1-create-a-hot-load-deployment) | `firectl deployment get ` shows a healthy deployment |
| 2 | [Upload full HF snapshot](#2-upload-and-hot-load-an-initial-full-snapshot) | All files exist under `...//` in object storage |
| 3 | `POST` [signal snapshot](#hot-load-api) | HTTP 200 |
| 4 | `GET` [poll status](#hot-load-api) | Every replica has `readiness: true` and `current_snapshot_identity` matches your `identity` |
| 5 | [Run rollouts](#3-run-rollouts) | Chat/completions returns tokens |
## 1. Create a hot-load deployment
Create the deployment that will serve rollouts. During preview, `--enable-hot-load` flags may be hidden from CLI help but can still be passed explicitly.
```bash theme={null}
firectl create deployment accounts//models/ \
--deployment-shape \
--deployment-id \
--enable-hot-load \
--hot-load-bucket-type S3 \
--hot-load-bucket-url s3:/// \
--hot-load-transition-type ASYNC \
--region US_OHIO_1
```
**Flags**
* `--deployment-shape` — Optional. If omitted, `firectl` prompts you to pick one.
* `--hot-load-bucket-type` — `MINIO`, `S3`, `NEBIUS`, or `FW_HOSTED`. This guide focuses on external buckets (`S3`, `gs://`, etc.). `FW_HOSTED` is for Fireworks-managed trainers.
* `--hot-load-bucket-url` — Required when `--enable-hot-load` is set. Examples: `s3://mybucket/path`, `gs://mybucket/path`. **No trailing slash.** This is the **parent prefix**; each snapshot is a subdirectory named by `identity` (see [snapshot layout](#snapshot-layout)).
* `--hot-load-transition-type` — `ASYNC` (recommended for RL) or `SYNC`. Defaults to `ASYNC` when hot load is enabled. See [checkpoint-swap behavior](/fine-tuning/rl-rollout-debugging#checkpoint-swap-behavior).
* `--region` — Where the deployment runs (for example `US_OHIO_1`, `US_VIRGINIA_1`). Keep the trainer upload path geographically close to the bucket and deployment.
Save the **account ID**, **deployment ID**, and **model ID** from the output for hot-load and rollout calls.
If you do not set a shape, the CLI shows a shape picker:
## 2. Upload and hot-load an initial full snapshot
Upload a full HuggingFace-format checkpoint, then signal Fireworks to load it.
### Snapshot layout
Place each snapshot under its own subdirectory. The `identity` you signal in the API must match the directory name (a single path segment—no slashes):
```
s3://///
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── model-00000.safetensors
├── model-00001.safetensors
└── ...
```
Example with the recommended path pattern:
```
s3:////-/version_001/
```
* **`identity` / ``** — Any opaque string (for example `version_001` or `step_00100`).
* **Format** — Same layout as the base model on HuggingFace: `config.json`, tokenizer files, and safetensors weights. **No tensor-parallel sharding** in uploaded files.
* **File size** — Split weights into multiple `.safetensors` files, each under about 5 GB. Group weights by layer when possible; putting one layer per file minimizes load time.
Optional: call the [per-file hint API](/fine-tuning/rl-rollout-delta-checkpoints#per-file-hints-optional) as each file lands to speed up loading on large models.
### Signal and poll
Use the [Hot-load API](#hot-load-api) below with `{ "identity": "" }` and poll until all replicas are ready.
## Hot-load API
All hot-load requests use these headers:
| Header | Value |
| ---------------------- | --------------------------------------------------- |
| `Authorization` | `Bearer ` |
| `fireworks-model` | `accounts//models/` |
| `fireworks-deployment` | `accounts//deployments/` |
| `Content-Type` | `application/json` |
| Operation | Method | URL |
| ------------------------ | ------ | ----------------------------------------------------------- |
| Signal snapshot ready | `POST` | `https://api.fireworks.ai/hot_load/v1/models/hot_load` |
| Poll load status | `GET` | `https://api.fireworks.ai/hot_load/v1/models/hot_load` |
| Per-file hint (optional) | `POST` | `https://api.fireworks.ai/hot_load/v1/models/hot_load/hint` |
### Signal snapshot ready
**Full snapshot** body:
```json theme={null}
{ "identity": "version_001" }
```
**Incremental snapshot** bodies, compression, hints, and `checksum_format` are documented in [Incremental snapshots](/fine-tuning/rl-rollout-delta-checkpoints).
Snapshot directory name under the configured bucket prefix. Must not contain `/`.
Required for incremental snapshots. Includes `previous_snapshot_identity`, `compression_format` (`arc_v2`), and `checksum_format` (`alder32`). See the incremental snapshots guide.
Prompt-cache policy after the swap: `all` (default), `none`, or `new_session`. See [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts) for active stream, session ID, and reset-option semantics.
Top-level `config.json` fields to ignore during snapshot validation. Only use for known-safe metadata fields.
```bash theme={null}
curl -X POST https://api.fireworks.ai/hot_load/v1/models/hot_load \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/" \
-H "Content-Type: application/json" \
-d '{ "identity": "version_001" }'
```
```python theme={null}
import os
import requests
API_KEY = os.environ["FIREWORKS_API_KEY"]
ACCOUNT = ""
MODEL = f"accounts/{ACCOUNT}/models/"
DEPLOYMENT = f"accounts/{ACCOUNT}/deployments/"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"fireworks-model": MODEL,
"fireworks-deployment": DEPLOYMENT,
"Content-Type": "application/json",
}
resp = requests.post(
"https://api.fireworks.ai/hot_load/v1/models/hot_load",
headers=HEADERS,
json={"identity": "version_001"},
timeout=60,
)
resp.raise_for_status()
```
### Poll load status
Poll until **every** replica has `readiness: true` and `current_snapshot_identity` equals the `identity` you signaled.
```bash theme={null}
curl https://api.fireworks.ai/hot_load/v1/models/hot_load \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/"
```
```python theme={null}
status = requests.get(
"https://api.fireworks.ai/hot_load/v1/models/hot_load",
headers=HEADERS,
timeout=30,
).json()
replicas = status.get("replicas", [])
ready = (
replicas
and all(r.get("readiness") for r in replicas)
and all(r.get("current_snapshot_identity") == "version_001" for r in replicas)
)
```
### When to start rollouts
* **Default (on-policy):** Wait until all replicas report readiness on the new `identity`.
* **Off-policy / higher utilization:** You may start sending rollouts when a **subset** of replicas is ready—inspect each entry in `replicas` in the `GET` response. Stale-policy rollouts are expected; use async transition mode and monitor policy version in streaming responses (see [Policy version in responses](/guides/rollout-inference#policy-version-in-responses)).
Per-file hints are optional but recommended for large checkpoints—see [Incremental snapshots](/fine-tuning/rl-rollout-delta-checkpoints#per-file-hints-optional).
## 3. Run rollouts
Call the OpenAI-compatible inference API. For multi-turn RL, set session headers so KV cache stays on one replica:
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/" \
-H "x-multi-turn-session-id: " \
-H "x-session-affinity: " \
-H "Content-Type: application/json" \
-d '{
"model": "accounts//models/",
"messages": [{"role": "user", "content": "..."}]
}'
```
See [Inference for RL rollouts](/guides/rollout-inference) for session affinity, weight-swap behavior, MoE Router Replay (R3), and policy-version fields.
## Steady-state training loop
After the first full snapshot:
1. **Intermediate steps** — Build and upload an [incremental snapshot](/fine-tuning/rl-rollout-delta-checkpoints) (`arc_v2`), signal with `incremental_snapshot_metadata`, poll until ready, then run rollouts.
2. **Every 20th or 30th step** — Publish a new **full** snapshot for faster recovery and chain reset.
3. **On failure** — Fall back to a full snapshot; see [Ledger & debugging](/fine-tuning/rl-rollout-debugging).
Brief incremental signal example (full details on the incremental page):
```bash theme={null}
curl -X POST https://api.fireworks.ai/hot_load/v1/models/hot_load \
-H "Authorization: Bearer " \
-H "fireworks-model: accounts//models/" \
-H "fireworks-deployment: accounts//deployments/" \
-H "Content-Type: application/json" \
-d '{
"identity": "version_002",
"incremental_snapshot_metadata": {
"previous_snapshot_identity": "version_001",
"compression_format": "arc_v2",
"checksum_format": "alder32"
}
}'
```
## Numerics alignment
For best training–inference alignment:
* Match **quantization / precision** between trainer checkpoints and the deployment shape (work with Fireworks if you need a custom shape).
* Measure **logprob divergence** between trainer forward passes and rollout inference on the same tokens.
* For MoE models, use **Router Replay (R3)** during rollouts—see [MoE Router Replay](/guides/rollout-inference#moe-router-replay).
## Next steps
Build ARC2 deltas, per-file hints, and incremental signal bodies.
Inspect snapshot history, reset the ledger, and reason about request behavior during weight swaps.
Session affinity headers, policy version in streams, weight-swap behavior, and MoE Router Replay (R3).
The alternative path where Fireworks runs the trainer through the Training API.
# Secure Training (BYOB)
Source: https://docs.fireworks.ai/fine-tuning/secure-fine-tuning
Fine-tune models while keeping sensitive data and components under your control
Fireworks enables secure model fine-tuning while maintaining customer control over sensitive components and data. Use your own cloud storage, keep reward functions proprietary, and ensure training data never persists on our platform beyond active workflows.
## Dataset Storage (BYOB)
Point Fireworks to your own cloud storage for training datasets. This applies to both Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) jobs.
Grant least-privilege IAM to only the bucket/path prefixes needed for training. Use server-side encryption and your KMS policies where required.
### GCS Bucket Integration
Use external Google Cloud Storage (GCS) buckets for fine-tuning while keeping your data private. Fireworks creates proxy datasets that reference your external buckets—data is only accessed during fine-tuning within a secure, isolated cluster.
Your data never leaves your GCS bucket except during fine-tuning, ensuring maximum privacy and security.
#### Required Permissions
You need to grant access to three service accounts:
**Fireworks Control Plane**
* **Account**: `fireworks-control-plane@fw-ai-cp-prod.iam.gserviceaccount.com`
* **Required role**: Custom role with `storage.buckets.getIamPolicy` permission
```bash theme={null}
gcloud storage buckets add-iam-policy-binding \
--member=serviceAccount:fireworks-control-plane@fw-ai-cp-prod.iam.gserviceaccount.com \
--role=projects//roles/
```
**Inference Service Account**
* **Account**: `inference@fw-ai-cp-prod.iam.gserviceaccount.com`
* **Required role**: Storage Object Viewer (`roles/storage.objectViewer`)
```bash theme={null}
gcloud storage buckets add-iam-policy-binding \
--member=serviceAccount:inference@fw-ai-cp-prod.iam.gserviceaccount.com \
--role=roles/storage.objectViewer
```
**Your Company's Fireworks Service Account**
* **Account**: Your company's Fireworks account email (get it with `firectl account get`)
* **Required role**: Storage Object Viewer (`roles/storage.objectViewer`)
```bash theme={null}
gcloud storage buckets add-iam-policy-binding \
--member=serviceAccount: \
--role=roles/storage.objectViewer
```
#### Usage
```bash theme={null}
# Create dataset referencing your GCS bucket
firectl dataset create {DATASET_NAME} --external-url gs://bucket-name/path/to/data.jsonl
# Use in fine-tuning job
firectl sftj create \
--dataset "accounts/{ACCOUNT}/datasets/{DATASET_NAME}" \
--base-model "accounts/fireworks/models/{MODEL}" \
--output-model {TRAINED_MODEL_NAME}
```
### AWS S3 Bucket Integration
Use external AWS S3 buckets for fine-tuning while keeping your data private. Fireworks accesses your S3 data using GCP-to-AWS OIDC federation—no long-lived credentials are stored.
S3 bucket integration is currently supported for **training datasets only** (SFT and RFT jobs). Evaluation datasets are not yet supported.
#### IAM Role Setup
Create an IAM role with a trust policy that allows Fireworks to assume it via web identity federation:
* **Federated Principal:** `accounts.google.com`
* **Action:** `sts:AssumeRoleWithWebIdentity`
* **Condition:** `accounts.google.com:aud` equals `117388763667264115668`
Then attach a policy granting `s3:GetObject` and `s3:ListBucket` on your bucket.
See the [AWS documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html) for detailed steps on creating roles for OIDC federation.
#### Usage
```bash theme={null}
# Create dataset referencing your S3 bucket
firectl dataset create {DATASET_NAME} --external-url s3://bucket-name/path/to/data.jsonl
# Use in fine-tuning job with IAM role
firectl sftj create \
--dataset "accounts/{ACCOUNT}/datasets/{DATASET_NAME}" \
--base-model "accounts/fireworks/models/{MODEL}" \
--output-model {TRAINED_MODEL_NAME} \
--aws-iam-role "arn:aws:iam::{AWS_ACCOUNT_ID}:role/{ROLE_NAME}"
```
For RFT jobs, use `firectl rftj create` with the same `--aws-iam-role` flag.
#### Alternative: Credentials Secret
Instead of IAM role federation, you can use static AWS access keys stored in a Fireworks secret:
```bash theme={null}
# Create secret
firectl secret create --name aws-creds \
--aws-access-key-id "AKIA..." \
--aws-secret-access-key "..."
# Use in fine-tuning job
firectl sftj create \
--dataset "accounts/{ACCOUNT}/datasets/{DATASET_NAME}" \
--base-model "accounts/fireworks/models/{MODEL}" \
--output-model {TRAINED_MODEL_NAME} \
--aws-credentials-secret "accounts/{ACCOUNT}/secrets/aws-creds"
```
IAM role federation is recommended for production. If using credentials, rotate them regularly.
### Azure Blob Storage Integration
Use external Azure Blob Storage containers for fine-tuning while keeping your data private. Fireworks accesses your Azure data using GCP-to-Azure Workload Identity Federation—no long-lived credentials are stored.
Azure Blob Storage integration is currently supported for **training datasets only** (SFT and RFT jobs). Evaluation datasets are not yet supported.
#### Federated Identity Setup
Create an App Registration (or user-assigned Managed Identity) in your Azure AD tenant with a federated credential that trusts the Fireworks GCP service account:
* **Issuer:** `https://accounts.google.com`
* **Subject identifier:** `117388763667264115668`
This is the Fireworks GCP service account subject; use it as shown when configuring the federated credential.
* **Audience:** `api://AzureADTokenExchange`
Then assign the **Storage Blob Data Reader** role on your storage account or container to the app registration.
See the [Azure documentation](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation-create-trust) for detailed steps on configuring workload identity federation.
#### Usage
```bash theme={null}
# Create dataset referencing your Azure Blob container
firectl dataset create {DATASET_NAME} \
--external-url https://{STORAGE_ACCOUNT}.blob.core.windows.net/{CONTAINER}/path/to/data.jsonl
# Use in fine-tuning job with managed identity federation
firectl sftj create \
--dataset "accounts/{ACCOUNT}/datasets/{DATASET_NAME}" \
--base-model "accounts/fireworks/models/{MODEL}" \
--output-model {TRAINED_MODEL_NAME} \
--azure-managed-identity-client-id "{MANAGED_IDENTITY_CLIENT_ID}" \
--azure-tenant-id "{AZURE_TENANT_ID}"
```
For RFT jobs, use `firectl rftj create` with the same `--azure-managed-identity-client-id` and `--azure-tenant-id` flags.
#### Alternative: Credentials Secret
Instead of workload identity federation, you can store Azure credentials in a Fireworks secret. The secret value must be a JSON object containing one of: `connection_string`, `sas_token`, or `account_key`.
```bash theme={null}
# Create secret with Azure credentials
firectl secret create --name azure-creds \
--value '{"sas_token": "sv=2023-01-03&ss=b&srt=o&sp=rl&se=..."}'
# Use in fine-tuning job
firectl sftj create \
--dataset "accounts/{ACCOUNT}/datasets/{DATASET_NAME}" \
--base-model "accounts/fireworks/models/{MODEL}" \
--output-model {TRAINED_MODEL_NAME} \
--azure-credentials-secret "accounts/{ACCOUNT}/secrets/azure-creds"
```
Workload Identity Federation is recommended for production. If using credentials, rotate them regularly.
## Secure Reinforcement Fine-Tuning (RFT)
Use reinforcement fine-tuning while keeping sensitive components and data under your control. Follow these steps to run secure RFT end to end using your own storage and reward pipeline.
Set up your dataset storage using [GCS](#gcs-bucket-integration), [AWS S3](#aws-s3-bucket-integration), or [Azure Blob Storage](#azure-blob-storage-integration) as described above.
For models, you can optionally use [External AWS S3 Bucket Integration](/models/uploading-custom-models#uploading-your-model).
Keep your reward functions, rollout servers, and training metrics under your control. Generate rewards from your environment and write them to examples in your dataset (or export a dataset that contains per-example rewards).
* Reward functions and reward models remain proprietary and never need to be shared
* Rollouts and evaluation infrastructure run in your environment
* Model checkpoints can be registered to your storage registry if desired
Create or point a `Dataset` at your BYOB storage. Ensure each example contains the information required by your reward pipeline (for example, prompts, outputs/trajectories, and numeric rewards).
You can reuse existing supervised data by attaching reward signals produced by your pipeline, or export a fresh dataset into your bucket for consumption by RFT.
Use the Python SDK to create a reinforcement fine-tuning step that reads from your BYOB dataset and produces a new checkpoint.
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
# Create a reinforcement fine-tuning step
step = client.reinforcement_fine_tuning_steps.create(
rlor_trainer_job_id="my-rft-job-001",
display_name="Secure RFT Training Step",
training_config={
"base_model": "accounts/fireworks/models/{BASE_MODEL}",
"learning_rate": 1e-5,
"lora_rank": 8,
"max_context_length": 4096,
"batch_size": 32768,
},
dataset="accounts/{ACCOUNT}/datasets/{DATASET_NAME}", # Your BYOB dataset with rewards
output_model="accounts/{ACCOUNT}/models/my-improved-model-v1",
reward_weights=["score"], # Field name for rewards in your dataset
)
# Poll for completion
import time
timeout = 3600 # 1 hour timeout
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(f"Job polling timed out after {timeout} seconds")
job = client.reinforcement_fine_tuning_steps.get(
rlor_trainer_job_id="my-rft-job-001"
)
if job.state == "JOB_STATE_COMPLETED":
print("Training complete!")
break
elif job.state in ("JOB_STATE_FAILED", "JOB_STATE_CANCELLED"):
raise RuntimeError(f"Training failed: {job.state}")
time.sleep(10)
```
See the [Create Reinforcement Fine-tuning Step API reference](/api-reference/create-reinforcement-fine-tuning-step) for full parameters and options.
For a complete iterative RL workflow example using the [Python SDK](/tools-sdks/python-sdk), including rollout generation, reward computation, and hot-reloading LoRA adapters, see the [iterative RL workflow example on GitHub](https://github.com/fw-ai-external/python-sdk/tree/main/examples/iterative_rl_workflow).
When continuing from a LoRA checkpoint, training parameters such as `lora_rank`, `learning_rate`, `max_context_length`, and `batch_size` must match the original LoRA training.
* Validate the new checkpoint functions as expected in your environment
* If exporting models to your storage, apply your registry policies and access reviews
* Review audit logs and rotate any temporary credentials used for the run
Do not store long-lived credentials in code. Use short-lived tokens, workload identity, or scoped service accounts when granting Fireworks access to your buckets.
You now have an end-to-end secure RFT workflow with BYOB datasets, proprietary reward pipelines, and isolated training jobs that generate new checkpoints.
## Related Resources
Learn about our comprehensive security measures
Full guide to reinforcement fine-tuning
# Checkpoints and Resume
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/checkpoints
Save training progress, resume from failures, and promote checkpoints to deployable models — driven by the recipe.
## TL;DR
If you launch training through a cookbook recipe (`rl_loop`, `sft_loop`, `dpo_loop`, `orpo_loop`, `igpo_loop`), you don't have to call any checkpoint APIs yourself. Set two config fields and the recipe handles save, resume, and promote:
* `dcp_save_interval=N` (top-level `Config` field on every recipe) — save resumable checkpoints every N steps
* `output_model_id="my-model"` — promote the final checkpoint to a deployable Fireworks model
Rerunning with the same `log_path` resumes from the last saved checkpoint automatically.
```python theme={null}
from training.recipes.sft_loop import Config, main
from training.utils import TrainerConfig
cfg = Config(
log_path="./my_training",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
output_model_id="qwen3-8b-finetuned",
dcp_save_interval=10,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(cfg)
# Interrupted? Run again with the same config — it picks up automatically.
main(cfg)
```
That's the full surface most users need. The rest of this page covers config knobs, manual promotion via the CLI, and (under [Advanced internals](#advanced-internals)) what the recipe is doing under the hood.
`dcp_save_interval` defaults to `0` (off). Without setting it to a positive value, training cannot be resumed from intermediate steps.
## Config fields
| Field | Applies to | Type | Default | Description |
| ---------------------- | ----------- | ------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `log_path` | All recipes | `str` | (required) | Directory for the recipe's local bookkeeping (`dataloader.json`) and logs |
| `dcp_save_interval` | All recipes | `int` | `0` | Save a resumable (DCP) checkpoint every N steps. `0` = off. Top-level `Config` field on every recipe (RL, IGPO, async RL, SFT, DPO, ORPO). |
| `output_model_id` | All recipes | `str \| None` | `None` | If set, promote the final checkpoint to this Fireworks model ID at the end of training |
| `init_from_checkpoint` | All recipes | `str \| None` | `None` | Load weights from another job (`"job-id:checkpoint-name"`). Step counter resets to 0. |
## Resume
### Automatic (same log\_path)
Just rerun with the same `log_path` and the recipe resumes. It queries the control plane for the newest resumable checkpoint on the trainer job and reloads weights and optimizer state. The step counter and the cookbook's `data_consumed` counter are restored from `dataloader.json` in `log_path`.
### From another job
```python theme={null}
config = Config(
log_path="./new_run",
init_from_checkpoint="i44pvd4syzg8hjfk:step-4", # job_id:checkpoint_name
...
)
```
Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.
## Promoting a checkpoint manually
If you want to promote an arbitrary checkpoint after training (not just the final one), use the cookbook's promote script:
```bash theme={null}
export FIREWORKS_API_KEY=...
python promote_checkpoint.py \
--job-id \
--output-model-id my-fine-tuned-model \
--base-model accounts/fireworks/models/qwen3-8b
```
By default the script promotes the newest promotable checkpoint on the job. Pass `--checkpoint-name ` to promote a specific one.
You can also call the API directly — see [Saving and Loading — Promoting](/fine-tuning/training-api/saving-and-loading#promoting-a-checkpoint-to-a-model).
## Advanced internals
Most users can stop reading here. The sections below cover what the recipe does internally — useful only if you're forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn't promote. The full SDK-level reference lives in [Saving and Loading](/fine-tuning/training-api/saving-and-loading).
### What gets saved, where
The recipe interacts with two surfaces:
| Surface | Owns | Source of truth for |
| ---------------------------------------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------ |
| Control plane (`FireworksClient.list_checkpoints(job_id)`) | All remote checkpoint blobs (DCP and sampler) | What checkpoints exist, their type, and whether each is promotable |
| `{log_path}/dataloader.json` | Local file | The cookbook's `data_consumed` counter per checkpoint name (no server-side representation) |
There is no `checkpoints.jsonl` registry — the control plane is queried at resume / promote time.
### Two axes: resumable and promotable
When the recipe saves a checkpoint, it picks two independent capabilities:
| Axis | What it writes | Resumes? | Promotes to a model? |
| ----------------- | --------------------------- | -------- | -------------------- |
| `resumable=True` | DCP (weights + optimizer) | Yes | No |
| `promotable=True` | Sampler weights (HF format) | No | Yes |
| Both | DCP + sampler | Yes | Yes |
Periodic saves use `resumable=True` only. The final save uses both. RL weight sync saves sampler checkpoints and syncs their snapshot identities separately from DCP resume saves.
### Forking a recipe
If you fork `rl_loop.py` (or another ported recipe) and need to drive checkpointing yourself, instantiate `TrainingCheckpoints`:
```python theme={null}
from training.utils.checkpoints import TrainingCheckpoints
ckpt = TrainingCheckpoints(
policy, # ReconnectableClient
service, # FiretitanServiceClient (control-plane checkpoint client)
trainer_id=service.trainer_job_id,
log_path=cfg.log_path,
lora_rank=cfg.lora_rank,
)
# Resume on startup
resume_info = ckpt.resume(
init_from_checkpoint=cfg.init_from_checkpoint,
warm_start_from_adapter=cfg.warm_start_from_adapter,
)
step_offset = resume_info.step if resume_info else 0
# Periodic save
ckpt.save(f"step-{step}", resumable=True, promotable=False, data_consumed=count)
# Final save + promote
ckpt.save(f"step-{step}", resumable=True, promotable=True, data_consumed=count)
if cfg.output_model_id:
ckpt.promote_latest(cfg.output_model_id, cfg.base_model)
```
The class is intentionally thin — it forwards `save_state` / `save_weights_for_sampler_ext` / `promote_checkpoint` to the SDK and uses the control plane as the source of truth for resume and promotion. Recipes pass the SDK-managed service client as the control-plane checkpoint client. The full API surface those calls expose is documented in [Saving and Loading](/fine-tuning/training-api/saving-and-loading).
### Checkpoint kinds
This subsection is the canonical reference for checkpoint kinds and promotability across the stack — other pages link here.
Three separate layers of the stack each have their own "type", and confusing them is the usual reason a promotion fails. They are not synonyms:
| Layer | Where | Values | What it controls |
| ------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Cookbook** | `TrainingCheckpoints.save(resumable=, promotable=)` | two booleans | Which of DCP / sampler blob (or both) gets saved |
| **SDK** | `save_weights_for_sampler_ext(checkpoint_type=...)` | `"base"`, `"delta"` | Whether the sampler blob is full weights or an `arc_v2` delta over the previous base (LoRA ignores this — full adapter is always saved) |
| **Server** | `checkpointType` on each control-plane row | `TRAINING`, `TRAINING_LORA`, `INFERENCE_BASE`, `INFERENCE_LORA`, `INFERENCE_ARC_V2` | Detected from blob contents. The first two are resumable; `INFERENCE_BASE` and `INFERENCE_LORA` are promotable; `INFERENCE_ARC_V2` (delta on full-param) is not. |
When the cookbook saves with `promotable=True`, it always calls the SDK with `checkpoint_type="base"`, which the server detects as `INFERENCE_BASE` (full-param) or `INFERENCE_LORA` (LoRA). Both are promotable. The non-promotable `INFERENCE_ARC_V2` only happens if you bypass the cookbook and call `save_weights_for_sampler_ext("delta")` on a full-parameter run.
#### Promotability cheat sheet
"Promotable" means the server will accept the blob for promotion — i.e. the checkpoint shows `promotable=True` in `list_checkpoints`. To actually promote, you need the checkpoint name plus `source_job_id` and `base_model`.
| How it was saved | LoRA promotable | Full-param promotable |
| ------------------------------------------------------------ | --------------------------------------- | --------------------- |
| `TrainingCheckpoints.save(resumable=True, promotable=False)` | No (DCP only) | No (DCP only) |
| `TrainingCheckpoints.save(promotable=True)` | Yes | Yes |
| `save_weights_for_sampler_ext(checkpoint_type="base")` | Yes | Yes |
| `save_weights_for_sampler_ext(checkpoint_type="delta")` | Yes (server always stores full adapter) | No |
| Recipe weight sync — first save | Yes | Yes |
| Recipe weight sync — later saves | Yes | No |
For SDK-level details on each row (full method signatures, base-vs-delta semantics, weight-sync lifecycle), see [Saving and Loading](/fine-tuning/training-api/saving-and-loading).
## Related guides
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — SDK-level reference for save / load / promote
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — SDK-managed sampler refresh lifecycle
* [Cookbook RL](/fine-tuning/training-api/cookbook/rl) — full GRPO walkthrough
# Cookbook: Distillation
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/distillation
Single-teacher OPD and routed multi-teacher policy distillation with cookbook recipes.
## What this is
The cookbook's `training.recipes.distillation_loop` trains one student on its own rollouts while one or more frozen teachers score those exact sampled tokens. The dense training signal is the per-token logprob gap between the selected teacher and the sampling student:
```text theme={null}
teacher_logprob - sampling_logprob
```
The recipe feeds that signal into the Training API's built-in `importance_sampling` loss. This is useful when you want on-policy distillation with token-level feedback instead of offline SFT traces or final-answer-only rewards.
## Single-teacher distillation
Use `teacher_model` when every prompt should be scored by the same teacher:
```python theme={null}
from training.recipes.distillation_loop import Config, main
from training.utils import DeployConfig, TrainerConfig
cfg = Config(
log_path="./distillation_logs",
base_model="accounts/fireworks/models/qwen3-8b",
teacher_model="accounts/fireworks/models/qwen3-32b",
dataset="/path/to/prompts.jsonl",
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
deployment=DeployConfig(tokenizer_model="Qwen/Qwen3-8B"),
max_rows=100,
epochs=1,
)
main(cfg)
```
If `teacher_model` is a base model, the recipe creates a frozen teacher deployment for scoring. If it is already an inference model or deployment resource, the recipe uses it directly.
## Routed multi-teacher distillation
Use `multi_teacher` when different prompts should be scored by different teachers. This is routed MOPD: each prompt is scored by exactly one teacher, selected by a string value in the dataset row.
```python theme={null}
from training.recipes.distillation_loop import Config, main
from training.utils import DeployConfig, TrainerConfig
from training.utils.distillation import MultiTeacherConfig, TeacherConfig
cfg = Config(
log_path="./mopd_logs",
base_model="accounts/fireworks/models/qwen3p5-35b-a3b",
teacher_model="",
dataset="/path/to/routed_prompts.jsonl",
multi_teacher=MultiTeacherConfig(
route_key="teacher",
teachers=[
TeacherConfig(
model="accounts/fireworks/models/qwen3p5-35b-a3b",
route_value="math-teacher",
tokenizer_model="Qwen/Qwen3.5-35B-A3B",
),
TeacherConfig(
model="accounts/fireworks/models/qwen3p5-35b-a3b",
route_value="arithmetic-teacher",
tokenizer_model="Qwen/Qwen3.5-35B-A3B",
),
],
),
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3p5-35b-a3b-256k-lora",
),
deployment=DeployConfig(tokenizer_model="Qwen/Qwen3.5-35B-A3B"),
lora_rank=8,
prompt_groups_per_step=2,
completions_per_prompt=1,
)
main(cfg)
```
Current routed MOPD is not teacher blending. The recipe does not average teacher probabilities, average teacher logits, or run multiple teachers for the same prompt. It routes each row to one configured teacher.
## Dataset format
The distillation recipe reads JSONL rows. For routed MOPD, the dataset must include the route key you configure on `MultiTeacherConfig.route_key`. The default route key is `teacher`.
Required fields:
| Field | Type | Description |
| ---------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `messages` | `list[dict]` | Student-visible OpenAI-style chat messages. The student samples from this prompt. |
| `teacher` | `str` | Default route key for routed MOPD. The value must exactly match one configured `TeacherConfig.route_value`. If `route_value` is unset, the value must match that teacher's `model`. |
Optional fields:
| Field | Type | Description |
| ------------------ | ------------ | ------------------------------------------------------------------------------------------------------------------- |
| `teacher_messages` | `list[dict]` | Teacher-side prompt used for scoring. If omitted, the selected teacher scores the student rollout under `messages`. |
| `expected_answer` | `str` | Optional answer metadata for eval callbacks and smoke checks. |
| `extra_info` | `dict` | Optional user metadata. The recipe does not require a specific shape. |
Example single-teacher row:
```json theme={null}
{
"messages": [
{"role": "user", "content": "Solve 6 * 7. End with exactly one line: Final: ."}
],
"expected_answer": "42"
}
```
Example routed MOPD rows:
```json theme={null}
{"messages":[{"role":"user","content":"Solve 6 * 7. End with Final: ."}],"teacher":"math-teacher","expected_answer":"42"}
{"messages":[{"role":"user","content":"Solve 18 + 24. End with Final: ."}],"teacher":"arithmetic-teacher","expected_answer":"42"}
```
Example with a privileged teacher prompt:
```json theme={null}
{
"messages": [
{"role": "user", "content": "Solve 6 * 7. End with exactly one line: Final: ."}
],
"teacher": "math-teacher",
"teacher_messages": [
{"role": "user", "content": "Solve 6 * 7. The correct answer is 42. Explain briefly, then end with Final: 42."}
],
"expected_answer": "42"
}
```
If a teacher uses a custom `TeacherConfig.teacher_messages_key`, rows routed to that teacher should provide that key instead of `teacher_messages`.
## Tokenizer compatibility
Sampled-token distillation scores the student's sampled token IDs under the teacher. The student and teacher must therefore share a compatible tokenizer and vocabulary. Prefer teachers from the same model family as the student, and set `TeacherConfig.tokenizer_model` when you want the recipe to validate the teacher tokenizer against `DeployConfig.tokenizer_model`.
## Example scripts
The cookbook includes two distillation examples:
| Example | Path | Description |
| ---------------------- | ---------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| Privileged-context OPD | `training/examples/distillation/gsm8k_privileged` | Student sees the problem; teacher can see privileged solution context. |
| Routed MOPD smoke | `training/examples/distillation/routed_mopd/train_two_teacher_lora.py` | Tiny generated dataset with two route labels and a Qwen3.5 35B-A3B LoRA student. |
Run the routed smoke example from the cookbook repository:
```bash theme={null}
cd training
FIREWORKS_API_KEY=... \
python examples/distillation/routed_mopd/train_two_teacher_lora.py
```
The smoke example writes a small JSONL dataset into the run log directory. It is intended to show the required row shape; production runs should provide their own JSONL dataset with the same route-key contract.
## Operational notes
* `teacher_replica_count` controls replicas for auto-created frozen teacher deployments.
* `teacher_deployment_shape` sets the default teacher deployment shape. Individual `TeacherConfig.deployment_shape` values can override it.
* Per-teacher metrics such as `teacher_route//scored` and `teacher_route//inflight` are logged so route skew and idle teachers are visible.
* The adaptive concurrency controller watches the student deployment. If one route dominates the dataset, some teacher deployments may be underused.
* `DISTILLATION_TEACHERS` and `DISTILLATION_TEACHER_ROUTE_KEY` can configure routed teachers for the recipe's `__main__` entrypoint. Legacy `OPD_TEACHERS` and `OPD_TEACHER_ROUTE_KEY` names are accepted as fallbacks.
## Next steps
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) - config classes and common recipe fields
* [Loss Functions](/fine-tuning/training-api/loss-functions) - built-in and custom Training API losses
* [Weight sync](/fine-tuning/training-api/cookbook/weight-sync) - how updated weights reach serving deployments
# Cookbook: DPO
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/dpo
Direct Preference Optimization with pairwise data using the cookbook recipe.
## What this is
This guide walks through DPO (Direct Preference Optimization) training using the cookbook. DPO learns from preference pairs (chosen vs. rejected responses) without a separate reward model.
## How DPO differs from GRPO
| | DPO | GRPO |
| ---------------------- | ------------------------------------------------------------ | ----------------------------------------------------------------------- |
| **Trainer jobs** | 1 for LoRA, 2 for full-parameter (policy + frozen reference) | 1-2 trainers plus an inference deployment, depending on reference needs |
| **Data** | Preference pairs (chosen/rejected) | Prompts + reward function |
| **Reference logprobs** | Cached once at initialization | Computed every step |
| **Loss** | `-log(sigmoid(beta * margin))` | Advantage-weighted policy gradient + KL |
## Architecture
```mermaid theme={null}
flowchart LR
loop[Your Python Loop] -->|forward chosen+rejected| reference[Reference source frozen]
reference -->|ref logprobs cached at init| loop
loop -->|forward_backward_custom + optim_step| policyTrainer[Policy Trainer]
```
## Using the recipe
```python theme={null}
from training.recipes.dpo_loop import Config, main
from training.utils import TrainerConfig, WandBConfig
cfg = Config(
log_path="./dpo_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/preference_data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
beta=0.1,
epochs=1,
batch_size=4,
max_seq_len=4096,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
reference_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
),
wandb=WandBConfig(entity="my-team", project="dpo-experiment"),
)
main(cfg)
```
## Dataset format
DPO expects preference pairs. Supported formats:
**Format 1 — chosen/rejected messages:**
```json theme={null}
{
"chosen": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "good response"}]},
"rejected": {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "bad response"}]}
}
```
**Format 2 — input/output split:**
```json theme={null}
{
"input": {"messages": [{"role": "user", "content": "..."}]},
"preferred_output": [{"role": "assistant", "content": "good"}],
"non_preferred_output": [{"role": "assistant", "content": "bad"}]
}
```
## Step-by-step (API-level)
### Provision trainers with `build_service_client`
DPO always needs reference logprobs. Full-parameter DPO uses a policy trainer and a forward-only reference trainer; LoRA DPO uses one policy trainer and the policy session's shared base reference. Provisioning is owned by the SDK-managed service client — `build_service_client` resolves shapes, attaches or creates the trainer(s), and decides the reference strategy for you:
* **LoRA** (`lora_rank > 0`) with no `reference_training_shape_id` → `create_reference_client` reuses the policy session (no second trainer).
* **Full-parameter**, or an explicit `reference_training_shape_id` → a separate forward-only reference trainer is provisioned and its lifecycle is owned by the service client.
```python theme={null}
import os
from training.utils import TrainerConfig, build_service_client
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
base_model = "accounts/fireworks/models/qwen3-8b"
service = build_service_client(
api_key=api_key,
base_url=base_url,
additional_headers=None,
base_model=base_model,
tokenizer_model="Qwen/Qwen3-8B",
lora_rank=0,
max_context_length=None,
learning_rate=1e-5,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
reference_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
),
# deployment=None → trainer-only provisioning (DPO has no rollouts)
cleanup_trainer_on_close=True, # delete SDK-managed trainers on service.close()
)
policy_client = service.create_training_client(base_model, lora_rank=0)
reference_client = service.create_reference_client(base_model, lora_rank=0)
# ... training loop ...
# service.close() # tears down the trainers it created
```
The cookbook recipes wrap these clients in `ReconnectableClient.from_training_client(...)` for blocking semantics; for a raw API-level loop you can call `policy_client` / `reference_client` directly.
### Cache reference logprobs
Reference logprobs are computed once at initialization and reused throughout training:
```python theme={null}
ref_cache = {}
for i, (chosen_tokens, rejected_tokens, prompt_len) in enumerate(dataset):
chosen_datum, rejected_datum = build_dpo_datums(
chosen_tokens, rejected_tokens, prompt_len, max_seq_len=4096,
)
fwd = reference_client.forward([chosen_datum, rejected_datum], "cross_entropy")
ref_cache[i] = {
"ref_chosen": fwd.loss_fn_outputs[0]["logprobs"].data,
"ref_rejected": fwd.loss_fn_outputs[1]["logprobs"].data,
"chosen_tokens": chosen_tokens,
"rejected_tokens": rejected_tokens,
"prompt_len": prompt_len,
}
```
### DPO loss function
```python theme={null}
import torch
import torch.nn.functional as F
def make_dpo_loss_fn(ref_chosen_logprobs, ref_rejected_logprobs, beta=0.1):
ref_chosen_t = torch.tensor(ref_chosen_logprobs, dtype=torch.float32)
ref_rejected_t = torch.tensor(ref_rejected_logprobs, dtype=torch.float32)
def loss_fn(data, logprobs_list):
pi_chosen, pi_rejected = logprobs_list[0], logprobs_list[1]
chosen_weights = torch.tensor(data[0].loss_fn_inputs["weights"].data, dtype=torch.float32)
rejected_weights = torch.tensor(data[1].loss_fn_inputs["weights"].data, dtype=torch.float32)
pi_chosen_sum = torch.dot(pi_chosen.float(), chosen_weights)
pi_rejected_sum = torch.dot(pi_rejected.float(), rejected_weights)
ref_chosen_sum = torch.dot(ref_chosen_t.float(), chosen_weights)
ref_rejected_sum = torch.dot(ref_rejected_t.float(), rejected_weights)
margin = (pi_chosen_sum - ref_chosen_sum) - (pi_rejected_sum - ref_rejected_sum)
dpo_loss = -F.logsigmoid(beta * margin)
with torch.no_grad():
accuracy = 1.0 if margin.item() > 0 else 0.0
return dpo_loss, {"dpo_loss": dpo_loss.item(), "margin": margin.item(), "accuracy": accuracy}
return loss_fn
```
### Training loop
```python theme={null}
step = 0
accum_count = 0
grad_accum = 4
for idx in ref_cache:
cached = ref_cache[idx]
chosen_datum, rejected_datum = build_dpo_datums(
cached["chosen_tokens"], cached["rejected_tokens"],
cached["prompt_len"], max_seq_len=4096,
)
loss_fn = make_dpo_loss_fn(
ref_chosen_logprobs=cached["ref_chosen"],
ref_rejected_logprobs=cached["ref_rejected"],
beta=0.1,
)
result = policy_client.forward_backward_custom([chosen_datum, rejected_datum], loss_fn)
accum_count += 1
if accum_count >= grad_accum:
policy_client.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
)
step += 1
accum_count = 0
print(f"Step {step}: {result.metrics}")
```
## Operational guidance
* **Set `trainer.training_shape_id` when you need an explicit policy shape** — otherwise supported recipes auto-select a validated policy shape.
* **Leave `trainer.reference_training_shape_id` unset unless you need a specific reference shape** — full-parameter DPO auto-selects a forward-only reference shape; LoRA DPO uses a shared-session reference by default.
* **DPO does not provision a deployment** — there are no rollout samples or deployment weight syncs in the recipe.
* **Keep a versioned reference cache** tied to tokenizer + base model revision. If the base model changes, recompute reference logprobs.
* **Monitor margin statistics**: increasing margins indicate the policy is learning preferences.
* **DCP checkpoints are disabled by default** (`dcp_save_interval=0`). If you need to resume training from a checkpoint, set `dcp_save_interval` directly on `dpo_loop.Config`.
## Common pitfalls
* **Mismatched formatting** between chosen/rejected sequences corrupts preference signals — ensure identical prompt prefixes.
* **Stale reference cache**: If you warm-start from a different model, cached reference logprobs are invalid.
## Related preference methods
* **ORPO** (`training.recipes.orpo_loop`) — Odds Ratio Preference Optimization. Combines an SFT-style negative-log-likelihood term on the chosen response with a margin term on the odds ratio between chosen and rejected. Unlike DPO, ORPO does **not** require a reference trainer (no cached reference logprobs), so the recipe runs with a single trainer + dataset of preference pairs. See `training.recipes.orpo_loop` in the public [cookbook repo](https://github.com/fw-ai/cookbook/tree/main/training/recipes/orpo_loop.py) for the full configuration.
## Related guides
* [Cookbook RL (GRPO)](/fine-tuning/training-api/cookbook/rl) — reinforcement learning recipes
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) — all config classes
* [Loss Functions](/fine-tuning/training-api/loss-functions) — API-level DPO loss details
# The Cookbook
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/overview
Ready-to-run training recipes for GRPO, DPO, SFT, and distillation built on top of the Training API.
## What is the Cookbook?
The [Fireworks Cookbook](https://github.com/fw-ai/cookbook/tree/main/training) is a collection of training recipes and utilities built on top of the [Training API](/fine-tuning/training-api/introduction). It provides config-driven training loops that handle trainer provisioning, data loading, tokenization, gradient accumulation, checkpointing, and cleanup automatically.
The cookbook is **optional** — everything it does can be done with the API directly. Use the cookbook when you want a working training loop quickly; use the API when you need full control.
## Installation
```bash theme={null}
git clone https://github.com/fw-ai/cookbook.git
cd cookbook/training && pip install -e .
```
Set your credentials:
```bash theme={null}
export FIREWORKS_API_KEY="your-api-key"
```
## Available recipes
| Recipe | Module | Use case |
| -------------------------------- | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **RL** *(primary, experimental)* | `training.recipes.async_rl_loop` | Reinforcement learning — you write a rollout function, the recipe owns the loop. Async rollout/training overlap by default; fully synchronous on-policy with `synchronous_training=True`. GRPO, importance sampling, DAPO, DRO, GSPO, CISPO. See [Cookbook RL](/fine-tuning/training-api/cookbook/rl). **No backward-compatibility guarantee.** |
| **RL** *(simpler, synchronous)* | `training.recipes.rl_loop` | Synchronous on-policy GRPO scaffold — reach for it when you want the server-side fast loss path or don't need rollout/train overlap |
| **IGPO** | `training.recipes.igpo_loop` | Information Gain-based Policy Optimization — turn-level IG rewards for multi-turn agent trajectories (extends GRPO) |
| **DPO** | `training.recipes.dpo_loop` | Direct preference optimization from chosen/rejected pairs |
| **SFT** | `training.recipes.sft_loop` | Supervised fine-tuning with cross-entropy loss |
| **Distillation** | `training.recipes.distillation_loop` | On-policy sampled-token distillation with one teacher or routed multi-teacher MOPD |
| **ORPO** | `training.recipes.orpo_loop` | Odds ratio preference optimization |
Each recipe follows the same pattern: import `Config` and `main`, set your config, and call `main(cfg)`. Trainer and deployment provisioning is handled internally by the recipe — you describe *what* you want with `TrainerConfig` / `DeployConfig`, and the SDK attaches or creates the resources.
All launch examples below use `trainer=TrainerConfig(training_shape_id=...)` for explicit shape selection. Cookbook recipes can also auto-select validated shapes when `training_shape_id` is unset. The main run-level trainer knob you may set alongside a shape is `replica_count` for replicated HSDP launches; reference shapes can usually be left unset because the cookbook auto-selects or uses a shared-session reference when appropriate.
If you want field-level details about what a training shape controls and what stays configurable, see [Training Shapes](/fine-tuning/training-api/training-shapes) and the [Cookbook Reference](/fine-tuning/training-api/cookbook/reference).
`InfraConfig` and the standalone `setup_infra` / `ResourceCleanup` helpers are **deprecated and removed from the recipe surface**. Recipes now take `trainer=TrainerConfig(...)` (and `deployment=DeployConfig(...)` for RL). See [Migrating from the deprecated managed infra](/fine-tuning/training-api/cookbook/reference#deprecated-managed-infra-infraconfig).
## Quick example: SFT
```python theme={null}
from training.recipes.sft_loop import Config, main
from training.utils import TrainerConfig
cfg = Config(
log_path="./sft_quickstart",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/training_data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
max_seq_len=4096,
epochs=1,
batch_size=4,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(cfg)
```
## Quick example: GRPO
```python theme={null}
from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, TrainerConfig
cfg = Config(
log_path="./grpo_quickstart",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/prompts.jsonl",
max_rows=100,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
deployment=DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
),
weight_sync_interval=1,
)
main(cfg)
```
## W\&B logging
All cookbook recipes accept a `WandBConfig` to stream metrics to [Weights & Biases](https://wandb.ai):
```python theme={null}
from training.utils import WandBConfig
cfg = Config(
# ... same config as above ...
wandb=WandBConfig(
entity="my-team",
project="grpo-experiment",
run_name="qwen3-8b-sft-v1", # optional, auto-generated if omitted
),
)
main(cfg)
```
## Vision-language model support
All cookbook recipes support VLM fine-tuning. Use a VLM training shape and tokenizer, and provide multimodal datasets with `image_url` content. See [Vision Inputs](/fine-tuning/training-api/vision-inputs) for dataset format and examples.
## Next steps
* [Cookbook SFT](/fine-tuning/training-api/cookbook/sft) — supervised fine-tuning
* [Cookbook DPO](/fine-tuning/training-api/cookbook/dpo) — preference optimization with pairwise data
* [Cookbook RL (GRPO)](/fine-tuning/training-api/cookbook/rl) — full GRPO walkthrough with reward functions
* [Cookbook Distillation](/fine-tuning/training-api/cookbook/distillation) — OPD and routed MOPD dataset format
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — fine-tune VLMs with image and text data
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) — all config classes and parameters
# Cookbook Reference
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/reference
Configuration classes, checkpoint utilities, and gradient accumulation normalization for cookbook recipes.
## TrainerConfig
Training-client launch settings: which training shape to use, the optional reference trainer, region, and run-level knobs. Recipes take it as `Config.trainer`:
```python theme={null}
from training.utils import TrainerConfig
trainer = TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
reference_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
)
```
Use `training_shape_id` for explicit shape selection — this is the primary shape-specific value you set. Pass the full shared path `accounts/fireworks/trainingShapes/` (the `fireworks` account is the public shared shape catalog). If you leave it unset, supported recipes auto-select a validated shape from the control plane based on `base_model`, `lora_rank`, and `max_seq_len`.
| Field | Type | Default | Description |
| ----------------------------- | ------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `training_shape_id` | `str \| None` | `None` | Optional full training-shape ID for the policy trainer, typically `accounts/fireworks/trainingShapes/`. When unset, supported recipes auto-select a validated shape. |
| `reference_training_shape_id` | `str \| None` | `None` | Optional full training-shape ID for a separate reference trainer. For full-parameter runs that need a reference, leave unset to auto-select a validated forward-only shape; for LoRA runs, leave unset to use the shared-session reference on the policy trainer. |
| `job_id` | `str \| None` | `None` | Attach to an existing trainer job (resume / reattach) instead of creating a new one. |
| `reference_job_id` | `str \| None` | `None` | Attach to an existing forward-only reference trainer job. |
| `cleanup_reference_on_close` | `bool` | `True` | Delete the SDK-managed reference trainer when the service closes. |
| `region` | `str \| None` | `None` | Region override (drives trainer + deployment colocation). |
| `timeout_s` | `float` | `3600` | Timeout for trainer provisioning / readiness waits. |
| `extra_args` | `list[str] \| None` | `None` | Extra trainer arguments. |
| `replica_count` | `int \| None` | `None` | Data-parallel HSDP replica count for policy trainer launches. This is a run-level knob, not part of the validated training shape; reference trainers are launched without it. |
| `skip_validations` | `bool` | `False` | Skip server-side shape validation. Requires elevated permissions. |
| `purpose` | `str \| None` | `None` | Optional platform purpose enum name, such as `"PURPOSE_PILOT"`. |
To request replicated HSDP for a run:
```python theme={null}
trainer = TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
replica_count=2,
)
```
On the shape path (`training_shape_id` set or auto-selected), `accelerator_type`, `accelerator_count`, `node_count`, and `custom_image_tag` are derived from the training shape. `TrainerConfig` still exposes those fields for the advanced manual path (`training_shape_id=None`), where they are sent directly and shape validation is skipped.
Migrating from `InfraConfig`? See [Deprecated managed infra (InfraConfig)](#deprecated-managed-infra-infraconfig) for the field-rename table.
## DeployConfig
Deployment settings for sampling and weight sync. Wraps `DeploymentConfig` fields:
```python theme={null}
from training.utils import DeployConfig
deploy_cfg = DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
)
```
When `deployment_shape` is set (the recommended path), the shape owns deployment hardware and serving configuration.
| Field | Type | Default | Description |
| ------------------------------ | ------------------------ | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `weight_sync_scope` | `WeightSyncScope` | `WeightSyncScope.PER_TRAINER` | Controls whether the trainer bucket or deployment bucket owns weight sync state. See [Weight sync](/fine-tuning/training-api/cookbook/weight-sync). |
| `deployment_id` | `str \| None` | `None` | Deployment identifier. If unset, the cookbook auto-derives one from the base model name. |
| `tokenizer_model` | `str \| None` | `None` | HuggingFace model name for client-side tokenization. Required for RL sampling. |
| `tokenizer_revision` | `str \| None` | `None` | Optional HuggingFace tokenizer revision. |
| `deployment_shape` | `str \| None` | `None` | Deployment shape resource name. When set, the shape owns GPU type and serving config. |
| `deployment_region` | `str \| None` | `None` | Region override for the deployment |
| `hot_load_bucket_type` | `str` | `"FW_HOSTED"` | Weight-sync storage backend |
| `hot_load_trainer_job` | `str \| None` | `None` | Trainer job name whose weight-sync bucket this deployment should use. Format: `accounts/{account}/rlorTrainerJobs/{job_id}`. |
| `deployment_timeout_s` | `float` | `5400` | Timeout for deployment provisioning / readiness waits |
| `reattach_settle_timeout_s` | `int` | `600` | Timeout for the serving pod to settle after re-attaching a deployment to a new trainer bucket. |
| `deployment_extra_args` | `list[str] \| None` | `None` | Extra serving arguments |
| `sample_timeout` | `int` | `600` | HTTP read timeout for sampling completions |
| `disable_speculative_decoding` | `bool` | `True` | Disable speculative decoding for weight-sync compatibility |
| `extra_values` | `dict[str, str] \| None` | `None` | Extra deployment Helm values |
| `replica_count` | `int \| None` | `None` | If set, pin the deployment to a fixed replica count (sets both min and max). |
| `deployment_accelerator_type` | `str \| None` | `None` | Manual-path deployment GPU type used only when no `deployment_shape` is set. |
When `deployment_shape` is set, the deployment shape owns GPU type and serving configuration. Use `deployment_accelerator_type` only for advanced manual deployments without a deployment shape.
## ConcurrencyConfig
Rollout sampling concurrency settings used by RL-family recipes:
| Field | Type | Default | Description |
| ---------------------- | ------------- | ------------ | --------------------------------------------------------------------------------------------- |
| `mode` | `str \| None` | `"adaptive"` | Concurrency mode. RL recipes currently use adaptive concurrency. |
| `initial_window` | `int \| None` | `None` | Starting adaptive concurrency window. When unset, recipes derive it from deployment capacity. |
| `min_window` | `int` | `1` | Minimum adaptive concurrency window. |
| `max_window` | `int` | `256` | Maximum adaptive concurrency window. |
| `prefill_queue_target` | `float` | `0.5` | Target prefill queue duration in seconds for AIMD adjustment. |
| `max_concurrency` | `int \| None` | `None` | Deprecated fixed-concurrency compatibility field. |
## Checkpoint & weight-sync fields
Weight-sync and checkpoint cadence are **top-level fields on the recipe `Config`** (no nested config object). `rl_loop` and `igpo_loop` expose the full weight-sync cadence knobs; `async_rl_loop` pins sampler sync to every optimizer step and exposes only pre-training sync and timeout. Every recipe exposes `dcp_save_interval`:
```python theme={null}
cfg = Config(
# ... base_model, dataset, trainer, deployment ...
weight_sync_interval=1, # rl_loop/igpo_loop: sync weights every N steps
weight_sync_before_training=False, # RL: sync a base checkpoint before step 1
weight_sync_timeout=600, # RL: per weight-sync timeout (seconds)
dcp_save_interval=10, # all recipes: save resumable DCP checkpoints every N steps
)
```
`dcp_save_interval` defaults to `0` (off). Without setting it to a positive value, **no DCP checkpoints are saved and training cannot be resumed**. If you need checkpoint-based resume, explicitly set `dcp_save_interval` (e.g. `dcp_save_interval=50`).
| Field | Recipes | Type | Default | Description |
| ----------------------------- | ---------------------- | ------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `dcp_save_interval` | All | `int` | `0` | Save resumable DCP checkpoints every N steps. `0` disables DCP saves. **Set to a positive value to enable resume.** |
| `weight_sync_interval` | `rl_loop`, `igpo_loop` | `int` | `1` | Save + sync weights to the deployment every N optimizer steps. `0` disables weight sync. `async_rl_loop` pins this internally to `1`. |
| `weight_sync_before_training` | RL family | `bool` | `False` | Save a base checkpoint and sync it to the deployment before the first training step. |
| `weight_sync_timeout` | RL family | `int` | `600` | Timeout for each weight sync (seconds). |
The old nested `WeightSyncConfig` recipe field is gone. Recipe `Config` objects set the fields above directly, and the SDK-managed service owns the underlying save and weight-sync state.
## WandBConfig
Weights & Biases logging settings:
```python theme={null}
from training.utils import WandBConfig
wandb = WandBConfig(
entity="my-team",
project="grpo-experiment",
run_name="qwen3-8b-v1",
)
```
| Field | Type | Default | Description |
| ---------- | ------------- | ------- | ------------------------------------ |
| `entity` | `str \| None` | `None` | W\&B team or user name |
| `project` | `str \| None` | `None` | W\&B project name |
| `run_name` | `str \| None` | `None` | Run name (auto-generated if omitted) |
## ReconnectableClient
Blocking convenience wrapper around `FiretitanTrainingClient`. All cookbook recipes use this as their training client — it dispatches each call and blocks until the result is ready or the timeout expires. Failures propagate to the caller so the training loop can crash cleanly and resume from the last DCP checkpoint.
This is a recipe-internal wrapper. User code should not construct it with trainer managers. Recipes build it from the `FiretitanTrainingClient` returned by the SDK-managed service client.
```python theme={null}
from training.utils import ReconnectableClient
client = ReconnectableClient.from_training_client(
training_client,
base_model="accounts/fireworks/models/qwen3-8b",
lora_rank=0,
job_id=service.trainer_job_id,
service=service,
)
result = client.forward_backward_custom(datums, loss_fn)
client.optim_step(tinker.AdamParams(...))
```
| Parameter | Type | Default | Description |
| ----------------- | -------------------------------- | ------- | ----------------------------------------------------------------- |
| `client` | `FiretitanTrainingClient` | — | Training client returned by `service.create_training_client(...)` |
| `job_id` | `str` | — | RLOR trainer job ID |
| `base_model` | `str` | — | Base model name |
| `lora_rank` | `int` | `0` | LoRA rank (`0` for full-parameter) |
| `service` | `FiretitanServiceClient \| None` | `None` | Managed service that owns the trainer lifecycle |
| `default_timeout` | `int` | `3600` | Timeout in seconds for forward/backward/optim calls |
**Properties:**
| Property | Type | Description |
| -------- | ----- | ------------------ |
| `job_id` | `str` | The trainer job ID |
**Methods:**
| Method | Description |
| -------------------------------------------------------------- | -------------------------------------------- |
| `forward(data, loss_fn)` | Forward pass, blocks until complete |
| `forward_backward(data, loss_fn, loss_fn_config)` | Forward + backward pass |
| `forward_backward_custom(data, loss_fn)` | Forward + backward with custom loss function |
| `optim_step(params, grad_accumulation_normalization)` | Optimizer step |
| `save_state(name, timeout)` | Save DCP checkpoint (default timeout: 2700s) |
| `load_state_with_optimizer(path, timeout)` | Load DCP checkpoint (default timeout: 2700s) |
| `save_weights_for_sampler_ext(name, checkpoint_type, timeout)` | Save sampler checkpoint for promotion |
| `resolve_checkpoint_path(name, source_job_id)` | Resolve cross-job checkpoint path |
| `list_checkpoints()` | List available DCP checkpoints |
## Checkpoint utilities
For checkpointing, resume, and promote — see the dedicated [Checkpoints and Resume](/fine-tuning/training-api/cookbook/checkpoints) page.
## Gradient accumulation normalization
Recipe configs expose `grad_accumulation_normalization`, which is passed to `optim_step(...)`:
```python theme={null}
from fireworks.training.sdk import GradAccNormalization
client.optim_step(
adam_params,
grad_accumulation_normalization=GradAccNormalization.NUM_LOSS_TOKENS,
)
```
See [Loss Functions](/fine-tuning/training-api/loss-functions#gradient-accumulation-normalization) for how to choose the mode and avoid double-normalization.
### Recipe defaults
| Recipe | Default | Rationale |
| --------- | -------------------------------------- | ------------------------------------------------------------- |
| SFT | `None` | The SFT loss is already normalized client-side. |
| GRPO / RL | `GradAccNormalization.NUM_LOSS_TOKENS` | RL losses use server-side per-token normalization by default. |
| DPO | `None` | The DPO loss is already normalized client-side. |
| ORPO | `None` | The ORPO loss is already normalized client-side. |
The cookbook reference documents the config surface and defaults. The conceptual guidance for loss reduction vs. server-side normalization now lives in [Loss Functions](/fine-tuning/training-api/loss-functions#gradient-accumulation-normalization).
## Deprecated managed infra (InfraConfig)
Earlier cookbook releases provisioned trainers and deployments from the recipe layer using `InfraConfig`, `WeightSyncConfig`, and the standalone helpers `setup_infra` / `ResourceCleanup` / `make_reference_client` / `create_base_reference`. Provisioning now lives entirely behind the **SDK-managed service client** (`build_service_client(...)` → `service.create_*`), and recipes take `trainer=TrainerConfig(...)` plus `deployment=DeployConfig(...)`.
This is a **breaking change to the recipe-facing interface**. The recipe `Config` no longer accepts `infra=` or `weight_sync=`, and `setup_infra` / `ResourceCleanup` have been removed. If you are **not ready to migrate, simply do not upgrade the SDK + cookbook** — pin your current versions and existing code keeps working. **Upgrading is recommended** (cleaner config, one provisioning path, SDK-owned lifecycle), but it is opt-in: the old and new surfaces do not coexist in one install.
### What to change
| Before (deprecated) | After (current) |
| -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Config(infra=InfraConfig(...))` | `Config(trainer=TrainerConfig(...))` |
| `InfraConfig.ref_training_shape_id` | `TrainerConfig.reference_training_shape_id` |
| `InfraConfig.trainer_timeout_s` | `TrainerConfig.timeout_s` |
| `InfraConfig.trainer_replica_count` | `TrainerConfig.replica_count` |
| `Config(weight_sync=WeightSyncConfig(weight_sync_interval=N))` | `Config(weight_sync_interval=N)` (top-level, `rl_loop` / `igpo_loop`; `async_rl_loop` pins this to `1`) |
| `weight_sync.dcp_save_interval=N` | `Config(dcp_save_interval=N)` (top-level, all recipes) |
| top-level `policy_job_id=...` | `TrainerConfig(job_id=...)` |
| `setup_infra(rlor_mgr, deploy_mgr, ...)` | `build_service_client(...)` (see the [DPO API-level example](/fine-tuning/training-api/cookbook/dpo#step-by-step-api-level)) |
| `create_base_reference()` / `make_reference_client()` | `service.create_reference_client(...)` |
| `with ResourceCleanup(...)` | `cleanup_trainer_on_close=True` + `service.close()` (see [Cleanup](/fine-tuning/training-api/reference/cleanup#automatic-cleanup-via-the-sdk-managed-service)) |
The `InfraConfig` dataclass is still importable for backward compatibility and now emits a `DeprecationWarning` when constructed; it is no longer accepted by recipe `Config` objects.
### Get help migrating
The cookbook ships a **debug-and-migrate skill** at [`skills/dev/`](https://github.com/fw-ai/cookbook/tree/main/skills/dev) that walks an agent through porting old `InfraConfig` / `setup_infra` scripts to the new `TrainerConfig` + `build_service_client` surface (in addition to its day-to-day debugging guidance for weight sync and checkpoint promotion). Point your coding agent at that skill to automate the migration.
# Cookbook: Reinforcement Learning
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/rl
Async RL on Fireworks — write a rollout function, the recipe owns the loop (gate, advantage, weight sync, KL/TIS, PPO, checkpoints). Runs async or fully synchronous.
## What this is
The cookbook's primary RL recipe is **`async_rl_loop`**. It runs rollout sampling and training as concurrent tasks, so the trainer doesn't sit idle waiting for a full batch of rollouts. **The only thing you write is a rollout function** — the recipe owns everything else: the off-policy gate, advantage computation, reference-model forwards, weight sync, KL/TIS metrics, the PPO inner loop, and checkpointing.
It is a strict superset of synchronous, on-policy GRPO: set one flag and it drains rollouts before every step (see [Sync vs. async](#sync-vs-async)). Start here for new RL work.
`async_rl_loop` is **experimental** and under active development. Config fields and the rollout protocol may change without backward-compatibility shims; the recipe emits a runtime warning at startup. Pin to a specific cookbook commit if you depend on the current shape.
## Core design: two files
You write two small files; the recipe is the third moving part you configure but don't edit.
| File | What it holds |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `rollout.py` | The **rollout function** — one trajectory per call: sample from the deployment, (optionally) score it, return a `RolloutSample`. Exposes `make_rollout_fn(setup) -> rollout_fn`. |
| `train.py` | **Config + wiring** — base model, training/deployment shapes, the policy loss variant, reward function, and the call to `main(cfg, rollout_fn_factory=..., rows=...)`. |
| `async_rl_loop.main` (recipe) | Everything else: fan-out, off-policy gate, advantage, reference forwards, weight sync, KL/TIS, PPO inner loop, checkpoints, promotion. |
```mermaid theme={null}
flowchart LR
rows[Dataset rows] --> recipe[async_rl_loop.main]
recipe -->|sample_prompt| rollout[your rollout_fn]
rollout -->|sample completions| deployment[Inference Deployment]
deployment --> rollout
rollout -->|RolloutSample| recipe
recipe -->|forward_backward + optim_step + weight sync| trainer[Policy Trainer]
```
### `rollout.py` — the rollout function
The recipe hands your factory a `RolloutSetup` (sampler dependencies, tokenizer, sampling kwargs, custom `extras`) once at startup. Your `rollout_fn` is then invoked once per sample and returns a `RolloutSample` (or `None` to drop it):
```python theme={null}
from training.examples.rl.vanilla_sampler import build_deployment_sampler
from training.utils.rl.rollout import RolloutSample
def make_rollout_fn(setup):
sampler = build_deployment_sampler(setup)
sample_kwargs = dict(setup.sample_kwargs)
async def rollout_fn(sample_prompt: dict) -> RolloutSample | None:
completions = await sampler.sample_with_prompt_tokens(
sample_prompt["prompt_token_ids"], n=1, **sample_kwargs,
)
if not completions:
return None
c = completions[0]
output = list(c.full_tokens)[c.prompt_len:]
return RolloutSample(
tokens=list(c.full_tokens),
logprobs=[0.0] * c.prompt_len + list(c.inference_logprobs),
loss_mask=[0] * c.prompt_len + [1] * len(output),
reward=score(c), # your reward function
finish_reason=c.finish_reason,
text=c.text,
)
return rollout_fn
```
`RolloutSample` is three parallel per-token lists plus a scalar reward:
```python theme={null}
@dataclass
class RolloutSample:
tokens: list[int]
logprobs: list[float] # 0.0 on non-generated positions
loss_mask: list[int] # 1 on assistant tokens, 0 elsewhere
reward: float
finish_reason: str = "stop"
text: str = ""
```
Multi-turn rollouts flatten into the same shape — turn boundaries are implicit in `loss_mask` transitions (0 on prompt/user/tool, 1 on assistant). The per-token mask alignment is the contract the trainer relies on.
### `train.py` — config, reward, and loss
`train.py` builds the `Config`, picks the policy loss, wires the reward (computed inside the rollout), and starts the loop:
```python theme={null}
from training.recipes.async_rl_loop import Config, main
from training.utils import DeployConfig, TrainerConfig, WandBConfig
from my_rollout import make_rollout_fn # your rollout.py
cfg = Config(
log_path="./gsm8k_logs",
base_model="accounts/fireworks/models/qwen3-8b",
learning_rate=1.7e-5,
completions_per_prompt=8,
prompt_groups_per_step=8,
policy_loss="grpo", # the "custom loss" knob
max_head_offpolicy_versions=4, # off-policy staleness budget (0 = on-policy)
trainer=TrainerConfig(training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"),
deployment=DeployConfig(tokenizer_model="Qwen/Qwen3-8B"),
wandb=WandBConfig(entity="my-team", project="gsm8k-rl"),
)
rows = [...] # dataset rows; each becomes a sample_prompt
main(cfg, rollout_fn_factory=make_rollout_fn, rows=rows)
```
Provisioning (policy trainer, reference trainer when `kl_beta > 0`, and the inference deployment) is handled internally from `trainer` / `deployment` — you never construct managers yourself.
## Sync vs. async
The same recipe covers the full spectrum from strict on-policy to overlapped off-policy:
| Setting | Behavior |
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `synchronous_training=True` | **Fully synchronous** — drains all in-flight rollouts before each train step. No overlap; useful as an on-policy baseline or to measure async savings. |
| `max_head_offpolicy_versions=0` (default) | **Strict on-policy** — samples that would arrive after the next weight sync are held until the sync. No drift; rollouts and training serialize at batch boundaries. |
| `max_head_offpolicy_versions=O` (`O > 0`) | **Off-policy with bounded staleness** — samples may land up to `O` weight-sync versions past their submit version, letting sampling overlap with training. |
Raising `O` later is a single-knob change. For the off-policy gate math, GPU split, and the `perf/*` tuning metrics, see the cookbook skill: [`skills/dev/references/rl/async-rl.md`](https://github.com/fw-ai/cookbook/blob/main/skills/dev/references/rl/async-rl.md).
### Policy loss variants
Set `policy_loss` on the `Config`:
| `policy_loss` | Description |
| ----------------------- | ------------------------------------------------- |
| `"grpo"` | REINFORCE + KL penalty (default) |
| `"importance_sampling"` | Off-policy ratio weighting with optional clipping |
| `"reinforce"` | Vanilla REINFORCE |
| `"dapo"` | Dynamic advantage with asymmetric PPO clipping |
| `"dro"` | Distributionally robust off-policy objective |
| `"gspo"` | Sequence-level clipped PPO |
| `"cispo"` | Clipped importance sampling policy optimization |
## Examples
Two minimal runnable examples ship under [`training/examples/rl/`](https://github.com/fw-ai/cookbook/tree/main/training/examples/rl), each as a `rollout.py` + `train.py` pair:
* **`single_turn_token_in/`** — pre-tokenized rows; the rollout makes one `/v1/completions` token-in/token-out call per invocation.
* **`multi_turn_message_in/`** — OpenAI-style messages; the rollout runs a retry loop (ports AReaL's multi-turn math example), with the reward in a separate `reward.py`.
### Black-box multi-turn agents
The ProRL SWE-Gym-style coding-agent path uses the same `async_rl_loop` contract
without modifying the agent. Run the agent in its sandbox, point its
Anthropic-compatible model endpoint at a local shim, and let the shim translate
each model call into a Fireworks deployment request while recording token ids and
logprobs. This mirrors the public slime
[`examples/coding_agent_rl`](https://github.com/THUDM/slime/tree/main/examples/coding_agent_rl)
example, which turns one agent run into `subagent`, `wipe`, and `final`
training segments.
The important part is to keep one stable trajectory session id for the whole
episode. Forward that id on every turn with the `user` request field or, for RL
rollout traffic, with `x-multi-turn-session-id` and `x-session-affinity`.
See [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts)
for how that session id interacts with sticky routing, prompt-prefix KV reuse,
active streams, and `reset_prompt_cache` during weight sync.
For the training datum, separate turn routing from token stitching. The shim uses
`training.utils.rl.rollout.turn_matching` to classify each incoming request as
`NEW`, `APPEND`, or `WIPE`. The default strategy matches structured message
hashes, which is useful for black-box agents that re-render the full
conversation each turn; a stricter token-prefix strategy is also available. An
`APPEND` continues the active chain, while a `WIPE` freezes the current chain as
its own segment and starts a fresh one. That is how compaction or sub-agent
excursions become multiple training segments without losing the rest of the
run.
Token stitching then happens in the coding-agent trajectory merge. Each recorded
turn stores the exact prompt token ids seen by the deployment, output token ids,
and per-token output logprobs. The first prompt becomes the segment prompt.
Later prompts are matched against the segment's `prompt_ids + response_ids`; any
new prompt suffix is non-trainable context, and generated output tokens are the
trainable span.
```python theme={null}
sample = RolloutSample(
tokens=segment.prompt_ids + segment.response_ids,
logprobs=[0.0] * len(segment.prompt_ids) + segment.rollout_log_probs,
loss_mask=[0] * len(segment.prompt_ids) + segment.loss_mask,
reward=run_reward,
)
```
Inside `segment.loss_mask`, prompt suffixes from user/tool/rendering turns stay
`0`, assistant output tokens stay `1`, and non-trainable logprobs are zeroed.
If a later rendered prompt no longer token-matches part of a previous model
output, the merge masks or drops the unstitched tail instead of training on
shifted masks. The rollout returns all surviving segments in one `RolloutRun`
with the final sandbox/grader reward, and the recipe handles advantage
computation, PPO/GRPO loss, and weight sync.
## Operational guidance
* **`deployment.tokenizer_model` is required** — the recipe tokenizes client-side.
* **Set `trainer.training_shape_id`** for an explicit shape; otherwise the recipe auto-selects a validated one.
* **Reward lives in the rollout** — set `RolloutSample.reward`; return `None` to drop a sample.
* **Skip uniform-reward groups** with `dynamic_filter_fn=lambda pg: len(set(pg.rewards)) > 1` — GRPO advantage is zero when all rewards in a group match.
* **DCP checkpoints are off by default** (`dcp_save_interval=0`); set a positive value to enable resume, and `output_model_id` to promote the final checkpoint.
## The simpler `rl_loop` recipe
If you don't need rollout/train overlap, the cookbook also ships **`rl_loop`** — a synchronous, strictly on-policy GRPO scaffold. It samples a batch, scores it, takes a step, syncs weights, and repeats. Configure it the same way (`trainer=TrainerConfig(...)`, `deployment=DeployConfig(...)`, `weight_sync_interval`, `policy_loss`) and call `main(cfg)`:
```python theme={null}
from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, TrainerConfig
cfg = Config(
log_path="./grpo_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/gsm8k.jsonl",
max_rows=200,
completions_per_prompt=4,
policy_loss="grpo",
trainer=TrainerConfig(training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"),
deployment=DeployConfig(deployment_id="grpo-serving", tokenizer_model="Qwen/Qwen3-8B"),
weight_sync_interval=1,
)
main(cfg)
```
`async_rl_loop` with `max_head_offpolicy_versions=0` is equivalent to `rl_loop`, so prefer the async recipe for new work and reach for `rl_loop` only when you specifically want the server-side fast loss path (which forbids `kl_beta>0` and pipeline parallelism). The reward function and `build_grpo_datums` / `make_grpo_loss_fn` internals are documented in [Loss Functions](/fine-tuning/training-api/loss-functions).
## Related guides
* [`skills/dev/references/rl/async-rl.md`](https://github.com/fw-ai/cookbook/blob/main/skills/dev/references/rl/async-rl.md) — full async contract: off-policy gate, `perf/*` metrics, GPU split tuning
* [Weight sync](/fine-tuning/training-api/cookbook/weight-sync) — how updated weights reach the deployment
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) — all config classes
* [Loss Functions](/fine-tuning/training-api/loss-functions) — policy-loss and datum internals
# Cookbook: SFT
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/sft
Supervised fine-tuning via the cookbook's sft_loop recipe.
## What this is
Supervised Fine-Tuning (SFT) trains the model to produce desired outputs by minimizing cross-entropy loss on (prompt, response) pairs. The cookbook's `sft_loop` recipe handles data loading, tokenization, batching, gradient accumulation, and checkpointing automatically.
## Using the recipe
```python theme={null}
from training.recipes.sft_loop import Config, main
from training.utils import TrainerConfig
cfg = Config(
log_path="./sft_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/training_data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
max_seq_len=4096,
epochs=1,
batch_size=4,
learning_rate=1e-5,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(cfg)
```
**`batch_size_samples` is not supported in the V2 SFT `CookbookTrainingConfig`.**
Passing `batch_size_samples` to the V2 config has no effect — the parameter is accepted without error but silently ignored, which can lead to unexpected step counts.
**How batching works in V2:** Steps are calculated as:
```
steps = (num_samples × num_epochs) / batch_size
```
where `batch_size` is set by the training shape and the recipe’s `batch_size` field — not by `batch_size_samples`.
**Example:** 10 samples × 5 epochs ÷ batch size of 10 = **5 steps**, not 50.
To control training length, adjust `epochs` (and related recipe fields). Contact support for custom batch size configurations.
## Dataset format
SFT datasets use the standard messages format (JSONL with one example per line):
```json theme={null}
{"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]}
```
Multi-turn conversations are supported:
```json theme={null}
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 = 4"}
]}
```
The recipe automatically tokenizes conversations using the chat template, setting token weights to `0.0` for prompt tokens and `1.0` for response tokens.
### Vision datasets
The SFT recipe also supports vision-language model fine-tuning. Use multimodal `content` arrays with `image_url` objects in your JSONL, and specify a VLM training shape and tokenizer. See [Vision Inputs](/fine-tuning/training-api/vision-inputs) for dataset format details and a full walkthrough.
## Checkpointing and resume
The current `sft_loop` recipe manages the trainer-side loop only. It does **not** create a deployment or run weight sync during training, but it does expose DCP checkpointing and resume controls:
```python theme={null}
from training.utils import TrainerConfig, WandBConfig
cfg = Config(
log_path="./sft_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/training_data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
max_seq_len=4096,
epochs=1,
batch_size=4,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
dcp_save_interval=50,
init_from_checkpoint="previous-job-id:step-100", # optional
wandb=WandBConfig(entity="my-team", project="sft-experiment"),
)
main(cfg)
```
## Operational guidance
* **Set `trainer.training_shape_id`** — cookbook trainer launches use training shapes.
* **Only one trainer job needed** — SFT does not require a reference trainer.
* **The current recipe does not provision a deployment** — use the API directly if you want deployment-side evaluation or weight sync during SFT.
* **Use `batch_size`** to control the number of examples per optimizer step.
* **Gradient accumulation normalization defaults to `None`** — the SFT loss is already normalized client-side, so adding server-side normalization would double-normalize gradients.
* **Resume**: The recipe uses `checkpoint_utils.resolve_resume()` to automatically restore from the last saved state on restart.
* **DCP checkpoints are disabled by default** (`dcp_save_interval=0`). If you need to resume training from a checkpoint, you must explicitly set `dcp_save_interval` to a positive value (e.g., `dcp_save_interval=50`).
## Related guides
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — VLM fine-tuning with image and text data
* [Cookbook DPO](/fine-tuning/training-api/cookbook/dpo) — preference optimization
* [Cookbook RL (GRPO)](/fine-tuning/training-api/cookbook/rl) — reinforcement learning recipes
* [Cookbook Reference](/fine-tuning/training-api/cookbook/reference) — all config classes and parameters
* [Loss Functions](/fine-tuning/training-api/loss-functions) — API-level SFT loss details
# Weight sync
Source: https://docs.fireworks.ai/fine-tuning/training-api/cookbook/weight-sync
How a trainer's updated weights reach the serving deployment during RL training.
During RL training the policy updates step by step, and the inference deployment needs those updated weights to generate the next batch of rollouts. The cookbook wires this as a **shared GCS bucket**:
* The **trainer** writes a fresh checkpoint to the bucket after each optimizer step (or on a configurable cadence).
* The **deployment** watches the same bucket and swaps in new weights without a pod restart.
**Terminology.** The internal Fireworks name for this mechanism is *hotload*. You'll see that name in SDK field names (`hot_load_trainer_job`, `hot_load_deployment_id`, `hot_load_bucket_url`) and server error messages. "Weight sync" and "hotload" refer to the same thing.
## Normal flow
The RL recipe provisions the trainer and deployment for you — set `deployment=DeployConfig(...)` on the recipe `Config` and the SDK-managed service client wires the bucket correctly. With the default `DeployConfig(weight_sync_scope=WeightSyncScope.PER_TRAINER)`, the trainer is requested first and the deployment is linked to the trainer-owned bucket. `WeightSyncScope.PER_DEPLOYMENT` reverses that order: the deployment is created first, then trainers write to the deployment-owned bucket. If you misconfigure the pairing, the server rejects the `CreateDeployment` or `CreateRlorTrainerJob` call up front with an error that links back here.
## `WeightSyncScope`: who owns the bucket
`DeployConfig.weight_sync_scope` controls which resource must be created first:
| Scope | Bucket owner | Use when |
| --------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| **`PER_TRAINER`** (default) | Trainer — one bucket per run | Single run, or one trainer feeding multiple deployments (sampler + held-out eval) |
| **`PER_DEPLOYMENT`** | Deployment — stable bucket across trainer runs | Long-lived deployment, many sequential trainers, can't tolerate deployment restarts between runs |
The recipe dispatches on this single field and wires the rest correctly. The two scopes are mutually exclusive for the same trainer ↔ deployment pair — don't mix them.
## Diagnosing errors
The control plane catches scope-mix mistakes at create time and returns an error that names both resources and suggests the fix. For the full list of server error strings and per-error recovery steps, see the cookbook's dev skill: **[`skills/dev/references/rl/hotload.md`](https://github.com/fw-ai/cookbook/blob/main/skills/dev/references/rl/hotload.md)**. It also covers trainer retention, the unified promote API, and runtime bucket-mismatch warnings.
## See also
* [RL cookbook](/fine-tuning/training-api/cookbook/rl) — end-to-end RL flow, including weight-sync cadence knobs
* [Checkpoints](/fine-tuning/training-api/cookbook/checkpoints) — base/delta, promote
# Introduction
Source: https://docs.fireworks.ai/fine-tuning/training-api/introduction
Fireworks Training API — custom training loops with full Python control over objectives, while Fireworks handles distributed GPU infrastructure.
The Training API is currently in **private preview**. [Request early access](https://fireworks.ai/contact-training) to get started.
**Using a code agent?** Clone [fw-ai/cookbook](https://github.com/fw-ai/cookbook). The cookbook includes the [`skills/dev/`](https://github.com/fw-ai/cookbook/tree/main/skills/dev) skill, which gives agents repo-specific guidance for setup, debugging, weight sync, RL recipe internals, and checkpoint promotion.
## What is the Training API?
Fireworks Training API lets you write training logic in plain Python on your local machine while model computation runs on remote GPUs managed by Fireworks.
Most users should start from [cookbook recipes](/fine-tuning/training-api/cookbook/overview), the recommended entry point for standard SFT, DPO, GRPO-style training, and async RL loops for agentic RL. Fork a recipe when you want to adapt an existing loop with your own loss, reward, rollout function, data loading, or checkpointing behavior.
Use the Direct Training SDK when you need full control over training behavior.
| Mode | Best for | Infrastructure |
| ----------------------- | --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Cookbook recipes** | Recommended entry point for adapting existing SFT/DPO/GRPO-style loops, including async RL for agentic RL | You configure and implement simple loss, reward, or rollout functions; platform runs GPUs |
| **Direct Training SDK** | Full control over training behavior | You drive the training flow; platform runs GPUs |
## Who does what
| Fireworks handles | Cookbook recipes handle | Direct Training SDK users implement |
| ------------------------------------------------------------------------ | -------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| GPU provisioning and cluster management | Training loop structure for supported recipes | Training loop logic (`forward_backward_custom` + `optim_step`) |
| Service-mode trainer lifecycle (create, health-check, reconnect, delete) | Resource setup, health checks, reconnect, and cleanup | Managed service setup with `FiretitanServiceClient.from_firetitan_config(...)` |
| Distributed forward pass, backward pass, optimizer execution | Common losses and reward/evaluation plumbing | Loss function and batch construction |
| Checkpoint storage and export | Checkpoint save, resume, promotion, and sampler refresh | Checkpoint calls (`save_weights_for_sampler`, DCP snapshots) |
| Inference deployments and weight sync | Deployment sampling and serving-integrated evaluation for RL recipes | Custom rollout, sampling, and evaluation logic through the managed service |
| Preemption recovery and job resume | Resume logic for supported recipe checkpoints | Resume policy and state restoration calls |
| Distributed training (multi-node, sharding, FSDP) | Config surfaces for learning rate, grad accumulation, context length, W\&B | Hyperparameter schedules, data pipeline, and experiment tracking |
## System architecture
```mermaid theme={null}
flowchart LR
local["Your Python Code
(loss function, data loading, metrics)"] <-->|HTTP API| gpu["Fireworks GPUs
(forward pass, backward pass, optimizer)"]
```
## How service-mode training works
**Most common gotchas**
* Every API call returns a future. Always call `.result()` or failures can be missed.
* `token_weights=0` means prompt/no-loss tokens, `token_weights=1` means response/learned tokens.
* `forward_backward_custom` computes gradients only; you still need `optim_step` to apply updates.
### Minimal training step lifecycle
1. Create an SDK-managed service and connect a training client.
2. Send tokenized datums (with loss weights).
3. Run `forward_backward_custom(...).result()`.
4. Run `optim_step(...).result()`.
5. Save sampler weights and refresh the SDK-managed sampler.
### Datums
A **Datum** is the unit of training data sent to the remote GPU. It wraps tokenized input and per-token weights that your loss function needs.
Token weights tell the loss function which tokens to train on:
* **`0.0`** = prompt token (don't train on this)
* **`1.0`** = response token (train on this)
```python theme={null}
import tinker
import torch
from tinker_cookbook.supervised.common import datum_from_model_input_weights
tokens = tokenizer.encode("What is 2+2? The answer is 4.")
prompt_len = len(tokenizer.encode("What is 2+2? "))
weights = torch.zeros(len(tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0 # Train on response tokens only
datum = datum_from_model_input_weights(
tinker.ModelInput.from_ints(tokens),
weights,
max_length=4096,
)
```
### Logprobs and forward\_backward\_custom
When you call `forward_backward_custom`, the GPU runs a forward pass and returns **per-token log-probabilities** as PyTorch tensors with `requires_grad=True`. Your loss function computes a scalar loss, the API calls `loss.backward()`, and gradients are sent back to the GPU for the model backward pass.
```python theme={null}
def my_loss_fn(data, logprobs_list):
loss = compute_something(logprobs_list)
return loss, {"loss": loss.item()}
result = training_client.forward_backward_custom(datums, my_loss_fn).result()
```
After accumulating gradients, call `optim_step` to apply the optimizer update:
```python theme={null}
import tinker
training_client.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
).result()
```
### Futures
All training client API calls return **futures**. Call `.result()` to block until completion. Without `.result()`, errors are silently swallowed.
### Checkpointing and weight sync
After training, you export checkpoints for serving:
* **Base checkpoint**: Full model weights. Use for the first checkpoint.
* **Delta checkpoint**: Only the diff from the previous base (\~10x smaller). Use for subsequent checkpoints.
**Weight sync** pushes a checkpoint onto a running inference deployment without restarting it, enabling evaluation under serving conditions during training. In normal SDK and cookbook code, this is expressed as `training_client.save_weights_for_sampler(...).result()` followed by `service.create_sampling_client(model_path=saved.path)` or `service.create_deployment_sampler(model_path=saved.path)`.
For RL rollouts that continue across weight sync, see [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts) for how active request streams, session IDs, and `reset_prompt_cache` interact.
```mermaid theme={null}
flowchart LR
train["Train step"] --> save["save_weights_for_sampler"]
save --> sample_client["create_sampling_client(model_path=...)"]
sample_client --> sample["Sample via deployment"]
sample --> eval["Evaluate quality"]
eval --> train
```
## Key APIs
| API | Purpose |
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| [`FiretitanServiceClient`](/fine-tuning/training-api/reference/service-client) | Recommended direct SDK entry point. Creates or reattaches trainers/deployments and returns training, reference, and sampling clients. |
| [`FiretitanTrainingClient`](/fine-tuning/training-api/reference/service-client) | Tinker-compatible training client: `forward_backward_custom`, `optim_step`, `save_weights_for_sampler`, `save_state`, and load methods. |
| [`DeploymentSampler`](/fine-tuning/training-api/reference/deployment-sampler) | FireTitan-native sampler for tokenized rollout/evaluation from SDK-managed deployments. |
| [`FireworksClient`](/fine-tuning/training-api/reference/fireworks-client) | Standalone checkpoint operations such as listing checkpoints or promoting a model without a live training instance. |
| [`TrainerJobManager`](/fine-tuning/training-api/reference/trainer-job-manager) | Legacy/compatibility lifecycle manager. Documented for existing SDK users and advanced debugging; not the recommended user-facing path. |
| [`DeploymentManager`](/fine-tuning/training-api/reference/deployment-manager) | Legacy/compatibility deployment manager. Documented for existing SDK users and advanced debugging; normal code uses `FiretitanServiceClient`. |
## Renderers
Chat-template formatting, stop-token handling, and loss-weight masking for SFT/DPO datasets are handled by **renderers** — pluggable per-model classes that turn raw conversations into the trainer's `Datum` shape. Most users never touch a renderer directly; cookbook recipes pick the right one for the `base_model` you set. If you need to author a new one or debug parity against HuggingFace, the implementation depth lives in the cookbook's [`skills/renderer/`](https://github.com/fw-ai/cookbook/tree/main/skills/renderer) skill.
## Comparing Training API pricing vs DIY bare metal
When comparing a managed training platform with a self-managed bare-metal stack,
optimize for **cost per successful iteration**, not just headline `$ / GPU-hour`.
### What to compare
* **Time to first deployed model**: include environment setup, training orchestration, checkpoint handoff, and serving integration.
* **Iteration cycle time** (`train -> eval -> deploy -> repeat`): include all retrain/redeploy plumbing, not just GPU runtime.
* **Infra engineering overhead**: include one-time setup and recurring maintenance for containers, runtimes, deployment workflows, and compatibility fixes.
* **Effective `$ / GPU-hour` at real utilization**: include idle capacity, reservation constraints, and burst/overflow behavior.
* **Train/serve parity risk**: account for potential quality drift when training and inference runtimes diverge.
* **Parallel experiment capacity**: compare fixed-reservation throughput against elastic capacity for sweeps and multi-seed runs.
### Useful formulas
```text theme={null}
iterations_per_month = available_working_days / cycle_time_days
effective_cost_per_gpu_hour = total_monthly_spend / gpu_hours_consumed
multi_turn_success ~= (single_turn_success)^turn_count
```
### Keep assumptions explicit
Document assumptions so readers can adjust them for their own workload:
* team size and fully-loaded engineering cost
* average cycle duration in each setup
* expected utilization and burst profile
* average turn count for production agent workflows
* required concurrent experiment count
## FAQ
### Why is my training run "doing nothing" even though code executed?
Usually because `.result()` was not called on futures, so failures were never surfaced.
### What's the difference between base and delta checkpoints, and when should I use each?
Use a base checkpoint for your first checkpoint. Use delta checkpoints for subsequent checkpoints to speed up sync and reduce storage.
### Do I need to manage distributed training infra?
No. You implement training logic while Fireworks manages GPU provisioning and distributed infrastructure.
### Should I start with Cookbook or Direct SDK?
Start with Cookbook for most SFT/DPO/GRPO adaptations. Use the Direct SDK when you need custom loop semantics and full control.
### Can I evaluate serving behavior during training?
Yes. Save a checkpoint, sync it onto a running deployment, and evaluate under serving conditions.
### How should I compare Training API pricing vs a DIY bare-metal setup?
Use the framework in [Comparing Training API pricing vs DIY bare metal](#comparing-training-api-pricing-vs-diy-bare-metal). Focus on total iteration economics (cycle time, engineering overhead, utilization-adjusted cost, and quality-parity risk), then plug in your own assumptions.
### How can I compare rollout cost vs other providers?
See the [Price comparison vs Tinker](/fine-tuning/multi-turn-cost-comparison) calculator to estimate scenario-based costs on Fireworks Dedicated against Tinker's per-token pricing.
## Next steps
* [Quickstart](/fine-tuning/training-api/quickstart) — get a custom training loop running in minutes
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — end-to-end API walkthrough
* [Loss Functions](/fine-tuning/training-api/loss-functions) — built-in and custom loss functions
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — fine-tune vision-language models with image and text data
* [The Cookbook](/fine-tuning/training-api/cookbook/overview) — ready-to-run recipes for SFT, DPO, ORPO, GRPO/IGPO, and async RL (experimental)
# Loss Functions
Source: https://docs.fireworks.ai/fine-tuning/training-api/loss-functions
Built-in loss functions and custom objectives via forward_backward_custom.
## What this is
The Training API supports two ways to compute loss:
1. **Built-in losses** via `forward_backward` with a string identifier (e.g. `"cross_entropy"`) — fastest, no extra forward pass needed.
2. **Custom losses** via `forward_backward_custom` with an arbitrary Python function — flexible, supports any differentiable objective at the cost of an additional forward pass.
## Built-in loss: cross\_entropy
For supervised fine-tuning, use the built-in `cross_entropy` loss via `forward_backward`:
```python theme={null}
result = training_client.forward_backward(datums, "cross_entropy").result()
```
This computes standard next-token prediction loss on the server side — no extra forward pass or local loss computation needed.
For built-in `cross_entropy`, the SDK backfills `result.metrics["response_tokens"]` so you can compute a mean loss from sum-style metrics when needed.
Built-in `cross_entropy` requires datums with `target_tokens` in `loss_fn_inputs`. Datums built with `datum_from_model_input_weights` (weight-based) will fail with `"missing required field 'target_tokens'"`. For built-in `cross_entropy`, use the target-token `tinker.Datum` format in the `Using tinker.Datum directly (target-token-based)` section below. If you want to keep weight-based datums, use `forward_backward_custom` with the weight-based format in [Building datums](#building-datums) and the custom-loss pattern in [Example: simple cross-entropy](#example-simple-cross-entropy).
For a **forward-only pass** (e.g. to compute reference logprobs without updating weights):
```python theme={null}
result = training_client.forward(datums, "cross_entropy").result()
ref_logprobs = [result.loss_fn_outputs[i]["logprobs"].data for i in range(len(datums))]
```
## Custom losses: forward\_backward\_custom
`forward_backward_custom` lets you implement any objective function in Python. You provide the loss computation; the API handles the forward pass on remote GPUs, passes logprobs back to your function, then sends the computed gradients back for the backward pass.
### How it works
1. You call `training_client.forward_backward_custom(datums, loss_fn)`.
2. The trainer runs a forward pass on the GPU and returns per-token logprobs.
3. The logprobs are converted to PyTorch tensors with `requires_grad=True`.
4. Your `loss_fn` is called with the datums and logprobs.
5. The API calls `loss.backward()` to compute `d_loss/d_logprob` gradients.
6. Gradients are sent back to the trainer GPU for the model backward pass.
Your loss function runs **locally** (on your machine), while the forward and backward passes run on **remote GPUs**.
`forward_backward_custom` does an extra forward pass compared to `forward_backward`, requiring \~1.5x FLOPs and up to \~3x wall time per step.
### Embedding-space custom losses
For objectives that operate on pooled hidden states instead of logprobs, pass `output="embedding"` and `pooling="mean"` or `"last"`:
```python theme={null}
def embedding_loss(data, embeddings):
loss = compute_embedding_objective(embeddings)
return loss, {"embedding_loss": float(loss.item())}
result = training_client.forward_backward_custom(
datums,
embedding_loss,
output="embedding",
pooling="mean",
).result()
```
### Loss function signature
```python theme={null}
def loss_fn(
data: list[tinker.Datum],
logprobs_list: list[torch.Tensor],
) -> tuple[torch.Tensor, dict[str, float]]:
"""
Args:
data: The same datums you passed to forward_backward_custom.
Access token weights via data[i].loss_fn_inputs["weights"].data
logprobs_list: Per-token log-probabilities from the forward pass.
Each tensor has requires_grad=True. Shape: (seq_len,) per sequence.
Returns:
loss: A scalar tensor. Must be differentiable w.r.t. logprobs_list entries.
metrics: A dict of float values for logging (not used for training).
"""
```
### Key rules
* **`logprobs_list[i]`** has `requires_grad=True` — your loss must be differentiable through it.
* **Use `torch.dot()`** to compute weighted sums — this correctly propagates gradients through the logprobs.
* **Return a scalar tensor** as the loss, and a `dict[str, float]` as metrics.
* **Access token weights** via `data[i].loss_fn_inputs["weights"].data` — these are `0` for prompt tokens and `1` for response tokens.
## Building datums
### Using tinker\_cookbook (weight-based)
`datum_from_model_input_weights` constructs datums with explicit token weights:
```python theme={null}
import tinker
import torch
from tinker_cookbook.supervised.common import datum_from_model_input_weights
tokens = [101, 2054, 2003, ...]
weights = torch.zeros(len(tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0 # Only train on response tokens
datum = datum_from_model_input_weights(tinker.ModelInput.from_ints(tokens), weights, max_length=8192)
```
### Using tinker.Datum directly (target-token-based)
For RL-style objectives where you need per-completion control (e.g. routing matrices, custom `loss_fn_inputs`), construct datums directly:
```python theme={null}
import tinker
model_input_len = len(tokens) - 1
datum = tinker.Datum(
model_input=tinker.ModelInput.from_ints(tokens[:-1]),
loss_fn_inputs={
"target_tokens": tinker.TensorData(
data=tokens[1:], dtype="int64", shape=[model_input_len],
),
},
)
```
## Example: simple cross-entropy
```python theme={null}
def cross_entropy_loss(data, logprobs_list):
total_loss = torch.tensor(0.0)
for i, logprobs in enumerate(logprobs_list):
weights = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
min_len = min(len(logprobs), len(weights))
weighted_sum = torch.dot(logprobs[:min_len].float(), weights[:min_len])
total_loss = total_loss - weighted_sum # Negative log-likelihood
loss = total_loss / len(logprobs_list)
return loss, {"cross_entropy": loss.item()}
result = training_client.forward_backward_custom(datums, cross_entropy_loss).result()
```
## Example: GRPO with KL penalty
```python theme={null}
def make_grpo_loss(rewards, ref_logprobs, kl_beta=0.001):
advantages = compute_advantages(rewards)
ref_tensors = [torch.tensor(lp, dtype=torch.float32) for lp in ref_logprobs]
def loss_fn(data, logprobs_list):
total_loss = torch.tensor(0.0)
for i in range(len(logprobs_list)):
weights = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
pi = logprobs_list[i][:len(weights)]
ref = ref_tensors[i][:len(weights)]
pg_loss = -advantages[i] * torch.dot(pi.float(), weights)
kl_term = torch.dot((pi - ref).float(), weights)
total_loss = total_loss + pg_loss + kl_beta * kl_term
return total_loss / len(logprobs_list), {"loss": (total_loss / len(logprobs_list)).item()}
return loss_fn
```
## Example: DPO margin loss
```python theme={null}
import torch.nn.functional as F
def make_dpo_loss(ref_chosen, ref_rejected, beta=0.1):
ref_c = torch.tensor(ref_chosen, dtype=torch.float32)
ref_r = torch.tensor(ref_rejected, dtype=torch.float32)
def loss_fn(data, logprobs_list):
pi_c, pi_r = logprobs_list[0], logprobs_list[1]
w_c = torch.tensor(data[0].loss_fn_inputs["weights"].data, dtype=torch.float32)
w_r = torch.tensor(data[1].loss_fn_inputs["weights"].data, dtype=torch.float32)
margin = (torch.dot(pi_c.float(), w_c) - torch.dot(ref_c, w_c)) - \
(torch.dot(pi_r.float(), w_r) - torch.dot(ref_r, w_r))
return -F.logsigmoid(beta * margin), {"margin": margin.item()}
return loss_fn
```
## Built-in loss methods: GRPO vs DAPO vs GSPO-token
When using the managed RFT flow or the cookbook's RL recipe, three built-in loss methods are available via `--rl-loss-method`:
| Method | Clipping | KL penalty | Loss aggregation | Importance sampling |
| ---------------- | ----------------------------- | ------------- | ------------------- | ------------------- |
| `grpo` (default) | Symmetric `[0.8, 1.2]` | Yes (`0.001`) | Token-mean | Token-level |
| `dapo` | Asymmetric `[0.8, 1.28]` | No | Token-mean | Token-level |
| `gspo-token` | Very tight `[1-3e-4, 1+4e-4]` | No | Seq-mean-token-mean | Sequence-level |
**GRPO** ([arXiv:2402.03300](https://arxiv.org/abs/2402.03300)) is the safe default with KL regularization.
**DAPO** ([arXiv:2503.14476](https://arxiv.org/abs/2503.14476)) removes KL and uses asymmetric clipping to allow more aggressive exploration in the improve direction.
**GSPO-token** ([arXiv:2507.18071](https://arxiv.org/abs/2507.18071)) uses sequence-level importance ratios and extremely tight clipping. The `seq-mean-token-mean` aggregation normalizes per-sequence before averaging, reducing bias toward longer responses.
For Training API users implementing custom loss functions via `forward_backward_custom`, these methods serve as reference implementations. You can replicate or modify their behavior in your custom loss function. See [Parameter Tuning](/fine-tuning/parameter-tuning#loss-method) for detailed guidance on when to choose each method.
## Applying the optimizer step
After `forward_backward_custom`, call `optim_step` to update weights:
```python theme={null}
training_client.forward_backward_custom(datums, loss_fn).result()
training_client.optim_step(
tinker.AdamParams(
learning_rate=1e-5,
beta1=0.9,
beta2=0.999,
eps=1e-8,
weight_decay=0.01,
)
).result()
```
For gradient accumulation, call `forward_backward_custom` multiple times before calling `optim_step`:
```python theme={null}
for micro_batch in micro_batches:
training_client.forward_backward_custom(micro_batch, loss_fn).result()
# One optimizer step after accumulating gradients
training_client.optim_step(tinker.AdamParams(learning_rate=1e-5, ...)).result()
```
## Gradient accumulation normalization
When you accumulate multiple micro-batches before `optim_step`, you have two places where normalization can happen:
1. Inside your loss function
2. Server-side inside `optim_step(..., grad_accumulation_normalization=...)`
Use only one normalization path. If your loss already returns a mean, leave server-side normalization unset. If your loss returns a raw sum, choose the matching server-side normalization mode:
```python theme={null}
from fireworks.training.sdk import GradAccNormalization
training_client.forward_backward_custom(datums, loss_fn).result()
training_client.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01),
grad_accumulation_normalization=GradAccNormalization.NUM_LOSS_TOKENS,
).result()
```
| Mode | Divides by | Best for |
| -------------------------------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| `GradAccNormalization.NUM_LOSS_TOKENS` | Total non-zero-grad tokens across accumulated micro-batches | Raw-sum token-level losses, such as RL / GRPO-style objectives |
| `GradAccNormalization.NUM_SEQUENCES` | Total sequences with at least one non-zero-grad token | Raw-sum sequence-level objectives |
| `None` | Nothing | Losses that already return per-token or per-sequence means, such as SFT, DPO, and ORPO |
### Choosing the right mode
* If your loss function returns a **raw sum over tokens**, use `GradAccNormalization.NUM_LOSS_TOKENS`.
* If your loss function returns a **raw sum over sequences**, use `GradAccNormalization.NUM_SEQUENCES`.
* If your loss function already returns a **mean**, leave `grad_accumulation_normalization` unset.
Do not normalize in both places. If your loss function already divides by tokens or sequences, adding server-side normalization will double-normalize the gradients.
### Recipe defaults
| Recipe | Default | Rationale |
| --------- | -------------------------------------- | ------------------------------------------------------------- |
| SFT | `None` | The SFT loss is already normalized client-side. |
| GRPO / RL | `GradAccNormalization.NUM_LOSS_TOKENS` | RL losses use server-side per-token normalization by default. |
| DPO | `None` | The DPO loss is already normalized client-side. |
| ORPO | `None` | The ORPO loss is already normalized client-side. |
## Common pitfalls
* **Token-weight misalignment** can silently break objective semantics — always verify that `len(logprobs)` and `len(weights)` are compatible (truncate to `min_len`).
* **Ignoring per-step diagnostics** makes instability hard to attribute — log metrics from every train step.
* **Forgetting `.result()`** — all Tinker API calls return futures. Without `.result()`, errors are silently swallowed.
* **Non-differentiable loss**: If your loss doesn't depend on `logprobs_list` entries through differentiable ops, gradients will be zero.
## Related guides
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — end-to-end workflow
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpoint and weight sync
* [Cookbook RL recipe](/fine-tuning/training-api/cookbook/rl) — GRPO with full reward pipeline
* [Cookbook DPO recipe](/fine-tuning/training-api/cookbook/dpo) — DPO with preference data
# Quickstart
Source: https://docs.fireworks.ai/fine-tuning/training-api/quickstart
Get a custom training loop running in minutes with the Fireworks Training API.
## Installation
Install the Fireworks Python package with training extensions:
```bash theme={null}
pip install --pre "fireworks-ai[training]"
```
Set your credentials:
```bash theme={null}
export FIREWORKS_API_KEY="your-api-key"
```
If you want ready-to-run recipes instead of writing a loop from scratch, see [The Cookbook](/fine-tuning/training-api/cookbook/overview) for config-driven GRPO, DPO, and SFT training.
## Your first training loop
This quickstart walks through a minimal SFT loop from scratch using only the API.
For trainer launch, the only shape-specific input you provide is the training shape ID. In most cases, use the full shared path `accounts/fireworks/trainingShapes/`. The `fireworks` account is the public shared shape catalog. The SDK-managed service client resolves the pinned version, creates or reattaches the trainer, and returns a Tinker-compatible training client.
### Step 1: Create the managed service
```python theme={null}
import os
from fireworks.training.sdk import FiretitanServiceClient
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
base_model = "accounts/fireworks/models/qwen3-8b"
shape_id = "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model=base_model,
tokenizer_model="Qwen/Qwen3-8B",
lora_rank=0,
training_shape_id=shape_id,
learning_rate=1e-5,
create_deployment=False,
cleanup_trainer_on_close=True,
)
```
### Step 2: Create the training client
```python theme={null}
training_client = service.create_training_client(
base_model=base_model,
lora_rank=0,
)
print(f"Trainer job: {service.trainer_job_id}")
```
### Step 3: Build training data
Each training example is a **Datum** — a tokenized sequence with per-token weights indicating which tokens to train on.
```python theme={null}
import tinker
import torch
import transformers
from tinker_cookbook.supervised.common import datum_from_model_input_weights
tokenizer = transformers.AutoTokenizer.from_pretrained(
"Qwen/Qwen3-8B", trust_remote_code=True,
)
conversation = [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
]
full_text = tokenizer.apply_chat_template(conversation, tokenize=False)
full_tokens = tokenizer.encode(full_text)
prompt_only = tokenizer.apply_chat_template(conversation[:1], tokenize=False)
prompt_len = len(tokenizer.encode(prompt_only))
weights = torch.zeros(len(full_tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0
datum = datum_from_model_input_weights(
tinker.ModelInput.from_ints(full_tokens),
weights,
max_length=4096,
)
```
### Step 4: Write a loss function and train
```python theme={null}
import tinker
def sft_loss(data, logprobs_list):
total_loss = torch.tensor(0.0)
n_tokens = 0
for i, logprobs in enumerate(logprobs_list):
weights = torch.tensor(
data[i].loss_fn_inputs["weights"].data, dtype=torch.float32,
)
min_len = min(len(logprobs), len(weights))
total_loss = total_loss - torch.dot(
logprobs[:min_len].float(), weights[:min_len],
)
n_tokens += weights[:min_len].sum().item()
loss = total_loss / max(n_tokens, 1)
return loss, {"sft_loss": loss.item(), "n_tokens": n_tokens}
batch = [datum]
for step in range(10):
result = training_client.forward_backward_custom(batch, sft_loss).result()
training_client.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
).result()
print(f"Step {step}: {result.metrics}")
```
### Step 5: Save and promote
```python theme={null}
saved = training_client.save_weights_for_sampler(
"sft-final",
checkpoint_type="base",
).result()
print(f"Checkpoint saved: {saved.path}")
# Promote the checkpoint to a deployable Fireworks model. `list_checkpoints`
# returns the full 4-segment checkpoint resource name that promotion expects.
entry = next(
row for row in service.list_checkpoints(service.trainer_job_id)
if row["name"].endswith(f"/checkpoints/{saved.path}")
)
model = service.promote_checkpoint(
name=entry["name"],
output_model_id="my-sft-model",
base_model=base_model,
)
service.close()
```
For production scripts, wrap `service.close()` in `try/finally` so SDK-managed trainers are cleaned up on exit — including on exceptions. See [Cleanup and Teardown](/fine-tuning/training-api/reference/cleanup).
## Next steps
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — full end-to-end lifecycle with deployment evaluation
* [Loss Functions](/fine-tuning/training-api/loss-functions) — GRPO, DPO, and custom loss function patterns
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — fine-tune vision-language models with image and text data
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpointing and weight sync details
* [The Cookbook](/fine-tuning/training-api/cookbook/overview) — ready-to-run recipes for GRPO, DPO, and SFT
# Cleanup and Teardown
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/cleanup
Delete trainer jobs and deployments after experiments to avoid leaked resources.
## What this is
RLOR trainer jobs and weight-sync-enabled deployments hold GPU resources. Always clean up after experiments — especially if jobs terminate unexpectedly. In new SDK and cookbook code, cleanup is owned by the SDK-managed service client.
## Automatic cleanup via the SDK-managed service
Create the service with cleanup options, then close it in `finally`:
```python theme={null}
from fireworks.training.sdk import FiretitanServiceClient
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model="accounts/fireworks/models/qwen3-8b",
tokenizer_model="Qwen/Qwen3-8B",
lora_rank=0,
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
deployment_id="research-serving",
cleanup_trainer_on_close=True,
cleanup_deployment_on_close="scale_to_zero",
)
try:
run_training_loop()
finally:
service.close()
```
`cleanup_trainer_on_close=True` deletes SDK-managed trainers. Separate reference trainers are governed by `cleanup_reference_trainer_on_close` (default `True`). `cleanup_deployment_on_close="scale_to_zero"` releases deployment GPUs while keeping the deployment resource around; use `"delete"` only when you want to remove the deployment entirely.
Cookbook recipes use the same service-client lifecycle internally and close the service through an `ExitStack`.
The standalone `ResourceCleanup` context manager and `setup_infra` helper have been **removed** from the cookbook. Provisioning and teardown now live behind the SDK-managed service client. See [Migrating from the deprecated managed infra](/fine-tuning/training-api/cookbook/reference#deprecated-managed-infra-infraconfig).
## Trainer inactivity cleanup
Long-running RLOR trainer jobs are automatically stopped after 60 minutes with no tracked activity. The trainer reports this activity to the control plane, and tracked activity includes trainer API operations and active-session heartbeats.
When creating a trainer through the REST API (`POST /v1/accounts/{account_id}/rlorTrainerJobs`), set `inactivityTimeout` to a positive protobuf JSON duration to choose a different timeout:
```json theme={null}
{
"inactivityTimeout": "1800s"
}
```
When creating a trainer through the legacy manager API, set `TrainerJobConfig.inactivity_timeout` and pass the config to `TrainerJobManager.create(...)` or `TrainerJobManager.create_and_wait(...)`:
```python theme={null}
from datetime import timedelta
from fireworks.training.sdk import TrainerJobConfig
config = TrainerJobConfig(
base_model="accounts/fireworks/models/qwen3-8b",
training_shape_ref="accounts/fireworks/trainingShapes//versions/",
inactivity_timeout=timedelta(minutes=30),
)
```
With `firectl`, use `--inactivity-timeout 30m` or `--inactivity-timeout 2h`. When the value is omitted or set to `0`, Fireworks uses the 60-minute default.
To disable automatic inactivity cleanup, set `disableInactivityCleanup` in the REST API, set `TrainerJobConfig.disable_inactivity_cleanup=True` in the Training SDK, or pass `--disable-inactivity-cleanup` in `firectl`. The trainer will not be stopped due to inactivity, and GPU usage continues to accrue while the trainer is running, so delete the trainer when you no longer need it.
## Manual compatibility cleanup
If you provisioned resources yourself with `TrainerJobManager` / `DeploymentManager` instead of the managed service, delete them directly.
### Cleaning up RLOR trainer jobs
```python theme={null}
import os
from fireworks.training.sdk import TrainerJobManager, DeploymentManager
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)
# Delete known trainer jobs from this run
for job_id in ["", ""]:
rlor_mgr.delete(job_id=job_id)
```
### Cleaning up deployments
```python theme={null}
deploy_mgr.delete(deployment_id="")
```
If you want to keep the deployment resource but release GPUs (lighter alternative to delete):
```python theme={null}
deploy_mgr.scale_to_zero(deployment_id="")
```
This sets both `minReplicaCount` and `maxReplicaCount` to `0`, releasing all accelerators while keeping the deployment available for future scale-up.
### Manual cleanup with try/finally
```python theme={null}
policy_job_id = ""
reference_job_id = ""
deployment_id = "research-loop-serving"
try:
run_training_loop()
finally:
rlor_mgr.delete(policy_job_id)
rlor_mgr.delete(reference_job_id)
deploy_mgr.delete(deployment_id)
```
## Checking for leaked resources
Track the IDs you create (trainer job IDs + deployment ID) and clean those explicitly. For broad account-wide discovery, use the Fireworks console or the managed `fw.*.list()` APIs.
## Operational guidance
* **Delete both policy and reference trainers** when running GRPO (which uses 2 RLOR jobs).
* **Close the managed service** in `finally` so trainer/reference/deployment cleanup runs on Ctrl+C or exceptions.
* **Don't delete a trainer** while a `save_weights_for_sampler` operation is in progress — wait for it to complete first.
## Related Guides
* [FiretitanServiceClient](/fine-tuning/training-api/reference/service-client)
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling)
# DeploymentManager (Compatibility)
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/deployment-manager
Legacy SDK reference for direct deployment lifecycle and weight-sync management.
## Overview
`DeploymentManager` is a low-level compatibility API. New user code should not wire deployments or weight-sync buckets manually; use [`FiretitanServiceClient.from_firetitan_config(...)`](/fine-tuning/training-api/reference/service-client#from_firetitan_config), then `service.create_sampling_client(model_path=...)` or `service.create_deployment_sampler(model_path=...)`. This page remains for existing integrations, migration support, and advanced deployment debugging.
`DeploymentManager` manages the lifecycle of inference deployments that serve as sampling and weight-sync targets during training. For on-policy training (GRPO), the deployment is synced with the latest policy weights.
```python theme={null}
from fireworks.training.sdk import DeploymentManager, DeploymentConfig
```
## Constructor
`DeploymentManager` supports separate URLs for control-plane, inference, and weight-sync traffic:
```python theme={null}
deploy_mgr = DeploymentManager(
api_key="",
base_url="https://api.fireworks.ai", # Control-plane URL (deployment CRUD)
inference_url="https://api.fireworks.ai", # Gateway URL for inference (defaults to base_url)
hotload_api_url="https://api.fireworks.ai",# Gateway URL for weight-sync ops (defaults to base_url)
)
```
| Parameter | Type | Default | Description |
| -------------------- | -------------- | ---------------------------- | --------------------------------------------------------------- |
| `api_key` | `str` | — | Fireworks API key |
| `base_url` | `str` | `"https://api.fireworks.ai"` | Control-plane URL for deployment CRUD |
| `inference_url` | `str \| None` | `None` | Gateway URL for inference completions (defaults to `base_url`) |
| `hotload_api_url` | `str \| None` | `None` | Gateway URL for weight-sync operations (defaults to `base_url`) |
| `additional_headers` | `dict \| None` | `None` | Extra HTTP headers |
| `verify_ssl` | `bool \| None` | `None` | SSL verification override |
For most users, all three URLs default to `base_url`. Separate URLs are useful when the control-plane and gateway have different endpoints (e.g. personal dev gateways).
## Methods
### `create_or_get(config, force_recreate=False)`
Create a new deployment or retrieve an existing one. Set `force_recreate=True` to delete and recreate if it already exists:
```python theme={null}
deploy_info = deploy_mgr.create_or_get(DeploymentConfig(
deployment_id="research-loop-serving",
base_model="accounts/fireworks/models/qwen3-8b",
min_replica_count=0,
max_replica_count=1,
))
```
Returns a `DeploymentInfo`.
### `wait_for_ready(deployment_id, timeout_s=600, poll_interval_s=15)`
Poll until the deployment is ready to serve:
```python theme={null}
deploy_mgr.wait_for_ready("research-loop-serving")
```
Returns a `DeploymentInfo`.
### `get(deployment_id)`
Inspect deployment status. Returns a `DeploymentInfo` or `None` if not found:
```python theme={null}
current = deploy_mgr.get("research-loop-serving")
print(current.state if current else "MISSING")
```
### `hotload_and_wait(deployment_id, base_model, snapshot_identity, ...)`
Load a checkpoint onto the deployment and wait for completion:
```python theme={null}
deploy_mgr.hotload_and_wait(
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
snapshot_identity=result.snapshot_name,
timeout_seconds=400,
)
```
For delta weight syncs, pass `incremental_snapshot_metadata`:
```python theme={null}
deploy_mgr.hotload_and_wait(
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
snapshot_identity=delta_result.snapshot_name,
incremental_snapshot_metadata={
"previous_snapshot_identity": base_result.snapshot_name,
"compression_format": "arc_v2",
"checksum_format": "alder32",
},
timeout_seconds=400,
)
```
### `hotload_check_status(deployment_id, base_model, timeout=30)`
Current weight-sync status per replica — `current_snapshot_identity`, `readiness`, `loading_state.stage`. Use for ad-hoc inspection or to decide whether a weight sync is needed.
### `wait_for_hotload(deployment_id, base_model, expected_identity, timeout_seconds=400, poll_interval=5)`
Poll until every replica reports `readiness=true` and `current_snapshot_identity == expected_identity`. The "wait half" of `hotload_and_wait` — call directly when you started a sync via `hotload()` and want to block separately.
### `update(deployment_id, body, update_mask)`
Partial PATCH. `update_mask` is **required** (snake-case field paths); without it the server replaces all mutable fields, silently zeroing anything not in `body`. Returns `DeploymentInfo`.
```python theme={null}
deploy_mgr.update("my-deployment",
body={"minReplicaCount": 2, "maxReplicaCount": 8},
update_mask="min_replica_count,max_replica_count")
```
### `warmup(model, max_retries=30, retry_interval_s=10.0)`
Send a warmup request to the deployment after weight sync. Retries until the deployment responds or the retry limit is reached. Returns `True` on success, `False` if all retries are exhausted.
### `scale_to_zero(deployment_id)`
Release GPU resources without deleting the deployment:
```python theme={null}
deploy_mgr.scale_to_zero("research-loop-serving")
```
Sets both `minReplicaCount` and `maxReplicaCount` to `0`.
### `delete(deployment_id)`
Delete a deployment entirely:
```python theme={null}
deploy_mgr.delete("research-loop-serving")
```
## DeploymentConfig
`DeploymentManager.create_or_get(...)` accepts a `DeploymentConfig` dataclass:
When `deployment_shape` is set (the recommended path), the shape owns the deployment's hardware and serving configuration. The fields below are what you set as a user:
| Field | Type | Default | Description |
| ------------------------------ | ------------------------ | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `deployment_id` | `str` | — | Stable deployment identifier |
| `base_model` | `str` | — | Base model name. Must match the trainer's base model for weight sync compatibility. |
| `deployment_shape` | `str \| None` | `None` | Deployment shape resource name. When set, the shape owns GPU type, node count, and serving engine config. |
| `region` | `str \| None` | `None` | Region for the deployment |
| `min_replica_count` | `int` | `0` | Minimum replicas (set `0` to scale to zero when idle) |
| `max_replica_count` | `int` | `1` | Maximum replicas for autoscaling |
| `accelerator_type` | `str` | `"NVIDIA_H200_141GB"` | Manual-path deployment GPU type. Do not set when `deployment_shape` is set. |
| `hot_load_bucket_type` | `str \| None` | `"FW_HOSTED"` | Weight sync storage backend |
| `hot_load_trainer_job` | `str \| None` | `None` | Trainer job name whose weight-sync bucket this deployment should use. Format: `accounts/{account}/rlorTrainerJobs/{job_id}`. When set, the deployment shares the trainer's bucket for weight sync. |
| `disable_speculative_decoding` | `bool` | `False` | Disable speculative decoding |
| `extra_args` | `list[str] \| None` | `None` | Extra serving arguments |
| `extra_values` | `dict[str, str] \| None` | `None` | Extra deployment Helm values |
| `annotations` | `dict[str, str] \| None` | `None` | Deployment annotations |
On the recommended shape path, `deployment_shape` owns the deployment hardware and serving configuration, so do not override `accelerator_type`. Advanced manual deployments can omit `deployment_shape` and set `accelerator_type` directly. `skip_shape_validation` is for internal development and requires elevated permissions.
## DeploymentInfo
Returned by `create_or_get`, `wait_for_ready`, and `get`:
| Field | Type | Description |
| --------------------- | ------------- | ------------------------------------------------------------------------ |
| `deployment_id` | `str` | Deployment identifier |
| `name` | `str` | Full resource name |
| `state` | `str` | Deployment state (e.g. `"READY"`, `"CREATING"`) |
| `hot_load_bucket_url` | `str \| None` | URL for weight sync storage |
| `inference_model` | `str \| None` | Model string for completions API (`accounts/{account}/deployments/{id}`) |
## Deployment shape and training shapes
When using a training shape, the linked **deployment shape is determined by the training shape and cannot be changed**. The training shape's `deploymentShapeVersion` locks the GPU type, node count, and serving engine configuration for the inference deployment.
The one thing you **can** adjust is the **replica count**. Use `min_replica_count` and `max_replica_count` to scale up throughput for sampling during RL loops:
```python theme={null}
deploy_mgr.create_or_get(DeploymentConfig(
deployment_id="rl-serving",
base_model="accounts/fireworks/models/qwen3-8b",
deployment_shape="accounts/fireworks/deploymentShapes/qwen3-8b-128k-h200",
min_replica_count=1,
max_replica_count=4,
))
```
## Operational guidance
* **Prefer `FiretitanServiceClient`** for normal trainer/deployment provisioning and sampler refresh.
* **Keep deployment IDs stable** per experiment family for easier rollbacks.
* **Use `min_replica_count=0`** for development to avoid idle GPU costs.
* **Create the trainer before the deployment** and link the deployment to the trainer's weight-sync bucket via `hot_load_trainer_job`.
* **Use `deployment_shape`** when the control plane has a pre-validated shape for your model.
* **Do not treat shape-owned hardware as a user-facing override surface** — in normal flows, leave `accelerator_type` and placement decisions to the deployment shape and only tune replica counts.
* **Use `scale_to_zero`** after training as a lighter alternative to `delete`.
## Related guides
* [DeploymentSampler](/fine-tuning/training-api/reference/deployment-sampler) — sample from the deployment
* [FiretitanServiceClient](/fine-tuning/training-api/reference/service-client) — recommended managed service path
* [Cleanup](/fine-tuning/training-api/reference/cleanup) — resource cleanup
# DeploymentSampler
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/deployment-sampler
Client-side tokenized sampling from inference deployments for training and evaluation.
## Overview
`DeploymentSampler` handles client-side tokenization via a HuggingFace tokenizer and returns structured `SampledCompletion` objects with token IDs, logprobs, and completion metadata. Use it in training scripts that need token-level outputs (e.g. GRPO, DPO).
```python theme={null}
from fireworks.training.sdk import DeploymentSampler
```
## Constructor
```python theme={null}
from transformers import AutoTokenizer
from fireworks.training.sdk import DeploymentSampler, AdaptiveConcurrencyController
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
# Adaptive concurrency (recommended) — auto-tunes based on server load
sampler = DeploymentSampler(
inference_url="https://api.fireworks.ai",
model="accounts//deployments/",
api_key="",
tokenizer=tokenizer,
concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
)
```
| Parameter | Type | Description |
| ------------------------ | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `inference_url` | `str` | Gateway URL for inference completions |
| `model` | `str` | Deployment model path (`accounts//deployments/`) |
| `api_key` | `str` | Fireworks API key |
| `tokenizer` | `PreTrainedTokenizerBase` | HuggingFace tokenizer matching the base model |
| `concurrency_controller` | `AdaptiveConcurrencyController \| FixedConcurrencyController \| None` | Controls how many concurrent HTTP requests are in-flight. `None` (default) means no limit. See [Concurrency Control](#concurrency-control) below. |
## Concurrency Control
`sample_with_tokens(n=K)` fans out into K individual streaming requests. Without concurrency control, all requests fire simultaneously, which can overload the server. Two controllers are available:
### AdaptiveConcurrencyController (recommended)
Auto-tunes the concurrency window using AIMD (Additive Increase / Multiplicative Decrease) based on the server's `prefill_queue_duration`:
```python theme={null}
from fireworks.training.sdk import AdaptiveConcurrencyController
ctrl = AdaptiveConcurrencyController(
initial_window=16, # starting concurrency
min_window=1, # minimum window
max_window=256, # maximum window
prefill_queue_target=0.5, # target prefill queue latency (seconds)
)
sampler = DeploymentSampler(..., concurrency_controller=ctrl)
# Between training steps, call step_completed() to trigger window adjustment
summary = ctrl.step_completed()
print(summary) # {"window": 20, "avg_pq": 0.08, "cache_hit_rate": 0.95, ...}
```
The controller reads `prefill_queue_duration` from server response metrics. When the queue is below target, the window grows proportionally. When above, it halves (multiplicative decrease).
### FixedConcurrencyController
Static semaphore — use when you know the right concurrency for your deployment:
```python theme={null}
from fireworks.training.sdk import FixedConcurrencyController
sampler = DeploymentSampler(
...,
concurrency_controller=FixedConcurrencyController(32),
)
```
## `sample_with_tokens(...)`
Sample completions and return structured results with token IDs. This method is `async`, so call it with `await` or wrap it with `asyncio.run(...)` from synchronous code:
```python theme={null}
import asyncio
async def main():
completions = await sampler.sample_with_tokens(
messages=[{"role": "user", "content": "Solve: 2+2="}],
n=4,
max_tokens=1024,
temperature=0.7,
)
for c in completions:
print(c.full_tokens) # prompt + completion token IDs
print(c.prompt_len) # number of prompt tokens
print(c.completion_len) # number of completion tokens
print(c.text) # decoded completion text
print(c.finish_reason) # "stop", "length", etc.
asyncio.run(main())
```
### Retrieving inference logprobs
For GRPO importance sampling, pass `logprobs=True`:
```python theme={null}
import asyncio
async def main():
completions = await sampler.sample_with_tokens(
messages=[{"role": "user", "content": "Solve: 2+2="}],
n=4,
logprobs=True,
top_logprobs=1,
)
for c in completions:
print(c.inference_logprobs) # List[float] or None
asyncio.run(main())
```
### Sequence length filtering
`sample_with_tokens` supports `max_seq_len` for automatic filtering:
```python theme={null}
import asyncio
completions = asyncio.run(
sampler.sample_with_tokens(
messages=input_messages,
n=4,
max_tokens=1024,
max_seq_len=8192, # filter out sequences exceeding this length
)
)
```
Two levels of filtering are applied:
1. **Prompt pre-filter**: If the tokenized prompt already meets or exceeds `max_seq_len`, the method returns an empty list immediately — no inference call is made.
2. **Completion post-filter**: After sampling, any completion whose full token sequence (prompt + completion) exceeds `max_seq_len` is silently dropped.
## SampledCompletion
Each completion returned by `sample_with_tokens`:
| Field | Type | Description |
| -------------------- | --------------------- | -------------------------------------------------------------------------------- |
| `text` | `str` | Decoded completion text |
| `full_tokens` | `List[int]` | Prompt + completion token IDs |
| `prompt_len` | `int` | Number of prompt tokens |
| `finish_reason` | `str` | `"stop"`, `"length"`, etc. |
| `completion_len` | `int` | Number of completion tokens |
| `inference_logprobs` | `List[float] \| None` | Per-token logprobs (when `logprobs=True` is passed) |
| `logprobs_echoed` | `bool` | `True` when `echo=True` was used — logprobs are training-aligned (P+C-1 entries) |
| `routing_matrices` | `List[str] \| None` | Base64-encoded per-token routing matrices for MoE Router Replay (R3) |
## Related guides
* [FiretitanServiceClient](/fine-tuning/training-api/reference/service-client) — create SDK-managed deployment samplers
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — end-to-end workflow
* [Cookbook RL recipe](/fine-tuning/training-api/cookbook/rl) — GRPO with sampling pipeline
# FireworksClient
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/fireworks-client
Account-level operations that don't require a running trainer job.
## Overview
`FireworksClient` provides Fireworks platform operations that are independent of any running trainer job: checkpoint promotion, training shape resolution, and model validation. It is also the base class for the legacy [`TrainerJobManager`](/fine-tuning/training-api/reference/trainer-job-manager), which adds direct trainer job lifecycle methods.
Use `FireworksClient` directly when you don't need to create or manage trainer jobs -- for example, promoting a checkpoint after the trainer has already been deleted, or resolving training shape configuration before deciding whether to launch a job.
```python theme={null}
from fireworks.training.sdk import FireworksClient
```
## Constructor
```python theme={null}
client = FireworksClient(
api_key="",
base_url="https://api.fireworks.ai", # optional
)
```
| Parameter | Type | Default | Description |
| -------------------- | -------------- | ---------------------------- | ------------------------- |
| `api_key` | `str` | -- | Fireworks API key |
| `base_url` | `str` | `"https://api.fireworks.ai"` | Control-plane URL |
| `additional_headers` | `dict \| None` | `None` | Extra HTTP headers |
| `verify_ssl` | `bool \| None` | `None` | SSL verification override |
## Methods
### `promote_checkpoint(*, name, output_model_id, base_model)`
Promote a sampler checkpoint to a deployable Fireworks model. The trainer job does not need to be running -- the checkpoint resource name is enough to resolve the GCS bucket where the files reside.
```python theme={null}
entry = client.list_checkpoints("")[0]
model = client.promote_checkpoint(
name=entry["name"], # accounts//rlorTrainerJobs//checkpoints/
output_model_id="my-fine-tuned-model",
base_model="accounts/fireworks/models/qwen3-8b",
)
print(f"Model state: {model['state']}, kind: {model['kind']}")
```
| Parameter | Type | Description |
| ----------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------- |
| `name` | `str` | Full 4-segment checkpoint resource name (`accounts//rlorTrainerJobs//checkpoints/`), as returned by `list_checkpoints` |
| `output_model_id` | `str` | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only) |
| `base_model` | `str` | Base model resource name for metadata inheritance (e.g. `accounts/fireworks/models/qwen3-8b`) |
Returns the model dict from the API (includes `state`, `kind`, `peftDetails`). See [Saving and Loading](/fine-tuning/training-api/saving-and-loading#promoting-a-checkpoint-to-a-model) for details, and [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds) for which checkpoints are promotable.
The trainer job can be in any state (running, failed, cancelled, or deleted) as long as the checkpoint files still exist in GCS. Promotion is a file copy -- it does not interact with the trainer process.
Validate `output_model_id` with [`validate_output_model_id`](#validate-output-model-id-output-model-id) before calling — a rejected ID (>63 chars or bad charset) orphans the staged sampler blob.
### `list_checkpoints(job_id, *, page_size=200)`
Server-side list of a trainer's checkpoints (sampler + DCP, with promotability metadata). Works on any trainer state — including deleted — while the DB record + GCS blobs survive. Auto-paginates. Distinct from [`FiretitanTrainingClient.list_checkpoints()`](/fine-tuning/training-api/reference/service-client) (live-pod, DCP names only).
```python theme={null}
rows = client.list_checkpoints(job_id)
latest = max((r for r in rows if r["promotable"]), key=lambda r: r["createTime"])
```
Each row has `name`, `createTime` / `updateTime` (RFC3339), `checkpointType` (opaque server enum — filter on `promotable` rather than matching values), and `promotable` (bool, authoritative). Server returns rows **oldest-first** — re-sort client-side for newest-first. Requires `fireworks-ai[training] >= 1.0.0a62`.
### `resolve_training_profile(shape_id)`
Resolve a training shape ID into a full configuration profile:
```python theme={null}
shape_id = "accounts/fireworks/trainingShapes/ts-qwen3-8b-policy"
profile = client.resolve_training_profile(shape_id)
print(profile.accelerator_type) # e.g. "NVIDIA_B200_192GB"
print(profile.trainer_image_tag) # e.g. "0.0.0-dev-..."
print(profile.node_count) # e.g. 1
print(profile.pipeline_parallelism) # e.g. 1
```
See [Training Shapes](/fine-tuning/training-api/training-shapes) for the user-facing shape workflow.
### `validate_output_model_id(output_model_id)`
Client-side validation helper for `promote_checkpoint(..., output_model_id=...)`:
```python theme={null}
from fireworks.training.sdk import validate_output_model_id
errors = validate_output_model_id("my-fine-tuned-model")
if errors:
raise ValueError("\n".join(errors))
```
Returns a list of formatted error strings. An empty list means the model ID is valid.
## Relationship to managed service clients
Normal training code should use [`FiretitanServiceClient.from_firetitan_config(...)`](/fine-tuning/training-api/reference/service-client#from_firetitan_config), which creates the trainer/deployment and delegates checkpoint listing/promotion through its managed control-plane client.
Use `FireworksClient` directly when you only need platform-level operations outside a live training service, such as promoting a checkpoint from a completed experiment. Use `TrainerJobManager` only for legacy integrations or advanced lifecycle debugging.
```python theme={null}
from fireworks.training.sdk import FireworksClient, TrainerJobManager
# Trainer-free: promote a checkpoint from a completed experiment
client = FireworksClient(api_key=api_key)
entry = client.list_checkpoints(job_id)[0]
client.promote_checkpoint(name=entry["name"], output_model_id="my-model", base_model=base_model)
# Compatibility lifecycle: create trainer manually, train, promote
mgr = TrainerJobManager(api_key=api_key)
endpoint = mgr.create_and_wait(config)
# ... train ...
entry = mgr.list_checkpoints(endpoint.job_id)[0]
mgr.promote_checkpoint(name=entry["name"], output_model_id="my-model", base_model=base_model)
mgr.delete(endpoint.job_id)
```
## TrainingShapeProfile
Returned by `resolve_training_profile`:
| Field | Type | Description |
| ------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------ |
| `training_shape_version` | `str` | Resolved shape version |
| `trainer_image_tag` | `str` | Docker image tag for the trainer |
| `max_supported_context_length` | `int` | Maximum supported context length |
| `node_count` | `int` | Number of trainer nodes |
| `deployment_shape_version` | `str` | Linked deployment shape |
| `deployment_image_tag` | `str` | Docker image tag for the linked deployment |
| `accelerator_type` | `str` | GPU type |
| `accelerator_count` | `int` | Number of GPUs per node |
| `base_model_weight_precision` | `str` | Model weight precision |
| `pipeline_parallelism` | `int` | Pipeline parallelism degree |
| `trainer_mode` | `str` | Shape mode, such as `POLICY_TRAINER`, `FORWARD_ONLY`, or `LORA_TRAINER` |
| `training_shape` | `str` | Training shape name (without `/versions/...` suffix) |
| `deployment_shape` | `str \| None` | Full versioned deployment shape resource name; pass as-is to `DeploymentConfig.deployment_shape` for pinning |
| `supports_lora` | `bool` | Whether the shape is LoRA-capable (`trainer_mode == "LORA_TRAINER"`) |
## Related guides
* [TrainerJobManager](/fine-tuning/training-api/reference/trainer-job-manager) -- trainer job lifecycle (extends FireworksClient)
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) -- checkpoint save, load, and promote workflows
* [Training Shapes](/fine-tuning/training-api/training-shapes) -- available shapes and deployment linkage
# FiretitanServiceClient & TrainingClient
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/service-client
Connect to a trainer endpoint and use the training client for forward/backward passes, optimizer steps, and checkpointing.
## Overview
`FiretitanServiceClient` is the recommended direct SDK entry point. In the managed path, it creates or reattaches the FireTitan trainer, optional reference trainer, and optional inference deployment, then returns Tinker-compatible training and sampling clients.
For most direct SDK code, create it with `FiretitanServiceClient.from_firetitan_config(...)`. The bare constructor is still useful when you already have a trainer endpoint URL, but that is a lower-level compatibility path.
```python theme={null}
from fireworks.training.sdk import FiretitanServiceClient, GradAccNormalization
```
## FiretitanServiceClient
### `from_firetitan_config(...)`
Create a lazy SDK-managed service. The trainer and deployment are provisioned on the first client call, usually `create_training_client(...)`:
```python theme={null}
service = FiretitanServiceClient.from_firetitan_config(
api_key="",
base_url="https://api.fireworks.ai",
base_model="accounts/fireworks/models/qwen3-8b",
tokenizer_model="Qwen/Qwen3-8B",
lora_rank=0,
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
deployment_id="research-serving", # set create_deployment=False for trainer-only flows
learning_rate=1e-5,
replica_count=1, # deployment replicas
cleanup_trainer_on_close=True,
cleanup_deployment_on_close="scale_to_zero",
)
training_client = service.create_training_client(
base_model="accounts/fireworks/models/qwen3-8b",
lora_rank=0,
)
```
Core managed config fields:
| Field | Type | Default | Description |
| ------------------------------------ | ------------------------------------- | -------------------------- | -------------------------------------------------------------------------------------------------------- |
| `api_key` | `str \| None` | `FIREWORKS_API_KEY` | Fireworks API key. |
| `base_url` | `str \| None` | `https://api.fireworks.ai` | Control-plane URL. |
| `inference_url` | `str \| None` | `None` | Optional inference gateway URL. |
| `base_model` | `str` | — | Fireworks base model resource name. |
| `tokenizer_model` | `str \| None` | `None` | HuggingFace tokenizer name used by `get_tokenizer()` and sampler setup. |
| `lora_rank` | `int` | `0` | `0` for full-parameter training; positive value for LoRA. |
| `training_shape_id` | `str \| None` | `None` | User-facing training shape ID. The SDK resolves the pinned version. |
| `reference_training_shape_id` | `str \| None` | `None` | Optional separate forward-only reference trainer shape. |
| `trainer_job_id` | `str \| None` | `None` | Reattach to an existing trainer instead of creating one. |
| `reference_trainer_job_id` | `str \| None` | `None` | Reattach to an existing reference trainer. |
| `create_deployment` | `bool` | `True` | Whether to create or reattach an inference deployment. Set `False` for trainer-only SFT/DPO-style loops. |
| `deployment_id` | `str \| None` | `None` | Create or reattach an inference deployment for sampling and weight sync. |
| `deployment_shape` | `str \| None` | Linked shape | Optional deployment shape override. Usually inherited from the training shape. |
| `trainer_replica_count` | `int \| None` | `None` | Data-parallel HSDP replicas for the trainer. |
| `replica_count` | `int` | `1` | Inference deployment replicas. |
| `cleanup_trainer_on_close` | `bool` | `False` | Delete the SDK-managed policy trainer when `service.close()` runs. |
| `cleanup_reference_trainer_on_close` | `bool` | `True` | Delete SDK-managed separate reference trainers when released/closed. |
| `cleanup_deployment_on_close` | `"scale_to_zero" \| "delete" \| None` | `None` | Optional deployment cleanup action on close. |
The managed service exposes resolved metadata after provisioning:
```python theme={null}
print(service.trainer_job_id)
print(service.deployment_id)
print(service.max_context_length)
print(service.reference_trainer_job_id) # None when the reference is shared
```
### Bare constructor
```python theme={null}
service = FiretitanServiceClient(
base_url=endpoint.base_url, # From TrainerJobManager.create_and_wait(...)
api_key="",
)
```
`base_url` is the trainer endpoint URL from `TrainerServiceEndpoint.base_url`. Use this only when you intentionally manage trainer lifecycle yourself. New user code should use `from_firetitan_config(...)`.
### `create_training_client(base_model, lora_rank, user_metadata)`
Creates a `FiretitanTrainingClient` for training operations:
```python theme={null}
training_client = service.create_training_client(
base_model="accounts/fireworks/models/qwen3-8b",
lora_rank=0, # Must match lora_rank from job creation
)
```
| Parameter | Type | Default | Description |
| --------------- | ------------------------ | ------- | ----------------------------------------------------------- |
| `base_model` | `str` | — | Must match the trainer job's `base_model` |
| `lora_rank` | `int` | `0` | Must match trainer creation config (`0` for full-parameter) |
| `user_metadata` | `dict[str, str] \| None` | `None` | Optional run metadata |
A `ValueError` is raised if you attempt to create a second training client with the same `(base_model, lora_rank)` on the same `FiretitanServiceClient` instance. Create a new `FiretitanServiceClient` for a separate trainer.
### Connecting to an existing trainer
If you already have a running trainer (e.g. from a previous session), connect directly by URL:
```python theme={null}
service = FiretitanServiceClient(
base_url="https://",
api_key="",
)
training_client = service.create_training_client(
base_model="accounts/fireworks/models/qwen3-8b",
lora_rank=0,
)
```
### `create_base_training_client(base_model, user_metadata=None)`
Creates a base-only client on the same trainer session. Use this as a frozen reference for LoRA KL/reference logprobs without launching a separate forward-only trainer:
```python theme={null}
reference_client = service.create_base_training_client(base_model=base_model)
ref = reference_client.forward(datums, "cross_entropy").result()
```
Do not call `forward_backward`, `forward_backward_custom`, or `optim_step` on this client; it is for reference forward passes only.
### `create_reference_client(base_model, lora_rank=0, user_metadata=None)`
Create a frozen reference client for KL/DPO baseline logprobs:
```python theme={null}
reference_client = service.create_reference_client(base_model, lora_rank=0)
ref = reference_client.forward(datums, "cross_entropy").result()
```
The SDK chooses the backing automatically. LoRA policies without an explicit reference shape reuse the policy trainer with the adapter disabled. Full-parameter policies, explicit `reference_training_shape_id`, or explicit `reference_trainer_job_id` use a separate forward-only reference trainer owned by the service.
### `create_sampling_client(model_path=None, ...)`
Return a Tinker-shaped sampling client backed by the SDK-managed deployment. When `model_path` is provided, the SDK first syncs that sampler snapshot to the deployment:
```python theme={null}
saved = training_client.save_weights_for_sampler("step-100").result()
sampler = service.create_sampling_client(model_path=saved.path)
```
This is the replacement for calling a standalone weight-sync helper in user code. The SDK tracks the base/delta chain and builds the weight-sync metadata internally.
### `create_deployment_sampler(model_path=None, tokenizer=None, concurrency_controller=None)`
Return the FireTitan-native `DeploymentSampler` directly. Use this when you need tokenized completions, inference logprobs, routing matrices, or adaptive concurrency:
```python theme={null}
sampler = service.create_deployment_sampler(
model_path=saved.path,
tokenizer=tokenizer,
concurrency_controller=controller,
)
```
### `hotload_sampler_snapshot(model_path)`
Low-level method for syncing a previously saved sampler snapshot into the SDK-managed deployment without constructing a sampler:
```python theme={null}
service.hotload_sampler_snapshot(saved.path)
```
## FiretitanTrainingClient
The training client returned by `create_training_client()`. Core training RPCs like `forward(...)`, `forward_backward_custom(...)`, `optim_step(...)`, `save_state(...)`, and `load_state_with_optimizer(...)` return **futures**. Fireworks convenience helpers like `save_weights_for_sampler_ext(...)`, `list_checkpoints()`, and `resolve_checkpoint_path(...)` return concrete values directly.
### `forward(datums, loss_type)`
Forward-only pass (no gradient computation). Useful for computing reference logprobs in GRPO/DPO:
```python theme={null}
result = training_client.forward(datums, "cross_entropy").result()
logprobs = result.loss_fn_outputs[0]["logprobs"].data
```
Built-in loss types like `"cross_entropy"` require datums with `target_tokens` in `loss_fn_inputs`. Datums built with `datum_from_model_input_weights` will fail. Use the target-token `tinker.Datum` example in [Loss Functions](/fine-tuning/training-api/loss-functions#using-tinkerdatum-directly-target-token-based) for built-in losses, or use `forward_backward_custom` with the weight-based format in [Building datums](/fine-tuning/training-api/loss-functions#building-datums) and the custom-loss pattern in [Example: simple cross-entropy](/fine-tuning/training-api/loss-functions#example-simple-cross-entropy).
### `forward_backward_custom(datums, loss_fn)`
Forward + backward with your custom loss function. See [Loss Functions](/fine-tuning/training-api/loss-functions) for details:
```python theme={null}
def my_loss(data, logprobs_list):
loss = compute_loss(data, logprobs_list)
return loss, {"loss": float(loss.item())}
result = training_client.forward_backward_custom(datums, my_loss).result()
print(result.metrics) # {"loss": 0.42}
```
For embedding-space objectives, pass `output="embedding"` and choose `pooling="mean"` or `"last"`; your loss function then receives pooled embedding tensors instead of logprobs:
```python theme={null}
result = training_client.forward_backward_custom(
datums,
embedding_loss,
output="embedding",
pooling="mean",
).result()
```
### `optim_step(adam_params, grad_accumulation_normalization=None)`
Apply optimizer update after accumulating gradients:
```python theme={null}
import tinker
training_client.optim_step(
tinker.AdamParams(
learning_rate=1e-5,
beta1=0.9,
beta2=0.999,
eps=1e-8,
weight_decay=0.01,
)
).result()
```
Supports `grad_accumulation_normalization` for controlling how accumulated gradients are normalized. Pass `GradAccNormalization.NUM_LOSS_TOKENS`, `GradAccNormalization.NUM_SEQUENCES`, or `GradAccNormalization.NONE` rather than raw strings. See [Loss Functions](/fine-tuning/training-api/loss-functions#gradient-accumulation-normalization) for when to use each mode.
### `save_weights_for_sampler(name, ttl_seconds=None, checkpoint_type=None)`
Save serving-compatible sampler weights and return a future. This is the normal Tinker-shaped API:
```python theme={null}
saved = training_client.save_weights_for_sampler(
"step-100",
checkpoint_type="base", # optional: "base" or "delta"
).result()
print(saved.path) # Snapshot identity for create_sampling_client(model_path=...)
```
Full-parameter training saves a base checkpoint first and deltas after that by default. LoRA training always saves base checkpoints. The returned `path` is a public snapshot identity, not a raw storage URI.
### `save_weights_for_sampler_ext(name, checkpoint_type, ttl_seconds)`
Fireworks-specific extension that returns a concrete `SaveSamplerResult` instead of a future:
```python theme={null}
result = training_client.save_weights_for_sampler_ext(
"step-100",
checkpoint_type="base", # "base" for full weights, "delta" for incremental
)
print(result.snapshot_name) # Session-qualified name for weight sync
```
| Parameter | Type | Default | Description |
| ----------------- | ------------- | ------- | ---------------------------------------------------- |
| `name` | `str` | — | Checkpoint name (auto-suffixed with session ID) |
| `checkpoint_type` | `str \| None` | `None` | `"base"` for full weights, `"delta"` for incremental |
| `ttl_seconds` | `int \| None` | `None` | Auto-delete checkpoint after this many seconds |
On full-parameter training, only `checkpoint_type="base"` produces a promotable blob; `"delta"` cannot be promoted. LoRA is always promotable. See [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds) for the full promotability matrix.
`save_weights_for_sampler_ext` saves the snapshot only; it does not mutate a deployment. To serve the snapshot, pass `result.snapshot_name` to the managed service weight-sync path, or use `create_sampling_client(model_path=...)` / `create_deployment_sampler(model_path=...)`, which sync and return a sampler.
### `save_state(name, ttl_seconds=None, timeout=None)`
Save full train state (weights + optimizer) for resume:
```python theme={null}
training_client.save_state("train_state_step_100").result()
```
| Parameter | Type | Default | Description |
| ------------- | --------------- | ------- | ------------------------------------------------------------- |
| `name` | `str` | — | Checkpoint name |
| `ttl_seconds` | `int \| None` | `None` | Auto-delete checkpoint after this many seconds |
| `timeout` | `float \| None` | `None` | If set, block until the save completes or the timeout expires |
### `load_state_with_optimizer(name)`
Restore full train state (weights + optimizer) from a checkpoint:
```python theme={null}
training_client.load_state_with_optimizer("train_state_step_100").result()
```
### `load_state(name)`
Load model weights from a checkpoint without restoring optimizer state. The optimizer is reset so the next `optim_step` starts fresh:
```python theme={null}
training_client.load_state("train_state_step_100").result()
```
### `load_adapter(adapter_path)`
Load Hugging Face PEFT adapter weights into the current LoRA session. This is a weights-only warm start; it does not restore optimizer state, scheduler state, or data cursor.
```python theme={null}
training_client.load_adapter("gs://my-bucket/adapters/run-42").result()
```
### `list_checkpoints()`
List available DCP checkpoints from the trainer. Returns a `list[str]`:
```python theme={null}
checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names) # e.g. ["step-2", "step-4"]
```
### `resolve_checkpoint_path(checkpoint_name, source_job_id)`
Resolve a checkpoint path for cross-job resume:
```python theme={null}
checkpoint_ref = training_client.resolve_checkpoint_path(
"step-4",
source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()
```
## SaveSamplerResult
Returned by `save_weights_for_sampler_ext`:
| Field | Type | Description |
| --------------- | ----- | ------------------------------------------------- |
| `path` | `str` | Snapshot name from trainer |
| `snapshot_name` | `str` | Session-qualified name for weight sync operations |
## GradAccNormalization
Enum for `optim_step`'s `grad_accumulation_normalization` parameter:
| Enum | Wire value | Description |
| -------------------------------------- | ------------------- | --------------------------------------------------------------- |
| `GradAccNormalization.NUM_LOSS_TOKENS` | `"num_loss_tokens"` | Normalize by total loss tokens across accumulated micro-batches |
| `GradAccNormalization.NUM_SEQUENCES` | `"num_sequences"` | Normalize by total sequences across accumulated micro-batches |
| `GradAccNormalization.NONE` | `"none"` | No normalization (raw gradient sum) |
## Related guides
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — managed service training + sampler refresh walkthrough
* [Loss Functions](/fine-tuning/training-api/loss-functions) — built-in and custom loss functions
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpoint details
# TrainerJobManager (Compatibility)
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/trainer-job-manager
Legacy SDK reference for service-mode trainer job lifecycle management.
## Overview
`TrainerJobManager` is a low-level compatibility API. New user code should not create trainer managers directly; use [`FiretitanServiceClient.from_firetitan_config(...)`](/fine-tuning/training-api/reference/service-client#from_firetitan_config) or cookbook recipes instead. This page remains for existing integrations, migration support, and advanced lifecycle debugging.
`TrainerJobManager` manages the lifecycle of service-mode trainer jobs — GPU-backed trainer endpoints that your Python loop connects to with a training client.
`TrainerJobManager` extends [`FireworksClient`](/fine-tuning/training-api/reference/fireworks-client), so all trainer-free operations (checkpoint promotion, training shape resolution) are also available here.
```python theme={null}
from fireworks.training.sdk import TrainerJobManager, TrainerJobConfig
```
## Constructor
```python theme={null}
rlor_mgr = TrainerJobManager(
api_key="",
base_url="https://api.fireworks.ai", # optional, defaults to https://api.fireworks.ai
)
```
| Parameter | Type | Default | Description |
| -------------------- | -------------- | ---------------------------- | ------------------------- |
| `api_key` | `str` | — | Fireworks API key |
| `base_url` | `str` | `"https://api.fireworks.ai"` | Control-plane URL |
| `additional_headers` | `dict \| None` | `None` | Extra HTTP headers |
| `verify_ssl` | `bool \| None` | `None` | SSL verification override |
## Methods
### `create(config)`
Create a service-mode trainer job and return immediately (without waiting). Returns a `CreatedTrainerJob`:
```python theme={null}
created = rlor_mgr.create(TrainerJobConfig(
base_model="accounts/fireworks/models/qwen3-8b",
training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
lora_rank=0,
learning_rate=1e-5,
))
print(created.job_id) #
print(created.job_name) # accounts//rlorTrainerJobs/
```
### `wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)`
Poll until a trainer job reaches `RUNNING` state and is healthy. Returns a `TrainerServiceEndpoint`:
```python theme={null}
endpoint = rlor_mgr.wait_for_ready(created.job_id)
```
### `create_and_wait(config, poll_interval_s=5.0, timeout_s=900)`
Create a service-mode trainer and poll until the endpoint is healthy. Combines `create()` + `wait_for_ready()`. Returns a `TrainerServiceEndpoint`.
```python theme={null}
endpoint = rlor_mgr.create_and_wait(TrainerJobConfig(
base_model="accounts/fireworks/models/qwen3-8b",
training_shape_ref="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
lora_rank=0,
learning_rate=1e-5,
display_name="grpo-policy-trainer",
))
print(endpoint.base_url) # https://
print(endpoint.job_id) #
print(endpoint.job_name) # accounts//rlorTrainerJobs/
```
### `wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)`
Wait for an already-existing trainer job to reach `RUNNING` state:
```python theme={null}
existing = rlor_mgr.wait_for_existing(job_id="")
print(existing.base_url)
```
### `resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)`
Resume a failed/cancelled/paused job and wait until healthy:
```python theme={null}
endpoint = rlor_mgr.resume_and_wait(job_id="")
```
### `reconnect_and_wait(job_id, ...)`
Handle pod preemption and transient failures. Waits for the job to reach a resumable state, resumes it, then polls until healthy:
```python theme={null}
endpoint = rlor_mgr.reconnect_and_wait(
job_id="",
timeout_s=600,
max_wait_for_resumable_s=120,
)
```
More robust than `resume_and_wait()` — retries when the job is in a transitional state (e.g. the control plane is still processing a pod death).
| Parameter | Type | Default | Description |
| -------------------------- | ------- | ------- | --------------------------------------------- |
| `job_id` | `str` | — | The RLOR job ID to reconnect |
| `poll_interval_s` | `float` | `5.0` | Seconds between health checks after resume |
| `timeout_s` | `float` | `600` | Overall timeout for the job to become RUNNING |
| `max_wait_for_resumable_s` | `float` | `120` | Max seconds to wait for a resumable state |
### `get(job_id)`
Inspect job status:
```python theme={null}
status = rlor_mgr.get(job_id=endpoint.job_id)
print(status["state"]) # JOB_STATE_RUNNING
```
### `delete(job_id)`
Delete a trainer job and release GPU resources:
```python theme={null}
rlor_mgr.delete(job_id="")
```
### `promote_checkpoint(*, name, output_model_id, base_model)`
*Inherited from [`FireworksClient`](/fine-tuning/training-api/reference/fireworks-client).* Promote a sampler checkpoint to a deployable Fireworks model. The trainer job does not need to be running -- the checkpoint resource name resolves the storage location.
```python theme={null}
entry = rlor_mgr.list_checkpoints(endpoint.job_id)[0]
model = rlor_mgr.promote_checkpoint(
name=entry["name"],
output_model_id="my-fine-tuned-model",
base_model="accounts/fireworks/models/qwen3-8b",
)
```
See [`FireworksClient.promote_checkpoint`](/fine-tuning/training-api/reference/fireworks-client#promote_checkpoint-name-output_model_id-base_model) for full parameter docs.
### `resolve_training_profile(shape_id)`
*Inherited from [`FireworksClient`](/fine-tuning/training-api/reference/fireworks-client).* Resolve a training shape ID into a full configuration profile.
```python theme={null}
shape_id = "accounts/fireworks/trainingShapes/ts-qwen3-8b-policy"
profile = rlor_mgr.resolve_training_profile(shape_id)
```
See [`FireworksClient.resolve_training_profile`](/fine-tuning/training-api/reference/fireworks-client#resolve_training_profileshape_id) for full parameter docs.
## TrainerJobConfig
`TrainerJobManager.create_and_wait(...)` accepts a `TrainerJobConfig` dataclass:
Launching through a training shape is the recommended path. In normal user code, you should not hand-author `training_shape_ref`; pass a training shape ID to `resolve_training_profile(...)` and use the returned versioned ref. Advanced manual launches can omit `training_shape_ref` and provide infra fields directly.
When `training_shape_ref` is set (the recommended **shape path**), the training shape owns the trainer's hardware and image configuration. The fields below are what you set as a user:
| Field | Type | Default | Description |
| ---------------------------- | ----------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `base_model` | `str` | — | Base model name (e.g. `"accounts/fireworks/models/qwen3-8b"`) |
| `training_shape_ref` | `str \| None` | `None` | Full training-shape resource name (e.g. `accounts/fireworks/trainingShapes/` or `.../versions/`). Use `mgr.resolve_training_profile(...)` to get the pinned versioned ref. See [Training Shapes](/fine-tuning/training-api/training-shapes). |
| `lora_rank` | `int` | `0` | LoRA rank. `0` for full-parameter tuning, or a positive integer (e.g. `16`, `64`) for LoRA |
| `max_context_length` | `int \| None` | `None` | Maximum sequence length. Usually inherited from the training shape on the shape path. |
| `learning_rate` | `float` | `1e-5` | Learning rate for the optimizer |
| `display_name` | `str \| None` | `None` | Human-readable trainer name |
| `region` | `str \| None` | `None` | Region for the job |
| `extra_args` | `list[str] \| None` | `None` | Extra trainer arguments |
| `forward_only` | `bool` | `False` | Create a forward-only trainer (reference model pattern) |
| `inactivity_timeout` | `datetime.timedelta \| str \| None` | `None` | Trainer inactivity timeout. The trainer reports tracked activity, including trainer API operations and active-session heartbeats. If no tracked activity is observed for this duration, the trainer is automatically stopped. When unset or `0`, Fireworks uses the 60-minute default. String values must use protobuf JSON duration format, such as `"1800s"`. |
| `disable_inactivity_cleanup` | `bool` | `False` | Disable trainer inactivity cleanup. GPU usage continues to accrue while the trainer is running. |
`gradient_accumulation_steps` is deprecated in `TrainerJobConfig`. Do not use it to request server-side accumulation. Accumulate gradients in client code by calling `forward_backward...` multiple times before one `optim_step(...)`; see [Loss Functions](/fine-tuning/training-api/loss-functions#applying-the-optimizer-step).
On the recommended shape path, `accelerator_type`, `accelerator_count`, `node_count`, and `custom_image_tag` are automatically configured by the training shape and cannot be overridden. Advanced manual launches can omit `training_shape_ref` and set those fields directly.
## CreatedTrainerJob
Returned by `create()`:
| Field | Type | Description |
| ---------- | ----- | --------------------------------------------------------- |
| `job_name` | `str` | Full resource name (`accounts//rlorTrainerJobs/`) |
| `job_id` | `str` | RLOR trainer job ID |
## TrainerServiceEndpoint
Returned by `create_and_wait`, `wait_for_ready`, `wait_for_existing`, `resume_and_wait`, and `reconnect_and_wait`:
| Field | Type | Description |
| ---------- | ----- | --------------------------------------------------------- |
| `base_url` | `str` | Trainer endpoint URL for connecting a training client |
| `job_id` | `str` | RLOR trainer job ID |
| `job_name` | `str` | Full resource name (`accounts//rlorTrainerJobs/`) |
## TrainingShapeProfile
See [`FireworksClient` > TrainingShapeProfile](/fine-tuning/training-api/reference/fireworks-client#trainingshapeprofile) for the full field reference.
## Job states
| State | Meaning |
| --------------------- | ---------------------------------------------------- |
| `JOB_STATE_CREATING` | Resources being provisioned |
| `JOB_STATE_PENDING` | Queued, waiting for GPU availability |
| `JOB_STATE_RUNNING` | Trainer is ready — you can connect a training client |
| `JOB_STATE_IDLE` | Service-mode job is idle |
| `JOB_STATE_COMPLETED` | Job finished successfully |
| `JOB_STATE_FAILED` | Job failed |
| `JOB_STATE_CANCELLED` | Job was cancelled |
## Related guides
* [FiretitanServiceClient](/fine-tuning/training-api/reference/service-client) — create a `FiretitanTrainingClient` for a live trainer endpoint
* [Training Shapes](/fine-tuning/training-api/training-shapes) — available shapes and deployment linkage
* [Cleanup](/fine-tuning/training-api/reference/cleanup) — resource cleanup
# WeightSyncer (Legacy)
Source: https://docs.fireworks.ai/fine-tuning/training-api/reference/weight-syncer
Backward-compatibility reference for the old standalone checkpoint-then-sync helper.
## Overview
`WeightSyncer` is a legacy low-level helper kept only for backward compatibility in SDK API reference. Do not use it in new cookbook recipes or direct user loops. Use the SDK-managed service flow instead: `training_client.save_weights_for_sampler(...).result()` followed by `service.create_sampling_client(model_path=saved.path)` or `service.create_deployment_sampler(model_path=saved.path)`.
`WeightSyncer` coordinates saving sampler checkpoints and syncing them to a deployment, including automatic base/delta chain state tracking, session-scoped snapshot naming, and post-sync warmup. The managed service client now owns this logic internally.
```python theme={null}
from fireworks.training.sdk import WeightSyncer
```
For full-parameter training, only the first checkpoint (saved as `base`) is promotable; subsequent `delta` checkpoints are not. LoRA checkpoints are always promotable (delta chain is disabled via `lora_rank > 0`). See [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds) for the full promotability matrix.
## Constructor
```python theme={null}
tracker = WeightSyncer(
policy_client=training_client,
deploy_mgr=deploy_mgr,
deployment_id="my-deployment",
base_model="accounts/fireworks/models/qwen3-8b",
hotload_timeout=600,
first_checkpoint_type="base",
warmup_after_hotload=True,
reset_prompt_cache=True,
lora_rank=0, # >0 for LoRA adapters (disables delta chain)
)
```
| Field | Type | Default | Description |
| ----------------------- | --------------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `policy_client` | `FiretitanTrainingClient` | — | Training client for save operations |
| `deploy_mgr` | `DeploymentManager \| None` | `None` | Deployment manager for weight sync (`None` = no weight sync) |
| `deployment_id` | `str \| None` | `None` | Target deployment for weight sync |
| `base_model` | `str` | `""` | Model name for weight sync API calls |
| `hotload_timeout` | `int` | `600` | Timeout in seconds for `hotload_and_wait` |
| `first_checkpoint_type` | `str` | `"base"` | Type for the first checkpoint (`"base"` or `"delta"`) |
| `compression_format` | `str` | `"arc_v2"` | Delta compression format |
| `warmup_after_hotload` | `bool` | `True` | Send a warmup request after each successful weight sync |
| `warmup_max_retries` | `int` | `10` | Max retries for post-weight-sync warmup |
| `reset_prompt_cache` | `bool` | `True` | Reset the deployment's prompt cache after each weight sync. See [KV cache behavior for RL rollouts](/guides/rollout-inference#kv-cache-behavior-for-rl-rollouts) for active stream, session ID, and reset-option semantics. |
| `lora_rank` | `int` | `0` | When > 0, forces all checkpoints to `base` type (no delta chain). LoRA adapter exports are standalone PEFT artifacts that cannot use incremental delta compression. |
## Methods
### `save_and_hotload(name, checkpoint_type=None)`
Save sampler weights and sync to deployment. Automatically handles base (first) vs delta (subsequent) checkpoint types.
Returns the `snapshot_name` (`str | None`) on success or raises on failure:
```python theme={null}
tracker.save_and_hotload(f"step-{step:05d}")
```
### `save_only(name, checkpoint_type=None)`
Save sampler weights without syncing to deployment:
```python theme={null}
snapshot = tracker.save_only("checkpoint-name", checkpoint_type="base")
```
Returns `snapshot_name` or `None`.
### `hotload(snapshot_name, checkpoint_type)`
Sync a previously saved snapshot to the deployment:
```python theme={null}
tracker.hotload(snapshot, checkpoint_type="base")
```
Returns `True` on success, `False` on failure.
### `check_deployment_state()`
Query the deployment's current weight sync state:
```python theme={null}
current = tracker.check_deployment_state()
print(current) # current_snapshot_identity or None
```
### `wait_for_hotload_ready(timeout_s=300, poll_interval_s=5)`
Block until the deployment's weight sync manager is initialized.
### `reset_delta_chain()`
Force the next save to be treated as `base`. Call when the deployment's bucket or trainer session changes — for example, after attaching an existing deployment to a new trainer job — otherwise the next `delta` could reference a base checkpoint the deployment never loaded.
## Usage patterns
These patterns are for maintaining older integrations. New code should use the service-client sampler refresh pattern documented in [Training and Sampling](/fine-tuning/training-api/training-and-sampling).
### Sync weights every step
To minimize sampler staleness in a synchronous loop, sync a new sampler snapshot after every optimizer step before submitting the next rollout batch. This makes new rollout requests target the latest synced checkpoint, but the loop still owns draining or rejecting any stale in-flight requests before training on them:
```python theme={null}
import asyncio
for step in range(total_steps):
# ... training step ...
tracker.save_and_hotload(f"step-{step:05d}")
completions = asyncio.run(
sampler.sample_with_tokens(messages=input_messages, n=4)
)
```
### Interval weight sync
For throughput-oriented loops that tolerate stale sampler weights, sync a new sampler snapshot every N steps. This only controls when new sampler snapshots are saved and synced; it does not prove that already-submitted or in-flight requests were generated by the latest policy:
```python theme={null}
for step in range(total_steps):
# ... training step ...
if step % weight_sync_interval == 0:
tracker.save_and_hotload(f"step-{step:05d}")
```
### Split save and sync
Separate save from weight sync when you need intermediate steps (e.g. warmup):
```python theme={null}
snapshot = tracker.save_only("resume-step-0", checkpoint_type="base")
deploy_mgr.warmup(model)
tracker.hotload(snapshot, checkpoint_type="base")
```
### DCP checkpoints for resume
Save DCP checkpoints at intervals using the training client directly:
```python theme={null}
for step in range(total_steps):
# ... training step ...
tracker.save_and_hotload(f"step-{step:05d}")
if step % dcp_interval == 0:
training_client.save_state(f"step-{step}")
```
## Related guides
* [DeploymentManager](/fine-tuning/training-api/reference/deployment-manager) — deployment lifecycle and weight-sync API
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpoint concepts
* [Training and Sampling](/fine-tuning/training-api/training-and-sampling) — end-to-end workflow
# Saving and Loading
Source: https://docs.fireworks.ai/fine-tuning/training-api/saving-and-loading
SDK-level reference for checkpoint save, load, weight sync, and promotion.
**Most users don't need this page.** If you're launching training through a cookbook recipe (`rl_loop`, `sft_loop`, etc.), the recipe handles save, resume, and promote for you — set `dcp_save_interval` and `output_model_id` on your config and you're done. See [Checkpoints and Resume (cookbook)](/fine-tuning/training-api/cookbook/checkpoints) for the recipe-driven flow.
This page is the SDK-level reference for advanced users who are forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn't promote.
## What this is
During training, you save checkpoints for three purposes:
1. **Sampler refresh / weight sync** (`save_weights_for_sampler` + `create_sampling_client(model_path=...)`): Save updated sampler weights, then sync the returned snapshot identity onto a running inference deployment without restarting it.
2. **Resuming** (`save_state` / `load_state_with_optimizer`): Persist full training state (weights + optimizer) so you can continue training from where you left off.
3. **Promotion** (`promote_checkpoint`): Turn a saved sampler checkpoint into a deployable Fireworks model.
## Sampler checkpoints
Sampler checkpoints are weight-only snapshots used for weight sync and promotion. For promotability rules, see [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds) — the cookbook page is the source of truth.
The raw SDK exposes two `checkpoint_type` modes that affect size and weight-sync speed:
| `checkpoint_type` | What it saves | Size |
| ----------------- | --------------------------- | ---------------------- |
| `"base"` | Full model weights | Large (\~16 GB for 8B) |
| `"delta"` | XOR diff from previous base | \~10× smaller |
Delta is much faster for per-step weight sync (`current_weights = base XOR delta` on the deployment). LoRA sampler checkpoints always contain the full adapter regardless of `checkpoint_type`.
On full-parameter training, `checkpoint_type="delta"` produces a blob that cannot be promoted — only `"base"` can. Use the SDK-managed service path (`save_weights_for_sampler` -> `create_sampling_client(model_path=...)`) or the cookbook recipe weight-sync path for the safe base-then-delta pattern. The cookbook's `TrainingCheckpoints.save(promotable=True)` always saves `base`.
### Saving checkpoints
```python theme={null}
# First checkpoint — must be base (full weights)
saved = training_client.save_weights_for_sampler(
"step-0001",
checkpoint_type="base",
).result()
# saved.path is the sampler snapshot identity (e.g. "step-0001-a1b2c3d4")
# Subsequent checkpoints — delta is faster
saved = training_client.save_weights_for_sampler(
"step-0010",
checkpoint_type="delta",
).result()
# With TTL (auto-delete after N seconds)
saved = training_client.save_weights_for_sampler(
"temp-checkpoint",
checkpoint_type="delta",
ttl_seconds=3600,
).result()
```
`save_weights_for_sampler_ext(...)` is the Fireworks-specific low-level variant that returns `SaveSamplerResult` directly. Use it when you need a concrete return value immediately; use `save_weights_for_sampler(...).result()` for the Tinker-shaped API.
## Promoting a checkpoint to a model
Promote a sampler checkpoint to a deployable Fireworks model. Available on [`FireworksClient`](/fine-tuning/training-api/reference/fireworks-client) and on the SDK-managed [`FiretitanServiceClient`](/fine-tuning/training-api/reference/service-client) after provisioning. The trainer job does not need to be running — its row only needs to exist; promotion is a metadata + file-copy operation. See [Checkpoint kinds](/fine-tuning/training-api/cookbook/checkpoints#checkpoint-kinds) for which checkpoints are promotable.
### Preferred: pass the 4-segment `name=` from `list_checkpoints`
`list_checkpoints` returns each checkpoint's full resource name (`accounts//rlorTrainerJobs//checkpoints/`). Hand that string straight to `promote_checkpoint` — no manual disassembly into `(job_id, checkpoint_id)`:
```python theme={null}
from fireworks.training.sdk import FireworksClient
client = FireworksClient(api_key=api_key)
# Pick a row from the trainer's checkpoints — usually newest promotable.
rows = client.list_checkpoints(job_id)
target = next(r for r in rows if r.get("promotable"))
model = client.promote_checkpoint(
name=target["name"], # 4-segment resource path
output_model_id="my-fine-tuned-qwen3-8b",
base_model="accounts/fireworks/models/qwen3-8b",
)
```
| Parameter | Type | Description |
| ----------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `name` | `str` | Full 4-segment checkpoint resource name from `list_checkpoints` output |
| `output_model_id` | `str` | Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only). Validate with `validate_output_model_id` before calling — a rejected ID orphans the staged sampler blob. |
| `base_model` | `str` | Base model resource name for metadata inheritance (e.g. `accounts/fireworks/models/qwen3-8b`) |
### Legacy: positional `(job_id, checkpoint_id)` form
The previous `(job_id, checkpoint_id)` shape still works for callers that haven't migrated. It fires a `DeprecationWarning` whenever `name=` is omitted, regardless of whether `job_id` and `checkpoint_id` are passed positionally or as keywords:
```python theme={null}
model = client.promote_checkpoint(
job_id=endpoint.job_id,
checkpoint_id=result.snapshot_name,
output_model_id="my-fine-tuned-qwen3-8b",
base_model="accounts/fireworks/models/qwen3-8b",
)
# DeprecationWarning: promote_checkpoint(job_id, checkpoint_id, ...) positional
# form is deprecated. Pass the 4-segment resource name instead:
# promote_checkpoint(name=entry['name'], output_model_id=..., base_model=...).
# The 'name' field comes straight from list_checkpoints output.
```
To migrate, look the row up via `list_checkpoints` and pass its `name` field straight through:
```python theme={null}
entry = client.list_checkpoints(endpoint.job_id)[0]
model = client.promote_checkpoint(
name=entry["name"],
output_model_id="my-fine-tuned-qwen3-8b",
base_model="accounts/fireworks/models/qwen3-8b",
)
```
The `hot_load_deployment_id` parameter has its own `DeprecationWarning` and is only needed for deployments that predate the stored-bucket-URL migration:
```
DeprecationWarning: promote_checkpoint(hot_load_deployment_id=...) is
deprecated. The gateway resolves the bucket URL from the trainer's
stored metadata for any run on cookbook >= 0.3.0 (both PER_TRAINER
and PER_DEPLOYMENT bucket scopes). Omit this argument unless you are
promoting a checkpoint from a deployment that predates the
stored-bucket-URL migration.
```
For modern runs (cookbook ≥ 0.3.0, either bucket scope), omit the argument.
### Listing checkpoints on a trainer
```bash theme={null}
curl "https://api.fireworks.ai/v1/accounts//rlorTrainerJobs//checkpoints?pageSize=200" \
-H "Authorization: Bearer $FIREWORKS_API_KEY"
```
Each entry includes `name`, `createTime`, `updateTime`, `checkpointType`, and `promotable`.
## Sampler refresh / weight sync
Weight sync pushes a checkpoint onto a running inference deployment without restarting it. With the SDK-managed service client, you do this by saving sampler weights and then creating a sampler for that snapshot:
```python theme={null}
saved = training_client.save_weights_for_sampler(f"step-{step:05d}").result()
# Tinker-shaped sampler wrapper.
sampler = service.create_sampling_client(model_path=saved.path)
# Or, for tokenized rollout/eval features:
deployment_sampler = service.create_deployment_sampler(
model_path=saved.path,
tokenizer=tokenizer,
concurrency_controller=controller,
)
```
The service client owns the base/delta chain, incremental weight-sync metadata, deployment weight-sync call, and sampler construction. Existing low-level code that manually uses `DeploymentManager` or `WeightSyncer` should be treated as compatibility code; new user loops should use the service-client pattern above.
## Train-state checkpoints
Use `save_state` to persist full training state, and one of two load methods to restore it:
| Method | Weights | Optimizer state |
| --------------------------------- | -------- | --------------- |
| `load_state_with_optimizer(path)` | Restored | Restored |
| `load_state(path)` | Restored | Reset to zero |
```python theme={null}
# Save full train state for resume
training_client.save_state("train_state_step_100").result()
# Resume training (weights + optimizer restored)
training_client.load_state_with_optimizer("train_state_step_100").result()
```
`save_state` accepts optional `ttl_seconds` and `timeout` parameters. When `timeout` is set, the SDK blocks until the save completes or the timeout expires.
For the raw `FiretitanTrainingClient`, `save_state()`, `load_state()`, and `load_state_with_optimizer()` return futures — call `.result()` to block. The cookbook's `ReconnectableClient` wrapper blocks for you.
### Cross-job checkpoint resolution
```python theme={null}
checkpoint_ref = training_client.resolve_checkpoint_path(
"step-4",
source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()
```
### List available checkpoints
```python theme={null}
checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names) # e.g. ["step-2", "step-4"]
```
## Related guides
* [Checkpoints and Resume (cookbook)](/fine-tuning/training-api/cookbook/checkpoints) — recipe-driven save / resume / promote (start here for most users)
* [FiretitanServiceClient reference](/fine-tuning/training-api/reference/service-client) — managed trainer/deployment clients and sampler refresh
* [DeploymentManager reference](/fine-tuning/training-api/reference/deployment-manager) — compatibility weight-sync API for existing low-level integrations
# Training and Sampling
Source: https://docs.fireworks.ai/fine-tuning/training-api/training-and-sampling
End-to-end SDK walkthrough: bootstrap resources, train, checkpoint, and sample through a serving deployment.
## What this is
This is the default lifecycle for research loops that need serving-quality evaluation during training: create an SDK-managed trainer and deployment, run iterative updates, save sampler weights, sync those weights to the deployment, then sample through the deployment.
For production RL, prefer the [cookbook recipes](/fine-tuning/training-api/cookbook/overview). They wrap this same SDK-managed service path and handle batching, reference clients, checkpoints, reconnect, and cleanup.
## Workflow
1. **Create the managed service** with `FiretitanServiceClient.from_firetitan_config(...)`.
2. **Create a training client** with `service.create_training_client(...)`.
3. **Create a deployment sampler** with `service.create_deployment_sampler(...)`.
4. **Run train steps**: `forward_backward_custom(...)` + `optim_step(...)`.
5. **Save sampler weights** with `training_client.save_weights_for_sampler(...).result()`.
6. **Refresh the sampler** with `service.create_deployment_sampler(model_path=saved.path, ...)`.
7. **Sample and evaluate** through the deployment endpoint.
The SDK owns trainer provisioning, deployment provisioning, bucket wiring, base-vs-delta sampler checkpoint selection, weight sync, and teardown. You do not construct `TrainerJobManager`, `DeploymentManager`, or `WeightSyncer` for the normal SDK flow.
## End-to-end example
The only training-shape input you choose below is the shape ID. The SDK resolves the versioned trainer shape and linked deployment shape before launch.
### 1. Bootstrap trainer and deployment
```python theme={null}
import os
import tinker
from transformers import AutoTokenizer
from fireworks.training.sdk import (
AdaptiveConcurrencyController,
FiretitanServiceClient,
)
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
base_model = "accounts/fireworks/models/qwen3-8b"
tokenizer_model = "Qwen/Qwen3-8B"
shape_id = "accounts/fireworks/trainingShapes/qwen3-8b-128k-h200"
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model=base_model,
tokenizer_model=tokenizer_model,
lora_rank=0,
training_shape_id=shape_id,
deployment_id="research-serving",
learning_rate=1e-5,
replica_count=1, # deployment replicas for rollout/eval throughput
cleanup_trainer_on_close=True,
cleanup_deployment_on_close="scale_to_zero",
)
training_client = service.create_training_client(base_model=base_model, lora_rank=0)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model, trust_remote_code=True)
concurrency = AdaptiveConcurrencyController(initial_window=16)
sampler = service.create_deployment_sampler(
tokenizer=tokenizer,
concurrency_controller=concurrency,
)
print({"trainer_job_id": service.trainer_job_id, "deployment_id": service.deployment_id})
```
### 2. Train step with custom objective
```python theme={null}
def objective(data, logprobs_list):
loss = compute_objective(data=data, logprobs_list=logprobs_list)
return loss, {"loss": float(loss.item())}
for step in range(total_steps):
# Accumulate gradients client-side: run N forward/backward calls, then one optim_step.
micro_batches = build_micro_batches(step)
for micro_batch in micro_batches:
training_client.forward_backward_custom(micro_batch, objective).result()
training_client.optim_step(
tinker.AdamParams(
learning_rate=1e-5,
beta1=0.9,
beta2=0.999,
eps=1e-8,
weight_decay=0.01,
)
).result()
```
### 3. Save, sync, sample, evaluate
```python theme={null}
import asyncio
if step % eval_interval == 0:
saved = training_client.save_weights_for_sampler(f"step_{step:05d}").result()
# Passing model_path syncs the saved snapshot into the SDK-managed
# deployment and returns a sampler backed by that deployment.
sampler = service.create_deployment_sampler(
model_path=saved.path,
tokenizer=tokenizer,
concurrency_controller=concurrency,
)
completions = asyncio.run(
sampler.sample_with_tokens(
messages=eval_prompts,
n=1,
max_tokens=512,
)
)
score = evaluate_responses(completions)
print({"step": step, "checkpoint": saved.path, "eval_score": score})
```
`save_weights_for_sampler(...)` returns a future whose `.result().path` is a public sampler snapshot identity, not a raw storage URI. `create_deployment_sampler(model_path=...)` consumes that identity, syncs it to the deployment, and returns the FireTitan-native deployment sampler. Use `service.create_sampling_client(model_path=...)` instead if you need the Tinker-shaped sampling client wrapper.
## Concurrency control
`sample_with_tokens(n=K)` fans out K concurrent requests. A concurrency controller prevents overloading the deployment:
* **`AdaptiveConcurrencyController`** (recommended) — automatically adjusts the concurrency window based on the server's prefill queue latency. Starts at `initial_window` and grows or shrinks between steps using AIMD.
* **`FixedConcurrencyController`** — a static semaphore with a fixed maximum. Use when you already know the right concurrency for your deployment.
See [DeploymentSampler — Concurrency Control](/fine-tuning/training-api/reference/deployment-sampler#concurrency-control) for full details and configuration options.
## Reference clients
For DPO, GRPO with KL, or any objective that needs frozen-reference logprobs, ask the service for a reference client:
```python theme={null}
reference_client = service.create_reference_client(base_model, lora_rank=0)
ref = reference_client.forward(datums, "cross_entropy").result()
```
The SDK chooses the backing automatically:
* LoRA policy with no explicit `reference_training_shape_id` reuses the policy trainer session with adapters disabled.
* Full-parameter policy, or any explicit `reference_training_shape_id`, uses a separate forward-only reference trainer owned by the service.
## Reconnecting to a running trainer
If your client disconnects, re-create the service with the existing trainer job ID. The SDK waits for the trainer, reconnects the training client, and can reuse or reattach the deployment:
```python theme={null}
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model=base_model,
tokenizer_model=tokenizer_model,
lora_rank=0,
training_shape_id=shape_id,
trainer_job_id="",
deployment_id="research-serving",
)
training_client = service.create_training_client(base_model=base_model, lora_rank=0)
```
For DCP train-state resume, load a saved state after creating the client:
```python theme={null}
training_client.load_state_with_optimizer("step-100").result()
```
## Cleanup
Close the service when the loop exits:
```python theme={null}
try:
run_training_loop()
finally:
service.close()
```
`cleanup_trainer_on_close=True` deletes SDK-managed trainers. `cleanup_deployment_on_close="scale_to_zero"` releases deployment GPUs while keeping the deployment resource around for later reuse; use `"delete"` only when you want to remove the deployment entirely.
## Operational guidance
* **Start from cookbook recipes** for SFT, DPO, ORPO, GRPO, IGPO, and async RL; fork them when you need custom loop behavior.
* **Use the managed service as the provisioning boundary** in direct SDK code. Manager classes are documented only for compatibility and advanced lifecycle debugging.
* **Service mode supports both full-parameter and LoRA tuning.** Set `lora_rank=0` for full-parameter or a positive integer for LoRA.
* **Use `save_weights_for_sampler(...)` for normal sampler refresh.** The SDK tracks the base/delta chain and performs weight sync through `create_sampling_client(model_path=...)` or `create_deployment_sampler(model_path=...)`.
* **Use `save_state(...)` for DCP resume checkpoints.** Sampler checkpoints are for serving/evaluation and promotion; DCP checkpoints restore training state.
* **Store the exact prompt set and sampler snapshot path** for every evaluation sweep.
## Related guides
* [Loss Functions](/fine-tuning/training-api/loss-functions) — built-in and custom loss function patterns
* [Vision Inputs](/fine-tuning/training-api/vision-inputs) — fine-tune VLMs with image and text data
* [Saving and Loading](/fine-tuning/training-api/saving-and-loading) — checkpoint types and weight sync details
* [DeploymentSampler reference](/fine-tuning/training-api/reference/deployment-sampler) — sampling API details
* [Cleanup and Teardown](/fine-tuning/training-api/reference/cleanup) — managed service cleanup
# Training Shapes
Source: https://docs.fireworks.ai/fine-tuning/training-api/training-shapes
Pre-configured GPU and model training profiles that simplify distributed training setup.
# Training Shapes
In practice, a training shape is the user-facing launch input for trainer jobs. Most users only need to choose a training shape ID such as `accounts/fireworks/trainingShapes/qwen3p5-9b-256k` and pass it to the API.
**A training shape is the recommended launch path for normal trainer jobs.** In most cases, pass the full shared shape path as `training_shape_id`, and the SDK resolves the pinned version for you. Advanced compatibility launches can still use manager-level shape refs and direct infra fields, but use that path only when you know the exact hardware and image configuration.
The `fireworks` account is the shared public shape catalog. Shapes published under can be referenced by all users.
You do not need to know the versioned shape reference, image tag, GPU layout, or linked deployment shape ahead of time. The API resolves those details internally.
## What You Need To Know
For most users, the workflow is:
1. Pick a training shape ID from the available shapes list below. In most cases this should be the full shared path .
2. Pass it as `training_shape_id` to a cookbook recipe's `TrainerConfig`, or to `FiretitanServiceClient.from_firetitan_config(...)`.
3. Let the SDK resolve the pinned shape version and linked deployment shape.
That is the only shape-specific value you choose yourself.
## What A Training Shape Controls
When you specify a training shape, it provides the trainer with:
* GPU and node layout: `acceleratorType`, `acceleratorCount`, `nodeCount`
* Model limits: `maxSupportedContextLength`
* Trainer runtime: `trainerImageTag`
* Linked serving setup: `deploymentShapeVersion`
## What You Can And Can't Change
You can still configure normal training-loop fields such as:
* `base_model`
* `lora_rank`
* `learning_rate`
* `display_name`
* Trainer replica count (`TrainerConfig.replica_count` or `trainer_replica_count`)
* Deployment replica count (`DeployConfig.replica_count` or `replica_count`)
Shape-owned infra is locked. Do not try to override `accelerator_type`, `accelerator_count`, `node_count`, `custom_image_tag`, or the linked deployment shape.
Gradient accumulation is not a trainer-launch setting. To accumulate gradients, call `forward_backward...` multiple times from your client loop before a single `optim_step(...)`; see [Loss Functions](/fine-tuning/training-api/loss-functions#applying-the-optimizer-step).
For field-level behavior and dataclass details, see the [`FiretitanServiceClient`](/fine-tuning/training-api/reference/service-client) and [Cookbook Reference](/fine-tuning/training-api/cookbook/reference).
## Using a Training Shape
The only shape-specific input you provide is the shape ID:
1. **You provide the shape ID** (e.g. `accounts/fireworks/trainingShapes/qwen3p5-9b-256k`) — no version needed.
2. **The SDK resolves the latest validated version** during managed service provisioning.
3. **The SDK applies the linked deployment shape** when you request a sampler deployment.
Pass the shape ID to the managed service:
```python theme={null}
from fireworks.training.sdk import FiretitanServiceClient
shape_id = "accounts/fireworks/trainingShapes/qwen3p5-9b-256k"
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_model="accounts/fireworks/models/qwen3p5-9b",
training_shape_id=shape_id,
lora_rank=0,
create_deployment=False,
)
training_client = service.create_training_client(
base_model="accounts/fireworks/models/qwen3p5-9b",
lora_rank=0,
)
```
Use the full training shape ID including the account prefix (for example `accounts/fireworks/trainingShapes/qwen3p5-9b-256k`). The `fireworks` account is the shared public account for training shapes, and you do **not** need to hand-write a versioned `training_shape_ref` yourself.
## Available Training Shapes
Below is a searchable catalog of customer-ready training shapes per model. During Reinforcement Fine-Tuning (RFT), two types of models are often deployed: a **policy trainer** (which updates its weights) and a **reference model** (which is forward-only).
* **Policy trainer shapes** are used for standard Supervised Fine-Tuning (SFT) or as the active policy model during Reinforcement Learning (RL).
* **LoRA trainer shapes** are used for parameter-efficient fine-tuning.
* **Forward-only / reference shapes** are used for reference models in RL pipelines. They do not require optimizer states or backward passes, and thus often require fewer resources.
The **Surfaces** column shows whether each shape supports direct **API** training jobs, customer-facing **Managed** training (SFT/DPO/RFT webapp), or both — based on model tunability, shape readiness, and linked deployment configuration.
Select a model from the dropdown to view the **Training method support** matrix (SFT / DPO / RFT × LoRA / Full-Param) with per-method surfaces and total GPU requirements, plus the backing training shapes for that model.
# Vision Inputs
Source: https://docs.fireworks.ai/fine-tuning/training-api/vision-inputs
Fine-tune vision-language models (VLMs) with the Training API using multimodal chat data containing images and text.
The Training API supports vision-language model (VLM) fine-tuning, allowing you to train models that understand both images and text. This works across all training modes — SFT, DPO, and RL — using the same API primitives and cookbook recipes you already know.
VLM support in the Training API requires a VLM-compatible training shape. See [Training Shapes](/fine-tuning/training-api/training-shapes#qwen3-vl) for available shapes.
## What changes for vision
Compared to text-only training, VLM fine-tuning differs in three ways:
| Aspect | Text-only | Vision |
| ------------------ | --------------------------------------- | ----------------------------------------------------- |
| **Training shape** | Text model shape (e.g. `qwen3-8b-128k`) | VLM shape (e.g. `qwen3-vl-8b-65k`) |
| **Tokenizer** | Text tokenizer (e.g. `Qwen/Qwen3-8B`) | VLM processor (e.g. `Qwen/Qwen3-VL-8B-Instruct`) |
| **Message format** | `content` is a string | `content` is an array of text and `image_url` objects |
Everything else — loss functions, checkpointing, weight sync, deployment sampling — works identically.
## Dataset format
Vision datasets use the standard OpenAI-compatible chat format. The key difference is that `content` fields can contain an array of content parts mixing text and images:
### Single image
```json theme={null}
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects do you see in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
}
}
]
},
{
"role": "assistant",
"content": "I can see a red car, a tree, and a blue house."
}
]
}
```
### Multiple images
```json theme={null}
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
}
}
]
},
{
"role": "assistant",
"content": "The first image shows a daytime scene while the second shows the same location at night."
}
]
}
```
### Multi-turn with images
```json theme={null}
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this kitchen."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
]
},
{
"role": "assistant",
"content": "This is a modern open-plan kitchen with white cabinets and granite countertops."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Now compare it with this living room."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4BBB..."}}
]
},
{
"role": "assistant",
"content": "Both spaces share a modern aesthetic with clean lines and neutral colors."
}
]
}
```
### Image encoding requirements
Images must be base64-encoded with a MIME type prefix. Raw HTTP URLs are **not** supported in training data.
```json theme={null}
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
}
}
```
```json theme={null}
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg"
}
}
```
Supported image formats: **PNG**, **JPEG/JPG**.
If your dataset contains image URLs, download and convert them to base64 first. See the [conversion script in the managed VLM fine-tuning guide](/fine-tuning/fine-tuning-vlm#if-your-dataset-contains-image-urls).
## Cookbook: VLM SFT
The cookbook's `sft_loop` recipe works with vision datasets out of the box. Use a VLM training shape and a VLM tokenizer:
```python theme={null}
from training.recipes.sft_loop import Config, main
from training.utils import TrainerConfig
cfg = Config(
log_path="./vlm_sft_logs",
base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
dataset="/path/to/vision_data.jsonl",
tokenizer_model="Qwen/Qwen3-VL-8B-Instruct",
max_seq_len=4096,
epochs=1,
batch_size=4,
learning_rate=1e-5,
trainer=TrainerConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-vl-8b-65k",
),
)
main(cfg)
```
The recipe handles vision-aware tokenization automatically — image tokens are assigned weight `0.0` (prompt) and text response tokens are assigned weight `1.0` (train).
## API-level: VLM training loop
For full control over the training loop, use the API directly with a VLM training shape. The workflow is the same as text-only training, but the tokenizer and shape are VLM-specific:
### 1. Create the managed VLM service
```python theme={null}
import os
from fireworks.training.sdk import FiretitanServiceClient
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
base_model = "accounts/fireworks/models/qwen3-vl-8b-instruct"
tokenizer_model = "Qwen/Qwen3-VL-8B-Instruct"
shape_id = "accounts/fireworks/trainingShapes/qwen3-vl-8b-65k"
service = FiretitanServiceClient.from_firetitan_config(
api_key=api_key,
base_url=base_url,
base_model=base_model,
tokenizer_model=tokenizer_model,
lora_rank=0,
training_shape_id=shape_id,
learning_rate=1e-5,
create_deployment=False,
cleanup_trainer_on_close=True,
)
```
### 2. Connect and train
```python theme={null}
import torch
import tinker
import transformers
from tinker_cookbook.supervised.common import datum_from_model_input_weights
training_client = service.create_training_client(
base_model=base_model, lora_rank=0,
)
processor = transformers.AutoProcessor.from_pretrained(
tokenizer_model, trust_remote_code=True,
)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/..."}},
],
},
{
"role": "assistant",
"content": "The image shows a sunset over the ocean.",
},
]
text = processor.apply_chat_template(conversation, tokenize=False)
full_tokens = processor.tokenizer.encode(text)
prompt_text = processor.apply_chat_template(conversation[:1], tokenize=False)
prompt_len = len(processor.tokenizer.encode(prompt_text))
weights = torch.zeros(len(full_tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0
datum = datum_from_model_input_weights(
tinker.ModelInput.from_ints(full_tokens),
weights,
max_length=4096,
)
def sft_loss(data, logprobs_list):
total_loss = torch.tensor(0.0)
n_tokens = 0
for i, logprobs in enumerate(logprobs_list):
w = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
min_len = min(len(logprobs), len(w))
total_loss = total_loss - torch.dot(logprobs[:min_len].float(), w[:min_len])
n_tokens += w[:min_len].sum().item()
return total_loss / max(n_tokens, 1), {"sft_loss": (total_loss / max(n_tokens, 1)).item()}
for step in range(100):
training_client.forward_backward_custom([datum], sft_loss).result()
training_client.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
).result()
```
### 3. Save and promote
Checkpointing and weight sync work identically to text-only training:
```python theme={null}
saved = training_client.save_weights_for_sampler(
"vlm-final",
checkpoint_type="base",
).result()
entry = next(
row for row in service.list_checkpoints(service.trainer_job_id)
if row["name"].endswith(f"/checkpoints/{saved.path}")
)
model = service.promote_checkpoint(
name=entry["name"],
output_model_id="my-vlm-model",
base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
)
service.close()
```
## VLM DPO and RL
Vision inputs also work with DPO and RL training. The dataset format is the same — use multimodal `content` arrays in your messages:
### DPO with vision
```json theme={null}
{
"chosen": {
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this chart."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]
},
{"role": "assistant", "content": "This bar chart shows quarterly revenue growth of 15% year-over-year."}
]
},
"rejected": {
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this chart."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]
},
{"role": "assistant", "content": "This is a chart."}
]
}
}
```
### RL with vision prompts
```json theme={null}
{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Solve the math problem shown in this image. Show your reasoning."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]
}
]
}
```
Use the corresponding cookbook recipes (`dpo_loop`, `rl_loop`) with a VLM training shape and tokenizer — the multimodal message handling is automatic.
## Available VLM training shapes
| Model | Shape ID | Context | GPUs |
| ----------- | --------------------------------------------------- | ------- | ---- |
| Qwen3 VL 8B | `accounts/fireworks/trainingShapes/qwen3-vl-8b-65k` | 65k | 4 |
See [Training Shapes](/fine-tuning/training-api/training-shapes#qwen3-vl) for the full list and details.
## Related guides
* [Training Shapes](/fine-tuning/training-api/training-shapes) — available VLM and text training shapes
* [Supervised Fine Tuning - Vision (Managed)](/fine-tuning/fine-tuning-vlm) — managed VLM fine-tuning without writing training loops
* [Querying Vision Language Models](/guides/querying-vision-language-models) — inference with VLMs
* [Cookbook SFT](/fine-tuning/training-api/cookbook/sft) — SFT recipe details
* [Loss Functions](/fine-tuning/training-api/loss-functions) — custom loss function patterns
# Training Prerequisites & Validation
Source: https://docs.fireworks.ai/fine-tuning/training-prerequisites
Requirements, validation checks, and common issues when launching RFT jobs
Before launching an RFT job using the [CLI](/fine-tuning/cli-reference) or [Web UI](/fine-tuning/web-ui-guide), ensure you meet these prerequisites and understand the validation process.
## Prerequisites
Before launching an RFT job, ensure you have the following set up. Our quickstart guides will walk you through this.
Your dataset must be in JSONL format with prompts (system and user messages). Each line represents one training example.
Upload via CLI:
```bash theme={null}
eval-protocol create dataset my-dataset --file dataset.jsonl
```
Or via the [Fireworks dashboard](https://app.fireworks.ai/dashboard/datasets).
Your reward function must be tested and uploaded. For local evaluators, upload via pytest:
```bash theme={null}
cd evaluator_directory
pytest my-evaluator-name.py -vs
```
The test automatically registers your evaluator with Fireworks. For remote environment testing, deploy your HTTP service first.
Set your API key as an environment variable:
```bash theme={null}
export FIREWORKS_API_KEY="fw_your_api_key_here"
```
Or store it in a `.env` file in your project directory.
Choose a base model that supports fine-tuning. Popular options:
* `accounts/fireworks/models/llama-v3p1-8b-instruct` - Good balance of quality and speed
* `accounts/fireworks/models/qwen3-0p6b` - Fast training for experimentation
* `accounts/fireworks/models/llama-v3p1-70b-instruct` - Best quality, slower training
Check available models at [fireworks.ai/models](https://fireworks.ai/models).
## Job validation
Before starting training, Fireworks validates your configuration:
* ✅ Valid JSONL format
* ✅ Each line has `messages` array
* ✅ Messages have `role` and `content` fields
* ✅ File size within limits
* ❌ Missing fields → error with specific line numbers
* ❌ Invalid JSON → syntax error details
* ✅ Evaluator code syntax is valid
* ✅ Required dependencies are available
* ✅ Entry point function exists
* ✅ Test runs completed successfully
* ❌ Import errors → missing dependencies
* ❌ Syntax errors → code issues
* ✅ Sufficient GPU quota
* ✅ Base model supports fine-tuning
* ✅ Account has RFT permissions
* ❌ Insufficient quota → request increase
* ❌ Invalid model → choose different base model
* ✅ Parameters within valid ranges
* ✅ Compatible parameter combinations
* ❌ Invalid ranges → error with allowed values
* ❌ Conflicting options → resolution guidance
If validation fails, you'll receive specific error messages with instructions to fix the issues.
## Common errors and fixes
**Error**: `Dataset validation failed: invalid JSON on line 42`
**Fix**:
1. Open your JSONL file
2. Check line 42 for JSON syntax errors
3. Common issues: missing quotes, trailing commas, unescaped characters
4. Validate JSON at jsonlint.com
**Error**: `Missing required field 'messages'`
**Fix**: Each dataset row must have a `messages` array:
```json theme={null}
{"messages": [{"role": "user", "content": "..."}]}
```
**Error**: `Evaluator 'my-evaluator' not found in account`
**Fix**:
1. Upload your evaluator first:
```bash theme={null}
cd evaluator_directory
pytest my-evaluator-name.py -vs
```
2. Or specify evaluator ID if using UI:
* Check [Evaluators dashboard](https://app.fireworks.ai/dashboard/evaluators)
* Copy exact evaluator ID
**Error**: `Insufficient GPU quota for this job`
**Fix**:
1. Check your current quota at [Account Settings](https://app.fireworks.ai/account/settings)
2. Request a quota increase through the dashboard
3. Or choose a smaller base model to reduce GPU requirements
**Error**: `Learning rate 1e-2 outside valid range [1e-5, 5e-4]`
**Fix**: Adjust the parameter to be within the allowed range:
```bash theme={null}
--learning-rate 1e-4 # Use default value
```
See [CLI Reference](tools-sdks/firectl/commands/reinforcement-fine-tuning-job-create) for all available parameters.
**Error**: `Evaluator build timed out after 10 minutes`
**Fix**:
1. Check build logs in [Evaluators dashboard](https://app.fireworks.ai/dashboard/evaluators)
2. Common issues:
* Large dependencies taking too long to install
* Network issues downloading packages
* Syntax errors in requirements.txt
3. Wait for build to complete, then retry launching your job
4. Consider splitting large dependencies or using lighter alternatives
## What happens after launching
Once your job is created, here's what happens:
Your job enters the queue and waits for available GPU resources. Queue time depends on current demand.
**Status**: `PENDING`
Fireworks validates your dataset to ensure it meets format requirements and quality standards. This typically takes 1-2 minutes.
**Status**: `VALIDATING`
The system begins generating rollouts, evaluating them, and updating model weights. You'll see:
* Rollout generation and evaluation
* Reward curves updating in real-time
* Training loss decreasing
**Status**: `RUNNING`
Track training via the dashboard. See [Monitor Training](/fine-tuning/monitor-training) for details on interpreting metrics and debugging issues.
**Status**: `RUNNING` → `COMPLETED`
When training finishes, your fine-tuned model is ready for deployment.
**Status**: `COMPLETED`
Next: [Deploy your model](/fine-tuning/deploying-loras) for inference.
## Next steps
Use eval-protocol CLI for fast, scriptable launches
Use the dashboard for visual, guided job creation
Track job progress, inspect rollouts, and debug issues
# Using Secrets
Source: https://docs.fireworks.ai/fine-tuning/using-secret-in-evaluator
Learn how to create secrets that can be utilized within your reward function.
# Creating Secrets
All secrets created here will be injected as environment variables for your Evaluator to access.
And that's it! If you want to learn more about creating evaluators, see:
1. Learn about [Evaluation](/fine-tuning/evaluators) and [Eval Protocol](https://evalprotocol.io/introduction) for evaluator authoring
# Warm Start from Fine-Tuned Models
Source: https://docs.fireworks.ai/fine-tuning/warm-start
Continue training from a previously fine-tuned model with RFT
Fireworks supports RFT training on warm start and already-fine-tuned models. Upload models to Fireworks and use the warm start option to continue training (e.g. from an SFT LoRA) with RFT, rather than start from scratch with a base model.
## When to use warm start
Use the `--warm-start-from` flag when you want to:
* Start RFT from an SFT model you've trained with Fireworks
* Continue training from an existing fine-tuned LoRA adapter you've uploaded to Fireworks
## Basic usage
```bash theme={null}
eval-protocol create rft \
--warm-start-from accounts/your-account/models/ \
--output-model
```
When using `--warm-start-from`, do NOT include `--base-model`. The base model is automatically determined from the LoRA adapter.
```bash theme={null}
# Wrong, includes --base-model
eval-protocol create rft \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--warm-start-from accounts/your-account/models/
```
## SFT to RFT workflow
Get started with supervised fine-tuning on Fireworks:
```bash theme={null}
firectl sftj create \
--base-model accounts/fireworks/models/ \
--dataset accounts/your-account/datasets/ \
--output-model
```
Or if you already have a LoRA adapter, upload it to Fireworks:
```bash theme={null}
firectl model create /path/to/files/ \
--base-model "accounts/fireworks/models/"
```
Learn more about uploading custom LoRA adapters in the [Custom Models guide](/models/uploading-custom-models#importing-fine-tuned-models).
Use an existing model as a starting point, and combine with standard RFT parameters.
```bash theme={null}
eval-protocol create rft \
--warm-start-from accounts/your-account/models/ \
--output-model \
--epochs 2 \
--learning-rate 5e-5 \
--temperature 0.8
```
## Troubleshooting
This means you specified both `--base-model` and `--warm-start-from`. Remove the `--base-model` flag.
Verify the model exists in your account:
```bash theme={null}
firectl model list --account accounts/your-account
```
# Training Guide: UI
Source: https://docs.fireworks.ai/fine-tuning/web-ui-guide
Launch RFT jobs using the Fireworks dashboard
**Reinforcement Fine-Tuning (RFT)** is free for models under 16B parameters. When creating an RFT job in the UI, filter for free tuning models in the model selection area on the [fine-tuning creation page](https://app.fireworks.ai/dashboard/fine-tuning/create). If kicking off jobs from the terminal, you can find the model ID from the [Model Library](https://app.fireworks.ai/models?filter=LLM\&tunable=true). Note: SFT and DPO jobs are billed per training token for all model sizes—see the [pricing page](https://fireworks.ai/pricing) for details.
The Fireworks RFT UI provides a visual interface for creating RFT jobs, with guided parameter selection. Results for all jobs can also be found in the UI.
## When to use Web UI
Start with the UI to learn the options, then switch to [CLI](/fine-tuning/cli-reference) for faster iteration and automation. Remember, your results will always live in the UI.
| Feature | CLI (eval-protocol) | Web UI |
| ----------------------- | ----------------------------- | ----------------------------- |
| **Best for** | Experienced users, automation | First-time users, exploration |
| **Parameter discovery** | Need to know flag names | Guided with tooltips |
| **Speed** | Fast - single command | Slower - multiple steps |
| **Automation** | Easy to script and reproduce | Manual process |
| **Batch operations** | Easy to launch multiple jobs | One at a time |
| **Reproducibility** | Excellent - save commands | Manual tracking needed |
## Launch training via Web UI
1. Go to [Fireworks Dashboard](https://app.fireworks.ai)
2. Click **Fine-Tuning** in the left sidebar
3. Click **Fine-tune a Model**
1. Choose **Reinforcement** as the tuning method
2. Select your base model from the dropdown
The UI shows only models that support fine-tuning. Popular choices appear at the top.
Not sure which model to choose? Start with `llama-v3p1-8b-instruct` for a good balance of quality and speed.
1. **Upload new dataset** or **select existing** from your account
2. Preview dataset entries to verify format
3. The UI validates your JSONL format automatically
Each dataset row should have `messages` array:
```json theme={null}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 25 * 4?"}
]
}
```
1. Choose from your uploaded evaluators
2. Preview evaluator code and test results
3. View recent evaluation metrics
If you haven't uploaded an evaluator yet, you'll need to do that first via CLI:
```bash theme={null}
pytest my-evaluator-name.py -vs
```
For remote evaluators, you'll enter your server URL in the environment configuration section.
Configure how the model learns:
**Core parameters**:
* **Output model name**: Custom name for your fine-tuned model
* **Epochs**: Number of passes through the dataset (start with 1)
* **Learning rate**: How fast the model updates (use default 1e-4)
* **LoRA rank**: Model capacity (8-16 for most tasks)
* **Batch size**: Training throughput (use default 32k tokens)
The UI shows helpful tooltips for each parameter. See [Parameter Tuning](/fine-tuning/parameter-tuning) for detailed guidance.
Control how the model generates responses during training:
* **Temperature**: Sampling randomness (0.7 for balanced exploration)
* **Top-p**: Probability mass cutoff (0.9-1.0)
* **Top-k**: Token candidate limit (40 is standard)
* **Number of rollouts (n)**: Responses per prompt (4-8 recommended)
* **Max tokens**: Maximum response length (2048 default)
Higher temperature and more rollouts increase exploration but also cost.
1. Review all settings in the summary panel
2. See estimated training time and cost
3. Click **Start Fine-Tuning** to launch
The dashboard will redirect you to the job monitoring page where you can track progress in real-time.
## Next steps
Review requirements, validation, and common errors
Track job progress, inspect rollouts, and debug issues
Learn how to adjust parameters for better results
# Weighted Training
Source: https://docs.fireworks.ai/fine-tuning/weighted-training
Control which samples have greater influence during RFT training
Weighted training lets you assign different importance levels to samples in your dataset, giving you control over how your model learns. This is useful when some examples are higher quality, more representative, or more important to your use case than others.
## How it works
During training, each sample's loss is multiplied by its weight before being used to update model parameters. Higher weights mean stronger learning signals—the model pays more attention to these examples. Lower weights (including negative) reduce or reverse a sample's influence.
## Dataset format
Add a `weight` field at the root level of each JSON object in your JSONL dataset:
```json theme={null}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "4"}
],
"weight": 2.0
}
```
### Weight values
The `weight` field accepts any floating-point number:
| Weight | Effect |
| ----------- | ------------------------------------------------------- |
| `> 1.0` | Increased importance—model learns more from this sample |
| `1.0` | Default behavior (same as omitting weight) |
| `0.0 - 1.0` | Reduced importance—sample has less influence |
| `0.0` | Sample is effectively ignored during training |
| `< 0.0` | Negative weight—reverses the learning signal |
If you use weights, the `weight` field must be present in **all samples** in your dataset. Mixing weighted and unweighted samples is not supported.
## Use cases
### Upweight high-quality examples
When you have samples of varying quality, give more weight to your best examples:
```json theme={null}
{"messages": [...], "weight": 2.0}
{"messages": [...], "weight": 1.0}
{"messages": [...], "weight": 0.5}
```
### Balance dataset distribution
If certain prompt types are underrepresented, upweight them to ensure the model learns them well:
```json theme={null}
{"messages": [...], "weight": 3.0}
{"messages": [...], "weight": 1.0}
```
### De-emphasize noisy samples
If you have samples that may contain noise but can't easily be filtered, reduce their weight:
```json theme={null}
{"messages": [...], "weight": 0.3}
```
## Message filtering
For multi-turn conversations, you can also control which assistant messages to include in training by adding a `weight` field to individual messages. This uses a binary format following the [OpenAI fine-tuning specification](https://platform.openai.com/docs/api-reference/fine-tuning/chat-input#fine_tuning-chat_input-messages-assistant_message-weight).
```json theme={null}
{
"messages": [
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris.", "weight": 1},
{"role": "user", "content": "What about Germany?"},
{"role": "assistant", "content": "Berlin.", "weight": 0}
]
}
```
Message-level weights only accept `0` or `1`:
* `1`: Include this assistant message in training (default)
* `0`: Exclude this assistant message from training
Message-level weights are for **filtering** which turns to train on, not for adjusting training influence. Use sample-level weights (at the root of the JSON object) for weighted importance.
## Example dataset
Here's a complete example of a weighted RFT dataset:
```jsonl dataset.jsonl theme={null}
{"messages": [{"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is 15 * 3?"}, {"role": "assistant", "content": "15 * 3 = 45"}], "weight": 1.0}
{"messages": [{"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "Solve: 2x + 5 = 15"}, {"role": "assistant", "content": "x = 5"}], "weight": 1.5}
{"messages": [{"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "Integrate x^2 dx"}, {"role": "assistant", "content": "(x^3)/3 + C"}], "weight": 2.0}
```
This dataset upweights more complex math problems, so the model focuses more on calculus than basic arithmetic.
# Fire Pass Setup
Source: https://docs.fireworks.ai/firepass
Kimi K2.6 Turbo for personal agentic coding — Fire Pass (Early Access), $49 / month
Fire Pass (Early Access) is a monthly pass that gives you access to **Kimi K2.6 Turbo** for use in personal agentic coding harnesses like [Claude Code](/ecosystem/integrations/claude-code), [OpenCode](/ecosystem/integrations/opencode), Cline, Kilo Code, OpenClaw, and [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/models) — with no per-token charges for that model.
**Fire Pass (Early Access)** is an experimental product. Features, availability, and
pricing are subject to change.
## What's Included
With an active Fire Pass you get:
* **Kimi K2.6 Turbo** requests with zero per-token costs (a powerful reasoning model with a 256k context window, optimized for complex coding tasks)
* A NEW dedicated Fire Pass key that only works for Kimi K2.6 Turbo. Use your normal Fireworks key for all other models.
* Full compatibility with OpenAI- and Anthropic-compatible agentic coding tools
Only **Kimi K2.6 Turbo** is covered by Fire Pass. Usage of regular Kimi K2.6,
or any other model on Fireworks, will continue to incur standard per-token
charges.
## Pricing
| | |
| ------------------ | ------------------------------------------- |
| **Cost** | \$49 per month |
| **Payment Method** | Billed directly to your credit card on file |
| **Auto-renew** | Auto-renew is default on |
## Getting Started
If you haven't already, sign up for a [Fireworks
account](https://app.fireworks.ai).
Go to the [**Billing** page](https://app.fireworks.ai/account/billing),
review the Fire Pass details, and complete your purchase. Your pass
activates immediately and is valid for one month at \$49 per month.
Create a new dedicated Fire Pass API key and use it with any of the
supported agentic harnesses below. Requests to Kimi K2.6 Turbo will no
longer incur per-token charges while your pass is active.
## Using your Fire Pass
Fire Pass is designed for use with personal agentic coding harnesses. Below are setup guides for each supported tool.
### General Configuration
If your tool supports custom API endpoints, you can configure it manually using these details:
* **Model ID**: `accounts/fireworks/routers/kimi-k2p6-turbo`
* **API Key**: Generate a new Fire Pass API key (generated at [app.fireworks.ai/api-keys](https://app.fireworks.ai/api-keys))
* **Base URL (OpenAI-compatible)**: `https://api.fireworks.ai/inference/v1`
* **Base URL (Anthropic-compatible)**: `https://api.fireworks.ai/inference` (some tools may expect `/v1/messages` appended)
1. **Configure Kimi K2.6 as the default model** — run the setup script with your Fireworks API key from the [**API Keys** page](https://app.fireworks.ai/api-keys). Replace `YOUR_FIREWORKS_API_KEY` with your key:
```bash theme={null}
curl -fsSL https://storage.googleapis.com/fireworks-public/openclaw/setup-fireworks.sh | bash -s -- YOUR_FIREWORKS_API_KEY
```
This creates `~/.openclaw/openclaw.json` with your Fireworks credentials.
Do not share your API key or commit it to version control. Use the
placeholder above only as a pattern — substitute your real key locally.
2. **Install OpenClaw**
```bash theme={null}
curl -fsSL https://openclaw.ai/install.sh | bash
```
```powershell theme={null}
iwr -useb https://openclaw.ai/install.ps1 | iex
```
3. **Run the onboarding wizard** — in a terminal:
```bash theme={null}
openclaw onboard --install-daemon
```
Choose **Quickstart** mode, skip model and authentication (already configured), and accept the remaining defaults.
4. **Open the chat UI** — run:
```bash theme={null}
openclaw dashboard
```
The UI opens at [http://127.0.0.1:18789](http://127.0.0.1:18789). Send a message to confirm **Kimi K2.6** is responding.
For more about OpenClaw, see [openclaw.ai](https://openclaw.ai).
OpenCode setup lives on the [OpenCode integration guide](/ecosystem/integrations/opencode):
* [FireConnect setup](/ecosystem/integrations/opencode#enable-fireworks-routing) (recommended)
* [Using Fire Pass with OpenCode](/ecosystem/integrations/opencode#using-fire-pass)
* [Built-in provider connection](/ecosystem/integrations/opencode#built-in-provider-connection) (`/connect` in OpenCode)
To configure Cline with your Fire Pass:
1. Install the Cline extension in VS Code from [cline.bot](https://cline.bot) or the VS Code marketplace
2. Open VS Code settings and search for "Cline" or use the Cline panel
3. Configure the following settings:
**API Configuration:**
* **API Provider**: OpenAI Compatible
* **Base URL**: `https://api.fireworks.ai/inference/v1`
* **OpenAI Compatible API Key**: Your Fireworks API key (from [app.fireworks.ai/api-keys](https://app.fireworks.ai/api-keys))
* **Model ID**: `accounts/fireworks/routers/kimi-k2p6-turbo`
**Model Configuration:**
* **Supports Images**: Enabled (checked)
* **Context Window Size**: 256000
* **Max Output Tokens**: 256000
* **Input Price / 1M tokens**: 0 (Fire Pass covers all costs)
* **Output Price / 1M tokens**: 0 (Fire Pass covers all costs)
* **Temperature**: 1
4. Save your settings and start using Cline — your Fire Pass will automatically apply to requests to the turbo router
All Claude Code setup lives on the [Claude Code integration guide](/ecosystem/integrations/claude-code):
* [Install](/ecosystem/integrations/claude-code#install) (recommended)
* [Using Fire Pass with Claude Code](/ecosystem/integrations/claude-code#using-fire-pass)
* [Manual setup](/ecosystem/integrations/claude-code#manual-setup)
To configure Kilo Code with your Fire Pass:
1. Install the Kilo Code extension in VS Code from the [VS Code marketplace](https://marketplace.visualstudio.com/items?itemName=kilocode.Kilo-Code) or [kilocode.ai](https://kilocode.ai)
2. Open Kilo Code and select **Bring my own Key** on the "How would you like to get started?" screen
3. Configure the following settings:
**API Configuration:**
* **API Provider**: OpenAI Compatible
* **Base URL**: `https://api.fireworks.ai/inference/v1`
* **API Key**: Your Fireworks API key (from [app.fireworks.ai/api-keys](https://app.fireworks.ai/api-keys))
* **Model**: `accounts/fireworks/routers/kimi-k2p6-turbo`
* Type or paste the router ID and select "Use custom" when it appears
**Model Configuration:**
* **Context Window Size**: 256000 (256K tokens)
* **Supports images**: Enabled (checked)
* **Input Price**: 0 (Fire Pass covers all costs)
* **Output Price**: 0 (Fire Pass covers all costs)
4. Save your settings and start using Kilo Code — your Fire Pass will automatically apply to requests to the turbo router
Set `FIREWORKS_API_KEY` to your API key from [app.fireworks.ai/api-keys](https://app.fireworks.ai/api-keys). Install [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/models) with the Fireworks integration, then pass the turbo router as `fireworks:accounts/fireworks/routers/kimi-k2p6-turbo`:
```bash theme={null}
pip install deepagents langchain-fireworks
```
```python theme={null}
from deepagents import create_deep_agent
agent = create_deep_agent(
model="fireworks:accounts/fireworks/routers/kimi-k2p6-turbo",
system_prompt="You are a helpful coding assistant.",
)
result = agent.invoke(
{"messages": [{"role": "user", "content": "Hello"}]},
)
```
See the LangChain docs for [models](https://docs.langchain.com/oss/python/deepagents/models) and [customization](https://docs.langchain.com/oss/python/deepagents/customization).
## Terms of Use
Fire Pass is intended for **personal agentic coding use only**. By purchasing Fire Pass you agree to the following:
* **Allowed**: Personal development, experimentation, and coding with agentic harnesses (Claude Code, OpenCode, Cline, Kilo Code, OpenClaw, LangChain Deep Agents, and similar tools)
* **Prohibited**: Production workloads, team or shared usage, and any use that violates the [Fireworks Terms of Service](https://fireworks.ai/terms-of-service)
Violations of these terms may result in pass revocation without refund.
## FAQ
No. Fire Pass covers **Kimi K2.6 Turbo** only. All other models,
including regular Kimi K2.6, are billed at standard per-token rates.
When your pass expires, Kimi K2.6 Turbo requests will be billed at standard
per-token rates. If you have auto-renew enabled and a pass is available, it
will renew automatically.
No. Auto-renewal is best-effort and your renewal may not go through.
No. Fire Pass is for personal use only. Team or shared usage is
prohibited under the Terms of Service.
Fire Pass is currently by invite only.
Yes. You must use the specific router ID for the turbo model:
`accounts/fireworks/routers/kimi-k2p6-turbo`. The Fireworks billing system
will automatically detect your active Fire Pass and zero out the cost
for requests to this endpoint.
Your usage of Kimi K2.6 Turbo will still be logged in your dashboard so you
can track your request volume, but the associated cost will show as \$0.00
while your pass is active. You can also view your pass's active status and
expiration date directly on the [**Billing**
page](https://app.fireworks.ai/account/billing).
You can turn off auto-renewal at any time from the Fire Pass page in your
Fireworks dashboard. Your pass will remain active until the end of the
current billing month.
Yes. You can use any Fireworks model with your agentic harness. However,
only **Kimi K2.6 Turbo** requests are covered by Fire Pass. Usage
of other models will be billed to your account at standard per-token rates.
# Concepts
Source: https://docs.fireworks.ai/getting-started/concepts
This document outlines basic Fireworks AI concepts.
This page outlines core Fireworks resources and ideas. For definitions of technical terms (inference, LoRA, token, and many others), see the [**Glossary**](/getting-started/glossary).
## Resources
### Account
Your account is the top-level resource under which other resources are located. Quotas and billing are enforced at the account level, so usage for all users in an account contribute to the same quotas and bill.
* For developer accounts, the account ID is auto-generated from the email address used to sign up.
* Enterprise accounts can optionally choose a custom, unique account ID.
### User
A user is an email address associated with an account. Each user is assigned a role (such as Admin, User, Contributor, or Inference User) that determines their level of access to resources within the account.
### Models and model types
A model is a set of model weights and metadata associated with the model. Each model has a [**globally unique name**](/getting-started/concepts#resource-names-and-ids) of the form `accounts//models/`. There are two types of models:
**Base models:** A base model consists of the full set of model weights, including models pre-trained from scratch and full fine-tunes.
* Fireworks has a library of common base models that can be used for [**serverless inference**](/models/overview#serverless-inference) as well as [**dedicated deployments**](/models/overview#dedicated-deployments). Model IDs for these models are pre-populated. For example, `llama-v3p1-70b-instruct` is the model ID for the Llama 3.1 70B model that Fireworks provides. The ID for each model can be found on its page ([**example**](https://app.fireworks.ai/models/fireworks/qwen3-coder-480b-a35b-instruct))
* Users can also [upload their own](/models/uploading-custom-models) custom base models and specify model IDs.
**LoRA (low-rank adaptation) addons:** A LoRA addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to a fully fine-tuned model. Fireworks supports [**training**](/fine-tuning/finetuning-intro), [**uploading**](/models/uploading-custom-models#importing-fine-tuned-models), and [**serving**](/fine-tuning/fine-tuning-models#deploying-a-fine-tuned-model) LoRA addons. LoRA addons must be deployed on a dedicated deployment for its corresponding base model. Model IDs for LoRAs can be either auto-generated or user-specified.
#### A Note on API Model Metadata
When retrieving model details via the API, a model may be listed with both `supportsServerless: true` and `supportsLora: true`. This indicates that the base model is available for serverless inference, AND that the model architecture supports fine-tuning with LoRA.
However, these two features are mutually exclusive in deployment. The `supportsServerless` flag applies **only** to the base model. A LoRA addon fine-tuned from that base model **cannot** be deployed serverlessly and requires a dedicated (on-demand) deployment.
### Deployments and deployment types
A model must be deployed before it can be used for inference. A deployment is a collection (one or more) model servers that host one base model and optionally one or more LoRA addons.
Fireworks supports two types of deployments:
* **Serverless deployments:** Fireworks hosts popular base models on shared "serverless" deployments. Users pay-per-token to query these models and do not need to configure GPUs. See our [Quickstart - Serverless](/getting-started/quickstart) guide to get started.
* **Dedicated deployments:** Dedicated deployments enable users to configure private deployments with a wide array of hardware (see [on-demand deployments guide](/guides/ondemand-deployments)). Dedicated deployments give users performance guarantees and the most flexibility and control over what models can be deployed. Both LoRA addons and base models can be deployed to dedicated deployments. Dedicated deployments are billed by a GPU-second basis (see [**pricing**](https://fireworks.ai/pricing#ondemand) page).
See the [**Querying text models guide**](/guides/querying-text-models) for a comprehensive overview of making LLM inference.
### Deployed model
Users can specify a model to query for inference using the model name and deployment name. Alternatively, users can refer to a "deployed model" name that refers to a unique instance of a base model or LoRA addon that is loaded into a deployment. See [On-demand deployments](/guides/ondemand-deployments) guide for more.
### Dataset
A dataset is an immutable set of training examples that can be used to fine-tune a model.
### Fine-tuning job
A fine-tuning job is an offline training job that uses a dataset to train a LoRA addon model.
## Resource names and IDs
A resource name is a globally unique identifier of a resource. The format of a name also identifies the type and hierarchy of the resource, for example:
Resource IDs must satisfy the following constraints:
* Between 1 and 63 characters (inclusive)
* Consists of a-z, 0-9, and hyphen (-)
* Does not begin or end with a hyphen (-)
* Does not begin with a digit
## Control plane and data plane
The Fireworks API can be split into a control plane and a data plane.
* The **control plane** consists of APIs used for managing the lifecycle of resources. This
includes your account, models, and deployments.
* The **data plane** consists of the APIs used for inference and the backend services that power
them.
## Interfaces
Users can interact with Fireworks through one of many interfaces:
* The **web app** at [https://app.fireworks.ai](https://app.fireworks.ai)
* The [`firectl`](/tools-sdks/firectl/firectl) CLI
* [OpenAI compatible API](/tools-sdks/openai-compatibility)
* [Anthropic compatible API](/tools-sdks/anthropic-compatibility)
* [Python SDK](/tools-sdks/python-sdk)
# Glossary
Source: https://docs.fireworks.ai/getting-started/glossary
Definitions for key terms used across Fireworks AI documentation.
This glossary covers terms you'll encounter when working with the Fireworks AI platform — across inference, fine-tuning, deployments, security, and the API.
***
## Account & Billing
**Account**\
Your Fireworks AI organization identity. All deployments, models, fine-tuning jobs, and API keys belong to an account. Referenced as `accounts/` in firectl and the API.
**Credit**\
The unit used for prepaid usage on Fireworks. Credits are consumed as you use inference and training resources.
**Rate limiting / 429 errors**\
A `429 Too Many Requests` response means your account or deployment has hit a concurrency or request-per-minute limit. On serverless, limits scale automatically with usage tier. On dedicated deployments, adding replicas or adjusting autoscaling increases capacity.
**Quota**\
A per-account cap on a resource — such as requests per minute (RPM), GPU hours, or concurrent replicas. Quotas can be increased by contacting support.
**ZDR (Zero Data Retention)**\
The default behavior for all open model inference on Fireworks: no prompts, completions, or request logs are stored after the response is returned. ZDR is on by default — it is not an opt-in feature.
**BYOC (Bring Your Own Cloud)**\
An enterprise option to run inference entirely within your own cloud account. Your data never touches Fireworks-managed infrastructure. Contact sales for availability.
***
## Fireworks Platform
**Serverless inference**\
Pay-per-token inference with no deployment management. Requests are routed globally across Fireworks infrastructure for lowest latency and highest availability. Serverless does not support geographic constraints — for region-specific inference, use a dedicated deployment and set placement with the **`--region`** flag on `firectl deployment create` (for example `GLOBAL`, `US`, `EUROPE`, `APAC`).
**Dedicated deployment**\
A deployment you provision and manage, with reserved GPU capacity. Gives you control over model, hardware, placement, autoscaling, and addon support. Created with `firectl deployment create`.
**Multi-region deployment**\
A dedicated deployment configured to run replicas across multiple datacenters. Enabled with `--region GLOBAL` (or a specific mega-region: `US`, `EUROPE`, `APAC`) on `firectl deployment create`. Increases availability and throughput.
**Placement**\
Controls which regions a dedicated deployment is allowed to schedule replicas in. On the CLI, set at creation time with **`--region`** (`GLOBAL`, `US`, `EUROPE`, `APAC`, or a specific region id). Cannot be changed after deployment creation — recreate the deployment to change placement. If not specified, the deployment pins to a single datacenter at creation time.
**Deployment shape**\
The hardware and precision configuration used when creating a dedicated deployment. Shapes encode GPU type, count, precision (BF16, FP8, FP4), and other settings. Specified with `--deployment-shape`. Some shapes do not support LoRA addons.
**Deployment state**\
The lifecycle status of a deployment: `CREATING`, `DEPLOYING`, `DEPLOYED`, `SCALING`, `UPDATING`, `FAILED`, `DELETING`, `DELETED`.
**Replica**\
A single instance of a deployed model. Adding replicas increases concurrency. Replicas can be scaled manually or via autoscaling rules.
**Autoscaling**\
Automatic adjustment of replica count based on traffic. Configured with scale-to-zero, minimum replicas, maximum replicas, and scale-up/down thresholds.
**firectl**\
The Fireworks command-line tool for managing models, deployments, fine-tuning jobs, and account resources.
***
## Models & Inference
**Base model**\
A foundation model available on Fireworks for inference or fine-tuning. Referenced as `accounts/fireworks/models/`.
**Addon**\
A LoRA adapter loaded on top of a base model deployment at inference time. Enabled on a deployment with `--enable-addons`. The adapter is specified per request by passing the adapter model ID. FP8 and FP4 quantized shapes do not support addons — use a BF16 shape for LoRA addon inference.
**Multi-LoRA**\
A single base model deployment that serves multiple LoRA adapters. The adapter is selected per request. One deployment, multiple fine-tuned behaviors.
**Quantization**\
Reducing the numerical precision of model weights to decrease memory usage and increase throughput, with some quality tradeoff.
**BF16 (BFloat16)**\
A 16-bit floating point format used for model weights. Provides good quality with moderate memory usage. BF16 deployment shapes support LoRA addons.
**FP8**\
An 8-bit floating point format. Faster and more memory-efficient than BF16, with a small quality tradeoff. FP8 deployment shapes do not support LoRA addons.
**FP4**\
A 4-bit floating point format. Faster and cheaper than FP8, with a larger quality tradeoff. FP4 deployment shapes do not support LoRA addons.
**Speculative decoding**\
An inference acceleration technique where a smaller draft model proposes tokens that are verified by the main model in parallel. Reduces latency with no quality loss.
**KV cache**\
Key-value cache that stores intermediate attention computations across tokens. Reduces re-computation cost on repeated or shared prefixes. Cache hit percentage is reflected in billing. Configurable with `--kv-cache-fraction`.
**Context window**\
The maximum number of tokens (input + output) the model can process in a single request.
**`max_tokens`**\
API parameter that caps how many tokens the model will generate in a single response. Always set this explicitly in agentic workflows — without it, reasoning models and large models may generate very long outputs.
**TTFT (Time to First Token)**\
Latency from when a request is sent to when the first output token is received. Key metric for interactive applications.
**TPS (Tokens Per Second)**\
Throughput metric measuring how many output tokens are generated per second.
**Batch inference**\
Asynchronous large-scale inference via the Fireworks Batch API. Submit a JSONL file of requests; results are returned when the job completes. Lower cost than real-time inference, higher latency, no streaming.
**Streaming**\
Real-time token-by-token output delivery via server-sent events (SSE). Enabled with `stream=True`. Not available in batch inference.
**Tool calling / Function calling**\
The ability for a model to request that the client execute a function and return the result, enabling agentic and multi-step workflows.
**Structured output**\
Model output constrained to a specific JSON schema. Fireworks supports constrained decoding to guarantee valid JSON matching your schema.
**`reasoning_effort`**\
API parameter that controls how much thinking a reasoning model performs before generating its response. Set to `"none"` to disable extended reasoning (useful as a workaround for models prone to reasoning loops).
***
## Fine-tuning
**SFT (Supervised Fine-Tuning)**\
Training a model on labeled input-output pairs to adapt it to a specific task or style. Available via the Fireworks managed fine-tuning pipeline.
**LoRA (Low-Rank Adaptation)**\
A parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. The adapter can be loaded on top of the base model at inference time.
**LoRA rank**\
A hyperparameter that controls the size of the LoRA adapter. Higher rank = more capacity but larger adapter and more training compute.
**GRPO (Group Relative Policy Optimization)**\
A reinforcement learning variant used in RFT. Optimizes model outputs using group-relative reward scoring rather than a separate value model.
**RFT (Reinforcement Fine-Tuning)**\
Fine-tuning using reinforcement learning signals rather than labeled examples. Trains the model to maximize a reward function — useful for tasks with verifiable outcomes such as math, code, and reasoning.
**Training shape**\
The hardware configuration used for a fine-tuning job. Selected via `--training-shape` in firectl. Determines GPU type, count, and whether LoRA or full-parameter training is used.
**`dcp_save_interval`**\
RFT training parameter that controls how often full training state (weights + optimizer) is checkpointed. Default is `0` (disabled). Set to a positive integer to enable full checkpoint-and-resume including optimizer state.
**Epoch**\
One complete pass through the training dataset.
**Step**\
A single gradient update during training.
**Checkpoint**\
A saved snapshot of model state (and optionally optimizer state) at a point during training.
**`ppo_kl` vs `ref_kld`**\
Two KL divergence metrics logged during GRPO/RFT training. `ppo_kl` measures divergence between current and previous policy — stays near zero with one minibatch per rollout, which is expected. `ref_kld` measures divergence from the reference/base model — this is the metric to monitor for policy drift.
***
## Deployment
**Cold start**\
The latency incurred when a deployment scales up from zero replicas or provisions a new replica.
**Scale-to-zero**\
A deployment configuration where replica count drops to zero when there is no traffic. Eliminates idle costs but introduces cold-start latency.
**GPU hours**\
The billing unit for dedicated deployment capacity. Charged per GPU per hour of replica uptime.
**Prometheus metrics**\
Per-deployment performance and utilization metrics exposed via a Prometheus-compatible endpoint. Includes TTFT, TPS, GPU utilization, queue depth, and error rates.
***
## API & SDK
**API key**\
An authentication token used to make requests to the Fireworks API. Generated in the Fireworks console under Account > API Keys.
**OpenAI-compatible API**\
The Fireworks inference API is compatible with the OpenAI Chat Completions API format. Use the OpenAI Python SDK with Fireworks by changing `base_url` and `api_key`.
**`reconnect_and_wait()`**\
SDK method for recovering a training job that has been interrupted by pod preemption or a network error. Use in your training loop to make jobs resilient to transient interruptions.
**JSONL**\
JSON Lines format — one JSON object per line. Used for batch inference input files and fine-tuning dataset uploads.
***
## Security & Compliance
**ISO 27001**\
International standard for information security management systems (ISMS). Fireworks has achieved ISO 27001 certification. Certificate available at [trust.fireworks.ai](https://trust.fireworks.ai).
**ISO 27701**\
Extension to ISO 27001 covering privacy information management. Fireworks has achieved ISO 27701 certification. Certificate available at [trust.fireworks.ai](https://trust.fireworks.ai).
**ISO 42001**\
International standard for AI management systems. Fireworks has achieved ISO 42001 certification. Certificate available at [trust.fireworks.ai](https://trust.fireworks.ai).
**SOC 2 Type II**\
An auditing standard that verifies a company's security, availability, and confidentiality controls over time. Fireworks maintains SOC 2 Type II compliance.
**HIPAA**\
U.S. healthcare data regulation. Fireworks supports HIPAA-compliant deployments for covered entities. Contact sales for Business Associate Agreement (BAA) details.
**Audit logs**\
A record of API activity on your account. Useful for security review and compliance reporting. Available via the Fireworks console and API.
**Trust Center**\
Fireworks' central repository for security documentation, compliance certificates, and data handling policies. Available at [trust.fireworks.ai](https://trust.fireworks.ai).
***
## Inference Parameters
**Temperature**\
Controls randomness in token sampling. `0.0` = deterministic. Higher values increase diversity.
**Top-p (nucleus sampling)**\
Limits token sampling to the smallest set of tokens whose cumulative probability exceeds `p`.
**Top-k**\
Limits token sampling to the top `k` most probable tokens at each step.
**Frequency penalty**\
Reduces the likelihood of tokens that have already appeared in the output. Discourages repetition.
**Presence penalty**\
Reduces the likelihood of tokens that have appeared at all in the output. Encourages the model to introduce new topics.
**Stop sequences**\
One or more strings that cause the model to stop generating when produced. Useful for controlling output format.
**System prompt**\
An instruction prepended to the conversation that sets the model's persona, task, or constraints.
# Build with Fireworks AI
Source: https://docs.fireworks.ai/getting-started/introduction
Fast inference and fine-tuning for open source models
Fireworks AI is the fastest platform for building with open source AI models. Get production-ready inference and fine-tuning with best-in-class speed, cost and quality.
## Get started in minutes
Use popular models instantly with pay-per-token pricing. Perfect for quality vibe testing and prototyping.
Deploy with high performance on dedicated GPUs with fast autoscaling and minimal cold starts. Optimize deployments for speed and throughput.
Boost model quality with supervised and reinforcement fine-tuning of models up to 1T+ parameters. Start training in minutes, deploy immediately.
Not sure where to start? First, pick the right model for your use case with our [**model selection guide**](/guides/recommended-models). Then choose [**Serverless**](/getting-started/quickstart) to prototype quickly, move to [**Deployments**](/getting-started/ondemand-quickstart) to optimize and run production workloads, or use [**Fine-tuning**](/fine-tuning/finetuning-intro) to improve quality.
New to AI or Fireworks? Look up any term in the [**Glossary**](/getting-started/glossary).
Need help optimizing deployments, fine-tuning models, or setting up production infrastructure? [Talk to our team](https://fireworks.ai/company/contact-us) - we'll help you get the best performance and reliability.
## What you can build
Text, vision, audio, image, and embeddings
Drop-in replacement for inference and fine-tuning — same API, same SFT data format
Connect models to tools and APIs
Reliable JSON responses for agentic workflows
Analyze images and documents
Use embeddings & reranking in search & context retrieval
Run async inference jobs at scale, faster and cheaper
## Resources & help
Find the best model for your use case
Code examples and tutorials
Complete API documentation
Ask questions and get help from developers
SOC 2, HIPAA, and audit reports
Check service uptime
Talk to our team
# Deployments Quickstart
Source: https://docs.fireworks.ai/getting-started/ondemand-quickstart
Deploy models on dedicated GPUs in minutes
On-demand deployments are dedicated GPUs that give you better performance, no rate limits, fast autoscaling, and a wider selection of models than serverless. This quickstart will help you spin up your first on-demand deployment in minutes.
## Step 1: Create and export an API key
Before you begin, create an API key in the [Fireworks dashboard](https://app.fireworks.ai/settings/users/api-keys). Click **Create API key** and store it in a safe location.
Once you have your API key, export it as an environment variable in your terminal:
```bash theme={null}
export FIREWORKS_API_KEY="your_api_key_here"
```
```powershell theme={null}
setx FIREWORKS_API_KEY "your_api_key_here"
```
## Step 2: Install the CLI
To create and manage on-demand deployments, you'll need the `firectl` CLI tool. Install it using one of the following methods, based on your platform:
```bash homebrew theme={null}
brew tap fw-ai/firectl
brew install firectl
# If you encounter a failed SHA256 check, try first running
brew update
```
```bash macOS (Apple Silicon) theme={null}
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-arm64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash macOS (x86_64) theme={null}
curl https://storage.googleapis.com/fireworks-public/firectl/stable/darwin-amd64.gz -o firectl.gz
gzip -d firectl.gz && chmod a+x firectl
sudo mv firectl /usr/local/bin/firectl
sudo chown root: /usr/local/bin/firectl
```
```bash Linux (x86_64) theme={null}
wget -O firectl.gz https://storage.googleapis.com/fireworks-public/firectl/stable/linux-amd64.gz
gunzip firectl.gz
sudo install -o root -g root -m 0755 firectl /usr/local/bin/firectl
```
```Text Windows (64 bit) theme={null}
wget -L https://storage.googleapis.com/fireworks-public/firectl/stable/firectl.exe
```
Then, sign in:
```bash theme={null}
firectl signin
```
## Step 3: Create a deployment
This command will create a deployment of GPT OSS 120B optimized for speed. It will take a few minutes to complete. The resulting deployment will scale up to 1 replica.
```bash theme={null}
firectl deployment create accounts/fireworks/models/gpt-oss-120b \
--deployment-shape fast \
--scale-down-window 5m \
--scale-up-window 30s \
--min-replica-count 0 \
--max-replica-count 1 \
--scale-to-zero-window 5m \
--wait
```
`fast` is called a [deployment shape](/guides/ondemand-deployments#deployment-shapes), which is a pre-configured deployment template created by the Fireworks team that sets sensible defaults for most deployment options (such as hardware type).
You can also pass `throughput` or `cost` to `--deployment-shape`:
* `throughput` creates a deployment that trades off latency for lower cost-per-token at scale
* `cost` creates a deployment that trades off latency and throughput for lowest cost-per-token at small scale, usually for early experimentation and prototyping
While we recommend using a deployment shape, you are also free to pass your own configuration to the deployment via our [deployment guide](/guides/ondemand-deployments).
The response will look like this:
```bash theme={null}
Name: accounts//deployments/
Create Time:
Expire Time:
Created By:
State: CREATING
Status: OK
Min Replica Count: 0
Max Replica Count: 1
Desired Replica Count: 0
Replica Count: 0
Autoscaling Policy:
Scale Up Window: 30s
Scale Down Window: 5m0s
Scale To Zero Window: 5m0s
Base Model: accounts/fireworks/models/gpt-oss-120b
...other fields...
```
Take note of the `Name:` field in the response, as it will be used in the next step to query your deployment.
[Learn more about deployment options→](/guides/ondemand-deployments)
[Learn more about autoscaling options→](/guides/ondemand-deployments#autoscaling)
## Step 4: Query your deployment
Now you can query your on-demand deployment using the same API as serverless models, but using your dedicated deployment. Replace `` in the below snippets with the value from the `Name:` field in the previous step:
Install the [Fireworks Python SDK](/tools-sdks/python-sdk):
The SDK is currently in alpha. Use the `--pre` flag when installing to get the latest version.
```bash pip theme={null}
pip install --pre fireworks-ai
```
```bash poetry theme={null}
poetry add --pre fireworks-ai
```
```bash uv theme={null}
uv add --pre fireworks-ai
```
Then make your first on-demand API call:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/gpt-oss-120b#",
messages=[{
"role": "user",
"content": "Explain quantum computing in simple terms",
}],
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="",
messages=[{
"role": "user",
"content": "Explain quantum computing in simple terms",
}],
)
print(response.choices[0].message.content)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
});
console.log(response.choices[0].message.content);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
]
}'
```
The examples from the Serverless quickstart will work with this deployment as well, just replace the model string with the deployment-specific model string from above.
[Serverless quickstart→](/getting-started/quickstart)
## Common use cases
### Autoscale based on requests per second
```bash theme={null}
firectl deployment create accounts/fireworks/models/gpt-oss-120b \
--deployment-shape fast \
--scale-down-window 5m \
--scale-up-window 30s \
--scale-to-zero-window 5m \
--min-replica-count 0 \
--max-replica-count 4 \
--load-targets requests_per_second=5 \
--wait
```
### Autoscale based on concurrent requests
```bash theme={null}
firectl deployment create accounts/fireworks/models/gpt-oss-120b \
--deployment-shape fast \
--scale-down-window 5m \
--scale-up-window 30s \
--scale-to-zero-window 5m \
--min-replica-count 0 \
--max-replica-count 4 \
--load-targets concurrent_requests=5 \
--wait
```
## Next steps
Ready to scale to production, explore other modalities, or customize your models?
Bring your own model and deploy it on Fireworks
Improve model quality with supervised and reinforcement learning
Use embeddings & reranking in search & context retrieval
Run async inference jobs at scale, faster and cheaper
Explore all available models across modalities
Complete API documentation
# Serverless Quickstart
Source: https://docs.fireworks.ai/getting-started/quickstart
Make your first Serverless API call in minutes
Serverless is the fastest way to get started with using open models. This quickstart will help you make your first API call in minutes.
## Step 1: Create and export an API key
Before you begin, create an API key in the [Fireworks dashboard](https://app.fireworks.ai/settings/users/api-keys). Click **Create API key** and store it in a safe location.
Once you have your API key, export it as an environment variable in your terminal:
```bash theme={null}
export FIREWORKS_API_KEY="your_api_key_here"
```
```powershell theme={null}
setx FIREWORKS_API_KEY "your_api_key_here"
```
## Step 2: Make your first Serverless API call
Install the [Fireworks Python SDK](/tools-sdks/python-sdk):
The SDK is currently in alpha. Use the `--pre` flag when installing to get the latest version.
```bash pip theme={null}
pip install --pre fireworks-ai
```
```bash poetry theme={null}
poetry add --pre fireworks-ai
```
```bash uv theme={null}
uv add --pre fireworks-ai
```
Then make your first Serverless API call:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[{
"role": "user",
"content": "Say hello in Spanish",
}],
)
print(response.choices[0].message.content)
```
Fireworks provides an OpenAI compatible endpoint. Install the [OpenAI Python SDK](https://github.com/openai/openai-python):
```bash theme={null}
pip install openai
```
Then make your first Serverless API call:
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[{
"role": "user",
"content": "Say hello in Spanish",
}],
)
print(response.choices[0].message.content)
```
Fireworks provides an Anthropic compatible endpoint. Install the [Anthropic Python SDK](https://github.com/anthropics/anthropic-sdk-python):
```bash theme={null}
pip install anthropic
```
Then make your first Serverless API call:
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
response = client.messages.create(
model="accounts/fireworks/models/deepseek-v3p1",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Say hello in Spanish",
}],
)
print(response.content[0].text)
```
Fireworks provides an OpenAI compatible endpoint. Install the [OpenAI JavaScript / TypeScript SDK](https://github.com/openai/openai-node):
```bash theme={null}
npm install openai
```
Then make your first Serverless API call:
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/deepseek-v3p1",
messages: [
{
role: "user",
content: "Say hello in Spanish",
},
],
});
console.log(response.choices[0].message.content);
```
Fireworks provides an Anthropic compatible endpoint. Install the [Anthropic JavaScript / TypeScript SDK](https://github.com/anthropics/anthropic-sdk-typescript):
```bash theme={null}
npm install @anthropic-ai/sdk
```
Then make your first Serverless API call:
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const response = await client.messages.create({
model: "accounts/fireworks/models/deepseek-v3p1",
max_tokens: 1024,
messages: [
{
role: "user",
content: "Say hello in Spanish",
},
],
});
console.log(response.content[0].text);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/deepseek-v3p1",
"messages": [
{
"role": "user",
"content": "Say hello in Spanish"
}
]
}'
```
You should see a response like: `"¡Hola!"`
For **Priority tier** (`service_tier: "priority"`) and **Fast**, see [Serverless Serving Paths](/serverless/serving-paths).
## Common use cases
### Streaming responses
Stream responses token-by-token for a better user experience:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
stream = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[{"role": "user", "content": "Tell me a short story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
stream = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[{"role": "user", "content": "Tell me a short story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
with client.messages.stream(
model="accounts/fireworks/models/deepseek-v3p1",
max_tokens=1024,
messages=[{"role": "user", "content": "Tell me a short story"}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const stream = await client.chat.completions.create({
model: "accounts/fireworks/models/deepseek-v3p1",
messages: [{ role: "user", content: "Tell me a short story" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
```
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const stream = client.messages.stream({
model: "accounts/fireworks/models/deepseek-v3p1",
max_tokens: 1024,
messages: [{ role: "user", content: "Tell me a short story" }],
});
stream.on("text", (text) => {
process.stdout.write(text);
});
await stream.finalMessage();
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/deepseek-v3p1",
"messages": [
{
"role": "user",
"content": "Tell me a short story"
}
],
"stream": true
}'
```
### Function calling
Connect your models to external tools and APIs:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco",
}
},
"required": ["location"],
},
},
},
],
)
print(response.choices[0].message.tool_calls)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco",
}
},
"required": ["location"],
},
},
},
],
)
print(response.choices[0].message.tool_calls)
```
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
response = client.messages.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
max_tokens=1024,
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=[
{
"name": "get_weather",
"description": "Get the current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco",
}
},
"required": ["location"],
},
},
],
)
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}, Input: {block.input}")
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const tools = [
{
type: "function",
function: {
name: "get_weather",
description: "Get the current weather for a location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "City name, e.g. San Francisco",
},
},
required: ["location"],
},
},
},
];
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/kimi-k2-instruct-0905",
messages: [{ role: "user", content: "What's the weather in Paris?" }],
tools: tools,
});
console.log(response.choices[0].message.tool_calls);
```
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const response = await client.messages.create({
model: "accounts/fireworks/models/kimi-k2-instruct-0905",
max_tokens: 1024,
messages: [{ role: "user", content: "What's the weather in Paris?" }],
tools: [
{
name: "get_weather",
description: "Get the current weather for a location",
input_schema: {
type: "object",
properties: {
location: {
type: "string",
description: "City name, e.g. San Francisco",
},
},
required: ["location"],
},
},
],
});
for (const block of response.content) {
if (block.type === "tool_use") {
console.log(`Tool: ${block.name}, Input:`, block.input);
}
}
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/kimi-k2-instruct-0905",
"messages": [
{
"role": "user",
"content": "What'\''s the weather in Paris?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco"
}
},
"required": ["location"]
}
}
}
]
}'
```
[Learn more about function calling →](/guides/function-calling)
### Structured outputs (JSON mode)
Get reliable JSON responses that match your schema:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[
{
"role": "user",
"content": "Extract the name and age from: John is 30 years old",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "number" }
},
"required": ["name", "age"],
},
},
},
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[
{
"role": "user",
"content": "Extract the name and age from: John is 30 years old",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {"name": {"type": "string"}, "age": {"type": "number"}},
"required": ["name", "age"],
},
},
},
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
response = client.messages.create(
model="accounts/fireworks/models/deepseek-v3p1",
max_tokens=1024,
output_config={
"format": {
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "number" }
},
"required": ["name", "age"],
},
}
},
messages=[
{
"role": "user",
"content": "Extract the name and age from: John is 30 years old",
}
],
)
print(response.content[0].text)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/deepseek-v3p1",
messages: [
{
role: "user",
content: "Extract the name and age from: John is 30 years old",
},
],
response_format: {
type: "json_schema",
json_schema: {
name: "person",
schema: {
type: "object",
properties: {
name: { type: "string" },
age: { type: "number" },
},
required: ["name", "age"],
},
},
},
});
console.log(response.choices[0].message.content);
```
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const response = await client.messages.create({
model: "accounts/fireworks/models/deepseek-v3p1",
max_tokens: 1024,
output_config: {
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
name: { type: "string" },
age: { type: "number" },
},
required: ["name", "age"],
},
},
},
messages: [
{
role: "user",
content: "Extract the name and age from: John is 30 years old",
},
],
});
console.log(response.content[0].text);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/deepseek-v3p1",
"messages": [
{
"role": "user",
"content": "Extract the name and age from: John is 30 years old"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"age": {
"type": "number"
}
},
"required": ["name", "age"]
}
}
}
}'
```
[Learn more about structured outputs →](/structured-responses/structured-response-formatting)
### Reasoning
Some models support reasoning, where the model shows its thought process before giving the final answer:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p2",
messages=[
{"role": "user", "content": "What is 25 * 37? Show your work."}
],
reasoning_effort="medium",
)
msg = response.choices[0].message
if msg.reasoning_content:
print("Reasoning:", msg.reasoning_content)
print("Answer:", msg.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p2",
messages=[
{"role": "user", "content": "What is 25 * 37? Show your work."}
],
extra_body={"reasoning_effort": "medium"},
)
msg = response.choices[0].message
# Reasoning content is returned in a separate field
reasoning = getattr(msg, "reasoning_content", None)
if reasoning is None and hasattr(msg, "model_extra"):
reasoning = msg.model_extra.get("reasoning_content")
if reasoning:
print("Reasoning:", reasoning)
print("Answer:", msg.content)
```
The Anthropic SDK uses the `thinking` parameter to enable reasoning:
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
response = client.messages.create(
model="accounts/fireworks/models/deepseek-v3p2",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 4096},
messages=[
{"role": "user", "content": "What is 25 * 37? Show your work."}
],
)
for block in response.content:
if block.type == "thinking":
print("Thinking:", block.thinking)
elif block.type == "text":
print("Answer:", block.text)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/deepseek-v3p2",
messages: [
{ role: "user", content: "What is 25 * 37? Show your work." },
],
reasoning_effort: "medium",
});
const msg = response.choices[0].message;
if (msg.reasoning_content) {
console.log("Reasoning:", msg.reasoning_content);
}
console.log("Answer:", msg.content);
```
The Anthropic SDK uses the `thinking` parameter to enable reasoning:
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const response = await client.messages.create({
model: "accounts/fireworks/models/deepseek-v3p2",
max_tokens: 16000,
thinking: { type: "enabled", budget_tokens: 4096 },
messages: [
{ role: "user", content: "What is 25 * 37? Show your work." },
],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("Thinking:", block.thinking);
} else if (block.type === "text") {
console.log("Answer:", block.text);
}
}
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/deepseek-v3p2",
"messages": [
{
"role": "user",
"content": "What is 25 * 37? Show your work."
}
],
"reasoning_effort": "medium"
}'
```
[Learn more about reasoning →](/guides/reasoning)
### Vision models
Analyze images with vision-language models:
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png"
},
},
],
}
],
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png"
},
},
],
}
],
)
print(response.choices[0].message.content)
```
The Anthropic SDK uses its native image format with `type: "image"` and a `source` object:
```python theme={null}
import os
import anthropic
client = anthropic.Anthropic(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference"
)
response = client.messages.create(
model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image",
"source": {
"type": "url",
"url": "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png",
},
},
],
}
],
)
for block in response.content:
if block.type == "text":
print(block.text)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts/fireworks/models/qwen2p5-vl-32b-instruct",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What's in this image?" },
{
type: "image_url",
image_url: {
url: "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png",
},
},
],
},
],
});
console.log(response.choices[0].message.content);
```
```javascript theme={null}
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference",
});
const response = await client.messages.create({
model: "accounts/fireworks/models/qwen2p5-vl-32b-instruct",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{ type: "text", text: "What's in this image?" },
{
type: "image",
source: {
type: "url",
url: "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png",
},
},
],
},
],
});
for (const block of response.content) {
if (block.type === "text") {
console.log(block.text);
}
}
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/qwen2p5-vl-32b-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'\''s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://storage.googleapis.com/fireworks-public/image_assets/fireworks-ai-wordmark-color-dark.png"
}
}
]
}
]
}'
```
[Learn more about vision models →](/guides/querying-vision-language-models)
## Learn more about Serverless
For the model lifecycle policy, billing details, and serverless-specific request/response behavior, see the [Serverless overview](/serverless/overview).
## Next steps
Ready to scale to production, explore other modalities, or customize your models?
Deploy with high performance on dedicated GPUs with fast autoscaling and minimal cold starts
Improve model quality with supervised and reinforcement learning
Use embeddings & reranking in search & context retrieval
Run async inference jobs at scale, faster and cheaper
Explore all available models across modalities
Complete API documentation
# Batch API
Source: https://docs.fireworks.ai/guides/batch-inference
Process large-scale async workloads
Process large volumes of requests asynchronously at 50% lower cost. Batch API is ideal for:
* Production-scale inference workloads
* Large-scale testing and benchmarking
* Training smaller models with larger ones ([distillation guide](https://fireworks.ai/blog/deepseek-r1-distillation-reasoning))
Batch jobs automatically use [prompt caching](/guides/prompt-caching) for additional 50% cost savings on cached tokens. Maximize cache hits by placing static content first in your prompts.
## Model compatibility
Not all models support the Batch API. Before submitting a batch job, verify your target model is batch-compatible.
* **Base Models** – Any model that supports [On-Demand Deployments](https://docs.fireworks.ai/guides/ondemand-deployments) in the [Model Library](https://fireworks.ai/models)
* **Custom Models** – Your uploaded or fine-tuned models
*Note: Newly added models may have a delay before being supported. See [Quantization](/models/quantization) for precision info.*
If a model does not support batch inference, submitting a job may not produce an immediate error — the job can remain in a pending state and never schedule. Always verify compatibility before submitting.
**If your batch job is not scheduling:**
1. Confirm the model supports batch inference (see above).
2. Validate your JSONL input — each line must be a complete, valid JSON object matching the request schema.
3. Check that your account has sufficient quota for batch jobs.
4. If the job has been pending for more than 30 minutes, contact support with your job ID.
## Getting Started
Datasets must be in JSONL format (one JSON object per line):
**Requirements:**
* **File format:** JSONL (each line is a valid JSON object)
* **Size limit:** Under 1GB
* **Required fields:** `custom_id` (unique) and `body` (request parameters)
**Example dataset:**
```json theme={null}
{"custom_id": "request-1", "body": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], "max_tokens": 100}}
{"custom_id": "request-2", "body": {"messages": [{"role": "user", "content": "Explain quantum computing"}], "temperature": 0.7}}
{"custom_id": "request-3", "body": {"messages": [{"role": "user", "content": "Tell me a joke"}]}}
```
Save as `batch_input_data.jsonl` locally.
You can simply navigate to the dataset tab, click `Create Dataset` and follow the wizard.
```bash theme={null}
firectl dataset create batch-input-dataset ./batch_input_data.jsonl
```
You need to make two separate HTTP requests. One for creating the dataset entry and one for uploading the dataset. Full reference here: [Create dataset](/api-reference/create-dataset).
```bash theme={null}
# Create Dataset Entry
curl -X POST "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/datasets" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"datasetId": "batch-input-dataset",
"dataset": { "userUploaded": {} }
}'
# Upload JSONL file
curl -X POST "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/datasets/batch-input-dataset:upload" \
-H "Authorization: Bearer ${API_KEY}" \
-F "file=@./batch_input_data.jsonl"
```
Navigate to the Batch Inference tab and click "Create Batch Inference Job". Select your input dataset:
Choose your model:
Configure optional settings:
```bash theme={null}
firectl batch-inference-job create \
--model accounts/fireworks/models/llama-v3p1-8b-instruct \
--input-dataset-id batch-input-dataset
```
With additional parameters:
```bash theme={null}
firectl batch-inference-job create \
--job-id my-batch-job \
--model accounts/fireworks/models/llama-v3p1-8b-instruct \
--input-dataset-id batch-input-dataset \
--output-dataset-id batch-output-dataset \
--max-tokens 1024 \
--temperature 0.7 \
--top-p 0.9
```
```bash theme={null}
curl -X POST "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/batchInferenceJobs?batchInferenceJobId=my-batch-job" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"inputDatasetId": "accounts/'${ACCOUNT_ID}'/datasets/batch-input-dataset",
"outputDatasetId": "accounts/'${ACCOUNT_ID}'/datasets/batch-output-dataset",
"inferenceParameters": {
"maxTokens": 1024,
"temperature": 0.7,
"topP": 0.9
}
}'
```
View all your batch inference jobs in the dashboard:
```bash theme={null}
# Get job status
firectl batch-inference-job get my-batch-job
# List all batch jobs
firectl batch-inference-job list
```
```bash theme={null}
# Get specific job
curl -X GET "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/batchInferenceJobs/my-batch-job" \
-H "Authorization: Bearer ${API_KEY}"
# List all jobs
curl -X GET "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/batchInferenceJobs" \
-H "Authorization: Bearer ${API_KEY}"
```
Navigate to the output dataset and download the results:
```bash theme={null}
firectl dataset download batch-output-dataset
```
```bash theme={null}
# Get download endpoint and save response
curl -s -X GET "https://api.fireworks.ai/v1/accounts/${ACCOUNT_ID}/datasets/batch-output-dataset:getDownloadEndpoint" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{}' > download.json
# Extract and download all files
jq -r '.filenameToSignedUrls | to_entries[] | "\(.key) \(.value)"' download.json | \
while read -r object_path signed_url; do
fname=$(basename "$object_path")
echo "Downloading → $fname"
curl -L -o "$fname" "$signed_url"
done
```
The output dataset contains two files: a **results file** (successful responses in JSONL format) and an **error file** (failed requests with debugging info).
## Reference
Batch jobs progress through several states:
| State | Description |
| -------------- | ----------------------------------------------------- |
| **VALIDATING** | Dataset is being validated for format requirements |
| **PENDING** | Job is queued and waiting for resources |
| **RUNNING** | Actively processing requests |
| **COMPLETED** | All requests successfully processed |
| **FAILED** | Unrecoverable error occurred (check status message) |
| **EXPIRED** | Exceeded 24-hour limit (completed requests are saved) |
* **Base Models** – Any model that supports [On-Demand Deployments](https://docs.fireworks.ai/guides/ondemand-deployments) in the [Model Library](https://fireworks.ai/models)
* **Custom Models** – Your uploaded or fine-tuned models
*Note: Newly added models may have a delay before being supported. See [Quantization](/models/quantization) for precision info.*
* **Per-request limits:** Same as [Chat Completion API limits](/api-reference/post-chatcompletions)
* **Input dataset:** Max 1GB
* **Output dataset:** Max 8GB (job may expire early if reached)
* **Job timeout:** 24 hours maximum
Jobs expire after 24 hours. Completed rows are billed and saved to the output dataset.
**Resume processing:**
```bash theme={null}
firectl batch-inference-job create \
--continue-from original-job-id \
--model accounts/fireworks/models/llama-v3p1-8b-instruct \
--output-dataset-id new-output-dataset
```
This processes only unfinished/failed requests from the original job.
**Download complete lineage:**
```bash theme={null}
firectl dataset download output-dataset-id --download-lineage
```
Downloads all datasets in the continuation chain.
* **Validate thoroughly:** Check dataset format before uploading
* **Descriptive IDs:** Use meaningful `custom_id` values for tracking
* **Optimize tokens:** Set reasonable `max_tokens` limits
* **Monitor progress:** Track long-running jobs regularly
* **Cache optimization:** Place static content first in prompts
## Next Steps
Maximize cost savings with automatic prompt caching
Create custom models for your batch workloads
Full API documentation for Batch API
# Completions API
Source: https://docs.fireworks.ai/guides/completions-api
Use the completions API for raw text generation with custom prompt templates
The completions API provides raw text generation without automatic message formatting. Use this when you need full control over prompt formatting or when working with base models.
## When to use completions
**Use the completions API for:**
* Custom prompt templates with specific formatting requirements
* Base models (non-instruct/non-chat variants)
* Fine-grained control over token-level formatting
* Legacy applications that depend on raw completion format
**For most use cases, use [chat completions](/guides/querying-text-models) instead.** Chat completions handles message formatting automatically and works better with instruct-tuned models.
## Basic usage
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
prompt="Once upon a time"
)
print(response.choices[0].text)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
prompt="Once upon a time"
)
print(response.choices[0].text)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.completions.create({
model: "accounts/fireworks/models/deepseek-v3p1",
prompt: "Once upon a time",
});
console.log(response.choices[0].text);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/fireworks/models/deepseek-v3p1",
"prompt": "Once upon a time"
}'
```
Most models automatically prepend the beginning-of-sequence (BOS) token (e.g., ``) to your prompt. Verify this with the `raw_output` parameter if needed.
## Custom prompt templates
The completions API is useful when you need to implement custom prompt formats:
```python theme={null}
# Custom few-shot prompt template
prompt = """Task: Classify the sentiment of the following text.
Text: I love this product!
Sentiment: Positive
Text: This is terrible.
Sentiment: Negative
Text: The weather is nice today.
Sentiment:"""
response = client.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
prompt=prompt,
max_tokens=10,
temperature=0
)
print(response.choices[0].text) # Output: " Positive"
```
## Common parameters
All [chat completions parameters](/guides/querying-text-models) work with completions:
* `temperature` - Control randomness (0-2)
* `max_tokens` - Limit output length
* `top_p`, `top_k`, `min_p` - Sampling parameters
* `stream` - Stream responses token-by-token
* `frequency_penalty`, `presence_penalty` - Reduce repetition
See the [API reference](/api-reference/post-completions) for complete parameter documentation.
## Querying deployments
Use completions with [on-demand deployments](/guides/ondemand-deployments) by specifying the deployment identifier:
```python theme={null}
response = client.completions.create(
model="accounts//deployments/",
prompt="Your prompt here"
)
```
## Next steps
Use chat completions for most use cases
Stream responses for real-time UX
Complete API documentation
# Tool Calling
Source: https://docs.fireworks.ai/guides/function-calling
Connect models to external tools and APIs
Tool calling (also known as function calling) enables models to intelligently select and use external tools based on user input. You can build agents that access APIs, retrieve real-time data, or perform actions—all through [OpenAI-compatible](https://platform.openai.com/docs/guides/function-calling) tool specifications.
**How it works:**
1. Define tools using [JSON Schema](https://json-schema.org/learn/getting-started-step-by-step) (name, description, parameters)
2. Model analyzes the query and decides whether to call a tool
3. If needed, model returns structured tool calls with parameters
4. You execute the tool and send results back for the final response
## Quick example
Define tools and send a request - the model will return structured tool calls when needed:
Initialize the client:
```python Python (Fireworks SDK) theme={null}
from fireworks import Fireworks
client = Fireworks()
```
```python Python (OpenAI SDK) theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
```
Define the tools and make the request:
```python theme={null}
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
tools=tools,
temperature=0.1
)
print(response.choices[0].message.tool_calls)
# Output: [ChatCompletionMessageToolCall(id='call_abc123', function=Function(arguments='{"location":"San Francisco"}', name='get_weather'), type='function')]
```
For best results with tool calling, use a low temperature (0.0-0.3) to reduce hallucinated parameter values and ensure more deterministic tool selection.
```python theme={null}
import json
# Step 1: Define your tools
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}]
# Step 2: Send initial request
messages = [{"role": "user", "content": "What's the weather in San Francisco?"}]
response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=messages,
tools=tools,
temperature=0.1
)
# Step 3: Check if model wants to call a tool
if response.choices[0].message.tool_calls:
# Step 4: Execute the tool
tool_call = response.choices[0].message.tool_calls[0]
# Your actual tool implementation
def get_weather(location, unit="celsius"):
# In production, call your weather API here
return {"temperature": 72, "condition": "sunny", "unit": unit}
# Parse arguments and call your function
function_args = json.loads(tool_call.function.arguments)
function_response = get_weather(**function_args)
# Step 5: Send tool response back to model
messages.append(response.choices[0].message) # Add assistant's tool call
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(function_response)
})
# Step 6: Get final response
final_response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=messages,
tools=tools,
temperature=0.1
)
print(final_response.choices[0].message.content)
# Output: "It's currently 72°F and sunny in San Francisco."
```
## Defining tools
Tools are defined using [JSON Schema](https://json-schema.org/understanding-json-schema/reference) format. Each tool requires:
* **name**: Function identifier (a-z, A-Z, 0-9, underscores, dashes; max 64 characters)
* **description**: Clear explanation of what the function does (used by the model to decide when to call it)
* **parameters**: JSON Schema object describing the function's parameters
Write detailed descriptions and parameter definitions. The model relies on these to select the correct tool and provide appropriate arguments.
### Parameter types
JSON Schema supports: `string`, `number`, `integer`, `object`, `array`, `boolean`, and `null`. You can also:
* Use `enum` to restrict values to specific options
* Mark parameters as `required` or optional
* Provide descriptions for each parameter
* Reuse subschemas via `$defs` / `definitions` and `$ref`, including recursive references (e.g. linked lists, trees, mutually recursive types)
* Carry `$id`, `$schema`, and other annotation keywords (no external fetches are performed)
The same JSON Schema feature set is supported in tool `parameters` and in `response_format={"type": "json_schema", ...}`. See [JSON Schema Support](/structured-responses/structured-response-formatting#json-schema-support) for the full list, including the rules for resolving `$ref` and how to reach definitions placed inside nested subschemas.
A common shape produced by pydantic v2, Instructor, and LangChain `TypeAdapter`. The `$defs` block holds the reusable subschemas and properties reference them by `$ref`:
```python theme={null}
tools = [{
"type": "function",
"function": {
"name": "submit_order",
"description": "Submit an order with line items and a customer",
"parameters": {
"type": "object",
"$defs": {
"Product": {"type": "object", "properties": {"name": {"type": "string"}, "price": {"type": "number"}}, "required": ["name", "price"]},
"Customer": {"type": "object", "properties": {"name": {"type": "string"}, "email": {"type": "string"}}, "required": ["name", "email"]}
},
"properties": {
"items": {"type": "array", "items": {"$ref": "#/$defs/Product"}},
"customer": {"$ref": "#/$defs/Customer"}
},
"required": ["items", "customer"]
}
}
}]
```
```python theme={null}
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_restaurants",
"description": "Search for restaurants by cuisine type",
"parameters": {
"type": "object",
"properties": {
"cuisine": {
"type": "string",
"description": "Type of cuisine (e.g., Italian, Mexican)"
},
"location": {
"type": "string",
"description": "City or neighborhood"
},
"price_range": {
"type": "string",
"enum": ["$", "$$", "$$$", "$$$$"]
}
},
"required": ["cuisine", "location"]
}
}
}
]
```
## Additional configurations
### tool\_choice
The [`tool_choice`](/api-reference/post-chatcompletions) parameter controls how the model uses tools:
* **`auto`** (default): Model decides whether to call a tool or respond directly
* **`none`**: Model will not call any tools
* **`required`**: Model must call at least one tool
* **Specific function**: Force the model to call a particular function
```python theme={null}
# Force a specific tool
response = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=[{"role": "user", "content": "What's the weather?"}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "get_weather"}},
temperature=0.1
)
```
Some models support parallel tool calling, where multiple tools can be called in a single response. Check the model's capabilities before relying on this feature.
## Streaming
Tool calls work with streaming responses. Arguments are sent incrementally as the model generates them:
```python theme={null}
import json
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}]
stream = client.chat.completions.create(
model="accounts/fireworks/models/kimi-k2-instruct-0905",
messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
tools=tools,
stream=True,
temperature=0.1
)
# Accumulate tool call data
tool_calls = {}
for chunk in stream:
if chunk.choices[0].delta.tool_calls:
for tool_call in chunk.choices[0].delta.tool_calls:
index = tool_call.index
if index not in tool_calls:
tool_calls[index] = {"id": "", "name": "", "arguments": ""}
if tool_call.id:
tool_calls[index]["id"] = tool_call.id
if tool_call.function and tool_call.function.name:
tool_calls[index]["name"] = tool_call.function.name
if tool_call.function and tool_call.function.arguments:
tool_calls[index]["arguments"] += tool_call.function.arguments
if chunk.choices[0].finish_reason == "tool_calls":
for tool_call in tool_calls.values():
args = json.loads(tool_call["arguments"])
print(f"Calling {tool_call['name']} with {args}")
break
```
## Troubleshooting
* Check that your tool descriptions are clear and detailed
* Ensure the user query clearly indicates a need for the tool
* Try using `tool_choice="required"` to force tool usage
* Verify your model supports tool calling (check `supportsTools` field)
* Add more detailed parameter descriptions
* Use lower temperature (0.0-0.3) for more deterministic outputs
* Provide examples in parameter descriptions
* Use `enum` to constrain values to specific options
* Always validate tool call arguments before parsing
* Handle partial or malformed JSON gracefully in production
* Use try-catch blocks when parsing `tool_call.function.arguments`
The schema in `parameters` (or `response_format`) is reaching a `$ref` Fireworks cannot resolve. Common causes:
* **External `$ref` URI** (e.g. `https://example.com/schema.json`). Only in-document JSON Pointer fragments (`#/...`) are supported. Inline the referenced subschema or hoist it into `$defs`.
* **Wrong fragment path**, e.g. `#/components/schemas/Foo` when the document has no `components` key. Match the pointer to the actual document layout.
* **Older deployment image.** Recursive `$ref`, root `$id`, and nested `$defs` placed under a property all became supported in mid-2026 and earlier images returned 400 on those shapes. If you control a self-hosted/dedicated deployment, redeploy on a current image.
For the full set of supported reference forms, see [JSON Schema Support](/structured-responses/structured-response-formatting#json-schema-support).
## Next steps
Enforce JSON schemas for consistent responses
Learn about chat completions and other APIs
Deploy models on dedicated GPUs
Full chat completions API documentation
# Inference Error Codes
Source: https://docs.fireworks.ai/guides/inference-error-codes
Common error codes, their meanings, and resolutions for inference requests
Understanding error codes helps you quickly diagnose and resolve issues when making inference requests to the Fireworks API.
## Common error codes
| **Code** | **Error Name** | **Possible Issue(s)** | **How to Resolve** |
| -------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `400` | Bad Request | Invalid input or malformed request. | Review the request parameters and ensure they match the expected format. |
| `401` | Unauthorized | Invalid API key or insufficient permissions. | Verify your API key and ensure it has the correct permissions. |
| `402` | Payment Required | Account is not on a paid plan or has exceeded usage limits. | Check your billing status and ensure your payment method is up to date. Upgrade your plan if necessary. |
| `403` | Forbidden | Authentication issues. | Verify you have the correct API key. |
| `404` | Not Found | The API endpoint path doesn't exist, the model doesn't exist, the model is not deployed, or you don't have permission to access it. | Verify the URL path in your request and ensure you are using the correct API endpoint. Check if the model exists and is available. Ensure you have the necessary permissions. |
| `405` | Method Not Allowed | Using an unsupported HTTP method (e.g., using GET instead of POST). | Check the API documentation for the correct HTTP method. |
| `408` | Request Timeout | The request took too long to complete, possibly due to server overload or network issues. | Retry the request after a brief wait. Consider increasing the timeout value if applicable. |
| `412` | Precondition Failed | Account is suspended or there's an issue with account status. This error also occurs when attempting to invoke a LoRA model that failed to load. | Check your account status and billing information. For LoRA models, ensure the model was uploaded correctly and is compatible. Contact support if the issue persists. |
| `413` | Payload Too Large | Input data exceeds the allowed size limit. | Reduce the size of the input payload (e.g., by trimming large text or image data). |
| `429` | Too Many Requests | Rate limited (serverless) or deployment capacity exceeded (dedicated/on-demand). | See [understanding 429 errors](#understanding-429-errors) below. |
| `500` | Internal Server Error | Server-side code bug that is unlikely to resolve on its own. | Contact Fireworks support immediately, as this error typically requires intervention from the engineering team. |
| `502` | Bad Gateway | The server received an invalid response from an upstream server. | Wait and retry the request. If the error persists, it may indicate a server outage. |
| `503` | Service Unavailable | The service is down for maintenance or experiencing issues. | Retry the request after some time. Check the [status page](https://status.fireworks.ai) for maintenance announcements. |
| `504` | Gateway Timeout | The server did not receive a response in time from an upstream server. | Wait briefly and retry the request. Consider using a shorter input prompt if applicable. |
| `520` | Unknown Error | An unexpected error occurred with no clear explanation. | Retry the request. If the issue persists, contact support for further assistance. |
## Understanding 429 errors
HTTP 429 (`Too Many Requests`) can be returned on both serverless and dedicated/on-demand deployments, but the cause and recommended action differ.
### Serverless deployments
On serverless, a 429 means your account has exceeded its current serverless request or TPM limit. Standard serverless, Priority tier, and Fast all use the same public rate-limit policy, which combines request-rate limits with adaptive TPM limits and is designed to prevent very spiky traffic. To resolve:
* Wait briefly and retry with exponential backoff
* Smooth sudden bursts or spread traffic more evenly over time
* Check the rate-limit response headers returned with your requests
* Review [Serverless rate limits](/serverless/rate-limits) if you need more headroom, or use an [on-demand deployment](/guides/ondemand-deployments) for dedicated capacity
### Dedicated and on-demand deployments
On dedicated and on-demand deployments, **there are no account-level rate limits**. A 429 instead indicates that your deployment's processing capacity is saturated. The inference server returns 429 when the number of queued and active requests exceeds what the deployment's GPUs can handle at that moment.
This is a capacity signal, not quota enforcement. To resolve:
* **Reduce burst concurrency** — lower the number of parallel requests or add client-side rate limiting with backoff
* **Scale up the deployment** — add more replicas or GPUs to increase throughput
* **Optimize request patterns** — use shorter prompts, reduce max output tokens, or batch requests to lower per-request resource consumption
If you consistently see 429 errors on a dedicated or on-demand deployment, it's an indicator that your current GPU allocation is undersized for your traffic. [Contact us](https://fireworks.ai/company/contact-us) to discuss increasing your deployment capacity.
## Troubleshooting tips
If you encounter an error not listed here:
* Review the API documentation for the correct usage of endpoints and parameters
* Check the [Fireworks status page](https://status.fireworks.ai) for any ongoing service disruptions
* Contact support at [support@fireworks.ai](mailto:support@fireworks.ai) or join our [Discord](https://discord.gg/fireworks-ai)
Enable detailed error logging in your application to capture the full error response, including error messages and request IDs, which helps with debugging.
# Deployments
Source: https://docs.fireworks.ai/guides/ondemand-deployments
Configure and manage on-demand deployments on dedicated GPUs
**New to deployments?** Start with our [Deployments Quickstart](/getting-started/ondemand-quickstart) to deploy and query your first model in minutes, then return here to learn about configuration options.
On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:
* **Better performance** – Lower latency, higher throughput, and predictable performance unaffected by other users
* **No hard rate limits** – Only limited by your deployment's capacity
* **Cost-effective at scale** – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are [billed by GPU-second](https://fireworks.ai/pricing).
* **Broader model selection** – Access models not available on serverless
* **Custom models** – Upload your own models (for supported architectures) from Hugging Face or elsewhere
Need higher GPU quotas or want to reserve capacity? [Contact us](https://fireworks.ai/contact).
## Creating & querying deployments
**Create a deployment:**
```bash theme={null}
# This command returns your accounts//deployments/ - save it for querying
firectl deployment create accounts/fireworks/models/ --wait
```
**Deployment placement (`--region`) must be set at creation time and cannot be changed in place.**
If you do not specify `--region`, the deployment is pinned to a single datacenter at creation time and will not be automatically migrated later.
For production workloads that need geographic availability or capacity failover, always set `--region` explicitly:
```bash theme={null}
firectl deployment create accounts/fireworks/models/ --region GLOBAL # recommended default
firectl deployment create accounts/fireworks/models/ --region US
firectl deployment create accounts/fireworks/models/ --region EUROPE
firectl deployment create accounts/fireworks/models/ --region APAC
```
### Check current placement
```bash theme={null}
firectl deployment get
```
The deployment metadata shows where the deployment is currently allowed to schedule replicas (placement / region configuration).
### Change placement
There is no supported command to change region placement on an existing deployment. To change placement, recreate the deployment:
```bash theme={null}
# 1. Create replacement with correct region
firectl deployment create accounts/fireworks/models/ \
--deployment-shape \
--region GLOBAL \
--min-replica-count 1
# 2. Verify it's healthy, then point your app at the new endpoint
# 3. Delete old deployment
firectl deployment delete
```
See [Regions](/deployments/regions) for mega-regions and hardware availability.
See [Deployment shapes](#deployment-shapes) below to optimize for speed, throughput, or cost.
**Query your deployment:**
After creating a deployment, query it using this format:
```
accounts//deployments/
```
You can find your deployment name anytime with `firectl deployment list` and `firectl deployment get `.
**Example:**
```
accounts/alice/deployments/12345678
```
### Code examples
```python theme={null}
from fireworks import Fireworks
client = Fireworks()
response = client.chat.completions.create(
model="accounts/fireworks/models/gpt-oss-120b#",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)
print(response.choices[0].message.content)
```
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
model="accounts//deployments/",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)
print(response.choices[0].message.content)
```
```javascript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: "https://api.fireworks.ai/inference/v1",
});
const response = await client.chat.completions.create({
model: "accounts//deployments/",
messages: [
{
role: "user",
content: "Explain quantum computing in simple terms",
},
],
});
console.log(response.choices[0].message.content);
```
```bash theme={null}
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts//deployments/",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms"
}
]
}'
```
### Deployment status states
Deployment states from the Gateway API spec:
* `CREATING` - still being created
* `READY` - ready to be used
* `UPDATING` - in-progress updates happening
* `DELETING` - being deleted
* `DELETED` - soft-deleted
* `FAILED` - creation failed (see status for details)
UI-only states are display labels derived from deployment fields:
* `Inactive`: `state == READY && max_replica_count == 0 && ready_replica_count == 0`
* `Scaled to 0`: `state == READY && min_replica_count == 0 && max_replica_count > 0 && desired_replica_count == 0 && ready_replica_count == 0`
These are display labels computed from deployment fields; they are not new backend `Deployment.State` enum values.
## Deployment shapes
Deployment shapes are the primary way to configure deployments. They're pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other [performance factors](/faq/deployment/performance/optimization#performance-factors).
* **Fast** – Low latency for interactive workloads
* **Throughput** – Cost-per-token at scale for high-volume workloads
* **Minimal** – Lowest cost for testing or light workloads
**Usage:**
```bash theme={null}
# List available shapes
firectl deployment-shape-version list --base-model
# Create with a shape (shorthand)
firectl deployment create accounts/fireworks/models/deepseek-v3 --deployment-shape throughput
# Create with full shape ID
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
--deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast
# View shape details
firectl deployment-shape-version get
```
Need even better performance with tailored optimizations? [Contact our team](https://fireworks.ai/contact).
## Managing & configuring deployments
### Basic management
```bash theme={null}
# List all deployments
firectl deployment list
# Check deployment status
firectl deployment get
# Delete a deployment
firectl deployment delete
```
By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.
When a deployment is scaled to zero, requests return a `503` error immediately while the deployment scales up. Your application should implement retry logic to handle this. See [Scaling from zero behavior](/deployments/autoscaling#scaling-from-zero-behavior) for implementation details.
### GPU hardware
Choose GPU type with `--accelerator-type`:
* `NVIDIA_A100_80GB`
* `NVIDIA_H100_80GB`
* `NVIDIA_H200_141GB`
GPU availability varies by [region](/deployments/regions). See [Hardware selection guide→](https://docs.fireworks.ai/faq/deployment/ondemand/hardware-options#hardware-selection)
### Autoscaling
Control replica counts, scale timing, and load targets for your deployment.
See the [Autoscaling guide](/deployments/autoscaling) for configuration options.
### Multiple GPUs per replica
Use multiple GPUs to improve latency and throughput:
```bash theme={null}
firectl deployment create --accelerator-count 2
```
More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).
## Advanced
* **[Speculative decoding](/deployments/speculative-decoding)** - Speed up text generation using draft models or n-gram speculation
* **[Quantization](/models/quantization)** - Reduce model precision (e.g., FP16 to FP8) to improve speeds and reduce costs by 30-50%
* **[Performance benchmarking](/deployments/benchmarking)** - Measure and optimize your deployment's performance with load testing
* **[Managing default deployments](/deployments/managing-default-deployments)** - Control which deployment handles queries when using just the model name
* **[Publishing deployments](/deployments/publishing-deployments)** - Make your deployment accessible to other Fireworks users
## Next steps
Configure autoscaling for optimal cost and performance
Deploy your own models from Hugging Face
Reduce costs with model quantization
Choose deployment regions for optimal latency
Purchase reserved GPUs for guaranteed capacity
Fine-tune models for your specific use case
# Using predicted outputs
Source: https://docs.fireworks.ai/guides/predicted-outputs
Use Predicted Outputs to boost output generation speeds for editing / rewriting use cases
This feature is in beta and we are working on improvements. We welcome your feedback on [Discord](https://discord.gg/fireworks-ai)
In cases where large parts of the LLM output are known in advance, e.g. editing or rewriting a document or code snippet, you can improve output generation speeds with predicted outputs. Predicted outputs allows you to provide strong "guesses" of what output may look like.
To use Predicted Outputs, set the `prediction` field in the Fireworks API with the predicted output. For example, you may want to edit a survey and add an option to contact users by text message:
```
{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
```
In this case, we expect most of the code will remain the same. We set the ‘prediction’ field to be the original survey code. The output generation speed increases using predicted outputs.
```python Python (Fireworks) theme={null}
from fireworks.client import Fireworks
code = """{
"questions": [
{
"question": "Name",
"type": "text"
},
{
"question": "Age",
"type": "number"
},
{
"question": "Feedback",
"type": "text_area"
},
{
"question": "How to Contact",
"type": "multiple_choice",
"options": ["Email", "Phone"],
"optional": true
}
]
}
"""
client = Fireworks(api_key="")
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct",
messages=[{
"role": "user",
"content": "Edit the How to Contact question to add an option called Text Message. Output the full edited code, with no markdown or explanations.",
},
{
"role": "user",
"content": code
}
],
temperature=0,
prediction={"type": "content", "content": code}
)
print(response.choices[0].message.content)
```
### Additional information on Predicted Outputs:
* Using Predicted Outputs is free at this time
* We recommend setting `temperature=0` for best results for most intended use cases of Predicted Outputs. In these cases, using Predicted Outputs does not impact the quality of outputs generated
* If the prediction is substantially different from the generated output, output generation speed may decrease
* The max length of the `prediction` field is set by `max_tokens` and is 2048 by default, and needs to be updated if you have a longer input and prediction.
* If you are using an on-demand deployment, you can set `rewrite_speculation=True` and potentially get even faster output generation. We are working on rolling this out to Serverless soon.
# Prompt caching
Source: https://docs.fireworks.ai/guides/prompt-caching
Prompt caching is a performance optimization feature that allows Fireworks to
respond faster to requests with prompts that share common prefixes. In many
situations, it can reduce time to first token (TTFT) by as much as 80%.
Prompt caching is enabled by default for all Fireworks models and deployments.
For serverless models, cached prompt tokens are discounted compared to regular prompt tokens. The default discount is 50%, but the exact discount varies by model. Check the [Model Library](https://fireworks.ai/models) for model-specific cached token pricing.
For dedicated deployments, prompt caching frees up resources, leading to higher
throughput on the same hardware. Cached tokens on dedicated deployments are close to
free for you, because they affect context length but do not need extra processing.
## Using prompt caching
### Common use cases
Requests to LLMs often share a large portion of their prompt. For example:
* Long system prompts with detailed instructions
* Descriptions of available tools for function calling
* Growing previous conversation history for chat use cases
* Shared per-user context, like a current file for a coding assistant
Prompt caching avoids re-processing the cached prefix of the prompt and starts output generation much sooner.
### Structuring prompts for caching
Prompt caching works only for exact prefix matches within a prompt. To
realize caching benefits, place static content like instructions and examples at
the beginning of your prompt, and put variable content, such as user-specific
information, at the end.
For function calling models, tools are considered part of the prompt.
### Optimizing inference request for caching
Prompt caching only works within 1 replica. If you are using serverless or a
deployment with multiple replicas, you need to give us hints for where to send
the traffic to maximize prompt cache hit rates. To do this, you can send us a
unique identifier for each user or session, when you expect the prompts to share
the same prefix.
You may place this identifier either in the `user` field of the request body or
in the `x-session-affinity` header.
If you're collecting RL rollouts and want per-trajectory metrics plus
stickiness across multi-turn agent interactions, also set
`x-multi-turn-session-id`. See [Inference for RL
rollouts](/guides/rollout-inference#session-affinity).
```python theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
)
response = client.chat.completions.create(
model="accounts/fireworks/models/