Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt

Use this file to discover all available pages before exploring further.

This glossary covers terms you’ll encounter when working with the Fireworks AI platform — across inference, fine-tuning, deployments, security, and the API.

Account & Billing

Account
Your Fireworks AI organization identity. All deployments, models, fine-tuning jobs, and API keys belong to an account. Referenced as accounts/<your-account> in firectl and the API.
Credit
The unit used for prepaid usage on Fireworks. Credits are consumed as you use inference and training resources.
Rate limiting / 429 errors
A 429 Too Many Requests response means your account or deployment has hit a concurrency or request-per-minute limit. On serverless, limits scale automatically with usage tier. On dedicated deployments, adding replicas or adjusting autoscaling increases capacity.
Quota
A per-account cap on a resource — such as requests per minute (RPM), GPU hours, or concurrent replicas. Quotas can be increased by contacting support.
ZDR (Zero Data Retention)
The default behavior for all open model inference on Fireworks: no prompts, completions, or request logs are stored after the response is returned. ZDR is on by default — it is not an opt-in feature.
BYOC (Bring Your Own Cloud)
An enterprise option to run inference entirely within your own cloud account. Your data never touches Fireworks-managed infrastructure. Contact sales for availability.

Fireworks Platform

Serverless inference
Pay-per-token inference with no deployment management. Requests are routed globally across Fireworks infrastructure for lowest latency and highest availability. Serverless does not support geographic constraints — for region-specific inference, use a dedicated deployment and set placement with the --region flag on firectl deployment create (for example GLOBAL, US, EUROPE, APAC).
Dedicated deployment
A deployment you provision and manage, with reserved GPU capacity. Gives you control over model, hardware, placement, autoscaling, and addon support. Created with firectl deployment create.
Multi-region deployment
A dedicated deployment configured to run replicas across multiple datacenters. Enabled with --region GLOBAL (or a specific mega-region: US, EUROPE, APAC) on firectl deployment create. Increases availability and throughput.
Placement
Controls which regions a dedicated deployment is allowed to schedule replicas in. On the CLI, set at creation time with --region (GLOBAL, US, EUROPE, APAC, or a specific region id). Cannot be changed after deployment creation — recreate the deployment to change placement. If not specified, the deployment pins to a single datacenter at creation time.
Deployment shape
The hardware and precision configuration used when creating a dedicated deployment. Shapes encode GPU type, count, precision (BF16, FP8, FP4), and other settings. Specified with --deployment-shape. Some shapes do not support LoRA addons.
Deployment state
The lifecycle status of a deployment: CREATING, DEPLOYING, DEPLOYED, SCALING, UPDATING, FAILED, DELETING, DELETED.
Replica
A single instance of a deployed model. Adding replicas increases concurrency. Replicas can be scaled manually or via autoscaling rules.
Autoscaling
Automatic adjustment of replica count based on traffic. Configured with scale-to-zero, minimum replicas, maximum replicas, and scale-up/down thresholds.
firectl
The Fireworks command-line tool for managing models, deployments, fine-tuning jobs, and account resources.

Models & Inference

Base model
A foundation model available on Fireworks for inference or fine-tuning. Referenced as accounts/fireworks/models/<model-id>.
Addon
A LoRA adapter loaded on top of a base model deployment at inference time. Enabled on a deployment with --enable-addons. The adapter is specified per request by passing the adapter model ID. FP8 and FP4 quantized shapes do not support addons — use a BF16 shape for LoRA addon inference.
Multi-LoRA
A single base model deployment that serves multiple LoRA adapters. The adapter is selected per request. One deployment, multiple fine-tuned behaviors.
Quantization
Reducing the numerical precision of model weights to decrease memory usage and increase throughput, with some quality tradeoff.
BF16 (BFloat16)
A 16-bit floating point format used for model weights. Provides good quality with moderate memory usage. BF16 deployment shapes support LoRA addons.
FP8
An 8-bit floating point format. Faster and more memory-efficient than BF16, with a small quality tradeoff. FP8 deployment shapes do not support LoRA addons.
FP4
A 4-bit floating point format. Faster and cheaper than FP8, with a larger quality tradeoff. FP4 deployment shapes do not support LoRA addons.
Speculative decoding
An inference acceleration technique where a smaller draft model proposes tokens that are verified by the main model in parallel. Reduces latency with no quality loss.
KV cache
Key-value cache that stores intermediate attention computations across tokens. Reduces re-computation cost on repeated or shared prefixes. Cache hit percentage is reflected in billing. Configurable with --kv-cache-fraction.
Context window
The maximum number of tokens (input + output) the model can process in a single request.
max_tokens
API parameter that caps how many tokens the model will generate in a single response. Always set this explicitly in agentic workflows — without it, reasoning models and large models may generate very long outputs.
TTFT (Time to First Token)
Latency from when a request is sent to when the first output token is received. Key metric for interactive applications.
TPS (Tokens Per Second)
Throughput metric measuring how many output tokens are generated per second.
Batch inference
Asynchronous large-scale inference via the Fireworks Batch API. Submit a JSONL file of requests; results are returned when the job completes. Lower cost than real-time inference, higher latency, no streaming.
Streaming
Real-time token-by-token output delivery via server-sent events (SSE). Enabled with stream=True. Not available in batch inference.
Tool calling / Function calling
The ability for a model to request that the client execute a function and return the result, enabling agentic and multi-step workflows.
Structured output
Model output constrained to a specific JSON schema. Fireworks supports constrained decoding to guarantee valid JSON matching your schema.
reasoning_effort
API parameter that controls how much thinking a reasoning model performs before generating its response. Set to "none" to disable extended reasoning (useful as a workaround for models prone to reasoning loops).

Fine-tuning

SFT (Supervised Fine-Tuning)
Training a model on labeled input-output pairs to adapt it to a specific task or style. Available via the Fireworks managed fine-tuning pipeline.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. The adapter can be loaded on top of the base model at inference time.
LoRA rank
A hyperparameter that controls the size of the LoRA adapter. Higher rank = more capacity but larger adapter and more training compute.
GRPO (Group Relative Policy Optimization)
A reinforcement learning variant used in RFT. Optimizes model outputs using group-relative reward scoring rather than a separate value model.
RFT (Reinforcement Fine-Tuning)
Fine-tuning using reinforcement learning signals rather than labeled examples. Trains the model to maximize a reward function — useful for tasks with verifiable outcomes such as math, code, and reasoning.
Training shape
The hardware configuration used for a fine-tuning job. Selected via --training-shape in firectl. Determines GPU type, count, and whether LoRA or full-parameter training is used.
dcp_save_interval
RFT training parameter that controls how often full training state (weights + optimizer) is checkpointed. Default is 0 (disabled). Set to a positive integer to enable full checkpoint-and-resume including optimizer state.
Epoch
One complete pass through the training dataset.
Step
A single gradient update during training.
Checkpoint
A saved snapshot of model state (and optionally optimizer state) at a point during training.
ppo_kl vs ref_kld
Two KL divergence metrics logged during GRPO/RFT training. ppo_kl measures divergence between current and previous policy — stays near zero with one minibatch per rollout, which is expected. ref_kld measures divergence from the reference/base model — this is the metric to monitor for policy drift.

Deployment

Cold start
The latency incurred when a deployment scales up from zero replicas or provisions a new replica.
Scale-to-zero
A deployment configuration where replica count drops to zero when there is no traffic. Eliminates idle costs but introduces cold-start latency.
GPU hours
The billing unit for dedicated deployment capacity. Charged per GPU per hour of replica uptime.
Prometheus metrics
Per-deployment performance and utilization metrics exposed via a Prometheus-compatible endpoint. Includes TTFT, TPS, GPU utilization, queue depth, and error rates.

API & SDK

API key
An authentication token used to make requests to the Fireworks API. Generated in the Fireworks console under Account > API Keys.
OpenAI-compatible API
The Fireworks inference API is compatible with the OpenAI Chat Completions API format. Use the OpenAI Python SDK with Fireworks by changing base_url and api_key.
reconnect_and_wait()
SDK method for recovering a training job that has been interrupted by pod preemption or a network error. Use in your training loop to make jobs resilient to transient interruptions.
JSONL
JSON Lines format — one JSON object per line. Used for batch inference input files and fine-tuning dataset uploads.

Security & Compliance

ISO 27001
International standard for information security management systems (ISMS). Fireworks has achieved ISO 27001 certification. Certificate available at trust.fireworks.ai.
ISO 27701
Extension to ISO 27001 covering privacy information management. Fireworks has achieved ISO 27701 certification. Certificate available at trust.fireworks.ai.
ISO 42001
International standard for AI management systems. Fireworks has achieved ISO 42001 certification. Certificate available at trust.fireworks.ai.
SOC 2 Type II
An auditing standard that verifies a company’s security, availability, and confidentiality controls over time. Fireworks maintains SOC 2 Type II compliance.
HIPAA
U.S. healthcare data regulation. Fireworks supports HIPAA-compliant deployments for covered entities. Contact sales for Business Associate Agreement (BAA) details.
Audit logs
A record of API activity on your account. Useful for security review and compliance reporting. Available via the Fireworks console and API.
Trust Center
Fireworks’ central repository for security documentation, compliance certificates, and data handling policies. Available at trust.fireworks.ai.

Inference Parameters

Temperature
Controls randomness in token sampling. 0.0 = deterministic. Higher values increase diversity.
Top-p (nucleus sampling)
Limits token sampling to the smallest set of tokens whose cumulative probability exceeds p.
Top-k
Limits token sampling to the top k most probable tokens at each step.
Frequency penalty
Reduces the likelihood of tokens that have already appeared in the output. Discourages repetition.
Presence penalty
Reduces the likelihood of tokens that have appeared at all in the output. Encourages the model to introduce new topics.
Stop sequences
One or more strings that cause the model to stop generating when produced. Useful for controlling output format.
System prompt
An instruction prepended to the conversation that sets the model’s persona, task, or constraints.