After fine-tuning your model on Fireworks, deploy it to make it available for inference. Fireworks supports two deployment methods for LoRA fine-tuned models: live merge and multi-LoRA. Each method has different tradeoffs around performance, cost, and flexibility.Documentation Index
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
Use this file to discover all available pages before exploring further.
You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See importing fine-tuned models for details.
Choosing a deployment method
Fireworks offers two ways to deploy LoRA fine-tuned models. The right choice depends on how many fine-tuned variants you need to serve and your performance requirements.| Live merge | Multi-LoRA | |
|---|---|---|
| How it works | LoRA weights are merged into the base model at deployment time, creating a single merged model | Base model is deployed with addon support; LoRA adapters are loaded dynamically at request time |
| Number of LoRAs | One per deployment | Multiple per deployment |
| Inference performance | Matches the base model (no overhead) | Some overhead per request due to dynamic adapter application |
| Throughput | Same as base model | Lower maximum throughput under high concurrency |
| Cost efficiency | One deployment per fine-tune | Share a single deployment across many fine-tunes |
| Best for | Production workloads requiring maximum performance | Experimentation, A/B testing, or serving many variants of the same base model |
Live merge deployment
Live merge is the simplest way to deploy a fine-tuned model. Fireworks automatically merges the LoRA weights into the base model at deployment time, producing a model that performs identically to a natively fine-tuned model with no inference overhead.How it works
When you deploy a LoRA model directly, Fireworks:- Takes your LoRA adapter weights and the base model
- Merges them into a single set of weights at deployment time
- Serves the merged model as a standalone deployment
Deploy with live merge
Deploy your LoRA fine-tuned model with a single command:Your deployment will be ready to use once it completes, with performance that matches the base model.
Sending requests
Send inference requests to your live-merge deployment by referencing the deployment directly:- Python (Fireworks SDK)
- curl
When to use live merge
- You need maximum inference performance (latency and throughput matching the base model)
- You are serving a single fine-tuned model in production
- You want the simplest possible deployment workflow
Multi-LoRA deployment
Multi-LoRA lets you load multiple LoRA adapters onto a single base model deployment. This is useful when you have several fine-tuned variants of the same base model and want to share GPU resources across them rather than creating a separate deployment for each.How it works
With multi-LoRA:- You deploy the base model with addon support enabled
- You load one or more LoRA adapters onto the running deployment
- At inference time, the correct adapter is selected and applied dynamically based on the model specified in the request
LoRA addon shape compatibility
Not all deployment shapes support LoRA addons. FP8 and FP4 quantized shapes do not support--enable-addons.
| Precision | --enable-addons supported? |
|---|---|
| BF16 | ✅ Yes |
| FP8 | ❌ No |
| FP4 | ❌ No |
--enable-addons. See Uploading custom models and firectl model create.
"addons cannot be enabled with quantized precisions (FP8/FP4)" — your model’s default shape is quantized; use Option 1 or 2 above."the deployment shape version does not exist or you do not have access to it" — the shape you requested is not available on your account; contact support.Deploy with multi-LoRA
Sending requests
To route inference requests to a specific LoRA adapter on a multi-LoRA deployment, set themodel field to <model_name>#<deployment_name>. The # separator tells Fireworks to route the request to the specified adapter on the given deployment.
- Python (Fireworks SDK)
- Python (OpenAI SDK)
- JavaScript
- curl
When to use multi-LoRA
- You need to serve multiple fine-tuned models based on the same base model
- You want to maximize GPU utilization by sharing a single deployment
- You are running experiments or A/B tests across multiple fine-tuned variants
- You can accept some performance overhead compared to live merge
Performance considerations
Live merge eliminates all LoRA-related inference overhead because the adapter weights are baked into the model at deployment time. The resulting deployment behaves exactly like a natively fine-tuned base model. Multi-LoRA deployments incur overhead because adapters are applied dynamically:- Time to first token (TTFT): Increases by roughly 10–30% due to adapter loading and prompt processing overhead
- Generation speed: Overhead grows with higher request concurrency
- Maximum throughput: Lower than a live-merge deployment under sustained load
Next steps
On-Demand Deployments
Learn about deployment configuration and optimization
Import Fine-Tuned Models
Upload LoRA models fine-tuned outside of Fireworks
LoRA Performance
Understand performance tradeoffs and optimization strategies