A model is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be
deployed on hardware (i.e. a deployment) for inference. Each model has a globally unique name of the
form accounts/<ACCOUNT_ID>/models/<MODEL_ID>. The model IDs are:
Pre-populated for models that Fireworks has uploaded. For example, “llama-v3p1-70b-instruct” is the model ID for the Llama 3.1 70B model that Fireworks provides. It can be found on each model’s page (example)
Either auto-generated or user-specified for fine-tuned models uploaded or created by users
A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as
full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for
serverless inference as well as dedicated deployments. Fireworks
also allows you to upload your own custom base models.
A LoRA addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to
a fully fine-tuned model. Fireworks supports both training,
uploading, and serving LoRA addons.LoRA addons must be deployed on a serverless or dedicated deployment for its corresponding base model.
A model must be deployed before it can be used for inference. Take a look at the Querying text models
guide for a comprehensive overview of making LLM inference.
Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the
Fireworks team for the community to use. Take a look at the Models page for the latest
list of serverless models.Serverless inference is billed on a per-token basis depending on the model size. See our Pricing
page for details.
Since serverless deployments are shared across users, there are no SLA guarantees for up-time or latency. It is
best-effort. The Fireworks team may also deprecate models from serverless with at least 2 weeks notice.
Custom base models are not supported for serverless inference.
The most popular base models for fine-tuning will also support serverless LoRA addons. This feature allows users to
quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the
Deploying to serverless guide for details.
Similar to serverless inference, there are no SLA guarantees for serverless addons.
Dedicated deployments give users the most flexibility and control over what models can be deployed and performance
guarantees. These deployments are private to you and give you access to a wide array of hardware. Both LoRA addons and
base models can be deployed to dedicated deployments.Dedicated deployments are billed by a GPU-second basis. See our Pricing page
for details.Take a look at our On-demand deployments guide for a comprehensive overview.
Your data is your data. No prompt or generated data is logged or stored on Fireworks; only meta-data like the number of tokens in a request is logged, as required to deliver the service. There are two exceptions:
For our proprietary FireFunction model, input/output data is logged for 30 days only to enable bulk analytics to improve the model, such as tracking the number of functions provided to the model.
For certain advanced features (e.g. FireOptimizer), users can explicitly opt-in to log data.
Was this page helpful?
Assistant
Responses are generated using AI and may contain mistakes.