Overview
Introduction
A model is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be
deployed on hardware (i.e. a deployment) for inference. Each model has a globally unique name of the
form accounts/<ACCOUNT_ID>/models/<MODEL_ID>
. The model IDs are:
- Pre-populated for models that Fireworks has uploaded. For example, “llama-v3p1-70b-instruct” is the model ID for the Llama 3.1 70B model that Fireworks provides. It can be found on each model’s page (example)
- Either auto-generated or user-specified for fine-tuned models uploaded or created by users
- User-specified for custom models uploaded by users
There are two types of models:
- Base models
- Low-rank adaptation (LoRA) addons
Base models
A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for serverless inference as well as dedicated deployments. Fireworks also allows you to upload your own custom base models.
Low-rank adaptation (LoRA) addons
A LoRA addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to a fully fine-tuned model. Fireworks supports both training, uploading, and serving LoRA addons.
LoRA addons must be deployed on a serverless or dedicated deployment for its corresponding base model.
Using models for inference
A model must be deployed before it can be used for inference. Take a look at the Querying text models guide for a comprehensive overview of making LLM inference.
Serverless inference
Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the Fireworks team for the community to use. Take a look at the Models page for the latest list of serverless models.
Serverless inference is billed on a per-token basis depending on the model size. See our Pricing page for details.
Custom base models are not supported for serverless inference.
Serverless addons
The most popular base models for fine-tuning will also support serverless LoRA addons. This feature allows users to quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the Deploying to serverless guide for details.
Dedicated deployments
Dedicated deployments give users the most flexibility and control over what models can be deployed and performance guarantees. These deployments are private to you and give you access to a wide array of hardware. Both LoRA addons and base models can be deployed to dedicated deployments.
Dedicated deployments are billed by a GPU-second basis. See our Pricing page for details.
Take a look at our On-demand deployments guide for a comprehensive overview.
Data privacy & security
Your data is your data. No prompt or generated data is logged or stored on Fireworks; only meta-data like the number of tokens in a request is logged, as required to deliver the service. There are two exceptions:
- For our proprietary FireFunction model, input/output data is logged for 30 days only to enable bulk analytics to improve the model, such as tracking the number of functions provided to the model.
- For certain advanced features (e.g. FireOptimizer), users can explicitly opt-in to log data.
Was this page helpful?