Introduction

A model is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be deployed on hardware (i.e. a deployment) for inference. Each model has a globally unique name of the form accounts/<ACCOUNT_ID>/models/<MODEL_ID>. The model IDs are:

  • Pre-populated for models that Fireworks has uploaded. For example, “llama-v3p1-70b-instruct” is the model ID for the Llama 3.1 70B model that Fireworks provides. It can be found on each model’s page (example)
  • Either auto-generated or user-specified for fine-tuned models uploaded or created by users
  • User-specified for custom models uploaded by users

There are two types of models:

  • Base models
  • Parameter-efficient fine-tuned (PEFT) addons

Base models

A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for serverless inference as well as dedicated deployments. Fireworks also allows you to upload your own custom base models.

Parameter-efficient fine-tuned (PEFT) addons

A PEFT addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to a fully fine-tuned model. A common technique for training PEFT addons is low-rank adaptation (LoRA). Fireworks supports both training, uploading, and serving PEFT addons.

PEFT addons must be deployed on a serverless or dedicated deployment for its corresponding base model.

Using models for inference

A model must be deployed before it can be used for inference. Take a look at the Querying text models guide for a comprehensive overview of making LLM inference.

Serverless inference

Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the Fireworks team for the community to use. Take a look at the Models page for the latest list of serverless models.

Serverless inference is billed on a per-token basis depending on the model size. See our Pricing page for details.

Since serverless deployments are shared across users, there are no SLA guarantees for up-time or latency. It is best-effort. The Fireworks team may also deprecate models from serverless with at least 2 weeks notice.

Custom base models are not supported for serverless inference.

Serverless addons

The most popular base models for fine-tuning will also support serverless PEFT addons. This feature allows users to quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the Deploying to serverless guide for details.

Similar to serverless inference, there are no SLA guarantees for serverless addons.

Dedicated deployments

Dedicated deployments give users the most flexibility and control over what models can be deployed and performance guarantees. These deployments are private to you and give you access to a wide array of hardware. Both PEFT addons and base models can be deployed to dedicated deployments.

Dedicated deployments are billed by a GPU-second basis. See our Pricing page for details.

Take a look at our On-demand deployments guide for a comprehensive overview.

Data privacy & security

Your data is your data. No prompt or generated data is logged or stored on Fireworks; only meta-data like the number of tokens in a request is logged, as required to deliver the service. There are two exceptions:

  • For our proprietary FireFunction model, input/output data is logged for 30 days only to enable bulk analytics to improve the model, such as tracking the number of functions provided to the model.
  • For certain advanced features (e.g. FireOptimizer), users can explicitly opt-in to log data.