Introduction

A model is a foundational concept of the Fireworks platform, representing a set of weights and metadata that can be deployed on hardware (i.e. a deployment) for inference. Each model has a globally unique unique model name of the form accounts/<ACCOUNT_ID>/models/<MODEL_ID>.

There are two types of models:

  • Base models
  • Parameter-efficient fine-tuned (PEFT) addons

Base models

A base model consists of the full set of model weights. This may include models pre-trained from scratch as well as full fine-tunes (i.e. continued pre-training). Fireworks has a library of common base models that can be used for serverless inference as well as dedicated deployments. Fireworks also allows you to upload your own custom base models.

Parameter-efficient fine-tuned (PEFT) addons

A PEFT addon is a small, fine-tuned model that significantly reduces the amount of memory required to deploy compared to a fully fine-tuned model. A common technique for training PEFT addons is low-rank adaptation (LoRA). Fireworks supports both training, uploading, and serving PEFT addons.

PEFT addons must be deployed on a serverless or dedicated deployment for its corresponding base model.

Using models for inference

A model must be deployed before it can be used for inference. Take a look at the Querying text models guide for a comprehensive overview of making LLM inference.

Serverless inference

Fireworks supports serverless inference for popular models like Llama 3.1 405B. These models are pre-deployed by the Fireworks team for the community to use. Take a look at the Models page for the latest list of serverless models.

Serverless inference is billed on a per-token basis depending on the model size. See our Pricing page for details.

Since serverless deployments are shared across users, there are no SLA guarantees for up-time or latency. It is best-effort. The Fireworks team may also deprecate models from serverless with at least 2 weeks notice.

Custom base models are not supported for serverless inference.

Serverless addons

The most popular base models for fine-tuning will also support serverless PEFT addons. This feature allows users to quickly experiment and prototype with fine-tuning without having to pay extra for a dedicated deployment. See the Deploying to serverless guide for details.

Similar to serverless inference, there are no SLA guarantees for serverless addons.

Dedicated deployments

Dedicated deployments give users the most flexibility and control over what models can be deployed and performance guarantees. These deployments are private to you and give you access to a wide array of hardware. Both PEFT addons and base models can be deployed to dedicated deployments.

Dedicated deployments are billed by a GPU-second basis. See our Pricing page for details.

Take a look at our On-demand deployments guide for a comprehensive overview.