Quantization

Quantizing a model
Creating an FP8 deployment

Information about default precision can be found in Default Precisions. Quantization reduces the number of bits used to serve the model, improving performance and reducing cost to serve. However, this can change model numerics which may introduce small changes to the output. Take a look at our blog post for a detailed treatment of how quantization affects model quality.

Quantizing a model

A model can be quantized to 8-bit floating-point (FP8) precision.

firectl
Python (REST API)

firectl prepare-model <MODEL_ID>

This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.

You can check on the status of preparation by running:

firectl
Python (REST API)

firectl get model <MODEL_ID>

and checking if the state is still in PREPARING. A successfully prepared model will have the desired precision added to the Precisions list.

Creating an FP8 deployment

By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a model, run:

firectl
Python (REST API)

firectl get model <MODEL_ID>

The Precisions field will indicate what precisions the model has been prepared for. To use the quantized FP8 checkpoint, pass the --precision flag:

firectl
Python (REST API)

firectl create deployment <MODEL> --accelerator-type NVIDIA_H100_80GB --precision FP8

Quantized deployments can only be served using H100 GPUs.

Custom Models Speculative Decoding

⌘I

Get Started

Deployments

Text & Vision Models

Audio Models

Embeddings & Reranking

Batch & Async Inference

Fine-tuning

Administration

Security & Compliance

Migration & Integration

Reference

Quantizing a model

Creating an FP8 deployment

Get Started

Deployments

Text & Vision Models

Audio Models

Embeddings & Reranking

Batch & Async Inference

Fine-tuning

Administration

Security & Compliance

Migration & Integration

Reference

​Quantizing a model

​Creating an FP8 deployment

Quantizing a model

Creating an FP8 deployment