By default, models are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits required to serve the model, improving performance and reducing cost to serve. However, this can changes model numerics which may introduce small changes to the output.

Take a look at our blog post for a detailed treatment of how quantization affects model quality.

Quantizing a model

A model can be quantized to 8-bit floating-point (FP8) precision using using firectl prepare-model:

firectl prepare-model <MODEL_ID> --precision FP8
This is an additive process that adds a new FP8 checkpoint for your model. The original FP16 checkpoint is still available for use.

You can check on the status of preparation by running

firectl get model <MODEL_ID>

and checking if the state is still in PREPARING. A successfully prepared model will have the desired precision added to the Precisions list.

Creating an FP8 deployment

By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a model, run:

firectl get model <MODEL_ID>

The Precisions field will indicate what precisions the model has been prepared for.

To use the quantized FP8 checkpoint, pass the --precision flag:

firectl create deployment <MODEL> --precision FP8