Checking available precisions
Models may support different numerical precisions like FP16, FP8, BF16, or INT8, which affect memory usage and inference speed. Check default precision:Precisions field indicates what precisions the model has been prepared for.
Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision.- firectl
- Python (REST API)
This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.
- firectl
- Python (REST API)
PREPARING. A successfully prepared model will have the desired precision added
to the Precisions list.
Creating an FP8 deployment
By default, creating a deployment uses the FP16 checkpoint. To use a quantized FP8 checkpoint, first ensure the model has been prepared for FP8 (see Checking available precisions above), then pass the--precision flag when creating your deployment:
- firectl
- Python (REST API)
Quantized deployments can only be served using H100 GPUs.