Pytorch Model Optimization: Automatic Mixed Precision vs Quantization?

I'm trying to optimize my pytorch model. I understand the basics of quantization (changing 32 bit floats to other data types in 16 bit or 8bit), but I'm lost on how the two methods differ or what to choose.

I see AMP (Automatic Mixed Precision) https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html and regular Quantization https://pytorch.org/tutorials/recipes/quantization.html.

Could someone please explain the difference and applications? Thank you.

Solution

Automatic Mixed Precision (AMP)'s main goal is to reduce training time. On the other hand, quantization's goal is to increase inference speed.

AMP: Not all layers and operations require the precision of fp32, hence it's better to use lower precision. AMP takes care of what precision to use for what operation. It eventually helps speed up the training.

Mixed precision tries to match each op to its appropriate datatype, which can reduce your network’s runtime and memory footprint.

Also, note that the max performance gain is observed on Tensor Core-enabled GPU architectures.

Quantization converts 32-bit floating numbers in your model parameters to 8-bit integers. This will significantly decrease the model size and increase the inference speed. However, it could severly impact the model's accuracy. That's why you can utilize techniques like Quantization Aware Training (QAT). Rest you can read on the tutorials you shared.