Search code examples
huggingface-transformershuggingfacequantization

Quantization and torch_dtype in huggingface transformer


Not sure if its the right forum to ask but.

Assuming i have a gptq model that is 4bit. how does using from_pretrained(torch_dtype=torch.float16) work? In my understanding 4 bit meaning changing the weights from either 32-bit precision to 4bit precision using quantization methods.

However, calling it the torch_dtype=torch.float16 would mean the weights are in 16 bits? Am i missing something here.


Solution

  • GPTQ is a Post-Training Quantization method. This means a GPTQ model was created in full precision and then compressed. Not all values will be in 4 bits unless every weight and activation layer has been quantized.

    The GPTQ method does not do this:

    Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16.

    As these values need to be multiplied together, this means that,

    during inference, weights are dequantized on the fly and the actual compute is performed in float16.

    In a Hugging Face quantization blog post from Aug 2023, they talk about the possibility of quantizing activations as well in the Room for Improvement section. However, at that time there were no open source implementations.

    Since then, they have released Quanto. This does support quantizing activations. It looks promising but it is not yet quicker than other quantization methods. It is in beta and the docs say to expect breaking changes in the API and serialization. There are some accuracy and perplexity benchmarks which look pretty good with most models. Surprisingly, at the moment it is slower than 16-bit models due to lack of optimized kernels, but that seems to be something they're working on.

    So this does not just apply to GPTQ. You will find yourself using float16 with any of the popular quantization methods at the moment. For example, Activation-aware Weight Quantization (AWQ) also preserves in full precision a small percentage of the weights that are important for performance. This is a useful blog post comparing GPTQ with other quantization methods.