python pytorch nlp huggingface-transformers fine-tuning

Fine-tuning a Pretrained Model with Quantization and AMP: Scaler Error "Attempting to Unscale FP16 Gradients"

I am trying to fine-tune a pretrained model with limited VRAM. To achieve this, I am using quantization and automatic mixed precision (AMP). However, I am encountering an issue that I can't seem to resolve. Could you please help me identify the problem?

Here is a minimal example:

import os
from transformers import BitsAndBytesConfig, OPTForCausalLM, GPT2TokenizerFast
import torch
from torch.cuda.amp import GradScaler, autocast

model_name = "facebook/opt-1.3b"
cache_dir = './models'
os.environ["CUDA_VISIBLE_DEVICES"] = "7"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

pretrained_model:OPTForCausalLM = OPTForCausalLM.from_pretrained(model_name, 
                                                    cache_dir=cache_dir,                                                     
                                                    quantization_config=quantization_config)
tokenizer:GPT2TokenizerFast = GPT2TokenizerFast.from_pretrained(model_name,
                                                    cache_dir=cache_dir)
optimizer = torch.optim.AdamW(pretrained_model.parameters(), lr=1e-4)
scaler = GradScaler()
input_ids = torch.LongTensor([[0, 1, 2, 3]]).to(0)
labels = torch.LongTensor([[1, 2, 3, 4]]).to(0)
with torch.autocast(device_type='cuda'):
    out = pretrained_model(input_ids=input_ids, labels=labels)
    loss = out.loss
scaler.scale(out.loss).backward()
scaler.step(optimizer) 
scaler.update()
optimizer.zero_grad()

print(f'End')

At the line scaler.step(optimizer), an error occurs:

Exception has occurred: ValueError: Attempting to unscale FP16 gradients.

Thank you in advance for your help!

Solution

You can't fine-tune a fp16/uint8 model with AMP. AMP uses fp32 parameters. The params are autocast to fp16 for the forward pass, but AMP expects the master set of parameters to be FP32.

You also shouldn't fine-tune a quantized model in the first place. The quantization causes all sorts of numerical issues and instability during training.

What you are supposed to do is keep the quantized model static and train an adapter on top of the quantized model. You can find more details here