pytorch huggingface-transformers fine-tuning

HuggingFace Pretrained Model for Fine-Tuning has 100% Trainable Parameters

I believe I’m correctly following HuggingFace’s documentation on fine-tuning pretrained models, but I get a model with 100% trainable parameters. I thought only some layers would be unfrozen and optimized, but it looks like all of them are.

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

...
# id2label and label2id represent 3 classes in my current problem

model_name = "nvidia/segformer-b5-finetuned-cityscapes-1024-1024"
model = AutoModelForSemanticSegmentation.from_pretrained(model_name, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)
print_trainable_parameters(model)

Prints the following:

Some weights of SegformerForSemanticSegmentation were not initialized from the model checkpoint at nvidia/segformer-b5-finetuned-cityscapes-1024-1024 and are newly initialized because the shapes did not match:
- decode_head.classifier.weight: found shape torch.Size([19, 768, 1, 1]) in the checkpoint and torch.Size([3, 768, 1, 1]) in the model instantiated
- decode_head.classifier.bias: found shape torch.Size([19]) in the checkpoint and torch.Size([3]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainable params: 84595651 || all params: 84595651 || trainable%: 100.00

Why 100% trainable parameters? I could use PEFT to reduce the number of trainable parameters, but I thought that only a small subset of the parameters would be free to be optimized based on the warning message of layer decode_head.classifier.

Solution

This is the expected behavior. The library can't freeze the layers for you. You can freeze them yourself by setting requires_grad to False for certain layers as shown below:

from transformers import AutoModelForSemanticSegmentation

model_name = "nvidia/segformer-b5-finetuned-cityscapes-1024-1024"
model = AutoModelForSemanticSegmentation.from_pretrained(model_name)

print_trainable_parameters(model)
# freezing everything except the decoder head
for name, param in model.named_parameters():
    if not name.startswith("decode_head"): 
      param.requires_grad = False
print_trainable_parameters(model)

Output:

trainable params: 84607955 || all params: 84607955 || trainable%: 100.00
trainable params: 3164947 || all params: 84607955 || trainable%: 3.74