Search code examples
huggingface-transformerst5-transformer

How to freeze parts of T5 transformer model


I know that T5 has K, Q and V vectors in each layer. It also has a feedforward network. I would like to freeze K, Q and V vectors and only train the feedforward layers on each layer of T5. I use Pytorch library. The model could be a wrapper for huggingface T5 model or a modified version of it. I know how to freeze all parameters using the following code:

tokenizer = AutoTokenizer.from_pretrained(underlying_model_name)
model = T5ForConditionalGeneration.from_pretrained(underlying_model_name)

for p in model.parameters():
    p.requires_grad = False # freezing

Could you please guide me how can I do this?

This github project probably could be helpful but it's for Roberta and GPT, could I adapt it for T5?


Solution

  • I've adapted a solution based on this discussion from the Huggingface forums. Basically, you have to specify the names of the modules/pytorch layers that you want to freeze.

    In your particular case of T5, I started by looking at the model summary:

    from transformers import T5ModelForConditionalGeneration
    
    model = T5ModelForConditionalGeneration.from_pretrained("t5-small")
    print(model)
    

    This gives the following (abbreviated output):

    T5ForConditionalGeneration(
      (shared): Embedding(32128, 512)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 512)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(in_features=512, out_features=512, bias=False)
                  (k): Linear(in_features=512, out_features=512, bias=False)
                  (v): Linear(in_features=512, out_features=512, bias=False)
                  (o): Linear(in_features=512, out_features=512, bias=False)
                  (relative_attention_bias): Embedding(32, 8)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (1): T5LayerFF(
                (DenseReluDense): T5DenseReluDense(
                  (wi): Linear(in_features=512, out_features=2048, bias=False)
                  (wo): Linear(in_features=2048, out_features=512, bias=False)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (layer_norm): T5LayerNorm()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
    [...]  # abbreviated output
    

    with this, we can then generate a list of modules that we want to freeze. In particular, I decided to freeze the entire T5LayerSelfAttention block for the encoder (and, additionally, the T5LayerCrossAttention for the decoder):

    # All modules in the 
    modules_to_freeze = [model.encoder.block[i].layer[0] for i in range(len(model.encoder.block))]
    # And the decoder modules, which has both a SelfAttention (layer[0]) 
    modules_to_freeze.extend([model.decoder.block[i].layer[0] for i in range(len(model.decoder.block))])
    # and CrossAttention (layer[1]) block
    modules_to_freeze.extend([model.decoder.block[i].layer[1] for i in range(len(model.decoder.block))])
    

    And then simply freeze all the parameters in the respective modules:

    for module in modules_to_freeze:
        for param in module.parameters():
            param.requires_grad = False  # Actual freezing operation
    

    You can verify that these are actually frozen in your model by running the following:

    for param in model.parameters():
        print(param.requires_grad)
    

    which should print quite a few False as well. If you really only want to freeze K, Q and V, you can adapt the above process to just sub-select the modules you want.