I know that T5 has K, Q and V vectors in each layer. It also has a feedforward network. I would like to freeze K, Q and V vectors and only train the feedforward layers on each layer of T5. I use Pytorch library. The model could be a wrapper for huggingface T5 model or a modified version of it. I know how to freeze all parameters using the following code:
tokenizer = AutoTokenizer.from_pretrained(underlying_model_name)
model = T5ForConditionalGeneration.from_pretrained(underlying_model_name)
for p in model.parameters():
p.requires_grad = False # freezing
Could you please guide me how can I do this?
This github project probably could be helpful but it's for Roberta and GPT, could I adapt it for T5?
I've adapted a solution based on this discussion from the Huggingface forums. Basically, you have to specify the names of the modules/pytorch layers that you want to freeze.
In your particular case of T5, I started by looking at the model summary:
from transformers import T5ModelForConditionalGeneration
model = T5ModelForConditionalGeneration.from_pretrained("t5-small")
print(model)
This gives the following (abbreviated output):
T5ForConditionalGeneration(
(shared): Embedding(32128, 512)
(encoder): T5Stack(
(embed_tokens): Embedding(32128, 512)
(block): ModuleList(
(0): T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=512, bias=False)
(k): Linear(in_features=512, out_features=512, bias=False)
(v): Linear(in_features=512, out_features=512, bias=False)
(o): Linear(in_features=512, out_features=512, bias=False)
(relative_attention_bias): Embedding(32, 8)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseReluDense(
(wi): Linear(in_features=512, out_features=2048, bias=False)
(wo): Linear(in_features=2048, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
[...] # abbreviated output
with this, we can then generate a list of modules that we want to freeze. In particular, I decided to freeze the entire T5LayerSelfAttention
block for the encoder (and, additionally, the T5LayerCrossAttention
for the decoder):
# All modules in the
modules_to_freeze = [model.encoder.block[i].layer[0] for i in range(len(model.encoder.block))]
# And the decoder modules, which has both a SelfAttention (layer[0])
modules_to_freeze.extend([model.decoder.block[i].layer[0] for i in range(len(model.decoder.block))])
# and CrossAttention (layer[1]) block
modules_to_freeze.extend([model.decoder.block[i].layer[1] for i in range(len(model.decoder.block))])
And then simply freeze all the parameters in the respective modules:
for module in modules_to_freeze:
for param in module.parameters():
param.requires_grad = False # Actual freezing operation
You can verify that these are actually frozen in your model by running the following:
for param in model.parameters():
print(param.requires_grad)
which should print quite a few False
as well. If you really only want to freeze K, Q and V, you can adapt the above process to just sub-select the modules you want.