In my code I have:
output = self.decoder(output, embedded, tgt_mask=attention_mask)
where
decoder_layer = TransformerDecoderLayer(embedding_size, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = TransformerDecoder(decoder_layer, 1)
I generate the attention mask using a huggingface's tokenizer:
batch = tokenizer(example['text'], return_tensors="pt", truncation=True, max_length=1024, padding='max_length')
inputs = batch['input_ids']
attention_mask = batch['attention_mask']
Running it through the models fails on
AssertionError: only bool and floating types of attn_mask are supported
Changing the attention mask to attention_mask = batch['attention_mask'] .bool()
Causes
RuntimeError: The shape of the 2D attn_mask is torch.Size([4, 1024]), but should be (1024, 1024)
Any idea how I can use a huggingface tokenizer with my own pytorch module?
Pytorchs tgt_mask
is not the same as hf attention_mask
. The latter indicates which tokens are padded:
from transformers import BertTokenizer
t= BertTokenizer.from_pretrained("bert-base-cased")
encoded = t("this is a test", max_length=10, padding="max_length")
print(t.pad_token_id)
print(encoded.input_ids)
print(encoded.attention_mask)
Output:
0
[101, 1142, 1110, 170, 2774, 102, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
Pytorchs equivalent to that is tgt_key_padding_mask.
The tgt_mask
on the other hand serves a different purpose, it defines which token should attend to other tokens. For an NLP transformer decoder, this is usually used to prevent tokens to attend to future tokens (causal mask). In case this is your use case, you could also simply pass tgt_is_causal=True
and PyTorch will create the tgt_mask
for you.