Search code examples
pythonpytorchhuggingface-transformershuggingface

Transformers tokenizer attention mask for pytorch


In my code I have:

output = self.decoder(output, embedded, tgt_mask=attention_mask)

where

decoder_layer = TransformerDecoderLayer(embedding_size, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = TransformerDecoder(decoder_layer, 1)

I generate the attention mask using a huggingface's tokenizer:

batch = tokenizer(example['text'], return_tensors="pt", truncation=True, max_length=1024, padding='max_length')
inputs = batch['input_ids']
attention_mask = batch['attention_mask']

Running it through the models fails on

AssertionError: only bool and floating types of attn_mask are supported

Changing the attention mask to attention_mask = batch['attention_mask'] .bool()

Causes

RuntimeError: The shape of the 2D attn_mask is torch.Size([4, 1024]), but should be (1024, 1024)

Any idea how I can use a huggingface tokenizer with my own pytorch module?


Solution

  • Pytorchs tgt_mask is not the same as hf attention_mask. The latter indicates which tokens are padded:

    from transformers import BertTokenizer
    
    t= BertTokenizer.from_pretrained("bert-base-cased")
    
    encoded = t("this is a test", max_length=10, padding="max_length")
    print(t.pad_token_id)
    print(encoded.input_ids)
    print(encoded.attention_mask)
    

    Output:

    0
    [101, 1142, 1110, 170, 2774, 102, 0, 0, 0, 0]
    [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
    

    Pytorchs equivalent to that is tgt_key_padding_mask.

    The tgt_mask on the other hand serves a different purpose, it defines which token should attend to other tokens. For an NLP transformer decoder, this is usually used to prevent tokens to attend to future tokens (causal mask). In case this is your use case, you could also simply pass tgt_is_causal=True and PyTorch will create the tgt_mask for you.