python huggingface-transformers mistral-7b

Attention mask error when fine-tuning Mistral 7B using transformers trainer

I'm trying to fine-tune mistralai/Mistral-7B-v0.1 using the following sample notebook

I follow the steps in the notebook, but the training fails with:

***** Running training *****
  Num examples = 344
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 500
  Number of trainable parameters = 21,260,288
  0%|          | 0/500 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 293, in forward
    raise ValueError(
ValueError: Attention mask should be of size (2, 1, 512, 1024), but is torch.Size([2, 1, 512, 512])

Any ideas where this issue regarding the extension mask could result from? My tokenized data is exactly of size 512. Why is it expecting size 1024 and these particular 4 dimensions?

Solution

Experiencing the same issue, downgrading transformers to 4.35.2 instead of latest version 4.36.0 seems to work fine.