python pytorch nlp huggingface-transformers machine-translation

IndexError: Index out of range in self while implementing transformer model for translation

I am trying to implement a transformer model for the translation task, from some youtube tutorials. But I am getting the index out-of-range error. It seems The problem is with the input dimensions, but I can't figure it out. Here is the code (google colab link)

You can find the datasets here

I tried to change the dimensions but It didn't help or I couldn't do it correctly. I hope someone can help solve this problem. Thanks

Solution

I went through your code and found out that in the error trace of yours (error in forward call of SentenceEmbedding, encoder stage)

69 def forward(self, x, start_token, end_token): # sentence  
70      x = self.batch_tokenize(x, start_token, end_token)   
71 ---> x = self.embedding(x)  
72      pos = self.position_encoder().to(get_device())  
73      x = self.dropout(x + pos)

If you add print(torch.max(x)) before the line x = self.embedding(x)

Then you can see that the error is because x contains id that is >=68. If the value is greater than 68, then Pytorch will raise the error mentioned in the stack trace.

It means that while you are converting tokens to ids, you are assigning a value greater than 68.

To prove my point:

when you are creating english_to_index, since there are three "" in your english_vocabulary (START_TOKEN, PADDING_TOKEN, END_TOKEN are all "") you end up generating { "": 69 }. Since this value is greater than the len(english_to_index) # length = 68.
Hence, you are getting IndexError: index out of range in self

Solution

As a solution, you can give unique tags to these tokens (which is generally prescribed) as:

START_TOKEN = "START"
PADDING_TOKEN = "PAD"
END_TOKEN = "END"

This will make sure that the generated dictionaries will have the correct sizes.
Please find the working Google Colaboratory file here with the solution section.

I added '\\' to the english_vocabulary since after a few iterations we get a KeyError: '\\'.

Hope it helps.