Search code examples
pythonpytorchnlphuggingface-transformersmachine-translation

IndexError: Index out of range in self while implementing transformer model for translation


I am trying to implement a transformer model for the translation task, from some youtube tutorials. But I am getting the index out-of-range error. It seems The problem is with the input dimensions, but I can't figure it out. Here is the code (google colab link)

You can find the datasets here

I tried to change the dimensions but It didn't help or I couldn't do it correctly. I hope someone can help solve this problem. Thanks


Solution

  • I went through your code and found out that in the error trace of yours (error in forward call of SentenceEmbedding, encoder stage)

    69 def forward(self, x, start_token, end_token): # sentence  
    70      x = self.batch_tokenize(x, start_token, end_token)   
    71 ---> x = self.embedding(x)  
    72      pos = self.position_encoder().to(get_device())  
    73      x = self.dropout(x + pos)  
    

    If you add print(torch.max(x)) before the line x = self.embedding(x)

    Then you can see that the error is because x contains id that is >=68. If the value is greater than 68, then Pytorch will raise the error mentioned in the stack trace.

    It means that while you are converting tokens to ids, you are assigning a value greater than 68.

    To prove my point:

    when you are creating english_to_index, since there are three "" in your english_vocabulary (START_TOKEN, PADDING_TOKEN, END_TOKEN are all "") you end up generating { "": 69 }. Since this value is greater than the len(english_to_index) # length = 68.
    Hence, you are getting IndexError: index out of range in self

    Solution

    As a solution, you can give unique tags to these tokens (which is generally prescribed) as:

    START_TOKEN = "START"
    PADDING_TOKEN = "PAD"
    END_TOKEN = "END"
    

    This will make sure that the generated dictionaries will have the correct sizes.
    Please find the working Google Colaboratory file here with the solution section.

    I added '\\' to the english_vocabulary since after a few iterations we get a KeyError: '\\'.

    Hope it helps.