I am trying to implement a transformer model for the translation task, from some youtube tutorials. But I am getting the index out-of-range error. It seems The problem is with the input dimensions, but I can't figure it out. Here is the code (google colab link)
You can find the datasets here
I tried to change the dimensions but It didn't help or I couldn't do it correctly. I hope someone can help solve this problem. Thanks
I went through your code and found out that in the error trace of yours (error in forward
call of SentenceEmbedding
, encoder
stage)
69 def forward(self, x, start_token, end_token): # sentence 70 x = self.batch_tokenize(x, start_token, end_token) 71 ---> x = self.embedding(x) 72 pos = self.position_encoder().to(get_device()) 73 x = self.dropout(x + pos)
If you add print(torch.max(x))
before the line x = self.embedding(x)
Then you can see that the error is because x
contains id
that is >=68. If the value is greater than 68, then Pytorch will raise the error mentioned in the stack trace.
It means that while you are converting tokens
to ids
, you are assigning a value greater than 68.
when you are creating english_to_index
, since there are three ""
in your english_vocabulary
(START_TOKEN
, PADDING_TOKEN
, END_TOKEN
are all ""
) you end up generating { "": 69 }
. Since this value is greater than the len(english_to_index) # length = 68
.
Hence, you are getting IndexError: index out of range in self
As a solution, you can give unique tags to these tokens (which is generally prescribed) as:
START_TOKEN = "START"
PADDING_TOKEN = "PAD"
END_TOKEN = "END"
This will make sure that the generated dictionaries will have the correct sizes.
Please find the working Google Colaboratory file here with the solution section.
I added '\\'
to the english_vocabulary
since after a few iterations we get a KeyError: '\\'
.
Hope it helps.