I am trying to train word embedding with transformer encoder by masking the word itself with diagonal src_mask:
def _generate_square_subsequent_mask(self, sz):
mask = torch.diag(torch.full((sz,),float('-inf')))
return mask
def forward(self, src):
if self.src_mask is None or self.src_mask.size(0) != len(src):
device = src.device
mask = self._generate_square_subsequent_mask(len(src)).to(device)
self.src_mask = mask
src = self.embedding(src) * math.sqrt(self.ninp)
src = self.dropout(src)
src = self.pos_encoder(src)
src = self.transformer_encoder(src, self.src_mask)
output = self.decoder(src) # Linear layer
return output
After training the model predicts exactly the same sentence from the input. If I change any word in the input - it predict the new word. So the model doesn't block according to the mask.
Why is it ?
I understand that there is a mistake in my logic because BERT would probably be much simpler if it worked. But where am I wrong ?
Edit:
I am using the a sequence of word indices as input. Output is the same sequence as input.
As far as I understand - the model doesn't prevent each word to indirectly “see itself” in multylayer context. I tried to use one layer - it looks like the model works. But training is too slow.