pytorch word-embedding transformer-model

Why pytorch transformer src_mask doesn't block positions from attending?

I am trying to train word embedding with transformer encoder by masking the word itself with diagonal src_mask:

def _generate_square_subsequent_mask(self, sz):
    mask = torch.diag(torch.full((sz,),float('-inf')))
    return mask

def forward(self, src):

    if self.src_mask is None or self.src_mask.size(0) != len(src):
        device = src.device
        mask = self._generate_square_subsequent_mask(len(src)).to(device)
        self.src_mask = mask
    
    src = self.embedding(src) * math.sqrt(self.ninp)
    src = self.dropout(src)
    src = self.pos_encoder(src)
    src = self.transformer_encoder(src, self.src_mask)
    output = self.decoder(src) # Linear layer
    return output

After training the model predicts exactly the same sentence from the input. If I change any word in the input - it predict the new word. So the model doesn't block according to the mask.

Why is it ?

I understand that there is a mistake in my logic because BERT would probably be much simpler if it worked. But where am I wrong ?

Edit:

I am using the a sequence of word indices as input. Output is the same sequence as input.

Solution

As far as I understand - the model doesn't prevent each word to indirectly “see itself” in multylayer context. I tried to use one layer - it looks like the model works. But training is too slow.