neural-network deep-learning nlp pytorch word-embedding

Using pre-trained word embeddings - how to create vector for unknown / OOV Token?

I wan't to add pre-trained embeddings to a model. But as it seems there is no out-of-vocabulary (OOV) token resp. no vector for unseen words existent.

So what can I do to handle OOV-tokens I come across? I have some ideas, but none of them seem to be very good:

I could just create a random vector for this token, but ideally I'd like the vector to within the logic of the existing model. If I just create it randomly I'm afraid the vector accidentally could be very similar to a very frequent word like 'the', 'for', 'that' etc. which is not my intention.
Or should I just initialize the vector with plain zeros instead?
Another idea would be averaging the token over other existing vectors. But averaging on what vectors then? On all? This doesn't seem to be very conclusive either.
I also thought about trying to train this vector. However this doesn't come very handy if I want to freeze the rest of the embedding during training.

(A general solution is appreciated, but I wanted to add that I'm using PyTorch - just in case PyTorch already comes with a handy solution to this problem.)

So what would be a good and easy strategy to create such a vector?

Solution

There are multiple ways you can deal with it. I don't think I can cite references about which works better.

Non-trainable option:

Random vector as embedding
You can use an all-zero vector for OOV.
You can use the mean of all the embedding vectors, that way you avoid the risk of being away from the actual distribution.
Also embeddings generally come with "unk" vectors learned during the training you can use that.

Trainable Option:

You can declare a separate embedding vector for OOV and make it trainable keeping other embedding fixed. You might have to over-write the forward method of embedding lookup for this. You can declare a new trainable Variable and in the forward pass use this vector as embedding for OOV instead of doing a look-up.

Addressing the comments of OP:

I am not sure which of the three non-trainable methods may work better and I am not sure if there is some work about this. But method 4) should be working better.

For trainable option, you can create a new embedding layer as below.

class Embeddings_new(torch.nn.Module): 
    def __init__(self, dim, vocab): 
        super().__init__() 
        self.embedding = torch.nn.Embedding(vocab, dim) 
        self.embedding.weight.requires_grad = False
        # vector for oov 
        self.oov = torch.nn.Parameter(data=torch.rand(1,dim)) 
        self.oov_index = -1 
        self.dim = dim 

    def forward(self, arr): 
        N = arr.shape[0] 
        mask =  (arr==self.oov_index).long() 
        mask_ = mask.unsqueeze(dim=1).float() 
        embed =(1-mask_)*self.embedding((1-mask)*arr) + mask_*(self.oov.expand((N,self.dim))) 
        return embed

Usage:

model = Embeddings_new(10,20000)    
out = model.forward(torch.tensor([-1,-1, 100, 1, 0]))
# dummy loss
loss = torch.sum(a**2)
loss.backward()