Search code examples
vectorpytorchgensimword2vecrecurrent-neural-network

Expected input to torch Embedding layer with pre_trained vectors from gensim


I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:

import gensim
from torch import nn

model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))

This seems to work correctly, also on 1.0.1. My question is, that I don't quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?

I found that you can access a token's vector simply by something like

print(weights['the'])
# [-1.1206588e+00  1.1578362e+00  2.8765252e-01 -1.1759659e+00 ... ]

What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:

for seq_batch, y in batch_loader():
    # seq_batch is a batch of sequences (tokenized sentences)
    # e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
    output, hidden = model(seq_batch, hidden)

This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model's vocab:

weights.vocab['world'].index
# 147

So as an input to an Embedding layer, should I provide a tensor of int for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.


Solution

  • The documentation says the following

    This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

    So if you want to feed in a sentence, you give a LongTensor of indices, each corresponding to a word in the vocabulary, which the nn.Embedding layer will map into word vectors going forward.

    Here's an illustration

    test_voc = ["ok", "great", "test"]
    # The word vectors for "ok", "great" and "test"
    # are at indices, 0, 1 and 2, respectively.
    
    my_embedding = torch.rand(3, 50)
    e = nn.Embedding.from_pretrained(my_embedding)
    
    # LongTensor of indicies corresponds to a sentence,
    # reshaped to (1, 3) because batch size is 1
    my_sentence = torch.tensor([0, 2, 1]).view(1, -1)
    
    res = e(my_sentence)
    print(res.shape)
    # => torch.Size([1, 3, 50])
    # 1 is the batch dimension, and there's three vectors of length 50 each
    

    In terms of RNNs, next you can feed that tensor into your RNN module, e.g

    lstm = nn.LSTM(input_size=50, hidden_size=5, batch_first=True)
    output, h = lstm(res)
    print(output.shape)
    # => torch.Size([1, 3, 5])
    

    I also recommend you look into torchtext. It can automatate some of the stuff you will have to do manually otherwise.