vector pytorch gensim word2vec recurrent-neural-network

Expected input to torch Embedding layer with pre_trained vectors from gensim

I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:

import gensim
from torch import nn

model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))

This seems to work correctly, also on 1.0.1. My question is, that I don't quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?

I found that you can access a token's vector simply by something like

print(weights['the'])
# [-1.1206588e+00  1.1578362e+00  2.8765252e-01 -1.1759659e+00 ... ]

What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:

for seq_batch, y in batch_loader():
    # seq_batch is a batch of sequences (tokenized sentences)
    # e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
    output, hidden = model(seq_batch, hidden)

This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model's vocab:

weights.vocab['world'].index
# 147

So as an input to an Embedding layer, should I provide a tensor of int for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.

Solution

The documentation says the following

This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

So if you want to feed in a sentence, you give a LongTensor of indices, each corresponding to a word in the vocabulary, which the nn.Embedding layer will map into word vectors going forward.

Here's an illustration

test_voc = ["ok", "great", "test"]
# The word vectors for "ok", "great" and "test"
# are at indices, 0, 1 and 2, respectively.

my_embedding = torch.rand(3, 50)
e = nn.Embedding.from_pretrained(my_embedding)

# LongTensor of indicies corresponds to a sentence,
# reshaped to (1, 3) because batch size is 1
my_sentence = torch.tensor([0, 2, 1]).view(1, -1)

res = e(my_sentence)
print(res.shape)
# => torch.Size([1, 3, 50])
# 1 is the batch dimension, and there's three vectors of length 50 each

In terms of RNNs, next you can feed that tensor into your RNN module, e.g

lstm = nn.LSTM(input_size=50, hidden_size=5, batch_first=True)
output, h = lstm(res)
print(output.shape)
# => torch.Size([1, 3, 5])

I also recommend you look into torchtext. It can automatate some of the stuff you will have to do manually otherwise.