Search code examples
pythonmachine-learningpytorchword-embedding

PyTorch: Loading word vectors into Field vocabulary vs. Embedding layer


I'm coming from Keras to PyTorch. I would like to create a PyTorch Embedding layer (a matrix of size V x D, where V is over vocabulary word indices and D is the embedding vector dimension) with GloVe vectors but am confused by the needed steps.

In Keras, you can load the GloVe vectors by having the Embedding layer constructor take a weights argument:

# Keras code.
embedding_layer = Embedding(..., weights=[embedding_matrix])

When looking at PyTorch and the TorchText library, I see that the embeddings should be loaded twice, once in a Field and then again in an Embedding layer. Here is sample code that I found:

# PyTorch code.

# Create a field for text and build a vocabulary with 'glove.6B.100d'
# pretrained embeddings.
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)

TEXT.build_vocab(train_data, vectors='glove.6B.100d')


# Build an RNN model with an Embedding layer.
class RNN(nn.Module):
    def __init__(self, ...):

        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        ...

# Initialize the embedding layer with the Glove embeddings from the
# vocabulary. Why are two steps needed???
model = RNN(...)
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

Specifically:

  1. Why are the GloVe embeddings loaded in a Field in addition to the Embedding?
  2. I thought the Field function build_vocab() just builds its vocabulary from the training data. How are the GloVe embeddings involved here during this step?

Here are other StackOverflow questions that did not answer my questions:

PyTorch / Gensim - How to load pre-trained word embeddings

Embedding in pytorch

PyTorch LSTM - using word embeddings instead of nn.Embedding()

Thanks for any help.


Solution

  • When torchtext builds the vocabulary, it aligns the the token indices with the embedding. If your vocabulary doesn't have the same size and ordering as the pre-trained embeddings, the indices wouldn't be guaranteed to match, therefore you might look up incorrect embeddings. build_vocab() creates the vocabulary for your dataset with the corresponding embeddings and discards the rest of the embeddings, because those are unused.

    The GloVe-6B embeddings includes a vocabulary of size 400K. For example the IMDB dataset only uses about 120K of these, the other 280K are unused.

    import torch
    from torchtext import data, datasets, vocab
    
    TEXT = data.Field(tokenize='spacy', include_lengths=True)
    LABEL = data.LabelField()
    
    train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
    TEXT.build_vocab(train_data, vectors='glove.6B.100d')
    
    TEXT.vocab.vectors.size() # => torch.Size([121417, 100])
    
    # For comparison the full GloVe
    glove = vocab.GloVe(name="6B", dim=100)
    glove.vectors.size() # => torch.Size([400000, 100])
    
    # Embedding of the first token is not the same
    torch.equal(TEXT.vocab.vectors[0], glove.vectors[0]) # => False
    
    # Index of the word "the"
    TEXT.vocab.stoi["the"] # => 2
    glove.stoi["the"] # => 0
    
    # Same embedding when using the respective index of the same word
    torch.equal(TEXT.vocab.vectors[2], glove.vectors[0]) # => True
    

    After having built the vocabulary with its embeddings, the input sequences will be given in the tokenised version where each token is represented by its index. In the model you want to use the embedding of these, so you need to create the embedding layer, but with the embeddings of your vocabulary. The easiest and recommended way is nn.Embedding.from_pretrained, which is essentially the same as the Keras version.

    embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors)
    
    # Or if you want to make it trainable
    trainable_embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)
    

    You didn't mention how the embedding_matrix is created in the Keras version, nor how the vocabulary is built such that it can be used with the embedding_matrix. If you do that by hand (or with any other utility), you don't need torchtext at all, and you can initialise the embeddings just like in Keras. torchtext is purely for convenience for common data related tasks.