Search code examples
pythonpytorchneural-networkgensimword-embedding

PyTorch / Gensim - How do I load pre-trained word embeddings?


I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?


Solution

  • I just wanted to report my findings about loading a gensim embedding with PyTorch.


    • Solution for PyTorch 0.4.0 and newer:

    From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

    import torch
    import torch.nn as nn
    
    # FloatTensor containing pretrained weights
    weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
    embedding = nn.Embedding.from_pretrained(weight)
    # Get embeddings for index 1
    input = torch.LongTensor([1])
    embedding(input)
    

    The weights from gensim can easily be obtained by:

    import gensim
    model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
    weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
    

    As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

    weights = model.wv
    

    • Solution for PyTorch version 0.3.1 and older:

    I'm using version 0.3.1 and from_pretrained() isn't available in this version.

    Therefore I created my own from_pretrained so I can also use it with 0.3.1.

    Code for from_pretrained for PyTorch versions 0.3.1 or lower:

    def from_pretrained(embeddings, freeze=True):
        assert embeddings.dim() == 2, \
             'Embeddings parameter is expected to be 2-dimensional'
        rows, cols = embeddings.shape
        embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
        embedding.weight = torch.nn.Parameter(embeddings)
        embedding.weight.requires_grad = not freeze
        return embedding
    

    The embedding can be loaded then just like this:

    embedding = from_pretrained(weights)
    

    I hope this is helpful for someone.