Search code examples
pythonoopdeep-learningpytorchfast-ai

What is the difference between an Embedding Layer with a bias immediately afterwards and a Linear Layer in PyTorch


I am reading the "Deep Learning for Coders with fastai & PyTorch" book. I'm still a bit confused as to what the Embedding module does. It seems like a short and simple network, except I can't seem to wrap my head around what Embedding does differently than Linear without a bias. I know it does some faster computational version of a dot product where one of the matrices is a one-hot encoded matrix and the other is the embedding matrix. It does this to in effect select a piece of data? Please point out where I am wrong. Here is one of the simple networks shown in the book.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

Solution

  • Embedding

    [...] what Embedding does differently than Linear without a bias.

    Essentially everything. torch.nn.Embedding is a lookup table; it works the same as torch.Tensor but with a few twists (like possibility to use sparse embedding or default value at specified index).

    For example:

    import torch
    
    embedding = torch.nn.Embedding(3, 4)
    
    print(embedding.weight)
    
    print(embedding(torch.tensor([1])))
    

    Would output:

    Parameter containing:
    tensor([[ 0.1420, -0.1886,  0.6524,  0.3079],
            [ 0.2620,  0.4661,  0.7936, -1.6946],
            [ 0.0931,  0.3512,  0.3210, -0.5828]], requires_grad=True)
    tensor([[ 0.2620,  0.4661,  0.7936, -1.6946]], grad_fn=<EmbeddingBackward>)
    

    So we took the first row of the embedding. It does nothing more than that.

    Where is it used?

    Usually when we want to encode some meaning (like word2vec) for each row (e.g. words being close semantically are close in euclidean space) and possibly train them.

    Linear

    torch.nn.Linear (without bias) is also a torch.Tensor (weight) but it does operation on it (and the input) which is essentially:

    output = input.matmul(weight.t())
    

    every time you call the layer (see source code and functional definition of this layer).

    Code snippet

    The layer in your code snippet does this:

    • creates two lookup tables in __init__
    • the layer is called with input of shape (batch_size, 2):
      • first column contains indices of user embeddings
      • second column contains indices of movie embeddings
    • these embeddings are multiplied and summed returning (batch_size,) (so it's different from nn.Linear which would return (batch_size, out_features) and perform dot product instead of element-wise multiplication followed by summation like here)

    This is probably used to train both representations (of users and movies) for some recommender-like system.

    Other stuff

    I know it does some faster computational version of a dot product where one of the matrices is a one-hot encoded matrix and the other is the embedding matrix.

    No, it doesn't. torch.nn.Embedding can be one hot encoded and might also be sparse, but depending on the algorithms (and whether those support sparsity) there might be performance boost or not.