Search code examples
pythondeep-learningneural-networkpytorch

Embedding Layer - torch.nn.Embedding in Pytorch


I'm quite new to NN and sorry if my question is quite dumb. I was just reading codes on github and found the pros use embedding (in that case not a word embedding) but may I please just ask in general:

  1. Does Embedding Layer has trainable variables that learn over time as to improve in embedding?
  2. May you please provide an intuition on it and what circumstances to use, like would the house price regression benefit from it ?
  3. If so (that it learns) what is the difference than just using Linear Layers?
>>> embedding = nn.Embedding(10, 3)
>>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
>>> input
tensor([[1, 2, 4, 5],
        [4, 3, 2, 9]])

>>> embedding(input)
tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]],

        [[ 1.4970,  1.3448, -0.9685],
         [ 0.4362, -0.4004,  0.9400],
         [-0.6431,  0.0748,  0.6969],
         [ 0.9124, -2.3616,  1.1151]]])

Solution

  • In short, the embedding layer has learnable parameters and the usefulness of the layer depends on what inductive bias you want on the data.

    Does Embedding Layer has trainable variables that learn over time as to improve in embedding?

    Yes, as stated in the docs under the Variables section, it has an embedding weight that is altered during the training process.

    May you please provide an intuition on it and what circumstances to use, like would the house price regression benefit from it?

    An embedding layer is commonly used in NLP tasks where the input is tokenized. This means that the input is discrete in a sense and can be used for indexing the weight (which is basically what the embedding layer is in forward mode). This discrete attribution implies that inputs like 1, 2, 42 are entirely different (until the semantic correlation has been learnt). House price regression has continuous input space and values such as 1.0 and 1.1 might be more correlated than the values 1.0 and 42.0. This kind of assumption about the hypothesis space is called an inductive bias and pretty much every machine learning architecture conforms to some sort of inductive bias. I believe it is possible to use embedding layers for regression problems which would require some kind of discretization, but it would not benefit from it.

    If so (that it learns) what is the difference than just using Linear Layers?

    There is a big difference, the linear layer performs matrix multiplication with the weight as opposed to using it as a lookup table. During backpropagation for the embedding layer, the gradients will only propagate threw the corresponding indices used in the lookup and duplicate indices are accumulated.