Search code examples
machine-learningnlpword2vecword-embedding

In CBOW model, do we need to take Average at Hidden layer?


I search and read some articles about CBOW. But seem to have difference between these articles.

As I understand:

  • Input is a batch vector. And we will feed it to Hidden layer. So that we will get another batch vector H at Hidden layer.
  • In an article (part 2.2.1), they say that we will not use any Activation Function at Hidden layer, but we will take average on batch vector H to get a single vector (not a batch anymore). Then we will feed this average vector to Output layer and apply Softmax on it.

enter image description here

  • However, in this Coursera's video, they don't take average on batch vector H. They just feed this batch vector H to Output layer and apply Softmax on batch Output vector. And then calculate Cost function on it.
  • And, in Coursera's video, they say that we can use RelU as Activation function at Hidden layer. Is this a new method? Because I read many articles, but they always say that there is no Activation function at Hidden layer.

Can you please help me to answer it?


Solution

  • In actual implementations – whose source code you can review – the set of context-word vectors are averaged together before being fed as the "input" to the neural-network.

    Then, any back-propagated adjustments to the input are also applied to all the vectors contributing to that average.

    (For example, in the original word2vec.c released with Google's original word2vec paper, you can see the tallying of vectors into neu1, then averaging via division by the context-window count cw, at:

    https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L444-L448 )