When I am reading one of papers of Tomas Mikolov: http://arxiv.org/pdf/1301.3781.pdf
I have one concern on the Continuous Bag-of-Words Model section:
The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged).
I find some people mention that there is a hidden layer in Word2Vec model, but from my understanding, there is only one projection layer in that model. Does this projection layer do the same work as hidden layer?
The another question is that how to project input data into the projection layer?
"the projection layer is shared for all words (not just the projection matrix)", what does that mean?
From the original paper, section 3.1, it is clear that there is no hidden layer:
"the first proposed architecture is similar to the feedforward NNLM where the non-linear hidden layer is removed and the projection layer is shared for all words".
With respect to your second question (what does sharing the projection layer means), it means that you consider only one single vector, which is the centroid of the vectors of all the words in context. Thus, instead of having n-1
word vectors as input, you consider only one vector. This is why it is called Continuous Bag of Words (because word order is lost within the context of size n-1
).