I search and read some articles about CBOW. But seem to have difference between these articles.
As I understand:
Can you please help me to answer it?
In actual implementations – whose source code you can review – the set of context-word vectors are averaged together before being fed as the "input" to the neural-network.
Then, any back-propagated adjustments to the input are also applied to all the vectors contributing to that average.
(For example, in the original word2vec.c
released with Google's original word2vec paper, you can see the tallying of vectors into neu1
, then averaging via division by the context-window count cw
, at:
https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L444-L448 )