Search code examples
word2vec

What is the correct definition of window size for Word2Vec (CBOW and Skip-gram)?


Which one is the correct definition of the window size in Word2Vec (CBOW and Skip-gram)?

After examining multiple resources on how Word2Vec (CBOW and Skip-gram) works, I discovered that there are two ways in which people define the window size:

The window size is represented by an integer, indicating the number of words before or after the target word (excluding the target word itself). For example, with a window size of 3: "The quick brown fox jumps over the lazy dog." (Resource: https://www.tensorflow.org/tutorials/text/word2vec)

The window size is represented by an integer, indicating the number of words before and after the target word, including the target word itself. For example, with a window size of 3: "The quick brown fox jumps over the lazy dog." (Resource: http://jalammar.github.io/illustrated-word2vec/)

Which one is correct and why?


Solution

  • The original word2vec.c example implementation, released by the authors of the paper which introduced & named the word2vec algorithm, used the 1st interpretation: window indicates the number of context words on each side of the the center/target word.

    So, window=3 means up-to-six words are considered.

    Other writeups or implementations might have their own reasons for picking an alternate convention, but that would be in conflict with the early precedent. (Counting the center word also introduces an extra inefficiency in interpreting even-sized parameters: how would 4 and 5 mean anything different?) In my experience, most implementations follow the original word2vec.c approach.

    Another relevant note: the original implementation used the configured window as the maximum count of words on each side to consider. For each individual micro-example – context-words to center-word – some 'effective window' in the range from 1 to configured-window was chosen. This serves to essentially weight nearer words more highly – the direct neighbors are considered every time, the window-away words only 1 / window of the time – by doing less total calcualation comapred to an alternative approach that might, for example, scale-down more-distant words' influence by some flaoting-point multiplication.