Search code examples
word2vecword-embedding

Why does skipgram model take more time than CBOW


Why does skipgram model take more time than CBOW model. I train the model with same parameters (Vector size and window size).


Solution

  • The skip-gram approach involves more calculations.

    Specifically, consider a single 'target word' with a context-window of 4 words on either side.

    In CBOW, the vectors for all 8 nearby words are averaged together, then used as the input for the algorithm's prediction neural-network. The network is run forward, and its success at predicting the target word is checked. Then back-propagation occurs: all neural-network connection values – including the 8 contributing word-vectors – are nudged to make the prediction slightly better.

    Note, though, that the 8-word-window and one-target-word only require one forward-propagation, and one-backward-propagation – and the initial averaging-of-8-values and final distribution-of-error-correction-over-8-vectors are each relatively quick/simple operations.

    Now consider instead skip-gram. Each of the 8 context-window words is in turn individually provided as input to the neural-network, forward-checked for how well the target word is predicted, then backward-corrected. Though the averaging/splitting is not done, there's 8 times as much of the neural-network operations. Hence, much more net computation and more run-time.

    Note the extra effort/time may pay itself back by improving vector quality on your final evaluations. Whether and to what extent depends on your specific goals and corpus.