Search code examples
nlpfasttext

What's the different between fasttext skipgram and word2vec skipgram?


Given a sentence 'hello world', the vocabulary is

{hello, world} + {<hel, hell, ello, llo>, <wor, worl, orld, rld>},

for convenience, just list all 4-gram.

In my comprehension, the word2vec skipgram will maximize

What will fasttext skipgram do?


Solution

  • tl;dr

    The optimization criterion is the same, the difference is how the model gets the word vector.

    Using formulas

    Fasttext optimizes the same criterion as the standard skipgram model (using the formula from the FastText paper):

    enter image description here

    with all the approximation tricks that make the optimization computationally efficient. In the end, they get this:

    enter image description here

    There is a sum over all words wc and approximate the denominator using some negative samples n. The crucial difference is in the function s. In the original skip-gram model, it is a dot product of the two word embeddings.

    However, in the FastText case, the function s is redefined:

    enter image description here

    Word wt is represented as a sum of all n-grams zg the word consist of plus a vector for the word itself. You basically want to make no only the word, but also all its substrings probable in the given context window.