What's the different between fasttext skipgram and word2vec skipgram?

Given a sentence 'hello world', the vocabulary is

{hello, world} + {<hel, hell, ello, llo>, <wor, worl, orld, rld>},

for convenience, just list all 4-gram.

In my comprehension, the word2vec skipgram will maximize

$P(hello\vert world) + P(world\vert hello)$

What will fasttext skipgram do?

Solution

tl;dr

The optimization criterion is the same, the difference is how the model gets the word vector.

Using formulas

Fasttext optimizes the same criterion as the standard skipgram model (using the formula from the FastText paper):

with all the approximation tricks that make the optimization computationally efficient. In the end, they get this:

There is a sum over all words w_c and approximate the denominator using some negative samples n. The crucial difference is in the function s. In the original skip-gram model, it is a dot product of the two word embeddings.

However, in the FastText case, the function s is redefined:

Word w_t is represented as a sum of all n-grams z_g the word consist of plus a vector for the word itself. You basically want to make no only the word, but also all its substrings probable in the given context window.