Search code examples
pythonnlpword2vec

How does word2vec predicts the word correctly but the actual dataset does not contain it?


I'm trying to understand how word2vec predicts a word, given a list of words. Specifically, I trained my skip-gram model on twitter data of 500k tweets with the following parameters:

model = gensim.models.Word2Vec(data, window=5, workers=7, sg=1, min_count=10, size=200)

Given the words discrimination and uberx, I get the following output:

model.wv.most_similar(positive=[PorterStemmer().stem(WordNetLemmatizer().lemmatize("discrimination", pos='v')), WordNetLemmatizer().lemmatize("uberx", pos='v')], topn=30)
[('discret', 0.7425585985183716),
 ('fold_wheelchair', 0.7286415696144104),
 ('illeg_deni', 0.7280288338661194),
 ('tradit_cab', 0.7262350916862488),
 ('mobil_aid', 0.7252357602119446),
 ('accommod_disabl', 0.724936842918396),
 ('uberwav', 0.720955491065979),
 ('discrimin_disabl', 0.7206833958625793),
 ('deni_access', 0.7202375531196594),...]

However, when I search the dataset data which I dumped on my hard drive, for the words "discrimination", "uberx", and any other word from the output list, I never find a single instance of a datapoint which contained all 3 words. So my question is, how does the model know that, say, word "accommodation disabled" is the right word for the context "discrimination" and "uberx" if it has never seen those 3 words together in a single tweet?


Solution

  • The skip-gram model is working like the question of filling the blank. For example, there are two twitter data:

    1)

    It's summer now. Today is ___.

    It's ______ now. Today is hot.

    2)

    It's winter now. Today is ____.

    It's ______ now. Today is cold.

    By training a model to predict the blank, the model learns that the representations of these two words, either (cold and winter) or (hot and summer), should be closer.

    At the same time, it also learns that the distance between "cold" and "summer" should be increased, because when the context contains "cold", the blank is more likely to be "winter", which in turn suppresses the possibility of being "summer".

    Thus, even though there is no one data containing "cold" and "summer", the model still can learn the relationship between these two words.

    This is my humble opinion on skip-gram. Please feel free to discuss :)