Search code examples
pythongensimtraining-dataword2vec

All of the words, those I use to train the word2vec model, must be in model.vocab, aren't they?


I use the next code to train the model:

norms_train = [ [''], [ u'word', u'to', u'learn', ... ], ...]
model = word2vec.Word2Vec(norms_train, size=100, window=10)

With procedure to check the results:

i, j = 0, 0
for text in norms_train:
    j += len(text)
    for word in text:
        if word not in model.vocab:
            i += 1
print i, '/', j

13129 / 185379


Solution

  • All words that you have used to train the Word2Vec model should be in model.vocab. There may be a threshold on the minimum number of occurrences of a word, that have to be present for it to be included in the model vocabulary.

    I suppose the argument min_count is set to 5 by default i.e. if a word has occurred less than 5 times in the training data, that word would not be present in the model.vocab.