I use the next code to train the model:
norms_train = [ [''], [ u'word', u'to', u'learn', ... ], ...]
model = word2vec.Word2Vec(norms_train, size=100, window=10)
With procedure to check the results:
i, j = 0, 0
for text in norms_train:
j += len(text)
for word in text:
if word not in model.vocab:
i += 1
print i, '/', j
13129 / 185379
All words that you have used to train the Word2Vec model should be in model.vocab. There may be a threshold on the minimum number of occurrences of a word, that have to be present for it to be included in the model vocabulary.
I suppose the argument min_count
is set to 5 by default i.e. if a word has occurred less than 5 times in the training data, that word would not be present in the model.vocab.