Search code examples
pythonnlpword2vec

How to interpret word2vec train output?


Running the code snippet below report an output (3, 60). I wonder what exactly it is reporting?

The code is reproducible..just copy into a notebook cell and run.

from gensim.models import Word2Vec    
sent = [['I', 'love', 'cats'], ['Dogs', 'are', 'friendly']]
w2v_model = Word2Vec(sentences=sent, vector_size=100, window=7, min_count=1,sg=1)
w2v_model.train(sent, total_examples=len(sent), epochs=10)

(3, 60)


Solution

  • You seem to be using the Gensim Python library for your Word2Vec, & for internal reasons, the .train() method returns just the tuple (trained_word_count, raw_word_count).

    The 1st number happens to be the number of words actually trained on – more on why this is only 3 for you below – & the 2nd the total raw words passed to training routines – just your 6 words times 10 epochs. But, most users never need to consult these values.

    A better way to monitor progress is to turn on logging to the INFO level - at which point you'll see many log lines of the model's steps & progress. By reading these, & over time, you'll start to recognize signs of a good run, or common errors (as when the totals or elapsed times don't seem consistent with what you thought you were doing).

    You 3 lines are already a bit off, though:

    • If you pass your training corpus into the constructor, you don't have to also call .train() - that's already done for you, automatically. So, you're trining twice here. (And, if you want epochs=10 for that automatic training, you can specify it in the constructor.)
    • With a tiny toy-sized corpus, word2vec learns no useful vectors – and even the reporting is more likely to reveal oddnesses that are irrelevant to more realistic-sized training runs. I recommend never training on anything less than hundreds-of-thousands of words, so that all your experiments reveal useful things about its usual operation, with minimal distractions from artifacts of unrealistic runs.
    • In particular, here, since you only have 6 words total, each has a word frequency of ~17% of all words. In any real corpus, such a word would be unrelaistically super-frequent – and thus all your words fall victim to what is (in real corpora) a very useful optimization: probabilistic highly-frequent-word-dropping (tuned by the sample parameter). This is why out of 60 words (6 words times 10 epochs), only 3 word occurrences were actually trained at all. (With truly frequent words in an adequately-sized corpus, dropping 19-out-of-20 appearances leaves plenty, & the overall model gets improved by spending relatively more effort on rarer words.)
    • min_count=1 is essentially always a bad idea with real word2vec workloads, as an words that only appear once can't get good vectors, but do waste model time/state. Ignoring such rare words completely is a standard practice. (If you need vectors for such words, you should find more training material sufficient to demonstrate their varied uses, in context, repeatedly.