Running the code snippet below report an output (3, 60). I wonder what exactly it is reporting?
The code is reproducible..just copy into a notebook cell and run.
from gensim.models import Word2Vec
sent = [['I', 'love', 'cats'], ['Dogs', 'are', 'friendly']]
w2v_model = Word2Vec(sentences=sent, vector_size=100, window=7, min_count=1,sg=1)
w2v_model.train(sent, total_examples=len(sent), epochs=10)
(3, 60)
You seem to be using the Gensim Python library for your Word2Vec
, & for internal reasons, the .train()
method returns just the tuple (trained_word_count, raw_word_count)
.
The 1st number happens to be the number of words actually trained on – more on why this is only 3
for you below – & the 2nd the total raw words passed to training routines – just your 6 words times 10 epochs. But, most users never need to consult these values.
A better way to monitor progress is to turn on logging to the INFO
level - at which point you'll see many log lines of the model's steps & progress. By reading these, & over time, you'll start to recognize signs of a good run, or common errors (as when the totals or elapsed times don't seem consistent with what you thought you were doing).
You 3 lines are already a bit off, though:
.train()
- that's already done for you, automatically. So, you're trining twice here. (And, if you want epochs=10
for that automatic training, you can specify it in the constructor.)sample
parameter). This is why out of 60 words (6 words times 10 epochs), only 3 word occurrences were actually trained at all. (With truly frequent words in an adequately-sized corpus, dropping 19-out-of-20 appearances leaves plenty, & the overall model gets improved by spending relatively more effort on rarer words.)min_count=1
is essentially always a bad idea with real word2vec workloads, as an words that only appear once can't get good vectors, but do waste model time/state. Ignoring such rare words completely is a standard practice. (If you need vectors for such words, you should find more training material sufficient to demonstrate their varied uses, in context, repeatedly.