python gensim convergence word-embedding

embedded vectors doesn't converge in gensim

I am training a word2vec model using gensim on 800k browser useragent. My dictionary size is between 300 and 1000 depending on the word frequency limit. I am looking at few embedding vectors and similarities to see if the algorithm has been converged. here is my code:

wv_sim_min_count_stat={}
window=7;min_count=50;worker=10;size=128
total_iterate=1000
from copy import copy
for min_count in [50,100,500]:
    print(min_count)

    wv_sim_min_count_stat[min_count]={}
    model=gensim.models.Word2Vec(size=size,window=window,min_count=min_count,iter=1,sg=1)
    model.build_vocab(ua_parsed)


    wv_sim_min_count_stat[min_count]['vocab_counts']=[len(ua_parsed),len(model.wv.vocab),len(model.wv.vocab)/len(ua_parsed)]
    wv_sim_min_count_stat[min_count]['test']=[]

    alphas=np.arange(0.025,0.001,(0.001-0.025)/(total_iterate+1))
    for i in range(total_iterate):
        model.train(ua_parsed,total_examples=model.corpus_count,
                    epochs=model.iter,start_alpha=alphas[i],end_alpha=alphas[i+1])

        wv_sim_min_count_stat[min_count]['test'].append(
        (copy(model.wv['iphone']),copy(model.wv['(windows']),copy(model.wv['mobile']),copy(model.wv['(ipad;']),copy(model.wv['ios']),
         model.similarity('(ipad;','ios')))

unfortunately even after 1000 epochs there is no sign of convergence in embedding vectors. for example I plot embedding of the first dimension of '(ipad''s embedding vector vs number of epochs below:

for min_count in [50,100,500]:
    plt.plot(np.stack(list(zip(*wv_sim_min_count_stat[min_count]['test']))[3])[:,1],label=str(min_count))

plt.legend()

embedding of '(ipad' vs number of epochs

I looked at many blogs and papers and it seems nobody trained the word2vec beyond 100 epochs. What I am missing here?

Solution

Your dataset, user-agent strings, may be odd for word2vec. It's not natural-language; it might not have the same variety of co-occurences that causes word2vec to do useful things for natural language. (Among other things, a dataset of 800k natural-language sentences/docs would tend to have a much larger vocabulary than just ~1,000 words.)

Your graphs do look like they're roughly converging, to me. In each case, as the learning-rate alpha decreases, the dimension magnitude is settling towards a final number.

There is no reason to expect the magnitude of a particular dimension, of a particular word, would reach the same absolute value in different runs. That is: you shouldn't expect the three lines you're plotting, under different model parameters, to all tend towards the same final value.

Why not?

The algorithm includes random initialization, randomization-during-training (in negative-sampling or frequent-word downsampling), and then in its multi-threading some arbitrary re-ordering of training-examples due to OS thread-scheduling jitter. As a result, even with exactly the same metaparameters, and the same training corpus, a single word could land at different coordinates in subsequent training runs. But, its distances and orientation with regard to other words in the same run should be about-as-useful.

With different metaparameters like min_count, and thus a different ordering of surviving words during initialization, and then wildly different random-initialization, the final coordinates per word could be especially different. There is no inherent set-of-best-final-coordinates for any word, even with regard to a particular fixed corpus or initialization. There's just coordinates that work increasingly well, through a particular randomized initialization/training session, balanced over all the other co-trained words/examples.