Search code examples
pythonstanford-nlpgensimword2vecword-embedding

Gensim train not updating weights


I have a domain specific corpus for which I am trying to train embeddings. Since I want to be comprehensive in vocabulary, I am adding word vectors from glove.6B.50d.txt. Post adding vectors from here, I am training the model using the corpus I have.

I am trying the solutions from here but the word embeddings don't seem to update.

This is the solution I have so far.

#read glove embeddings
glove_wv = KeyedVectors.load_word2vec_format(GLOVE_PATH, binary=False)

#initialize w2v model
model =  Word2Vec(vector_size=50, min_count=0, window=20, epochs=10, sg=1, workers=10, 
                      hs=1, ns_exponent=0.5, seed=42, sample=10**-2, shrink_windows=True)
model.build_vocab(sentences_tokenized)
training_examples_count = model.corpus_count

# add vocab from glove
model.build_vocab([list(glove_wv.key_to_index.keys())], update=True)
model.wv.vectors_lockf = np.zeros(len(model.wv)) # ALLOW UPDATE OF WEIGHTS FROM BACK PROP; 0 WILL SUPPRESS

# add glove embeddings
model.wv.intersect_word2vec_format(GLOVE_PATH,binary=False, lockf=1.0)

Below I am training the model and checking word embedding of a particular word explicitly present in training

# train model
model.train(sentences_tokenized,total_examples=training_examples_count, epochs=model.epochs)

#CHECK IF EMBEDDING CHANGES FOR 'oyo'
print(model.wv.get_vector('oyo'))
print(glove_wv.get_vector('oyo'))

The word embeddings of the word oyo comes out to be same before and after the training. Where am I going wrong?

The input corpus- sentences_tokenized contains few sentences that contains the word oyo. One of such sentences-

'oyo global platform empowers entrepreneur small business hotel home providing full stack technology increase earnings eas operation bringing affordable trusted accommodation guest book instantly india largest budget hotel chain oyo room one preferred hotel booking destination vast majority student country hotel chain offer many benefit include early check in couple room id card flexibility oyo basically network budget hotel completely different famous hotel aggregator like goibibo yatra makemytrip partner zero two star hotel give makeover room bring customer hotel website mobile app'

Solution

  • You're improvising a lot here with a bunch of potential errors or suboptimalities. Note especially that:

    • While (because it's Python) you can always mutate the models however you want for interesting effects, seeding a model with outside word-vectors then continuing training isn't formally- or well-supported by Gensim. As far as I can tell – & I wrote a bunch of this code! – there aren't any good docs/examples of doing it well, or doing the necessary tuning/validation of results, or demonstrating a reliable advantage of this technique. Most examples online are of eager people plowing ahead unaware of the tradeoffs, seeing a trivial indicator of completion or a tiny bit of encouraging results, and then overconfidently showing their work as if this were a well-grounded technique or best-practice. It isn't. Without a deep understanding of the model, & review of the source code, & regular re-checking of your results for sanity/improvement, there will be hidden gotchas. It is especially the case that fresh training on just a subset of all words could pull those words out of compatible coordinate alignment with other words not receiving training.
    • The intersect_word2vec_format() feature, and especially the lockf function, are also experimental - one stab at maybe offering a way to mix in other word-vectors, but without any theoretical support. (I also believe intersect_word2vec_format() remains slightly broken in recent (circa 4.1.2) Gensim versions, though there may be a simple workaround.) Still, the lockf functionality may require tricky manual initialization & adaptation to other non-standard steps. To use it, it'd be best to read & understand the Gensim source code where related variables appear.

    So, if you really need a larger vocabulary than your initial domain-specific corpus, the safest approach is probably to extend your training corpus with more texts that feature the desired words, as used in similar language contexts. (For example, if you rdomain is scientific discourse, you'd want to extend your corpus with more similar scientific text to learn compatible words – not, say, classic fiction.) Then all words go through the well-characterized simultaneous training process.

    That said, if you really want to continue experimenting with this potentially complicated and error-prone improvised approach, your main problems might be:

    • using strings as your sentences instead of lists-of-tokens (so the training 'words' wind up actually just being single-characters)
    • something related to the intersect_word2vec_format bug; check if .vectors_lockf is the right length, with 1.0 in the all the right slots for word-updates, before training

    Separately, other observations:

    • min_count=0 is usually a bad idea: these models improve when you discard rare words entirely. (Though, when doing a .build_vocab(…, update=True) vocab-expansion, a bunch of things with the usual neat handling of low-frequency words and frequency-sorted vocabularies become screwy.)
    • hs=1 should generally not be set without also disabling the usually-preferred default negative-sampling with negative=0. (Otherwise, you're creating a hybrid franken-model, using both modes on one side of the internal neural network, that share the same input word-vectors: a much slower approach not especially likely to be better than either alone.)
    • ns_exponent=0.5 is non-standard, and using non-standard values for the parameter is most-likely to offer benefit in peculiar situations (like training texts that aren't true natural language sentences), and should only be tweaked within a harness for comparing results with alternate values.
    • sample=10**-2 is also non-standard, and such a large value might be nearly the same as turning off sample (say with a 0 value) entirely. It's more common to want to make this parameter more-aggressive (smaller than the default), if you have plentiful training data.

    In general, while the defaults aren't sacred, you generally should avoid tinkering with them until you have both (a) a good idea of why your corpus/goals might benefit from a different value; & (b) a system for verifying which alterations are helping or hurting, such as a grid-search over many parameter combinations that scores the fitness of resulting models on (some proxy for) your true end task.