I have a domain specific corpus for which I am trying to train embeddings. Since I want to be comprehensive in vocabulary, I am adding word vectors from glove.6B.50d.txt
. Post adding vectors from here, I am training the model using the corpus I have.
I am trying the solutions from here but the word embeddings don't seem to update.
This is the solution I have so far.
#read glove embeddings
glove_wv = KeyedVectors.load_word2vec_format(GLOVE_PATH, binary=False)
#initialize w2v model
model = Word2Vec(vector_size=50, min_count=0, window=20, epochs=10, sg=1, workers=10,
hs=1, ns_exponent=0.5, seed=42, sample=10**-2, shrink_windows=True)
model.build_vocab(sentences_tokenized)
training_examples_count = model.corpus_count
# add vocab from glove
model.build_vocab([list(glove_wv.key_to_index.keys())], update=True)
model.wv.vectors_lockf = np.zeros(len(model.wv)) # ALLOW UPDATE OF WEIGHTS FROM BACK PROP; 0 WILL SUPPRESS
# add glove embeddings
model.wv.intersect_word2vec_format(GLOVE_PATH,binary=False, lockf=1.0)
Below I am training the model and checking word embedding of a particular word explicitly present in training
# train model
model.train(sentences_tokenized,total_examples=training_examples_count, epochs=model.epochs)
#CHECK IF EMBEDDING CHANGES FOR 'oyo'
print(model.wv.get_vector('oyo'))
print(glove_wv.get_vector('oyo'))
The word embeddings of the word oyo
comes out to be same before and after the training. Where am I going wrong?
The input corpus- sentences_tokenized
contains few sentences that contains the word oyo
. One of such sentences-
'oyo global platform empowers entrepreneur small business hotel home providing full stack technology increase earnings eas operation bringing affordable trusted accommodation guest book instantly india largest budget hotel chain oyo room one preferred hotel booking destination vast majority student country hotel chain offer many benefit include early check in couple room id card flexibility oyo basically network budget hotel completely different famous hotel aggregator like goibibo yatra makemytrip partner zero two star hotel give makeover room bring customer hotel website mobile app'
You're improvising a lot here with a bunch of potential errors or suboptimalities. Note especially that:
intersect_word2vec_format()
feature, and especially the lockf
function, are also experimental - one stab at maybe offering a way to mix in other word-vectors, but without any theoretical support. (I also believe intersect_word2vec_format()
remains slightly broken in recent (circa 4.1.2) Gensim versions, though there may be a simple workaround.) Still, the lockf
functionality may require tricky manual initialization & adaptation to other non-standard steps. To use it, it'd be best to read & understand the Gensim source code where related variables appear.So, if you really need a larger vocabulary than your initial domain-specific corpus, the safest approach is probably to extend your training corpus with more texts that feature the desired words, as used in similar language contexts. (For example, if you rdomain is scientific discourse, you'd want to extend your corpus with more similar scientific text to learn compatible words – not, say, classic fiction.) Then all words go through the well-characterized simultaneous training process.
That said, if you really want to continue experimenting with this potentially complicated and error-prone improvised approach, your main problems might be:
intersect_word2vec_format
bug; check if .vectors_lockf
is the right length, with 1.0
in the all the right slots for word-updates, before trainingSeparately, other observations:
min_count=0
is usually a bad idea: these models improve when you discard rare words entirely. (Though, when doing a .build_vocab(…, update=True)
vocab-expansion, a bunch of things with the usual neat handling of low-frequency words and frequency-sorted vocabularies become screwy.)hs=1
should generally not be set without also disabling the usually-preferred default negative-sampling with negative=0
. (Otherwise, you're creating a hybrid franken-model, using both modes on one side of the internal neural network, that share the same input word-vectors: a much slower approach not especially likely to be better than either alone.)ns_exponent=0.5
is non-standard, and using non-standard values for the parameter is most-likely to offer benefit in peculiar situations (like training texts that aren't true natural language sentences), and should only be tweaked within a harness for comparing results with alternate values.sample=10**-2
is also non-standard, and such a large value might be nearly the same as turning off sample
(say with a 0
value) entirely. It's more common to want to make this parameter more-aggressive (smaller than the default), if you have plentiful training data.In general, while the defaults aren't sacred, you generally should avoid tinkering with them until you have both (a) a good idea of why your corpus/goals might benefit from a different value; & (b) a system for verifying which alterations are helping or hurting, such as a grid-search over many parameter combinations that scores the fitness of resulting models on (some proxy for) your true end task.