Purpose: We are exploring the use of word2vec models in clustering our data. We are looking for the ideal model to fit our needs and have been playing with using (1) existing models offered via Spacy and Gensim (trained on internet data only), (2) creating our own custom models with Gensim (trained on our technical data only) and (3) now looking into creating hybrid models that add our technical data to existing models (trained on internet + our data).
Here is how we created our hybrid model of adding our data to an existing Gensim model:
model = api.load("word2vec-google-news-300")
model = Word2Vec(size=300, min_count =1)
model.build_vocab(our_data)
model.train(our_data, total_examples=2, epochs =1)
model.wv.vocab
Question: Did we do this correctly in terms of our intentions of having a model that is trained on the internet and layered with our data?
Concerns: We are wondering if our data was really added to the model. When using the most similar function, we see really high correlations with more general words with this model. Our custom model has much lower correlations with more technical words. See output below.
Most Similar results for 'Python'
This model (internet + our data):
'technicians' = .99
'system' = .99
'working' = .99
Custom model (just our data):
'scripting' = .65
'perl' = .63
'julia' = .58
No: your code won't work for your intents.
When you execute the line...
model = Word2Vec(size=300, min_count=1)
...you've created an all-new, empty Word2Vec
object, assigning it into the model
variable, which discards anything that's already there. So the prior-loaded data will have no effect. You're just training a new model on your (tiny) data.
Further, the object you had loaded isn't a full Word2Vec
model. The 'GoogleNews' vectors that Google released back in 2013 are only the vectors, not a full model. There's no straightforward & reliable way to keep training that object, as it is missing lots of information a real full model would have (including word-frequencies and the model's internal weights).
There are some advanced ways you could try to seed your own model with those values - but they involve lots of murky tradeoffs & poorly-documented steps, in order for the end-results to have any value, compared to just training your own model on your own sufficient data. There's no officially-documented/supported way to do it in Gensim.