Gensim pretrained model similarity

Problem :

Im using glove pre-trained model with vectors to retrain my model with a specific domain say #cars, after training I want to find similar words within my domain but I got words not in my domain corpus, I believe it's from glove's vectors.

model_2.most_similar(positive=['spacious'],    topn=10)

[('bedrooms', 0.6275501251220703),
 ('roomy', 0.6149100065231323),
 ('luxurious', 0.6105825901031494),
 ('rooms', 0.5935696363449097),
 ('furnished', 0.5897485613822937),
 ('cramped', 0.5892841219902039),
 ('courtyard', 0.5721820592880249),
 ('bathrooms', 0.5618442893028259),
 ('opulent', 0.5592212677001953),
 ('expansive', 0.555268406867981)]

Here I expect something like leg-room, car's spacious features mentioned in the domain's corpus. How can we exclude the glove vectors while having similar vectors?

Thanks

Solution

There may not be enough info in a simple set of generic word-vectors to filter neighbors by domain-of-use.

You could try using a mixed-weighting: combine the similarities to 'spacious', and to 'cars', and return the top results in that combination – and it might help a little.

Supplying more than one positive word to the most_similar() method might approximate this. If you're sure of some major sources of interference/overlap, you might even be able to use negative word examples, similar to how word2vec finds candidate answers for analogies (though this might also suppress useful results that are legitimately related to both domains, like 'roomy'). For example:

candidates = vec_model.most_similar(positive=['spacious', 'car'], 
                                    negative=['house'])

(Instead of using single words like 'car' or 'house' you could also try using vectors combined from many words that define a domain.)

But a sharp distinction sounds like a research project, rather than something easily possible with off-the-shelf libraries/vectors – and may requires more sophisticated approaches and datasets.

You could also try using a set of vectors trained only on a dataset of text from the domain of interest – thus ensuring the vocabulary, and senses, of words are all in that domain.