python nlp artificial-intelligence word2vec gensim

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:

model.wv.most_similar(positive = ['man'])
Out[14]: 
[('woman', 0.7380284070968628),
 ('lady', 0.6933152675628662),
 ('monk', 0.6662989258766174),
 ('guy', 0.6513140201568604),
 ('soldier', 0.6491742134094238),
 ('priest', 0.6440571546554565),
 ('farmer', 0.6366012692451477),
 ('sailor', 0.6297377943992615),
 ('knight', 0.6290514469146729),
 ('person', 0.6288090944290161)]
--------------------------------------------

Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,

model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will 
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215

This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.

This is what I am doing here:

#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList

#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

Solution

First of all, you should really use self.wv.similarity().

I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.

Eg. Until your model knows that

Man:Aadmi::Woman:Aurat,

from the word embeddings, it can never make out the

Raja:King::Rani:Queen

relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:

Make an independent Hindi corpus/model
Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
Randomly replace input document words with their counterparts from the corresponding document while training

These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.