This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:
model.wv.most_similar(positive = ['man'])
Out[14]:
[('woman', 0.7380284070968628),
('lady', 0.6933152675628662),
('monk', 0.6662989258766174),
('guy', 0.6513140201568604),
('soldier', 0.6491742134094238),
('priest', 0.6440571546554565),
('farmer', 0.6366012692451477),
('sailor', 0.6297377943992615),
('knight', 0.6290514469146729),
('person', 0.6288090944290161)]
--------------------------------------------
Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,
model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215
This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.
This is what I am doing here:
#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList
#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")
First of all, you should really use self.wv.similarity().
I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.
Eg. Until your model knows that
Man:Aadmi::Woman:Aurat,
from the word embeddings, it can never make out the
Raja:King::Rani:Queen
relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:
These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.