Search code examples
pythonnlpartificial-intelligenceword2vecgensim

word2vec gensim multiple languages


This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:

model.wv.most_similar(positive = ['man'])
Out[14]: 
[('woman', 0.7380284070968628),
 ('lady', 0.6933152675628662),
 ('monk', 0.6662989258766174),
 ('guy', 0.6513140201568604),
 ('soldier', 0.6491742134094238),
 ('priest', 0.6440571546554565),
 ('farmer', 0.6366012692451477),
 ('sailor', 0.6297377943992615),
 ('knight', 0.6290514469146729),
 ('person', 0.6288090944290161)]
--------------------------------------------

Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,

model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will 
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215

This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.

This is what I am doing here:

#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList

#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

Solution

  • First of all, you should really use self.wv.similarity().

    I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.

    Eg. Until your model knows that

    Man:Aadmi::Woman:Aurat,

    from the word embeddings, it can never make out the

    Raja:King::Rani:Queen

    relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:

    1. Make an independent Hindi corpus/model
    2. Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
    3. Randomly replace input document words with their counterparts from the corresponding document while training

    These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.