Search code examples
python-3.xword2vec

Is it possible to set the matrix weights of embeddings before or after training word2vec?


I need to change the matrix embedding in word2vec after to train this. Here is the example:

w2v=Word2Vec(sentences,size=100,window=1,min_count=1,negative=15,iter=3)
w2v.save("word2vec.model")

#Getting embedding matrix
embedding_matrix=w2v.wv.vectors

for p in ("mujer", "hombre"):
    result=w2v.wv.similar_by_word(p)
    print("Similar words from '",p,"': ",result[:3])

#Trying to set wights matrix
w2v.wv.vectors=np.random.rand(w2v.wv.vectors.shape[0],w2v.wv.vectors.shape[1])

print()

for p in ("mujer", "hombre"):
    result=w2v.wv.similar_by_word(p)
    print("Similar words from '",p,"': ",result[:3])

And here is the output:

Similar words from ' mujer ':  [('honra', 0.9999152421951294), ('muerte', 0.9998959302902222), ('contento', 0.999891459941864)]
Similar words from ' hombre ':  [('valor', 0.9999064207077026), ('nombre', 0.9998984336853027), ('llegar', 0.9998887181282043)]

Similar words from ' mujer ':  [('honra', 0.9999152421951294), ('muerte', 0.9998959302902222), ('contento', 0.999891459941864)]
Similar words from ' hombre ':  [('valor', 0.9999064207077026), ('nombre', 0.9998984336853027), ('llegar', 0.9998887181282043)]

As you can see, I get the same predictions despite having changed the embedding matrix by random numbers.

I don't get any method in the documentation to make this change.

Will it be possible?


Solution

  • I already found the solution. Just use the init_sims() function after setting the array.

    w2v=Word2Vec(sentences,size=100,window=1,min_count=1,negative=15,iter=3)
    w2v.save("word2vec.model")
    
    #Getting embedding matrix
    embedding_matrix=w2v.wv.vectors
    
    for p in ("mujer", "hombre"):
        result=w2v.wv.similar_by_word(p)
        print("Similar words from '",p,"': ",result[:3])
    
    #Setting new values on wights matrix
    w2v.wv.vectors=np.random.rand(w2v.wv.vectors.shape[0],w2v.wv.vectors.shape[1])
    
    #This line create a l2 normalization over the embedding matrix 
    word_vectors.vectors_norm=word_vectors.init_sims(replace=False)
    
    print()
    
    for p in ("mujer", "hombre"):
        result=w2v.wv.similar_by_word(p)
        print("Similar words from '",p,"': ",result[:3])