Search code examples
pythonpandasnlpword2vec

Word2vec in pandas dataframe


I am trying to apply word2vec to check similarity of two columns per each row of my dataset.

For instance:

Sent1                                     Sent2
It is a sunny day                         Today the weather is good. It is warm outside
What people think about democracy         In ancient times, Greeks were the first to propose democracy  
I have never played tennis                I do not know who Roger Feder is 

To apply word2vec, I consider the following:

import numpy as np

words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:

    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words1[0])
count = 1

for w in words1[1:]:
    sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
    count += 1
sentence1_meaning /= count

sentence2_meaning = word2vec(words2[0])
count = 1
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
    sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
    count += 1
sentence2_meaning /= count

#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))

However, this should work for two sentences not in a pandas dataframe.

Can you please tell me what I need to do for applying word2vec in case of a pandas dataframe to check similarity between sent1 and sent2? I would like a new column for the results.


Solution

  • I don't have word2vec trained and available, so I will show how to do what you want with a bogus word2vec, with words converted to sentences by tfidf weights.

    Step 1. Prepare data

    from sklearn.feature_extraction.text import TfidfVectorizer
    df = pd.DataFrame({"sentences": ["this is a sentence", "this is another sentence"]})
    
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(df.sentences).todense()
    vocab = tfidf.vocabulary_
    vocab
    {'this': 3, 'is': 1, 'sentence': 2, 'another': 0}
    

    Step 2. Have bogus word2vec (of the size of our vocab)

    word2vec = np.random.randn(len(vocab),300)
    

    Step 3. Calculate a column containing word2vec for sentences:

    sent2vec_matrix = np.dot(tfidf_matrix, word2vec) # word2vec here contains vectors in the same order as in vocab
    df["sent2vec"] = sent2vec_matrix.tolist()
    df
    
    sentences   sent2vec
    0   this is a sentence  [-2.098592110459085, 1.4292324332403232, -1.10...
    1   this is another sentence    [-1.7879436822159966, 1.680865619703155, -2.00...
    

    Step 4. Calcualte similarity matrix

    from sklearn.metrics.pairwise import cosine_similarity
    similarity = cosine_similarity(df["sent2vec"].tolist())
    similarity
    array([[1.        , 0.76557098],
           [0.76557098, 1.        ]])
    

    For your word2vec to work you will need slightly adjust Step 2, so that word2vec contains all the words in vocab in the same order (as specified by value, or alphabetically).

    For your case it should be:

    sorted_vocab = sorted([word for word,key in vocab.items()])
    sorted_word2vec = []
    for word in sorted_vocab:
        sorted_word2vec.append(word2vec[word])