I am trying to apply word2vec to check similarity of two columns per each row of my dataset.
For instance:
Sent1 Sent2
It is a sunny day Today the weather is good. It is warm outside
What people think about democracy In ancient times, Greeks were the first to propose democracy
I have never played tennis I do not know who Roger Feder is
To apply word2vec, I consider the following:
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
However, this should work for two sentences not in a pandas dataframe.
Can you please tell me what I need to do for applying word2vec in case of a pandas dataframe to check similarity between sent1 and sent2? I would like a new column for the results.
I don't have word2vec
trained and available, so I will show how to do what you want with a bogus word2vec
, with words converted to sentences by tfidf
weights.
Step 1. Prepare data
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"sentences": ["this is a sentence", "this is another sentence"]})
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.sentences).todense()
vocab = tfidf.vocabulary_
vocab
{'this': 3, 'is': 1, 'sentence': 2, 'another': 0}
Step 2. Have bogus word2vec
(of the size of our vocab)
word2vec = np.random.randn(len(vocab),300)
Step 3. Calculate a column containing word2vec for sentences:
sent2vec_matrix = np.dot(tfidf_matrix, word2vec) # word2vec here contains vectors in the same order as in vocab
df["sent2vec"] = sent2vec_matrix.tolist()
df
sentences sent2vec
0 this is a sentence [-2.098592110459085, 1.4292324332403232, -1.10...
1 this is another sentence [-1.7879436822159966, 1.680865619703155, -2.00...
Step 4. Calcualte similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(df["sent2vec"].tolist())
similarity
array([[1. , 0.76557098],
[0.76557098, 1. ]])
For your word2vec
to work you will need slightly adjust Step 2, so that word2vec
contains all the words in vocab
in the same order (as specified by value, or alphabetically).
For your case it should be:
sorted_vocab = sorted([word for word,key in vocab.items()])
sorted_word2vec = []
for word in sorted_vocab:
sorted_word2vec.append(word2vec[word])