I'm using BERT to compare text similarity, with the following code:
from bert_embedding import BertEmbedding
import numpy as np
from scipy.spatial.distance import cosine as cosine_similarity
bert_embedding = BertEmbedding()
TEXT1 = "As expected from MIT-level of course: it's interesting, challenging, engaging, and for me personally quite enlightening. This course is second part of 5 courses in micromasters program. I was interested in learning about supply chain (purely personal interest, my work touch this topic but not directly) and stumbled upon this course, took it, and man-oh-man...I just couldn't stop learning. Now I'm planning to take the rest of the courses. Average time/effort per week should be around 8-10 hours, but I tried to squeeze everything into just 5 hours since I have very limited free time. You will need 2-3 hours per week for the lecture videos, 2 hours for practice problems, and another 2 hours for the weekly homework. This course offers several topics around demand forecasting and inventory. Basic knowledge of probability and statistics is needed. It will help if you take the prerequisite course: supply chain analytics. But if you've already familiar with basic concept of statistics, you can pick yourself along the way. The lectures are very interesting and engaging, it gives you a lot of knowledge but also throw in some business perspective, so it's very relatable and applicable! The practice problems can help strengthen the understanding of the given knowledge and the homework are very challenging compared to other online-courses I have taken. This course is the best quality I have taken so far, and I have taken several (3-4 MOOCs) from other provider."
TEXT1 = TEXT1.split('.')
sentence2 = ["CHALLENGING COURSE "]
from there I want to find the best match of sentence2 in one of the sentences of TEXT1, using cosine distance
best_match = {'sentence':'','score':''}
best = 0
for sentence in TEXT1:
#sentence = sentence.replace('SUPPLY CHAIN','')
if len(sentence) < 5:
continue
avg_vec1 = calculate_avg_vec([sentence])
avg_vec2 = calculate_avg_vec(sentence2)
score = cosine_similarity(avg_vec1,avg_vec2)
if score > best:
best_match['sentence'] = sentence
best_match['score'] = score
best = score
best_match
The code is working, but since I want to compare the sentence2 not only with TEXT1 with but N texts I need to improve the speed. is it possible to vectorice this loop? or any way to speed it up?
cosine_similarity
is defined as a dot product of two normalized vectors.
This is essentially a matrix multiplication, followed by an argmax
to get the best index.
I'll be using numpy
, even though - as mentioned in the comments - you could probably plug it in to the BERT
model with pytorch
or tensorflow
.
First, we define a normalized average vector:
def calculate_avg_norm_vec(sentence):
vs = sentence2vectors(sentence) # TODO: use Bert embedding
vm = vs.mean(axis=0)
return vm/np.linalg.norm(vm)
Then, we build a matrix of all sentences and their vectors
X = np.apply_along_axis(calculate_avg_norm_vec, 1, all_sentences)
target = calculate_avg_norm_vec(target_sentence)
Finally, we'll need to multiply the target
vector with the X
matrix, and take the argmax
index_of_sentence = np.dot(X,target.T).argmax(axis=1)
You might want to make sure that the axis
and indexing fit your data, but this is the overall scheme