I am building a NLP project which compares sentence similarities between two different dataframes. Here is an example of the dataframes:
df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent',
'this too is a sentence',
'this is very different',
'another sentence is this',
'not much of anything']
})
I currently have the code set up in a way that it compares the first cell in df with all the cells in df1. It then picks the highest cosine similarity score and puts that in a separate dataframe with the following code:
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
[sentence_vecs[0]],
sentence_vecs1[0:]
)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
The end product is this dataframe:
XNew_ID X0
0 1 0.615005
How can I write a piece of code so it will loop through the rest of the elements in df and write the to the 'Final' dataframe in the same manner?
Cosign similarity can perform well on two lists, so you can pass the whole embeddings list as arguments and extract maximum similarities afterward.
import pandas as pd
import numpy as np
model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df1['Element Detail'])
sentence_vecs1 = model.encode(df2['Element Details'])
from sklearn.metrics.pairwise import cosine_similarity
new = cosine_similarity(
sentence_vecs,
sentence_vecs1
)
max_similarities = np.amax(new, axis=1)
d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
Final
output:
XNew_ID X0 X1
0 1 0.615005 0.868932