Search code examples
pythonpandasnlpbert-language-modelcosine-similarity

Looping cosine similarity formula from one dataframe to another dataframe using pandas & BERT


I am building a NLP project which compares sentence similarities between two different dataframes. Here is an example of the dataframes:

df = pd.DataFrame({'Element Detail':['Too many competitors in market', 'Highly skilled employees']})
df1 = pd.DataFrame({'Element Details':['Our workers have a lot of talent', 
                                      'this too is a sentence',
                                      'this is very different',
                                      'another sentence is this',
                                      'not much of anything']
                    })

I currently have the code set up in a way that it compares the first cell in df with all the cells in df1. It then picks the highest cosine similarity score and puts that in a separate dataframe with the following code:

import pandas as pd
import numpy as np

model_name = 'bert-base-nli-mean-tokens'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
sentence_vecs = model.encode(df['Element Detail'])
sentence_vecs1 = model.encode(df1['Element Details'])

from sklearn.metrics.pairwise import cosine_similarity

new = cosine_similarity(
    [sentence_vecs[0]],
    sentence_vecs1[0:]
)

d = pd.DataFrame(new)
T =pd.DataFrame.transpose(d)
df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
Tnew = (T.add_prefix('X'))
Final = (Tnew[Tnew.X0 == Tnew.X0.max()])

The end product is this dataframe:

    XNew_ID     X0  
0   1           0.615005 

How can I write a piece of code so it will loop through the rest of the elements in df and write the to the 'Final' dataframe in the same manner?


Solution

  • Cosign similarity can perform well on two lists, so you can pass the whole embeddings list as arguments and extract maximum similarities afterward.

    import pandas as pd
    import numpy as np
    
    model_name = 'bert-base-nli-mean-tokens'
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer(model_name)
    sentence_vecs = model.encode(df1['Element Detail'])
    sentence_vecs1 = model.encode(df2['Element Details'])
    
    from sklearn.metrics.pairwise import cosine_similarity
    
    new = cosine_similarity(
        sentence_vecs,
        sentence_vecs1
    )
    max_similarities = np.amax(new, axis=1)
    d = pd.DataFrame(new)
    T =pd.DataFrame.transpose(d)
    df_new = T.insert(0, 'New_ID', range(1, 1 + len(T)))
    Tnew = (T.add_prefix('X'))
    Final = (Tnew[Tnew.X0 == Tnew.X0.max()])
    Final
    

    output:

        XNew_ID     X0          X1
    0   1           0.615005    0.868932