Search code examples
pythonpandassimilarity

Check similarity of texts in pandas dataframe


I have a dataframe

Account      Message
454232     Hi, first example 1
321342     Now, second example
412295     hello, a new example 1 in the third row
432325     And now something completely different

I would like to check similarity between texts in Message column. I would need to choose one of the message as source to test (for example the first one) and create a new column with the output from similarity test. If I had two lists, I would do as follows

import spacy
spacyModel = spacy.load('en')

list1 = ["Hi, first example 1"]
list2 = ["Now, second example","hello, a new example 1 in the third row","And now something completely different"]

list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]

similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]

print(similarityMatrix)

But I do not know how to do the same in pandas, creating a new column with similarity results.

Any suggestions?


Solution

  • I am not sure about spacy, but in order to compare the one text with other values in the columns I would use .apply() and pass the match making function and set axis=1 for column-wise. Here is an example using SequenceMatcher (I don't have spacy for now).

    test = 'Hi, first example 1'
    df['r'] = df.apply(lambda x: SequenceMatcher(None, test, x.Message).ratio(), axis=1)
    print(df)
    

    Result:

       Account                                  Message         r
    0   454232                      Hi, first example 1  1.000000
    1   321342                      Now, second example  0.578947
    2   412295  hello, a new example 1 in the third row  0.413793
    3   432325   And now something completely different  0.245614
    

    So in your case, it will be a similar statement but using functions you have instead of SequenceMatcher