Search code examples
pandasdifflibsequencematcher

Using difflib to compare a string with a row in a dataframe


I have a string

email = '[email protected]' 

and a DF

df = DataFrame({ ‘id’: [1, 2, 3], 'email_address': [‘[email protected]’, ‘[email protected]’, ‘[email protected]’ ]})

I want to add a column named 'score' and score each email_address against my email string. I tried:

  df['score']  = difflib.SequenceMatcher(None, df['email_address'], email).ratio()

but it always scores everything as 0.0, even if I make the string email an exact match to one of the emails in the df.

Context is that we have an issue with people signing up for multiple accounts so we want to be able to search an email and see if there are any similar emails that already exist.

I am also open to a different approach for this issue. Thanks!


Solution

  • You could use pandas.DataFrame.apply:

    In [1]: import pandas as pd
       ...: from difflib import SequenceMatcher
    In [2]: df = pd.DataFrame({'id': [1, 2, 3], 'email_address': ['[email protected]', '[email protected]', '[email protected]']})
       ...: df
    Out[2]: 
       id     email_address
    0   1   [email protected]
    1   2   [email protected]
    2   3  [email protected]
    In [3]: email = '[email protected]'
    In [4]: df['score'] = df['email_address'].apply(lambda e: SequenceMatcher(None, email, e).ratio())
       ...: df
    Out[4]: 
       id     email_address     score
    0   1   [email protected]  0.785714
    1   2   [email protected]  0.857143
    2   3  [email protected]  0.620690