Search code examples
pythonjaro-winkler

Checking and Removing NoneTypes for Jaro String Similarity


I'm trying to discern the string similarity between two strings (using Jaro). Each string resides in a separate column in my dataframe.

String 1 = df['name_one'] 

String 2 = df['name_two']

When I try to run my string similarity logic:

from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)

I get the following error:

 **error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)**

Great, so there is a nonetype in the columns, so the first thing I do is check for this:

maskone = df['name_one'] == None
df[maskone]

masktwo = df['name_two'] == None
df[masktwo]

This yields in no None types found.... I'm scratching my head here at this point, but proceed to clean the two columns any ways.

df['name_one'] = df['name_one'].fillna('').astype(str)
df['name_two'] = df['name_two'].fillna('').astype(str) 

And yet, I'm still getting:

error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)

Am I removing NoneTypes correctly?


Solution

  • Problem

    The issue isn't exactly that you are only experiencing NoneTypes but empty strings which can also throw this exception as you can see in the implementation of distance.get_jaro_distance

    if not first or not second:
        raise JaroDistanceException("Cannot calculate distance from NoneType ({0}, {1})".format(
            first.__class__.__name__,
            second.__class__.__name__))
    

    Option 1

    Trying replacing your none types and/or empty strings with 'NA' or filtering them from your dataset.

    Option 2

    Use a flag value/distance for rows that may raise this exception . In the example below, I will utilize 999

    from pyjarowinkler import distance
    
    df['distance'] = df.apply(lambda d: 999 if not str(d['name_one']) or not str(d['name_two']) else distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)