Search code examples
pythonpandaslevenshtein-distancedifflib

Efficient way of generating new columns having string similarity distances between two string columns


I have a pandas dataframe with shape (1138812, 14) and columns

['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
       'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
       'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
       'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
       'point_of_interest_2', 'match']

I want to create new columns based on the string similarity distances using Levenshtein and difflib difflib.SequenceMatcher().ratio(), Levenshtein.distance(), Levenshtein.jaro_winkler() and LongestCommonSubstring() between each of the columns

['name', 'address', 'city', 'state',
       zip', 'country', 'url', 'phone', 'categories']

and corresponding _2 suffixed columns. In the end it will give me 9*4 = 36 new columns. Right now, I am using df.iterrows() to loop through the dataframe and make column lists. But it is very very time and memory consuming. It takes 3.5 hours to go through the whole dataframe while using full 16GB ram memory. I am trying to find a better method both time and memory wise to get my result. My code:

import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
           'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
    data_dict[f"{i}_geshs"] = []
    data_dict[f"{i}_levens"] = []
    data_dict[f"{i}_jaros"] = []
    data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
    for j in columns:
        data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
        data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)

Solution

  • Starting with a dataframe that looks like:

    first_name address city state zip url phone categories first_name_2 address_2 city_2 state_2 zip_2 url_2 phone_2 categories_2
    Rori 680 Buell Crossing Dallas Texas 75277 url_shortened 214-533-2179 Granite Surfaces Agustin 7 Schiller Crossing Lubbock Texas 79410 url_shortened 806-729-7419 Roofing (Metal)
    Dmitri 05 Coolidge Way Charleston West Virginia 25356 url_shortened 304-906-6384 Structural and Misc Steel (Fabrication) Kearney 0547 Clemons Plaza Peoria Illinois 61651 url_shortened 309-326-4252 Framing (Steel)

    And is of shape 1024000 rows × 16 columns

    import difflib
    import Levenshtein
    import numpy as np
    import pandas as pd
    from pandarallel import pandarallel
    pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all
    
    def dists(x, y):
        matcher = difflib.SequenceMatcher(None, x, y)
        geshs = matcher.ratio()
        levens = Levenshtein.distance(x, y)
        jaros = Levenshtein.jaro_winkler(x, y)
        lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
        return [geshs, levens, jaros, lcss]
    
    
    df = pd.read_csv('MOCK_DATA.csv')
    df = df.astype(str) # force all fields to strings.
    
    cols = df.columns
    cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
    for x, y in zip(*cols):
        (df[x + '_geshs'], 
         df[x + '_levens'], 
         df[x + '_jaros'], 
         df[x + '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
        # Replace parallel_apply with apply to run non-parallel.
    

    (In addition to keeping the original columns) I get these columns in 3 minutes, without parallezation, it would still probably only take ~20-30 minutes. Peak memory usage from python was only about 3GB, and would be much lower without parallezation.

    first_name_geshs first_name_levens first_name_jaros first_name_lcss address_geshs address_levens address_jaros address_lcss city_geshs city_levens city_jaros city_lcss state_geshs state_levens state_jaros state_lcss zip_geshs zip_levens zip_jaros zip_lcss url_geshs url_levens url_jaros url_lcss phone_geshs phone_levens phone_jaros phone_lcss categories_geshs categories_levens categories_jaros categories_lcss
    0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
    0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3