Search code examples
pythonpandaslevenshtein-distance

pandas algorithm slow: for loops and lambda


summary: I am searching for misspellings between a bunch of data and it is taking forever

I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling code:

for file in file_list:  #20+ files             
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values():  #30 ish json values
    for row in v["json_search_string"]:    #200 ish substrings
        df_temp = df
        df_temp['jelly'] = row
        df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
        df_agg = df_temp[df_temp['difference'] <3]
        if os.path.isfile(filepath+"levenshtein.csv"):
            with open(filepath+"levenshtein.csv", 'a') as f:
                df_agg.to_csv(f, header=False)
        else:
            df_agg.to_csv(filtered_filepath+"levenshtein.csv") 

I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:

for file in file_list:  #20+ files             
    df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
    for row in data['z']['json_search_string']:
        #levenshtein algorithm above

The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?

Notes: Here is an example of what the data might look like: JSON data

{
"A": {
    "email": "blah",
    "name": "Joe Blah",
    "json_search_string": [
        "Company A",
        "Some random company name",
        "Company B",
        "etc",
        "..."

And the csv columns:

ID, Search Column,            Other Columns
1,  Clompany A,               XYZ
2,  Company A,                XYZ
3,  Some misspelled company,  XYZ
etc

Solution

  • Well, it is really hard to answer performance enhancement question. Depending on the effort and performance, here are some suggestions.

    1. Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:

      # prepare fixed words from JSON DATA
      fixed_words = [] 
      for v in json_data.values():
          fixed_words += v["json_search_string"]
      # looping over each file, and compare them with word from fixed words
      for f in file_list:
          # do the comparison and save.
      
    2. Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.

    3. Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:

      a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from @Corley Brigman , editdistance is one option you can use.

      b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance

      c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++

    You could use several of my suggestion to gain higher performance. Hope this would be helpful.