Search code examples
pythonpandasnumpyout-of-memorypython-itertools

How to solve for Memory issue in python while creating a large dataframe


Context:

I have a list of ~80k words which have potential spelling mistakes

(e.g., "apple" vs "applee" vs "  apple" vs "     aplee   ").

I'm planning to great a dataframe grid by picking two words at a time and then applying a fuzzy score function to compare the similarity. I am also applying standard text cleaning such as trimming, removing special characters, double spaces, etc. and then getting the unique list to check for similarity

Approach:

I'm using the itertools.combinations function to create a the dataframe grid

#Sample python code

#Step1:
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])

print(data_grid)


    name1   name2
0   apple   applee
1   apple   aplee
2   applee  aplee

I have defined a function that calculates the fuzzyscore

def fuzzy_score_func(row):     
    fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
    thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])

    return fuzzywuzzy_partial_ratio, thefuzz_ratio    

and use apply function to get the final score

#Step2:

data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')

print(data_grid)

    name1   name2   partial_ratio   ratio
0   apple   applee  100             91
1   apple   aplee   80              80
2   applee  aplee   80              91

This works fine when the list is ~8k where checking all combination has ~25Mn rows in the dataframe.

But when I try to expand the list to 80k, I get memory error in step 1 when I'm trying to initialize the dataframe with all possible combination. Which makes sense given the size of the dataframe would be ~6.4Bn row

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:738, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    736         data = np.asarray(data)
    737     else:
--> 738         data = list(data)
    739 if len(data) > 0:
    740     if is_dataclass(data[0]):

MemoryError: 

Any suggestion on how to tackle this memory issue or if there's a better way to implement my problem statement. I tried exploring multiprocessing, nested loops, etc. but no major success.

I'm using an Intel windows laptop

Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz   3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor

Solution

  • I might try starting with this code based on just using itertools without pandas.

    import csv
    import itertools
    import fuzzywuzzy.fuzz
    
    MIN_RATION = 90
    
    ## ----------------------
    ## the result of cleaning and filtering your input data...
    ## ----------------------
    my_unique_list = ['apple','applee','aplee']
    ## ----------------------
    
    ## ----------------------
    ## Create a result file of acceptably close matches 
    ## ----------------------
    with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
        writer = csv.writer(file_out)
        writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
        for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
            if index % 1000 == 0:
                print(f"combinations processed: {index}", end="\r", flush=True)
    
            partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
            ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
            if max(partial_ratio, ratio) >= MIN_RATION:
                writer.writerow([word1, word2, partial_ratio, ratio])
        print()
        print(f"Total combinations processed: {index+1}")
    ## ----------------------
    

    While I'm not a multiprocessing expert, this might work. You might want to test it a bit on a smaller subset:

    import csv
    import functools
    import itertools
    import multiprocessing
    
    import fuzzywuzzy.fuzz
    
    MIN_RATION = 90
    
    def get_ratios(pair, queue):
        partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
        ratio = fuzzywuzzy.fuzz.ratio(*pair)
        if max(partial_ratio, ratio) >= MIN_RATION:
            queue.put(list(pair) + [partial_ratio, ratio])
    
    def main(my_unique_list):
        with multiprocessing.Manager() as manager:
            queue = manager.Queue()
    
            with multiprocessing.Pool(processes=8) as pool:
                _ = pool.map(functools.partial(get_ratios, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
                pool.close()
                pool.join()
    
            with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
                writer = csv.writer(file_out)
                writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
                while not queue.empty():
                    item = queue.get()
                    writer.writerow(item)
                    #print(item)
    
    if __name__ == "__main__":
        ## ----------------------
        ## the result of cleaning and filtering your input data...
        ## ----------------------
        my_unique_list = ['apple','applee','aplee']
        ## ----------------------
    
        main(my_unique_list)