python pandas numpy out-of-memory python-itertools

How to solve for Memory issue in python while creating a large dataframe

Context:

I have a list of ~80k words which have potential spelling mistakes

(e.g., "apple" vs "applee" vs "  apple" vs "     aplee   ").

I'm planning to great a dataframe grid by picking two words at a time and then applying a fuzzy score function to compare the similarity. I am also applying standard text cleaning such as trimming, removing special characters, double spaces, etc. and then getting the unique list to check for similarity

Approach:

I'm using the itertools.combinations function to create a the dataframe grid

#Sample python code

#Step1:
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])

print(data_grid)


    name1   name2
0   apple   applee
1   apple   aplee
2   applee  aplee

I have defined a function that calculates the fuzzyscore

def fuzzy_score_func(row):     
    fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
    thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])

    return fuzzywuzzy_partial_ratio, thefuzz_ratio

and use apply function to get the final score

#Step2:

data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')

print(data_grid)

    name1   name2   partial_ratio   ratio
0   apple   applee  100             91
1   apple   aplee   80              80
2   applee  aplee   80              91

This works fine when the list is ~8k where checking all combination has ~25Mn rows in the dataframe.

But when I try to expand the list to 80k, I get memory error in step 1 when I'm trying to initialize the dataframe with all possible combination. Which makes sense given the size of the dataframe would be ~6.4Bn row

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:738, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    736         data = np.asarray(data)
    737     else:
--> 738         data = list(data)
    739 if len(data) > 0:
    740     if is_dataclass(data[0]):

MemoryError:

Any suggestion on how to tackle this memory issue or if there's a better way to implement my problem statement. I tried exploring multiprocessing, nested loops, etc. but no major success.

I'm using an Intel windows laptop

Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz   3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor

Solution

I might try starting with this code based on just using itertools without pandas.

import csv
import itertools
import fuzzywuzzy.fuzz

MIN_RATION = 90

## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------

## ----------------------
## Create a result file of acceptably close matches 
## ----------------------
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
    for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
        if index % 1000 == 0:
            print(f"combinations processed: {index}", end="\r", flush=True)

        partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
        ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
        if max(partial_ratio, ratio) >= MIN_RATION:
            writer.writerow([word1, word2, partial_ratio, ratio])
    print()
    print(f"Total combinations processed: {index+1}")
## ----------------------

While I'm not a multiprocessing expert, this might work. You might want to test it a bit on a smaller subset:

import csv
import functools
import itertools
import multiprocessing

import fuzzywuzzy.fuzz

MIN_RATION = 90

def get_ratios(pair, queue):
    partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
    ratio = fuzzywuzzy.fuzz.ratio(*pair)
    if max(partial_ratio, ratio) >= MIN_RATION:
        queue.put(list(pair) + [partial_ratio, ratio])

def main(my_unique_list):
    with multiprocessing.Manager() as manager:
        queue = manager.Queue()

        with multiprocessing.Pool(processes=8) as pool:
            _ = pool.map(functools.partial(get_ratios, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
            pool.close()
            pool.join()

        with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
            writer = csv.writer(file_out)
            writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
            while not queue.empty():
                item = queue.get()
                writer.writerow(item)
                #print(item)

if __name__ == "__main__":
    ## ----------------------
    ## the result of cleaning and filtering your input data...
    ## ----------------------
    my_unique_list = ['apple','applee','aplee']
    ## ----------------------

    main(my_unique_list)