Search code examples
pythonfuzzy-logicfuzzy-comparisonfuzzywuzzy

Fuzzy compare two column


I have a CSV file with search terms (numbers and text) that I would like to compare against a list of other terms (numbers and text) to determine if there are any matches or potential matches. Then I would like to have all results written to a new CSV for manual review. I am using the fuzzywuzzy plug-in to create a 'score' to determine how close of a match there is between the terms. Ideally, I would be able to filter on the ratio.

My current code compares the files rows one to one instead of one row in the first file to all the rows in the second; which is what I need.

How do I perfrom a fuzzy lookup for each row in file1 against all the rows in file2?

from fuzzywuzzy import fuzz
import csv
from itertools import zip_longest

f = open('FuzzyMatch2.csv', 'wt')
writer = csv.writer(f, lineterminator = '\n')


file1_loc = 'LookUp.csv'
file2_loc = 'Prod.csv'

file1 = csv.DictReader(open(file1_loc, 'r'), delimiter=',', quotechar='"')
file2 = csv.DictReader(open(file2_loc, 'r'), delimiter=',', quotechar='"')

for row in file1:
    for l1, l2 in zip_longest(file1, file2):
        if all((l1, l2)):
            partial_ratio = fuzz.token_sort_ratio(str(l1['SearchTerm']), str(l2['Description']))       

        a = [l1,l2,partial_ratio]
        writer.writerow(a)

f.close()

Below is a much cleaner version of the above code, but it still has issues. The code gives an error

IndexError: list index out of range

Any idea how to get the list within range and the code working?

from fuzzywuzzy import process
import csv

save_file = open('FuzzyResults.csv', 'wt')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):
    with open(path,'r') as f:
        for row in f:
            row = row.split()
            yield row


if __name__ == "__main__":
    ## Create lookup dictionary by parsing the products csv
    data = {}
    for row in parse_csv('Prod.csv'):
        data[row[0]] = row[1]

    ## For each row in the lookup compute the partial ratio
    for row in parse_csv("LookUp.csv"):

        for found, score in process.extract(row, data, limit=100):
            if score >= 10:
                print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
                Digi_Results = [score, row, found]
                writer.writerow(Digi_Results)


save_file.close()

Solution

  • The code below will work. Make sure that the latest FuzzyWuzzy is installed - last update 28APR15. Otherwise you will receive an 'unicode' error. Hope this helps!

    from fuzzywuzzy import process
    import csv
    
    save_file = open('FuzzyResults3.csv', 'w')
    writer = csv.writer(save_file, lineterminator = '\n')
    
    def parse_csv(path):
    
        with open(path,'r') as f:
            reader = csv.reader(f, delimiter=',')
            for row in reader:
                yield row
    
    
    if __name__ == "__main__":
        ## Create lookup dictionary by parsing the products csv
        data = {}
        for row in parse_csv('File1.csv'):
            data[row[0]] = row[1]
    
        ## For each row in the lookup compute the partial ratio
        for row in parse_csv("File2.csv"):
            #print(process.extract(row,data, limit = 100))
            for found, score, matchrow in process.extract(row, data, limit=100):
                if score >= 60:
                    print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
                    Digi_Results = [score, row, found]
                    writer.writerow(Digi_Results)
    
    
    save_file.close()