Search code examples
pythonperformancedictionarynested-loopsdefaultdict

How can I use nested dictionaries to make my fuzzy matching code more efficient?


I have 2 lists of dictionaries (dictreaders) that look something like this:

Name1

 [{'City' :'San Francisco', 'Name':'Suzan', 'id_number' : '1567', 'Street': 'Pearl'},
     {'City' :'Boston', 'Name':'Fred', 'id_number' : '1568', 'Street': 'Pine'},
     {'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
     {'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}
     {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'}
     {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'}
     {'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}
     {'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}]

and

Name2

[{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'},
 {'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
 {'City' :'Chicago', 'Name':'Lizy', 'id_number' : '1579', 'Street': 'Spruce'},
 {'City' :'Denver', 'Name':'Bobby', 'id_number' : '1580', 'Street': 'Spruce'}
 {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Spruce'}
 {'City' :'Boston', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Walnut'}]

If you notice the names in the second chunk are spelled differently than the first chunk but a few are nearly the same. I'd like to use fuzzy string matching to match these up. I'd also like to narrow to where I'm only comparing names that are in the same city and the on the same street. Currently I'm running a for loop that looks like this

from fuzzywuzzy import fuzz
from  fuzzywuzzy import process
from itertools import izip_longest
import csv

name1_file = 'name1_file.csv'
node_file = 'name2_file.csv'

name1 = csv.DictReader(open(name1_file, 'rb'), delimiter=',', quotechar='"')


score_75_plus = []
name1_name =[]
name2_name =[]
name1_city = []
name2_city = []
name1_street = []
name2_street = []
name1_id = []
name2_id = []


for line in name1:
    name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
    for line2 in name2:
        if line['City'] == line2['City'] and line['Street'] == line['Street']:
            partial_ratio = fuzz.partial_ratio(line['Name'], line2['Name'])
            if partial_ratio > 75:
                name1.append(line['Name'])
                name1_city.append(line['City'])
                name1_street.append(line['Street'])
                name2_name.append(line2['Name'])
                name2_city.append(line2['City'])
                name2_street.append(line2['Street'])
                score_75_plus.append(partial_ratio)
                name1_id.append(line['objectid']
                name2_id.append(line2['objectid']

big_test= zip(name1_name, name1_city, name1_street, name1_id, name2_name, name2_city, name2_street, name2_id, score_75_plus)
writer=csv.writer(open('big_test.csv', 'wb'))
writer.writerows(big_test)

However since my files are quite large I think its going to take quite some time... days perhaps. I'd like to make it more efficient but haven't figured out how to. So far my thinking is in restructuring the dictionaries into nested dictionaries to lessen the amount of data it has to loop through to check if the city and street are the same. I'm envisioning something like this :

['San Francisco' : 
    {'Pearl': 
        {'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'} }, 
'Boston' : 
    {'Pine': 
        {'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'}, 
    'Canyon': {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'} },
'Chicago' : 
    {'Spruce': 
        {'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'}, 
        {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'} },
'Denver' :
    {'Spruce':
        {'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}},
'Boulder':
    {'Violet': 
        {'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}},
'Detroit':
    {'Grape': 
        {'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}}]

This it would only have to look through the distinct cities and distinct streets within that city to decide whether to apply fuzz.partial_ratio. I used defaultdict to split it up by city but haven't been able to apply it again for streets.

city_dictionary = defaultdict(list)

for line in name1:
    city_dictionary[line['City']].append(line)

I've looked at this answer but didn't understand how to implement it.

Sorry for so much detail, I'm not totally sure nested dictionaries are the way to go so I thought I would present the big picture.


Solution

  • Several things you can do:

    1. Don't re-open the second file for every line in the first csv file. Use a data structure (probably a list) to store all the information and use it in-memory.
    2. Profile your code to see where it spends most of it's time.
    3. If most of the time is spent on CPU, use the multiprocessing module to use all cores on your machine, since tasks here seem context-free.
    4. If most of the time is spent reading files (I/O), use the threading module to do some processing while reading files. You may want to break your files to smaller chunks.

    If this does not help I can try and add more things based on your code then.

    Edit:

    Example of reading the second file once, instead of re-reading for each line in first file:

    # read file once before the loop
    file_2_dicts = list(csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"'))
    for line in name1:
        # remove old read and use in-memory dicts from first file
        # name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
        name2 = file_2_dicts
        for line2 in name2:
            ...
            ...
    
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    4550    0.055    0.000    0.066    0.000 csv.py:100(next)
    9098    0.006    0.000    0.006    0.000 csv.py:86(fieldnames)
    4497409    3.845    0.000   54.221    0.000 difflib.py:154(__init__)
    4497409    3.678    0.000   50.377    0.000 difflib.py:223(set_seqs)
    4497409    3.471    0.000    3.471    0.000 difflib.py:235(set_seq1)
    4497409    3.695    0.000   43.228    0.000 difflib.py:261(set_seq2)
    4497409   29.130    0.000   39.533    0.000 difflib.py:306(__chain_b)
    13356323   78.759    0.000  100.599    0.000 difflib.py:350(find_longest_match)
    3123327    1.398    0.000    1.398    0.000 difflib.py:41(_calculate_ratio)
    4497409   36.080    0.000  164.628    0.000 difflib.py:460(get_matching_blocks)
    3123327    7.450    0.000  128.167    0.000 difflib.py:636(ratio)
    7500936    1.673    0.000    1.673    0.000 difflib.py:658(<lambda>)
    1374082   16.978    0.000  252.893    0.000 fuzz.py:57(partial_ratio)
    1374082    1.172    0.000    1.647    0.000 utils.py:42(make_type_consistent)
    3123327    2.587    0.000    4.260    0.000 {_functools.reduce}
    23980904    7.633    0.000    7.633    0.000 {built-in method __new__ of type object at 0x100185f40}
    4497410    6.525    0.000   16.009    0.000 {map}
    1373764    0.496    0.000    0.496    0.000 {max}
    32176130    3.231    0.000    3.231    0.000 {method '__contains__' of 'set' objects}
    61813598    9.676    0.000    9.676    0.000 {method 'append' of 'list' objects}
    72656176    7.728    0.000    7.728    0.000 {method 'get' of 'dict' objects}
    13356323    5.311    0.000    5.311    0.000 {method 'pop' of 'list' objects}
    33073067    4.927    0.000    4.927    0.000 {method 'setdefault' of 'dict' objects}
    4497409    1.568    0.000    1.568    0.000 {method 'sort' of 'list' objects}