I have 2 lists of dictionaries (dictreaders) that look something like this:
Name1
[{'City' :'San Francisco', 'Name':'Suzan', 'id_number' : '1567', 'Street': 'Pearl'},
{'City' :'Boston', 'Name':'Fred', 'id_number' : '1568', 'Street': 'Pine'},
{'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
{'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'}
{'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'}
{'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}
{'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}]
and
Name2
[{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'},
{'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
{'City' :'Chicago', 'Name':'Lizy', 'id_number' : '1579', 'Street': 'Spruce'},
{'City' :'Denver', 'Name':'Bobby', 'id_number' : '1580', 'Street': 'Spruce'}
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Spruce'}
{'City' :'Boston', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Walnut'}]
If you notice the names in the second chunk are spelled differently than the first chunk but a few are nearly the same. I'd like to use fuzzy string matching to match these up. I'd also like to narrow to where I'm only comparing names that are in the same city and the on the same street. Currently I'm running a for loop that looks like this
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import izip_longest
import csv
name1_file = 'name1_file.csv'
node_file = 'name2_file.csv'
name1 = csv.DictReader(open(name1_file, 'rb'), delimiter=',', quotechar='"')
score_75_plus = []
name1_name =[]
name2_name =[]
name1_city = []
name2_city = []
name1_street = []
name2_street = []
name1_id = []
name2_id = []
for line in name1:
name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
for line2 in name2:
if line['City'] == line2['City'] and line['Street'] == line['Street']:
partial_ratio = fuzz.partial_ratio(line['Name'], line2['Name'])
if partial_ratio > 75:
name1.append(line['Name'])
name1_city.append(line['City'])
name1_street.append(line['Street'])
name2_name.append(line2['Name'])
name2_city.append(line2['City'])
name2_street.append(line2['Street'])
score_75_plus.append(partial_ratio)
name1_id.append(line['objectid']
name2_id.append(line2['objectid']
big_test= zip(name1_name, name1_city, name1_street, name1_id, name2_name, name2_city, name2_street, name2_id, score_75_plus)
writer=csv.writer(open('big_test.csv', 'wb'))
writer.writerows(big_test)
However since my files are quite large I think its going to take quite some time... days perhaps. I'd like to make it more efficient but haven't figured out how to. So far my thinking is in restructuring the dictionaries into nested dictionaries to lessen the amount of data it has to loop through to check if the city and street are the same. I'm envisioning something like this :
['San Francisco' :
{'Pearl':
{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'} },
'Boston' :
{'Pine':
{'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
'Canyon': {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'} },
'Chicago' :
{'Spruce':
{'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'} },
'Denver' :
{'Spruce':
{'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}},
'Boulder':
{'Violet':
{'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}},
'Detroit':
{'Grape':
{'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}}]
This it would only have to look through the distinct cities and distinct streets within that city to decide whether to apply fuzz.partial_ratio. I used defaultdict to split it up by city but haven't been able to apply it again for streets.
city_dictionary = defaultdict(list)
for line in name1:
city_dictionary[line['City']].append(line)
I've looked at this answer but didn't understand how to implement it.
Sorry for so much detail, I'm not totally sure nested dictionaries are the way to go so I thought I would present the big picture.
Several things you can do:
multiprocessing
module to use all cores on your machine, since tasks here seem context-free.threading
module to do some processing while reading files. You may want to break your files to smaller chunks.If this does not help I can try and add more things based on your code then.
Example of reading the second file once, instead of re-reading for each line in first file:
# read file once before the loop
file_2_dicts = list(csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"'))
for line in name1:
# remove old read and use in-memory dicts from first file
# name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
name2 = file_2_dicts
for line2 in name2:
...
...
ncalls tottime percall cumtime percall filename:lineno(function)
4550 0.055 0.000 0.066 0.000 csv.py:100(next)
9098 0.006 0.000 0.006 0.000 csv.py:86(fieldnames)
4497409 3.845 0.000 54.221 0.000 difflib.py:154(__init__)
4497409 3.678 0.000 50.377 0.000 difflib.py:223(set_seqs)
4497409 3.471 0.000 3.471 0.000 difflib.py:235(set_seq1)
4497409 3.695 0.000 43.228 0.000 difflib.py:261(set_seq2)
4497409 29.130 0.000 39.533 0.000 difflib.py:306(__chain_b)
13356323 78.759 0.000 100.599 0.000 difflib.py:350(find_longest_match)
3123327 1.398 0.000 1.398 0.000 difflib.py:41(_calculate_ratio)
4497409 36.080 0.000 164.628 0.000 difflib.py:460(get_matching_blocks)
3123327 7.450 0.000 128.167 0.000 difflib.py:636(ratio)
7500936 1.673 0.000 1.673 0.000 difflib.py:658(<lambda>)
1374082 16.978 0.000 252.893 0.000 fuzz.py:57(partial_ratio)
1374082 1.172 0.000 1.647 0.000 utils.py:42(make_type_consistent)
3123327 2.587 0.000 4.260 0.000 {_functools.reduce}
23980904 7.633 0.000 7.633 0.000 {built-in method __new__ of type object at 0x100185f40}
4497410 6.525 0.000 16.009 0.000 {map}
1373764 0.496 0.000 0.496 0.000 {max}
32176130 3.231 0.000 3.231 0.000 {method '__contains__' of 'set' objects}
61813598 9.676 0.000 9.676 0.000 {method 'append' of 'list' objects}
72656176 7.728 0.000 7.728 0.000 {method 'get' of 'dict' objects}
13356323 5.311 0.000 5.311 0.000 {method 'pop' of 'list' objects}
33073067 4.927 0.000 4.927 0.000 {method 'setdefault' of 'dict' objects}
4497409 1.568 0.000 1.568 0.000 {method 'sort' of 'list' objects}