I am a business student who just began to learn Python. My professor asked me to do fuzzy matching between two files: US Patent information and Company information downloaded from stock exchange website. My task is to compare the company names that showed up in US Patent documentation (column 1 from file 1) and names found on stock exchange website(column 1 from file 2) . From what I’ve known, the (1) first step is to change all the letters listed file 1 and file 2 to lower cases; (2) Pick each name from file 2 and match it with all the names in file 1 and return 15 closest matches. (3) Repeat step 2, run through all the names is file 2. (4) With every match, there is one similarity level. I guess I will use the SequenceMatcher() object. I just learn how to import data from my csv file(I have 2 files), see below
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell
Sorry about my silly question but I am too new to replace the strings (“abcde”, “abcde”, as shown below) data with my own data. I have no idea how to change the data I imported to lower cases. And I don’t even know how to set the 15 closest matches standard. My professor told me this was an easy task, but I really felt defeated. Thank you for reading! Hopefully someone can give me some instructions. I am not that stupid :)
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
To answer your questions one by one.
1) "I have no idea how to change the data I imported to lower cases."
In order to change the cell to lower case, you would use [string].lower()
The following code will print out each cell in lower case
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell.lower();
So to change each cell to lower case you would do
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
for cell in row:
cell = cell.lower();
2) "I don’t even know how to set the 15 closest matches standard."
For this you should set up a dictionary, the key will be the first string, the value will be a list of pairs, (string2, the value from difflib.SequenceMatcher(None, string1, string2).ratio()).
Please attempt to write some code and we will help you fix it.
Look at https://docs.python.org/2/tutorial/datastructures.html for how to construct a dictionary