I have a csv file with thousands of lines. I would like to retrieve only the lines with some similarity regarding a specific word. In this case I am expecting to catch the line 1, 2 and 4.
Any idea how to achieve that?
import csv
a='Microsoft'
f = open("testing.csv")
reader = csv.reader(f, delimiter='\n')
for row in reader:
if a in row[0]:
print row[0]
testing.csv
I like very much the Microsoft products
Me too, I like Micrsoft
I prefer Apple products
microfte here
The fuzzywuzzy
library is suitable for this. Given your test data and expected results I'm assuming case does not matter, so I am uppercasing both the word to compare against and the test data:
from fuzzywuzzy import fuzz
import csv
word = 'Microsoft'.upper()
f = open('testing.csv')
reader = csv.reader(f, delimiter='\n')
for row in reader:
a = row[0].split(' ')
if max([fuzz.ratio(word, x.upper()) for x in a]) > 80:
print(row[0])
Result:
$ python test.py I like very much the Microsoft products Me too, I like Micrsoft microfte here