python algorithm python-2.7 similarity levenshtein-distance

Find csv lines by word similarity

I have a csv file with thousands of lines. I would like to retrieve only the lines with some similarity regarding a specific word. In this case I am expecting to catch the line 1, 2 and 4.

Any idea how to achieve that?

import csv
a='Microsoft'
f = open("testing.csv")
reader = csv.reader(f, delimiter='\n')

for row in reader:
    if a in row[0]:
        print row[0]

testing.csv

I like very much the Microsoft products
Me too, I like Micrsoft
I prefer Apple products
microfte here

Solution

The fuzzywuzzy library is suitable for this. Given your test data and expected results I'm assuming case does not matter, so I am uppercasing both the word to compare against and the test data:

from fuzzywuzzy import fuzz
import csv

word = 'Microsoft'.upper()

f = open('testing.csv')
reader = csv.reader(f, delimiter='\n')

for row in reader:
    a = row[0].split(' ')
    if max([fuzz.ratio(word, x.upper()) for x in a]) > 80:
        print(row[0])

Result:

$ python test.py
I like very much the Microsoft products
Me too, I like Micrsoft
microfte here