Search code examples
pythonalgorithmpython-2.7similaritylevenshtein-distance

Find csv lines by word similarity


I have a csv file with thousands of lines. I would like to retrieve only the lines with some similarity regarding a specific word. In this case I am expecting to catch the line 1, 2 and 4.

Any idea how to achieve that?

import csv
a='Microsoft'
f = open("testing.csv")
reader = csv.reader(f, delimiter='\n')

for row in reader:
    if a in row[0]:
        print row[0]

testing.csv

I like very much the Microsoft products
Me too, I like Micrsoft
I prefer Apple products
microfte here

Solution

  • The fuzzywuzzy library is suitable for this. Given your test data and expected results I'm assuming case does not matter, so I am uppercasing both the word to compare against and the test data:

    from fuzzywuzzy import fuzz
    import csv
    
    word = 'Microsoft'.upper()
    
    f = open('testing.csv')
    reader = csv.reader(f, delimiter='\n')
    
    for row in reader:
        a = row[0].split(' ')
        if max([fuzz.ratio(word, x.upper()) for x in a]) > 80:
            print(row[0])
    

    Result:

    $ python test.py
    I like very much the Microsoft products
    Me too, I like Micrsoft
    microfte here