Search code examples
pythonpandascsvdifflibsequencematcher

Comparing two columns of a csv and outputting string similarity ratio in another csv


I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.

The csv may look like this:

Column 1|Column 2 
tomato|tomatoe 
potato|potatao 
apple|appel 

I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.

This is the code I have so far:

import csv
import difflib

f = open('test.csv')

csf_f = csv.reader(f)

row_a = []
row_b = []

for row in csf_f:
    row_a.append(row[0])
    row_b.append(row[1])

a = row_a
b = row_b

def similar(a, b):
    return difflib.SequenceMatcher(a, b).ratio()

match_ratio = similar(a, b)

match_list = []
for row in match_ratio:
    match_list.append(row)

with open("output.csv", "wb") as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)

f.close()

I get the error:

Traceback (most recent call last):
  File "comparison.py", line 24, in <module>
    for row in match_ratio:
TypeError: 'float' object is not iterable

I feel like I am not importing the column list correctly and running it against the sequencematcher function.


Solution

  • Here is another way to get this done using pandas:

    Consider your csv data is like this:

    Column 1,Column 2 
    tomato,tomatoe 
    potato,potatao 
    apple,appel
    

    CODE

    import pandas as pd
    import difflib as diff
    #Read the CSV
    df = pd.read_csv('datac.csv')
    #Create a new column 'diff' and get the result of comparision to it
    df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) 
    #Save the dataframe to CSV and you could also save it in other formats like excel, html etc
    df.to_csv('outdata.csv',index=False)
    

    Result

    Column 1,Column 2 ,diff
    tomato,tomatoe ,0.923076923077
    potato,potatao ,0.923076923077
    apple,appel ,0.8