Search code examples
pythoncsvcomparison

python compare CSV and find difference


I've two CSVs as :

CSV1

id, count
1, 5
2, 10
100, 1

CSV2

id, count
100, 5
1, 10
2, 1

I need to compare the CSVs with id and get the count for both and find the difference. My expected result would be:

id, Diff
1, -5
100, -4
2, 9

For now i'm using nested loops as :

with open('csv1.csv', 'r') as t1, open('csv2.csv', 'r') as t2:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
    for datum in csv2:
        if data['id'] == datum['id']:
            diff = int(data['count']) - int(datum['count'])

            if diff > 0:
                print(diff)
                item = [[
                    str(data['id']),
                   str(data['count']),
                  str(datum['count']),
                  str(diff)]]
                writer.writerows(item)

But since the above code does a loop inside a loop, it takes forever if i have large files O(n^2). Is there anyway i can easily do a comparison in python.


Solution

  • This O(n**2) code:

    fileone = csv.DictReader(t1)
    filetwo = csv.DictReader(t2)
    csv1 = list(fileone)
    csv2 = list(filetwo)
    for data in csv1:
        for datum in csv2:
            if data['id'] == datum['id']:
               diff = int(data['count']) - int(datum['count'])
               ...
    

    can be replaced by creating 2 dictionaries using the id field as keys, then perform the intersection of the keys. Then loop on the intersected keys:

    csv1 = {data["id"]:data for data in fileone}
    csv2 = {data["id"]:data for data in filetwo}
    keys = set(csv1) & csv2
    for k in keys:
        data = csv1[k]
        datum = csv2[k]
        diff = int(data['count']) - int(datum['count'])
        ...
    

    now you have approx O(n) complexity (dict lookup is O(1) in average)