I've two CSVs as :
CSV1
id, count
1, 5
2, 10
100, 1
CSV2
id, count
100, 5
1, 10
2, 1
I need to compare the CSVs with id and get the count for both and find the difference. My expected result would be:
id, Diff
1, -5
100, -4
2, 9
For now i'm using nested loops as :
with open('csv1.csv', 'r') as t1, open('csv2.csv', 'r') as t2:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
for datum in csv2:
if data['id'] == datum['id']:
diff = int(data['count']) - int(datum['count'])
if diff > 0:
print(diff)
item = [[
str(data['id']),
str(data['count']),
str(datum['count']),
str(diff)]]
writer.writerows(item)
But since the above code does a loop inside a loop, it takes forever if i have large files O(n^2). Is there anyway i can easily do a comparison in python.
This O(n**2)
code:
fileone = csv.DictReader(t1)
filetwo = csv.DictReader(t2)
csv1 = list(fileone)
csv2 = list(filetwo)
for data in csv1:
for datum in csv2:
if data['id'] == datum['id']:
diff = int(data['count']) - int(datum['count'])
...
can be replaced by creating 2 dictionaries using the id field as keys, then perform the intersection of the keys. Then loop on the intersected keys:
csv1 = {data["id"]:data for data in fileone}
csv2 = {data["id"]:data for data in filetwo}
keys = set(csv1) & csv2
for k in keys:
data = csv1[k]
datum = csv2[k]
diff = int(data['count']) - int(datum['count'])
...
now you have approx O(n)
complexity (dict lookup is O(1)
in average)