I have 2 csv files,
zulu
has the base info, divided in columns.bommel
has only updated info with the same records in the same columns.I want to solve this in Python (using the csv
module from the standard library) without Pandas or other external resources.
#!/usr/bin/env python3
import csv
# Define column names
fields = ['capcode', 'discipline', 'region', 'location', 'description', 'remark']
# Open the neccesary files
with open('bommel_db_capcodes.txt', 'r') as readFile_bommel:
with open('results.csv', 'w') as results:
with open('zulu_db_capcodes.txt', 'r') as readFile_zulu:
master = csv.DictReader(readFile_zulu, fieldnames=fields)
update = csv.DictReader(readFile_bommel, fieldnames=fields)
writer = csv.DictWriter(results, fieldnames=fields)
# Saves and skips header to output file
writer.writerow(next(master))
# Goes through whole zulu csv
for row in master:
for row2 in update:
if row['capcode'] in update:
writer.writerow(row2)
else:
writer.writerow(row)
ReadFilezulu.close()
ReadFilebommel.close()
results.close()
Contents of zulu
csv:
capcode,discipline,region,location,description,remark
000400001,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water (Noord)
000400002,Brandweer,Groningen,Groningen,,
000400003,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water) (Oost)
000100000,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,Aalsmeer,Postalarm
000100001,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,,
000100002,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,,Banaanzulu
000100003,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,,
Contents of bommel
csv:
capcode,discipline,region,location,description,remark
000100000,Brandweer,Amsterdam-Amstelland,,banaanProefalarm,
000100001,Brandweer,Amsterdam-Amstelland,Aalsmeer,Bevelvoerders,
000100004,Brandweer,Amsterdam-Amstelland,Aalsmeer,Korpsalarm,
Current result
capcode,discipline,region,location,description,remark
000400001,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water (Noord)
000400001,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water (Noord)
000400001,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water (Noord)
Expected result
capcode,discipline,region,location,description,remark < from saving header
000400001,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water (Noord) < from zulu
000400002,Brandweer,Groningen,Groningen,, < from zulu
000400003,Brandweer,Groningen,Groningen,Regionaal,Pelotonscommandant Logistiek/Water) (Oost) < from zulu
000100000,Brandweer,Amsterdam-Amstelland,,banaanProefalarm, < from bommel
000100001,Brandweer,Amsterdam-Amstelland,Aalsmeer,Bevelvoerders, < from bommel
000100002,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,,Banaanzulu < from zulu
000100003,Brandweer,Amsterdam-Amstelland,Amsterdam-Amstelland,, < from zulu
000100004,Brandweer,Amsterdam-Amstelland,Aalsmeer,Korpsalarm, < from bommel
Any ideas on how to get this done?
The first time when you use the condition in update
you consume the entire input file. Because update
is basically a generator, you exhaust it when you loop over it.
Also, your condition checks if exactly the same line exists in the update file, which of course it doesn't (you would not want or need to update anything if the data was exactly the same).
You want to read the update lines into memory once, then skip those lines from the master file when you see a line with the same key (not the whole line!)
I assume the first field (capcode
) is the key here, though there could be other arrangements.
Tangentially, you can combine all the with
statements; and when you use with open
, there is no need to .close()
anything.
#!/usr/bin/env python3
import csv
fields = ['capcode', 'discipline', 'region', 'location', 'description', 'remark']
with open('bommel_db_capcodes.txt', 'r') as readFile_bommel, \
open('results.csv', 'w') as results, \
open('zulu_db_capcodes.txt', 'r') as readFile_zulu:
master = csv.DictReader(readFile_zulu, fieldnames=fields)
update = csv.DictReader(readFile_bommel, fieldnames=fields)
writer = csv.DictWriter(results, fieldnames=fields)
# Save header to output file and skip
writer.writerow(next(master))
# Skip header from updates
next(update)
# Read, remember, and write updated lines
seen = set()
for row in update:
writer.writerow(row)
seen.add(row['capcode'])
for row in master:
if row['capcode'] not in seen:
writer.writerow(row)