Having data containing both parent and child records inside a same text file (two headers) .
Parent is department
and childs are employees
and dno
is the join column.
dno,dname,loc
10,FIN,USA
20,HR,EUR
30,SEC,AUS
empno,ename,sal,dno
2100,AMY,1001,10
2200,JENNY,2001,10
3100,RINI,3001,20
4100,EMP4,4001,30
4200,EMP5,5001,30
4300,EMP6,6001,30
Would like to combine both data by dno
and create an output like below:
empno,ename,sal,dno,dname,loc
2100,AMY,1001,10,FIN,USA
2200,JENNY,2001,10,FIN,USA
3100,RINI,3001,20,HR,EUR
4100,EMP4,4001,30,SEC,AUS
4200,EMP5,5001,30,SEC,AUS
4300,EMP6,6001,30,SEC,AUS
Python version - 2.6
Have tried the following solution:
dept_lst = []
emp_lst = []
with open(efile,'rb') as e_file:
reader = csv.reader(e_file,delimiter=",")
for row in reader:
if ((row[0] != 'dno' and row[0] != 'dname' ) or
(row[0] != 'empno' and row[0] != 'ename')):
if len(row) == 3:
dept_lst.append(row)
elif len(row) == 4:
emp_lst.append(row)
result = [ e + [d[1],d[2]] for e in emp_lst for d in dept_lst if e[3] == d[0]]
for line in result:
print ",".join(line)
Question: Original data is like more than 1GB and this seems to be working. Not sure if this is an optimal solution.
Would like to know if there are any other efficient ways/alternatives of handling this scenario
using Python Standard Library - 2.6
.
Consider reading the first part and building a dictionary of follow-ups, then switching to the second part and using the dictionary. Also, consider using a CSV writer to write processed rows at once instead of saving them as a list.
dno = {}
# Why do you open the file in the binary mode?
with open("efile.csv", "r") as e_file,\
open("ofile.csv", "w") as o_file:
reader = csv.reader(e_file)
next(reader) # Skip the header
for row in reader:
if row[0] == 'empno':
break # The second part begins
dno[row[0]] = row[1:]
writer = csv.writer(o_file)
for row in reader:
writer.writerow(row + dno[row[3]])