Search code examples
pythonpython-2.6

Python: Combine data based on a key column


Having data containing both parent and child records inside a same text file (two headers) . Parent is department and childs are employees and dno is the join column.

dno,dname,loc
10,FIN,USA
20,HR,EUR
30,SEC,AUS
empno,ename,sal,dno
2100,AMY,1001,10
2200,JENNY,2001,10
3100,RINI,3001,20
4100,EMP4,4001,30
4200,EMP5,5001,30
4300,EMP6,6001,30

Would like to combine both data by dno and create an output like below:

empno,ename,sal,dno,dname,loc
2100,AMY,1001,10,FIN,USA
2200,JENNY,2001,10,FIN,USA
3100,RINI,3001,20,HR,EUR
4100,EMP4,4001,30,SEC,AUS
4200,EMP5,5001,30,SEC,AUS
4300,EMP6,6001,30,SEC,AUS

Python version - 2.6

Have tried the following solution:

dept_lst = []
emp_lst = []

with open(efile,'rb') as e_file:
    reader = csv.reader(e_file,delimiter=",")
    for row in reader:
        if ((row[0] != 'dno' and row[0] != 'dname' ) or 
            (row[0] != 'empno' and row[0] != 'ename')):
            if len(row) == 3:
                dept_lst.append(row)
            elif len(row) == 4:
                emp_lst.append(row)

result = [ e + [d[1],d[2]] for e in emp_lst for d in dept_lst if e[3] == d[0]]

for line in result:
    print ",".join(line)

Question: Original data is like more than 1GB and this seems to be working. Not sure if this is an optimal solution.

Would like to know if there are any other efficient ways/alternatives of handling this scenario using Python Standard Library - 2.6.


Solution

  • Consider reading the first part and building a dictionary of follow-ups, then switching to the second part and using the dictionary. Also, consider using a CSV writer to write processed rows at once instead of saving them as a list.

    dno = {}
    # Why do you open the file in the binary mode?
    with open("efile.csv", "r") as e_file,\
         open("ofile.csv", "w") as o_file:
        reader = csv.reader(e_file)
        next(reader) # Skip the header
        for row in reader:
            if row[0] == 'empno':
                break # The second part begins
            dno[row[0]] = row[1:]
        writer = csv.writer(o_file)
        for row in reader:
            writer.writerow(row + dno[row[3]])