Search code examples
pythoncsvinner-joincsvkit

CSV joining based on keys


This may be a simple/repeat question, but I could find/figure out yet how to do it.

I have two csv files:

info.csv:

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076

abc, xyz, 1234, 982-128-0000, pqt,

bcd, uvw, 3124, 813-222-1111, tre, 

poi, ccc, 9087, 123-45607890, weq,

and then

age.csv:

student_id,age_1

3124,20

9087,21

1234,45

I want to compare the two csv files, based on the columns "id" from info.csv and "student_id" from age.csv and take the corresponding "age_1" data and put it into the "age" column in info.csv.

So the final output should be:

info.csv:

"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
 abc, xyz, 1234, 982-128-0000, pqt,45
 bcd, uvw, 3124, 813-222-1111, tre,20
 poi, ccc, 9087, 123-45607890, weq,21

I am able to simply join the tables based on the keys into a new.csv, but can't put the data in the columns titles "age". I used "csvkit" to do that.

Here is what I used:

csvjoin -c 3,1 info.csv age.csv > new.csv

Solution

  • Try this...

    import csv
    
    info = list(csv.reader(open("info.csv", 'rb')))
    age = list(csv.reader(open("age.csv", 'rb')))
    
    def copyCSV(age, info, outFileName = 'out.csv'):
        # put age into dict, indexed by ID
        # assumes no duplicate entries
    
        # 1 - build a dict ageDict to represent data
        ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])
    
        # 2 - setup output
        with open(outFileName, 'wb') as outFile:
            outwriter = csv.writer(outFile)
            # 3 - run through info and slot in ages and write to output
            # nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
            outwriter.writerow(info[0])
            for entry in info[1:]:
                if entry != []:
                    key = entry[2].replace(' ','')
                    if key in ageDict: # checks that you have data from age.csv
                        entry[5] = ageDict[key]
                outwriter.writerow(entry)
    
    copyCSV(age, info)
    

    Let me know if it works or if anything is unclear. I've used a dict because it should be faster if your files are massive, as you only have to loop through the data in age.csv once.

    There may be a simpler way / something already implemented...but this should do the trick.