Search code examples
pythonpandasnumpydata-cleaning

Print out values in a dictionary to a new csv file


I have a csv file looks like this

year,gender,age,country
2002,F,9-10,CO
2002,F,9-10,CO
2002,M,9-10,CO
2002,F,9-10,BR
2002,M,11-15,BR
2002,F,11-15,CO
2003,F,9-10,CO
2003,M,9-10,CO
2003,F,9-10,BR
2003,M,9-10,CO
2004,F,11-15,BR
2004,F,11-15,CO
2004,F,9-10,BR
2004,F,9-10,CO

And I want to get a output file like this:

year,gender,age,country,population
2002,F,9-10,CO,2
2002,M,9-10,CO,1
2002,F,9-10,BR,1
2002,M,9-10,BR,0
2002,F,11-15,CO,1
2002,M,11-15,CO,0
2002,F,11-15,BR,0
2002,M,11-15,BR,1
2003,F,9-10,CO,1
2003,M,9-10,CO,1
2003,F,9-10,BR,1
2003,M,9-10,BR,0
2003,F,11-15,CO,0
2003,M,11-15,CO,0
2004,F,9-10,CO,1
2004,M,9-10,CO,0
2004,F,9-10,BR,1
2004,M,9-10,BR,0
2004,F,11-15,CO,1
2004,M,11-15,CO,0
2004,F,11-15,BR,1
2004,M,11-15,BR,0

Basically I want to print out the number of female for each year,each age and each country, so year,gender,age and country will be the key of the dictionary. Moreover, some year do not have the data of a specific country or some year do not have a specific age for a specific country. For example, year 2003,female do not have data for 11-15 age group in country CO. In this situation, the population will be 0. Moreover, some year do not have a specific gender data at all. For example, for year 2004, there is no male data for all the age and country, but I still want to print it out in the output file with population 0.

Below are some python code I wrote but it doesn't work and I don't know how to deal with the missing data and print it out as 0 in the population field.

import csv
import os
import sys
from operator import itemgetter, attrgetter
import math
from collections import Counter

# Create dictionary to hold the data
valDic = {}

# Read data into dictionary
with open(sys.argv[1], "r",) as inputfile:
    readcsv = csv.reader(inputfile, delimiter = ',')    
    next(readcsv)
    for line in readcsv:
        key = line[0] + line[1] + line[2] + line[3]
        year = line[0]
        gender = line[1]
        age = line[2]
        country = line[3]
        if key in valDic:
            key = key + 1
        else:
            valDic[key] = [year, gender, age, country, 0] # 0s are placeholder for running sum and itemCount
    inputfile.close()  

newcsvfile = []

for key in valDic:
    newcsvfile.append([valDic[key][0], valDic[key][1], valDic[key][2], valDic[key][3], len(valDic[key])])

newcsvfile = sorted(newcsvfile)
newcsvfile = [["year", "gender", "age", "country", "population"]] 

with open(sys.argv[2], "w") as outputfile:
    writer = csv.writer(outputfile)
    writer.writerows(newcsvfile)        

Solution

  • We can store each combination of year, gender, age, country as a tuple and use this as the key for your dictionary. We also maintain a unique set of each of these values. We iterate over every combination we have seen, and if the data doesn't exist for that (like in 2004 only female exists but not male); then we can add '0' for this.

    Demo:

    import csv
    import sys
    
    # Create dictionary to hold the data
    valDic = {}
    
    years, genders, age, country = set(), set(), set(), set()
    
    # Read data into dictionary
    with open(sys.argv[1], 'r',) as inputfile:
    
        reader = csv.reader(inputfile, delimiter = ',')
        next(reader)
    
        for row in reader:
    
            key = (row[0], row[1], row[2], row[3])
    
            years.add(key[0])
            genders.add(key[1])
            age.add(key[2])
            country.add(key[3])
    
            if key not in valDic:
                valDic[key]=0
    
            valDic[key]+=1
    
    
    #Add missing combinations
    for y in years:
        for g in genders:
            for a in age:
                for c in country:
                    key = (y, g, a, c)
                    if key not in valDic:
                        valDic[key]=0
    
    #Prepare new CSV
    newcsvfile = [["year", "gender", "age", "country", "population"]] 
    
    for key, val in sorted(valDic.items()):
        newcsvfile.append([key[0], key[1], key[2], key[3], valDic[key]])
    
    with open(sys.argv[2], "w", newline='') as outputfile:
        writer = csv.writer(outputfile)
        writer.writerows(newcsvfile)  
    

    Outputs:

    year,gender,age,country,population
    2002,F,11-15,BR,0
    2002,F,11-15,CO,1
    2002,F,9-10,BR,1
    2002,F,9-10,CO,2
    2002,M,11-15,BR,1
    2002,M,11-15,CO,0
    2002,M,9-10,BR,0
    2002,M,9-10,CO,1
    2003,F,11-15,BR,0
    2003,F,11-15,CO,0
    2003,F,9-10,BR,1
    2003,F,9-10,CO,1
    2003,M,11-15,BR,0
    2003,M,11-15,CO,0
    2003,M,9-10,BR,0
    2003,M,9-10,CO,2
    2004,F,11-15,BR,1
    2004,F,11-15,CO,1
    2004,F,9-10,BR,1
    2004,F,9-10,CO,1
    2004,M,11-15,BR,0
    2004,M,11-15,CO,0
    2004,M,9-10,BR,0
    2004,M,9-10,CO,0