Search code examples
pythonpython-2.7csvdefaultdict

Counting how many unique identifiers there are by merging two columns of data?


I'm trying to make a really simple counting script I guess using defaultdict (I can't get my head around how to use DefaultDict so if someone could comment me a snippit of code I would greatly appreciate it)

My objective is to take element 0 and element 1, merge them into a single string and then to count how many unique strings there are...

For example, in the below data there are 15 lines consisting of 3 classes, 4 classids which when merged together we only have 3 unique classes. The merged data for the first line (ignoring the title row) is: Class01CD2

CSV Data:

uniq1,uniq2,three,four,five,six
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
DClass2,DE2,data,data,data,data
DClass2,DE2,data,data,data,data
Class02,CD1,data,data,data,data
Class02,CD1,data,data,data,data

The idea of it is to simply print out how many unique classes are available. Anyone able to help me work this out?

Regards
- Hyflex


Solution

  • Since you are dealing with CSV data, you can use the CSV module along with dictionaries:

    import csv
    
    uniq = {} #Create an empty dictionary, which we will use as a hashmap as Python dictionaries support key-value pairs.
    
    ifile = open('data.csv', 'r') #whatever your CSV file is named.
    reader = csv.reader(ifile)
    
    for row in reader:
        joined = row[0] + row[1] #The joined string is simply the first and second columns in each row.
        #Check to see that the key exists, if it does increment the occurrence by 1
        if joined in uniq.keys():
            uniq[joined] += 1
        else:
            uniq[joined] = 1 #This means the key doesn't exist, so add the key to the dictionary with an occurrence of 1
    
    print uniq #Now output the results
    

    This outputs:

    {'Class02CD3': 7, 'Class02CD1': 2, 'Class01CD2': 3, 'DClass2DE2': 2}
    

    NOTE: This is assuming that the CSV doesn't have the header row (uniq1,uniq2,three,four,five,six).

    REFERENCES:

    http://docs.python.org/2/library/stdtypes.html#dict