Search code examples
pythonmapreduce

Combining my results so not to create another mapper


I am working on MapReduce project and wanted to improve my output. I am using CSV file with date on tickets that are being issued and i need to see which color cars are ticketed the most. Column 33 contains information on vehicle colors and the header "Vehicle color". My MapReduce works however result could be better. Column 33 has blank values and many values that written differently but mean same thing Example: WH and White, BK BLACK BLA. My MapReducer counts them as different colors. Whats the best to combine them into one Key.

sys_stdin = open("Parking_Violations.csv", "r")

for line in sys_stdin:
    vehiclecolor = line.split(",")[33].strip()
    vehiclecolor = vehiclecolor.strip("Vehicle Color")

     if vehiclecolor:
        issuecolor = str(vehiclecolor)
        print("%s\t%s" % (issuecolor, 1))



dict_color_count = {}

for line in sys_stdin:
    line = line.strip()
    color, num = line.split('\t')
    try:
        num = int(num)
        dict_color_count[color] = dict_color_count.get(color, 0) + num

    except ValueError:
        pass

sorted_dict_color_count = sorted(dict_color_count.items(), key=itemgetter(1), reverse=True)
for color, count in sorted_dict_color_count:
    print('%s\t%s') % (color, count)
MY Result after MapReduce
BLK 35
WH 21
WHITE 20
BK 16
GRAY 14
WHT 8
BLACK 6
BLA 1

Solution

  • I think the approach that you could follow is to add a dictionary with all the variants of your colors so far and substitute those colors before your counting them. For instance:

    
    # Dictionary with all the colors that you have identified so far
    color_dict = {
        "BLK":["BLK","BLACK","BLA"],
        "WHT":["WHITE","WHT","WHIT"],
    }
    
    for line in sys_stdin:
        vehiclecolor = line.split(",")[33].strip()
        vehiclecolor = vehiclecolor.strip("Vehicle Color")
    
         if vehiclecolor:
            testcolor = str(vehiclecolor).upper()
            issuecolor = testcolor
            for k,v in color_dict.items()
                if testcolor in v:
                    issuecolor = k
            print("%s\t%s" % (issuecolor, 1))
    

    In this sense, you will be able to substitute and improve your color count with the results that you already know.

    Let me know if this helps! :D