Search code examples
pythonpython-3.xpysparklambda

reduceByKey with multiple values to sum in pyspark


I have a csv file that looks like the following:

Philadelphia 30 40
Philadelphia 45 60
Philadelphia  3  4
Wilmington   10 20
Wilmington   15 30
Wilmington    1  2

I read it into pyspark by using a map function to parse each line. I now have my data (called input_data) looking like this:

("Philadelphia", 30, 40)
("Philadelphia", 45, 60)
("Philadelphia", 3, 4)
("Wilmington", 10, 20)
("Wilmington", 15, 30)
("Wilmington", 1, 2)

I am trying to use this RDD and reduceByKey to group by city and sum the values in the tuple. I am trying to make something like this:

("Philadelphia", 78, 104)
("Wilmington", 26, 52)

However, when I try to run a reduceByKey line like this

temp = input_data.reduceByKey(lambda x, y: (x[1] + y[1]))

I receive an error indicating that there are too many values to unpack (expected 2). I am trying to better understand what is going on here. Why are the numeric values not grouping as expected? Additional context, so I can understand better would be greatly appreciated!


Solution

  • Your RDD (input_data):

    ('Philadelphia', 30, 40)
    ('Philadelphia', 45, 60)
    ('Philadelphia', 3, 4)
    ('Wilmington', 10, 20)
    ('Wilmington', 15, 30)
    ('Wilmington', 1, 2)
    

    Try this:

    input_data \
    .map(lambda row: (row[0], (row[1],row[2]))) \
    .groupByKey() \
    .map(lambda row: (
            row[0],
            sum([tup[0] for tup in list(row[1])]),
            sum([tup[1] for tup in list(row[1])])
        )
    ).collect()
    

    Output:

    ('Wilmington', 26, 52)
    ('Philadelphia', 78, 104)