I have a csv file that looks like the following:
Philadelphia 30 40
Philadelphia 45 60
Philadelphia 3 4
Wilmington 10 20
Wilmington 15 30
Wilmington 1 2
I read it into pyspark by using a map function to parse each line. I now have my data (called input_data
) looking like this:
("Philadelphia", 30, 40)
("Philadelphia", 45, 60)
("Philadelphia", 3, 4)
("Wilmington", 10, 20)
("Wilmington", 15, 30)
("Wilmington", 1, 2)
I am trying to use this RDD and reduceByKey
to group by city and sum the values in the tuple. I am trying to make something like this:
("Philadelphia", 78, 104)
("Wilmington", 26, 52)
However, when I try to run a reduceByKey
line like this
temp = input_data.reduceByKey(lambda x, y: (x[1] + y[1]))
I receive an error indicating that there are too many values to unpack (expected 2). I am trying to better understand what is going on here. Why are the numeric values not grouping as expected? Additional context, so I can understand better would be greatly appreciated!
Your RDD (input_data):
('Philadelphia', 30, 40)
('Philadelphia', 45, 60)
('Philadelphia', 3, 4)
('Wilmington', 10, 20)
('Wilmington', 15, 30)
('Wilmington', 1, 2)
Try this:
input_data \
.map(lambda row: (row[0], (row[1],row[2]))) \
.groupByKey() \
.map(lambda row: (
row[0],
sum([tup[0] for tup in list(row[1])]),
sum([tup[1] for tup in list(row[1])])
)
).collect()
Output:
('Wilmington', 26, 52)
('Philadelphia', 78, 104)