Using Pyspark, I'm trying to work with an RDD to aggregate based on the contents of that RDD.
My RDD currently looks like (obviously with a lot more data):
[([u'User1', u'2'], 1), ([u'User2', u'2'], 1), ([u'User1', u'3'], 1)]
I want to aggregate this into the format:
User1 5
User2 2
I'm struggling to interact with the RDD, in particular the lists within the RDD to get to this data. I'm also expected to keep this to an RDD, and not convert it to a data frame.
Can anybody show me how to do this please?
Another solution, very similar to @mck, but slightly more readable is to use the operator add
instead of another lambda function:
from operator import add
rdd = sc.parallelize([("user1", "2"), ("user2", "2"), ("user1", "3")])
rdd = rdd.map(lambda x: (x[0], int(x[1])))
rdd = rdd.reduceByKey(add)
"""
>>> rdd.collect()
>>> Out[54]: [('user2', 2), ('user1', 5)]
"""