I have the following RDD:
[(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)]
My expected RDD is:
[(1,[300, 1]), (2,[10, 3]), (4,[80,2])]
The first value in the list within the tuple is the sum(e.g. for 2: its 2+3+5 = 10) and second value is the no. of occurrences (e.g. 1 occurs once). Can the expected RDD be achieved using groupBy function?
You can map each value to a list [x, 1]
, then sum all the lists for each key.
rdd = sc.parallelize([(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)])
result = rdd.mapValues(lambda x: [x, 1]).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
result.collect()
# [(1, [300, 1]), (2, [10, 3]), (4, [80, 2])]