Search code examples
pythonapache-sparkpysparkrdd

Get sum and length of rdd column using groupBy?


I have the following RDD:

[(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)]

My expected RDD is:

[(1,[300, 1]), (2,[10, 3]), (4,[80,2])]

The first value in the list within the tuple is the sum(e.g. for 2: its 2+3+5 = 10) and second value is the no. of occurrences (e.g. 1 occurs once). Can the expected RDD be achieved using groupBy function?


Solution

  • You can map each value to a list [x, 1], then sum all the lists for each key.

    rdd = sc.parallelize([(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)])
    
    result = rdd.mapValues(lambda x: [x, 1]).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
    
    result.collect()
    # [(1, [300, 1]), (2, [10, 3]), (4, [80, 2])]