Search code examples
pythonapache-sparkpysparkrdd

PySpark - reducyByKey on a (tuple,int) value


I have the RDD:

key,((tuple),int)

I want to reduce it to each key and its average per point in the tuple. For example if I have (small example):

[(1,((0,19,15,39),1)),(1,((0,64,19,3),1))]

I will get:

[(1,(0,83,34,41),2))]

then (or directly)

[(1,(0,41.5,17,21)]

I tried:

reduceByKey(lambda a,b: a+b)
reduceByKey(lambda a,b: (a[0]+b[0],a[1]+b[1]))

and other stuff that didn't helped or gave RDD error.

How can I solve the issue?


Solution

  • You need to do some further calculations to get the average per key:

    result = rdd.reduceByKey(lambda a, b: (tuple(i+j for (i,j) in zip(a[0],b[0])), a[1]+b[1])) \
                .map(lambda r: (r[0], tuple(i/r[1][1] for i in r[1][0])))