Search code examples
pysparkrddreduce

python : reduce by key with if condition statement?


(K1, (v1, v2))
(K2, (v3, v4))
(K1, (v1, v5))
(K2, (v3, v6))

How can I sum up the values of the key provided the first value is the some or eque such that I get (k1, (v1,v2+v5), (k2,(v3,v4+v6) ?


Solution

  • IIUC, you need to change the key before the reduce, and then map your values back in the desired format.

    You should be able to do the following:

    new_rdd = rdd.map(lambda row: ((row[0], row[1][0]), row[1][1]))\
        .reduceByKey(sum).
        .map(lambda row: (row[0][0], (row[0][1], row[1])))