Search code examples
apache-sparkpysparkrddreduce

PySpark Reduce on RDD with only single element


Is there anyway to deal with RDDs with only a single element (this can sometimes happen for what I am doing)? When that's the case, reduce stops working as the operation requires 2 inputs.

I am working with key-value pairs such as:

(key1, 10),
(key2, 20),

And I want to aggregate their values, so the result should be:

30

But there are cases where the rdd only contain a single key-value pair, so reduce does not work here, example:

(key1, 10)

This will return nothing.


Solution

  • If you do a .values() before doing reduce, it should work even if there is only 1 element in the RDD:

    from operator import add
    
    rdd = sc.parallelize([('key1', 10),])
    
    rdd.values().reduce(add)
    # 10