Search code examples
pythonapache-sparkpysparkrddpython-itertools

How to use mapPartitions in pyspark


After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc I am just returning back the iterator after printing the values. But when I do collect on the RDD it is empty.

from pyspark import SparkConf
from pyspark import SparkContext                                                          

def myfunc(it):
    print(it.next())
    return it

def fun1(sc):
    n = 5
    rdd = sc.parallelize([x for x in range(n+1)], n)
    print(rdd.mapPartitions(myfunc).collect())


if __name__ == "__main__":                                                                
    conf = SparkConf().setMaster("local[*]")                                              
    conf = conf.setAppName("TEST2")                                                       
    sc = SparkContext(conf = conf)                                                        
    fun1(sc)

Solution

  • mapPartitions is not relevant here. Iterators (here itertools.chain) are stateful and can be traversed only once. When you call it.next() you read and discard the first element and what you return is a tail of the sequence.

    When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.

    A few notes:

    • Putting anything to stdout inside a task is typically useless.
    • The way you use next is not portable and cannot be used in Python 3.