After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc
I am just returning back the iterator after printing the values. But when I do collect
on the RDD it is empty.
from pyspark import SparkConf
from pyspark import SparkContext
def myfunc(it):
print(it.next())
return it
def fun1(sc):
n = 5
rdd = sc.parallelize([x for x in range(n+1)], n)
print(rdd.mapPartitions(myfunc).collect())
if __name__ == "__main__":
conf = SparkConf().setMaster("local[*]")
conf = conf.setAppName("TEST2")
sc = SparkContext(conf = conf)
fun1(sc)
mapPartitions
is not relevant here. Iterators (here itertools.chain
) are stateful and can be traversed only once. When you call it.next()
you read and discard the first element and what you return is a tail of the sequence.
When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.
A few notes:
next
is not portable and cannot be used in Python 3.