Search code examples
apache-sparkrdd

What is a glom?. How it is different from mapPartitions?


I've come across the glom() method on RDD. As per the documentation

Return an RDD created by coalescing all elements within each partition into an array

Does glom shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions.

I would also like to know if there are any use cases that benefit from glom.


Solution

  • Does glom shuffle the data across partitions

    No, it doesn't

    If this is the second case I believe that the same can be achieved using mapPartitions

    It can:

    rdd.mapPartitions(iter => Iterator(_.toArray))
    

    but the same thing applies to any non shuffling transformation like map, flatMap or filter.

    if there are any use cases which benefit from glob.

    Any situation where you need to access partition data in a form that is traversable more than once.