Search code examples
apache-sparkrdd

Differences: Object instantiation within mapPartitions vs outside


I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:

val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
   //Do object instantiation here
   //Use that instantiated object in applying the business logic
   })

My question is can we not do that with map function itself by doing object instantiation outside the map function like:

val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject

val res = rddData.map(element => 
    //Use the instantiated object 'obj' and do something with data
)

I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.


Solution

  • All objects that you create outside of your lambdas are created on the driver. For each execution of the lambda, they are sent over the network to the specific executor.

    When calling map, the lambda is executed once per data element, causing to send your serialized object once per data element over the network. When using mapPartitions, this happens only once per partition. However, even when using mapPartions, it would usually be better to create the object inside of your lambda. In many cases, your object is not serializable (like a database connection for example). In this case you have to create the object inside your lambda.