Search code examples
apache-sparkhadoopmapreducedistributed-computing

Memory usage in Spark stand alone setup


I have a spark setup with a single worker having 6 cores. Now I broadcast an object x to the worker. I have three questions -

  1. For a map reduce job will 6 copies of my obj x will be generated or a single copy of x will be shared by all the cores ?

  2. What will be the life cycle of x i.e; when will it get destroyed. I'm asking because this object x takes up good amount of memory.

  3. Is there some other way to share an object among all the 6 cores if I read that object from a file.


Solution

  • Broadcast data is transmitted and stored once per executor (java process) not once per core. In other words if you have a single node, and you set spark.executor.instances set to 2, and spark.executor.cores set to 3 you will end up with two java processes on the node where each process has a copy of your data. This is one of the benefits of using broadcasting instead of just passing the data into your executor code using closures.

    As for the life cycle, the broadcast data will be removed when the broadcast handle on the driver no longer has any references to it. This also means after any tasks using that broadcast data have been run. If you watch the spark logs you will see messages along the lines of "Broadcast removed" when this happens.