Search code examples
javapythonapache-sparkhadoop-yarn

spark.storage.memoryFraction setting in Apache Spark


According to Spark documentation

spark.storage.memoryFraction: Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase it if you configure your own old generation size.

I found several blogs and article where it is suggested to set it to zero in yarn mode. Why is that better than set it to something close to 1? And in general, what is a reasonable value for it ?


Solution

  • The Spark executor is set up into 3 regions.

    1. Storage - Memory reserved for caching
    2. Execution - Memory reserved for object creation
    3. Executor overhead.

    In Spark 1.5.2 and earlier:

    spark.storage.memoryFraction sets the ratio of memory set for 1 and 2. The default value is .6, so 60% of the allocated executor memory is reserved for caching. In my experience, I've only ever found that the number is reduced. Typically when a developer is getting a GC issue, the application has a larger "churn" in objects, and one of the first places for optimizations is to change the memoryFraction.

    If your application does not cache any data, then setting it to 0 is something you should do. Not sure why that would be specific to YARN, can you post the articles?

    In Spark 1.6.0 and later:

    Memory management is now unified. Both storage and execution share the heap. So this doesnt really apply anymore.