Search code examples
apache-sparkapache-spark-sqlapache-spark-2.0off-heap

spark off heap memory config and tungsten


I thought that with the integration of project Tungesten, spark would automatically use off heap memory.

What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here?


Solution

  • Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient and friendly to GC memory utilization.

    Thus, even operating in the default on-heap mode Tungsten alleviates the great overhead of JVM objects memory layout and the GC operating time. Tungsten in that mode does allocate objects on heap for its internal purposes and the allocation memory chunks might be huge but it happens much less frequently and survives GC generation transitions smoothly. This almost eliminates the need to consider moving this internal structure off-heap.

    In our experiments with this mode on and off we did not see a considerable run time improvements. But what you get with off-heap mode on is that one need to carefully design for the memory allocation outside of you JVM process. This might impose some difficulties within container managers like YARN, Mesos etc when you will need to allow and plan for additional memory chunks besides your JVM process configuration.

    Also in off-heap mode Tungsten uses sun.misc.Unsafe which might not be a desired or even possible in your deployment scenarios (with restrictive java security manager configuration for example).

    I am also sharing a time tagged video conference talk from Josh Rosen when he is being asked the similar question.