apache-spark caching out-of-memory rdd partitioning

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?

many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.

If that is the case then OOM error should never occur but it does.What is the reason?

Detailed explanation is highly appreciated.Thanks in advance

Solution

There are different ways you can persist in your dataframe in spark.

1)Persist (MEMORY_ONLY)

when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).

To answer your question

If that is the case then OOM error should never occur but it does.What is the reason? even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM

2)Persist (MEMORY_AND_DISK)

when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)

3)Persist (MEMORY_ONLY_SER) when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.

4)Persist (MEMORY_AND_DISK_SER) it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.

5)Persist (DISK_ONLY) In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.

6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2) These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.

7)Persist (OFF_HEAP) Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.