Search code examples
amazon-web-servicesapache-sparkhadoop-yarnapache-spark-mllibemr

Spark Using Disk Resources When Memory is Available


I'm working on optimizing the performance of my Spark cluster (run on AWS EMR) that is performing Collaborative Filtering using the ALS matrix factorization algorithm. We are using quite a few factors and iterations so I'm trying to optimize these steps in particular. I am trying to understand why I am using disk space when I have plenty of memory available. Here is the total cluster memory available:

enter image description here

Here is the remaining disk space (notice the dips of disk utilization):

enter image description here

I've tried looking at the Yarn manager and it looks like it shows that each node slave has: 110 GB (used) 4 GB (avail.). You can also see the total allocated on the first image (700 GB). I've also tried changing the ALS source and forcing the intermediateRDDStorageLevel and finalRDDStorageLevel from MEMORY_AND_DISK to MEMORY_ONLY and that didn't affect anything.

I am not persisting my RDD's anywhere else in my code so I'm not sure where this disk utilization is coming from. I'd like to better utilize the resources on my cluster, any ideas? How can I more effectively use the available memory?


Solution

  • There can be few scenerios where spark will be using the disk usage instead of memory

    1. If you have shuffle operation. Spark writes the shuffled data in the disk only so if you have shuffle operation you are out of luck

    2. Low executor memory. If you have low executor memory spark has less memory to keep the data so it will be spilling the data from memory to disk. However as you suggested you have tried executor memory from 20G to 40G. I will recommend to keep the executor memory till 40G as beyoind that JVM GC could make your process slower.

    3. If you don't have shuffle operation you might as well tweak spark.memory.fraction if you are using spark 2.2

    From documentation

    spark.memory.fraction (doc) expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

    So you can make the spark.memory.fraction to .9 and see the behavior.

    1. Lastly there are options apart from MEMORY_ONLY as storage level like MEMORY_ONLY_SER which will serialize the data and store in memory. This option reduces the memory usage as serialized object size is much smaller than the actual object size. If you see lot of spill you can opt this storage level.