I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example, if you have a server with 32g ram. set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers. The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node. 15 cores per executor can lead to bad HDFS I/O throughput.
A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one with the AM, which will have two executors. --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.
This is explained here if you want to know more : http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/