apache-spark pyspark apache-spark-sql spark-streaming

Writing Spark Submit Commands

I am new to Spark.

I have a cluster with this configuration:

Number of nodes : 10
Number of cores per node : 16
Memory (RAM) per node : 64gb

This is my spark-submit command:

spark-submit --master yarn 
    --deploy-mode cluster 
    --driver-cores 5 
    --driver-memory ??G 
    --executor-cores 5 
    --executor-memory 12G 
    --conf spark.sql.shuffle.partitions=1440
    example.py

I have driver and executor cores set to 5 because I was told that is optimal. For calculating the executor memory, I was given this formula:

executor_memory = memory_available - container_overhead
memory_available = total_memory / number_of_executors_per_node
memory_available = 64gb / 5 = 12.8 = ~13gb
container_overhead = 13gb – (7% * 13gb) = 0.91 = ~1
executor_memory = 13 – 1 => 12gb

My questions are:

When calculating the container overhead, how do I know how much space to give for heap overhead? why is it 7%?
How do I calculate the driver_memory?

Solution

Normally broad questions like these (and question posts that contain multiple questions) are not favoured on Stackoverflow because:

answers can vary quite a bit, possibly leaving the original poster and future readers confused
be opinionated, which we want to avoid here
people can answer answer only a part of your question, making answer non-atomic

But since it is a genuine question and it's obvious you've been asking some great questions to yourself, I'll give you an answer. Due to the nature of your question you might leave with more questions. In that case, feel free to ask new questions on here but make them good. Include code, information about your cluster, Spark version, information about your data, ask only 1 question per post, ...

The answer

Where does your information come from? These questions are too generic, and you can't really answer these without knowing anything about how you're going to use your cluster.

Some considerations you need to ask before starting to fill in your numbers.

Will your calculations be memory/CPU intensive?
- This will teach you where your bottleneck might be
What will your driver be doing? Will you do lots of .collect actions or do many non-distributed calculations?
- If almost all of your code will be on distributed data object like DataFrames, your driver won't be doing much (so not needed to worry about it)
- If you will be calling .collect on big objects, you might need a lot of memory (in general try to avoid this but there might be reasons why you would want to do this)
Are you going to have lots of shuffle operations? Will they be skewed?
- This might have an impact on your choice of spark.sql.shuffle.partitions
- The nature of your joins (skewed, non equi joins, ...) might also have an impact on the code you write
How big is your data?
- This will affect everything :)
Many more questions...

So to conclude, all of the numbers you filled in seem premature if you don't know more about your actual calculations.