I am new to Spark.
I have a cluster with this configuration:
Number of nodes : 10
Number of cores per node : 16
Memory (RAM) per node : 64gb
This is my spark-submit command:
spark-submit --master yarn
--deploy-mode cluster
--driver-cores 5
--driver-memory ??G
--executor-cores 5
--executor-memory 12G
--conf spark.sql.shuffle.partitions=1440
example.py
I have driver and executor cores set to 5 because I was told that is optimal. For calculating the executor memory, I was given this formula:
executor_memory = memory_available - container_overhead
memory_available = total_memory / number_of_executors_per_node
memory_available = 64gb / 5 = 12.8 = ~13gb
container_overhead = 13gb – (7% * 13gb) = 0.91 = ~1
executor_memory = 13 – 1 => 12gb
My questions are:
Normally broad questions like these (and question posts that contain multiple questions) are not favoured on Stackoverflow because:
But since it is a genuine question and it's obvious you've been asking some great questions to yourself, I'll give you an answer. Due to the nature of your question you might leave with more questions. In that case, feel free to ask new questions on here but make them good. Include code, information about your cluster, Spark version, information about your data, ask only 1 question per post, ...
Where does your information come from? These questions are too generic, and you can't really answer these without knowing anything about how you're going to use your cluster.
Some considerations you need to ask before starting to fill in your numbers.
.collect
actions or do many non-distributed calculations?
.collect
on big objects, you might need a lot of memory (in general try to avoid this but there might be reasons why you would want to do this)spark.sql.shuffle.partitions
So to conclude, all of the numbers you filled in seem premature if you don't know more about your actual calculations.