I'm bit new to Spark and trying to understand few term. (Couldn't understand using online resources)
Please validate me first with below terms:
Executor: Its container or JVM process
which will be running on worker node or data node
. We can have multiple Executors per node.
Core: Its a thread within a container or JVM process
running on worker node or data node
. We can have multiple cores or threads per executor.
Please correct me If am wrong in above two concepts.
Questions:
application or job
in cluster and execute that ?
Its it correct understanding .. ?In command used to submit job in spark cluster, there is an option to set number of executors.
spark-submit --class <CLASS_NAME> --num-executors ? --executor-cores ? --executor-memory ? ....
So these number of executors + cores will be setting up per-node? If not then how can we set specific number of cores per node?
All of your assumptions are correct. For in a detailed explanation regarding cluster architecture please go through this link. You'll get a clear picture. Regarding your second question, num-of-executors is for the entire cluster. It is calculated as below:
num-cores-per-node * total-nodes-in-cluster
For example, suppose that you have a 20-node cluster with 4-core machines, and you submit an application with -executor-memory 1G and --total-executor-cores 8. Then Spark will launch eight executors, each with 1 GB of RAM, on different machines. Spark does this by default to give applications a chance to achieve data locality for distributed filesystems running on the same machines (e.g., HDFS) because these systems typically have data spread out across all nodes.
I hope it helps!