I have a GCP Dataproc cluster with 50 workers (n1-standard-16 16 VCores 64 GB RAM).
The cluster has Capacity Scheduler with Default Resource Calculator.
My Spark job has following configuration
Now when I see YARN UI it shows that each node has 2 containers running with 1-Vcore and 20GB RAM, which almost make it look like that spark.executor.cores
is not getting applies. To crosscheck I looked at Spark UI and to my surprise every executor showed 5 cores. This is a bit confusing to me.
Also the job completion time (26 mins) also indicates that those 5 cores are indeed vcores not just 5 threads inside 1 core (this is just my understanding, I might be completely wrong here).
Can anyone help me in understanding this ?
The YARN-reported vCores number is known to be incorrect; this is a known issue related to the capacity-scheduler when used with Spark, but is only a cosmetic issue, since it is working as intended in Dataproc to prefer only memory-based bin-packing and to allow oversubscription of vCores if desired for high-IO jobs. Even if YARN is configured to include cores in bin-packing, it doesn't provide CPU isolation anyways. The number of cores per executor reported in Spark UI is the correct one to trust.
See this related StackOverflow answer: Dataproc set number of vcores per executor container