google-cloud-platform hadoop-yarn google-cloud-dataproc

GCP Dataproc - Inconsistent container metrics - YARN UI vs Spark UI

I have a GCP Dataproc cluster with 50 workers (n1-standard-16 16 VCores 64 GB RAM).

The cluster has Capacity Scheduler with Default Resource Calculator.

My Spark job has following configuration

spark.executor.cores=5
spark.executor.memory=18G
spark.yarn.executor.memoryOverhead=2G

Now when I see YARN UI it shows that each node has 2 containers running with 1-Vcore and 20GB RAM, which almost make it look like that spark.executor.cores is not getting applies. To crosscheck I looked at Spark UI and to my surprise every executor showed 5 cores. This is a bit confusing to me.

Also the job completion time (26 mins) also indicates that those 5 cores are indeed vcores not just 5 threads inside 1 core (this is just my understanding, I might be completely wrong here).

Can anyone help me in understanding this ?

Solution

The YARN-reported vCores number is known to be incorrect; this is a known issue related to the capacity-scheduler when used with Spark, but is only a cosmetic issue, since it is working as intended in Dataproc to prefer only memory-based bin-packing and to allow oversubscription of vCores if desired for high-IO jobs. Even if YARN is configured to include cores in bin-packing, it doesn't provide CPU isolation anyways. The number of cores per executor reported in Spark UI is the correct one to trust.

See this related StackOverflow answer: Dataproc set number of vcores per executor container