Search code examples
apache-sparkapache-spark-sql

Table being broadcasted in YARN but not in K8s


I am running same queries in Spark on YARN and Spark on K8s. Both K8s & YARN refer to the same hive metastore and hdfs path. When I run the job in YRAN certain table is getting broadcasted (in join), while same is not happening in K8s. In both the environment broadcast threshold is same. Table is also same. But there is difference in plan when run on YARN vs K8s. And both places broadcast is enabled.

Why this difference in behaviour?


Solution

  • Found the issue:

    First thing that I tried was to explicitly pass the broadcast hint in the join and it did work. Now It was sure that some way the broadcast was not happening automatically. I ran the DESCRIBE FORMATTED <TBL_NAME> on the table and noticed that numRows & rawDataSize both were -1. So the issue was table stats in hive metastore was not updated. But the question was, how it works in YARN. On getting the spark config being used in YARN and K8s at runtime we found the difference. So in YARN this property spark.sql.statistics.fallBackToHdfs was set to true while in K8s it was not set (default is false). So YARN was getting stats from hdfs and hence it was able to braodcast, while on K8s it was not happening. Once setting this in K8s as well to true the broadcast happened.