Search code examples
apache-sparkpysparkapache-spark-sqlaws-glue-data-catalog

BROADCASTJOIN hint is not working in PySpark SQL


I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Output : enter image description here

Note :

  1. Size of tables are in KBs (test data)
  2. Joining column 'serial_id' is not partitioned column
  3. Using glue catalog as metastore (AWS)
  4. Spark Version - Spark 2.4.4
  5. I have tried BROADCASTJOIN and MAPJOIN hint as well
  6. When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()

Output - enter image description here

Why spark behavior is strange with AWS Glue Catalog as my metastore?


Solution

  • In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).

    Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *


    spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()