apache-spark pyspark apache-spark-sql aws-glue-data-catalog

BROADCASTJOIN hint is not working in PySpark SQL

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Output :

Note :

Size of tables are in KBs (test data)
Joining column 'serial_id' is not partitioned column
Using glue catalog as metastore (AWS)
Spark Version - Spark 2.4.4
I have tried BROADCASTJOIN and MAPJOIN hint as well
When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -

spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()

Output -

Why spark behavior is strange with AWS Glue Catalog as my metastore?

Solution

In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).

Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *

spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()