Search code examples
scalafunctionapache-sparkapache-spark-sqlbroadcast

Difference between sc.broadcast and broadcast function in spark sql


I have used sc.broadcast for lookup files to improve the performance.

I also came to know there is a function called broadcast in Spark SQL Functions.

What is the difference between two?

Which one i should use it for broadcasting the reference/look up tables?


Solution

  • If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:

    • Mark given relation for broadcasting.
    • Adjust SQL execution plan.
    • When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.

    SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.