Search code examples
apache-sparksparkr

Can you do a broadcast join with SparkR?


I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.

However I couldn't find the broadcast function in the SparkR documentation.

So I'm wondering if you can do a broadcast join with SparkR?


Solution

  • Spark 2.3: There will be broadcast function created in this pull request: https://github.com/apache/spark/pull/17965/files

    Spark 2.2:

    You can provide custom hint to query:

    head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))
    

    Reference: this code: https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L3740

    Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.

    Side note: Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:

    spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")
    

    Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here