I'm trying to join a large dataframe to a smaller dataframe and I saw that broadcast join is an efficient way to do that, according to this post.
However I couldn't find the broadcast function in the SparkR documentation.
So I'm wondering if you can do a broadcast join with SparkR?
Spark 2.3:
There will be broadcast
function created in this pull request: https://github.com/apache/spark/pull/17965/files
Spark 2.2:
You can provide custom hint to query:
head(join(df, hint(avg_mpg, "broadcast"), df$cyl == avg_mpg$cyl))
Reference: this code: https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L3740
Broadcast function in Java, Scala and Python API is also a wrapper for adding broadcast hint. Hint means that optimizer gets additional information: this DataFrame is small, I - user - guarantee this, you should do broadcast before joining with other DataFrames.
Side note: Spark sometimes do automatically performs Broadcast Join. You can manipulate configuration of automatic Broadcast Joins by setting:
spark.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")
Here, -1 means that no DataFrame will be broadcasted to use Broadcast Join. You can read about this topic more here