How to deal with "could not execute broadcast in 300 secs"?

I am trying to get a build working, and one of the stages intermittently fails with the following error:

Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

How should I deal with this error?

Solution

First, let's talk a bit about what that error means.

From the official Spark documentation (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables):

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

From my experience the broadcast timeout usually occurs when one of the input datasets are partitioned poorly. Instead of disabling the broadcast, I recommend that you look at the partitions of your datasets and ensure that they are partitioned correctly.

The rules of thumb I use are to take the size of your dataset in MB and divide by 100, and set the number of partitions to be that number. Since the HDFS block size is 125 MB, we want to spilt the files into about 125 MB but since they don't split perfectly we can divide by a smaller number to get more partitions.

The main thing is that very small datasets (~<125 MB) are in a single partition as the network overhead it too large! Hope this helps.