dataframe apache-spark optimization shuffle broadcasting

What drawbacks can be expected when broadcasting dataframes in Spark?

I know that broadcasting becomes very useful when trying to minimize the amount of shuffling of data across nodes. For example, in the following code I am broadcasting airports_df to flights_df in order to minimize the shuffling during the join operation.

broadcast_df = flights_df.join(broadcast(airports_df), \
flights_df["Destination Airport"] == airports_df["IATA"] )

1.) Now, doesn't broadcasting require additional storage space on my worker's nodes? Will the broadcasted df reside in memory? What if it is too big to fit in a worker's memory?

2.) Can broadcasting cause an I/O bottleneck?

Solution

To answer your questions,

Now, doesn't broadcasting require additional storage space on my worker's nodes? Will the broadcasted df reside in memory? What if it is too big to fit in a worker's memory?

Broadcast variables are kept in cache memory of each worker node, not sure what you mean additional storage, but its nothing but cache memory and yes we can say its a additional memory other than spark memory.

As mentioned earlier broadcast df reside in cache memory of worker.

Broadcast variables up to 10mb by default fit in memory, you can control it by spark.sql.autoBroadcastJoinThreshold parameter. Not sure about the threshold value though.

Can broadcasting cause an I/O bottleneck?

When you broadcast a value, it is copied to executors only once. So there will no repetitive shuffling of data set during spark execution, which in turn less network I/O.