Search code examples
apache-sparkpysparkout-of-memory

Driver Out of memory - due to Brodcasting


I came to know that Driver out of memory can also happen due to broadcasting.

During broadcasting, smaller table/dataframe is copied/broadcasted to all executors memory on worker nodes. How does out of memory can happen in driver node during Broadcasting, does driver also copies same data in driver memory?


Solution

  • See this jira SPARK-17556 attachments and comments.

    Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. And of course can cause OOM in the driver.

    It might be possible to broadcast directly from executors (the work is in progress).