Search code examples
apache-sparkpysparkhadoop-yarn

Are there any spark configuration parameters that one can tune in order to decrease the driver node memory consumption?


Are there any spark configuration parameters that one can tune in order to decrease the driver node memory consumption?

I am using pyspark,scikit-learn and joblibspark to carry out distributed hyper-parameter RandonSearchCV on a YARN cluster. It looks like the memory consumption of the driver node equals roughly the sum of the memory consumption of all the worker nodes. Becaus the memory consumption per node is limited, the driver node reaches very fast this limit.


Solution

  • Eventually, I found that the library joblibspark to be very bad for this job especially if you have a big (in terms of memory) feature matrix. Therefore, I implemented the random search "from scratch" for scikit-learn models usig native pyspark functionality such that I don't collect the whole result at the driver node in the end of the computation. I found pandas UDFs in pyspark to be especially useful.