apache-spark jupyter-notebook amazon-emr spark-notebook

Setting spark.driver.maxResultSize in EMR notebook jupyter

I am using Jupyter notebook in emr to handle large chunks of data. While processing data I see this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

It seems I need to update the maxResultsSize in the spark config. How do I set spark maxResultsSize from jupyter notebook.

Already checked this post: Spark 1.4 increase maxResultSize memory

Also, In emr notebook, spark context is already given, is there any way to edit spark context and increase maxResultsSize

Any leads would be very helpful.

Thanks

Solution

You can set livy configuration at the start of spark session See https://github.com/cloudera/livy#request-body

Place this at the start of your code

%%configure -f
{"conf":{"spark.driver.maxResultSize":"15G"}}

Check settings of your session by printing it in the next cell:

print(spark.conf.get('spark.driver.maxResultSize'))

This should resolve the problem