I am dealing with a huge amount of data which can't be processed through available memory in PySpark, which is resulted in Out of Memory error. I need to utilize the MEMORY_AND_DISK option for this.
My question is: How I can enable this flag in PySpark Jupyter Notebook?
I am looking for something like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "15g") \
.appName('voice-30') \
.getOrCreate()
This is how we are setting driver memory. Is there any similar way to set the DISK_AND_MEMORY flag for PySpark?
MEMORY_AND_DISK
is the default storage level since Spark 2.0 for persisting a Dataframe, or RDD, for use in multiple actions, so there is no need to set it explicitly. However, you are experiencing an OOM error, hence setting storage options for persisting RDDs is not the answer to your problem.
Note from the Spark FAQs:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Hence, your OOM error is due to your cluster running out of storage (both memory and disk), so you need to increase the resources of your cluster (some permutation of memory, disk, and number of nodes).