Search code examples
pythonapache-sparkcachingpyspark

Pyspark caches dataframe by default or not?


If i read a file in pyspark:

Data = spark.read(file.csv)

Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:

Data.cache()

Solution

  • If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct?

    No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.

    So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?

    No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.