Search code examples
cachingapache-sparkspark-csv

Would spark dataframe read from external source on every action?


On a spark shell I use the below code to read from a csv file

val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()

Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?

Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session


Solution

  • After each action spark forgets about the loaded data and any intermediate variables value you used in between.

    So, if you invoke 4 actions one after another, it computes everything from beginning each time.

    Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.

    That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.