Search code examples
dataframeapache-sparkpysparklazy-evaluation

Does Spark Always Read Data When an Action Occurs


I'm new with Spark and I learned that there are transformations and actions. Transformations return new rdds and dataframes, and actions make operations on them. Unless an action is not called, no transformations are performed. Transformations are just steps in the lineage unless any action is requested. So, when I read a dataframe it is also a transformation and if I call two actions on the same dataframe after reading it, is it read twice or read only once and then actions performed on them?

df = ss.read.csv('test.csv')
df.count()
df.take(5)

Solution

  • An Action causes execution in general, reads as you state. It is not a case of operations, but of actual execution - bar a few exceptions of jobs needing to be performed.

    If you have not cached / persisted data you will read the data more than once, unless some skipped actions occur.

    Delayed execution and lineage mean code can be optimized.

    Things like take(n) are optimized. show has special considerations as well.