Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once?

I am little confused with the caching mechanism of Spark.

Let's say I have a Spark application with only one action at the end of multiple transformations. In which suppose I have a dataframe A and I applied 2-3 transformation on it, creating multiple dataframes which eventually helps creating a last dataframe which is going to be saved to disk.

example :

val A=spark.read() // large size
val B=A.map()
val C=A.map()
.
.
.
val D=B.join(C)
D.save()

So do I need to cache dataframe A for performance enhancement?

Thanks in advance.

Solution

Yes, you are correct.

You should cache A as it used for B & C as input. The DAG visualization would show the extent of reuse or going back to source (in this case). If you have a noisy cluster, some spilling to disk could occur.

See also top answer here (Why) do we need to call cache or persist on a RDD

However, I was looking for skipped stages, silly me. But something else shows as per below.

The following code akin to your own code:

val aa = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")//.cache
val a = aa.flatMap(x => x.split(" ")).map(_.trim) 
val b=a.map(x => (x,1)) 
val c=a.map(x => (x,2)) 
val d=b.join(c)
d.count

Looking at UI with .cache

and without .cache

QED: SO, .cache has benefit. It would not make sense otherwise. Also, 2 reads could lead to different results in some cases.