I am little confused with the caching mechanism of Spark.
Let's say I have a Spark application with only one action at the end of multiple transformations. In which suppose I have a dataframe A and I applied 2-3 transformation on it, creating multiple dataframes which eventually helps creating a last dataframe which is going to be saved to disk.
example :
val A=spark.read() // large size
val B=A.map()
val C=A.map()
.
.
.
val D=B.join(C)
D.save()
So do I need to cache dataframe A for performance enhancement?
Thanks in advance.
Yes, you are correct.
You should cache A as it used for B & C as input. The DAG visualization would show the extent of reuse or going back to source (in this case). If you have a noisy cluster, some spilling to disk could occur.
See also top answer here (Why) do we need to call cache or persist on a RDD
However, I was looking for skipped stages, silly me. But something else shows as per below.
The following code akin to your own code:
val aa = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")//.cache
val a = aa.flatMap(x => x.split(" ")).map(_.trim)
val b=a.map(x => (x,1))
val c=a.map(x => (x,2))
val d=b.join(c)
d.count
Looking at UI with .cache
and without .cache
QED: SO, .cache has benefit. It would not make sense otherwise. Also, 2 reads could lead to different results in some cases.