Search code examples
apache-sparkapache-spark-sql

Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once?


I am little confused with the caching mechanism of Spark.

Let's say I have a Spark application with only one action at the end of multiple transformations. In which suppose I have a dataframe A and I applied 2-3 transformation on it, creating multiple dataframes which eventually helps creating a last dataframe which is going to be saved to disk.

example :

val A=spark.read() // large size
val B=A.map()
val C=A.map()
.
.
.
val D=B.join(C)
D.save()

So do I need to cache dataframe A for performance enhancement?

Thanks in advance.


Solution

  • Yes, you are correct.

    You should cache A as it used for B & C as input. The DAG visualization would show the extent of reuse or going back to source (in this case). If you have a noisy cluster, some spilling to disk could occur.

    See also top answer here (Why) do we need to call cache or persist on a RDD

    However, I was looking for skipped stages, silly me. But something else shows as per below.

    The following code akin to your own code:

    val aa = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")//.cache
    val a = aa.flatMap(x => x.split(" ")).map(_.trim) 
    val b=a.map(x => (x,1)) 
    val c=a.map(x => (x,2)) 
    val d=b.join(c)
    d.count
    

    Looking at UI with .cache

    enter image description here

    and without .cache

    enter image description here

    QED: SO, .cache has benefit. It would not make sense otherwise. Also, 2 reads could lead to different results in some cases.