Search code examples
apache-sparkrdd

Spark: rdd.count() and rdd.write() are executing transformations twice


I am using Apache Spark to fetch records from database and after some transformations, writing them to AWS S3. Now I also want to count the no of records I am writing to S3 and for that I am doing

rdd.count() and then
rdd.write()

In this way all the transformations are executing twice and giving performance issues. Is there any way It can be achieved while transformations execution will not perform again?


Solution

  • Two Actions - the count and the write mean 2 sets of reading.

    Assuming something like this:

    val rdd = sc.parallelize(collectedData, 4)
    

    then by adding .cache:

    val rdd = sc.parallelize(collectedData, 4).cache
    

    this will obviate the 2nd set of re-reading in general, but not always. You can also look at persist and the levels. Of course, caching has an overhead as well and it depends on sizes in play.

    The DAG Visualization on the Spark UI will show a green segment or dot, implying caching has been applied.