Search code examples
apache-sparkrdd

Differences between persist(DISK_ONLY) vs manually saving to HDFS and reading back


This answer clearly explains RDD persist() and cache() and the need for it - (Why) do we need to call cache or persist on a RDD

So, I understand that calling someRdd.persist(DISK_ONLY) is lazy, but someRdd.saveAsTextFile("path") is eager.

But other than this (also disregarding the cleanup of text file stored in HDFS manually), is there any other difference (performance or otherwise) between using persist to cache the rdd to disk versus manually writing and reading from disk? Is there a reason to prefer one over the other?

More Context: I came across code which manually writes to HDFS and reads it back in our production application. I've just started learning Spark and was wondering if this can be replaced with persist(DISK_ONLY). Note that the saved rdd in HDFS is deleted before every new run of the job and this stored data is not used for anything else between the runs.


Solution

  • There are at least these differences:

    • Writing to HDFS will have the replicas overhead, while caching is written locally on the executor (or to second replica if DISK_ONLY_2 is chosen).
    • Writing to HDFS is persistent, while cached data might get lost if/when an executor is killed for any reason. And you already mentioned the benefit of writing to HDFS when the entire application goes down.
    • Caching does not change the partitioning, but reading from HDFS might/will result in different partitioning than the original written DataFrame/RDD. For example, small partitions (files) will be aggregated and large files will be split.

    I usually prefer to cache small/medium data sets that are expensive to evaluate, and write larger data sets to HDFS.