Search code examples
apache-sparkpysparkapache-spark-sqlrdd

How does caching in Spark works


I'm struggling to grasp the use cases for caching in Spark. I understand the concept as "it saves an RDD into memory only" but isn't that already accomplished once an action is performed?

Let's say I read a text file and name the RDD as "df", then run a count() as my action. By doing this I already have my RDD in memory and can be called later so why or when would I want cache my RDD? Is it in case of using filters (but filter return a new RDD that can be stored as a new variable)?

Thank you for the help :)


Solution

  • When you call an action, the RDD does come into the memory, but that memory will be freed after that action is finished. By caching the RDD, it will be forcefully persisted onto memory (or disk, depending on how you cached it) so that it won't be wiped, and can be reused to speed up future queries on the same RDD.

    Filters are different because filter is a transformation, not an action. Of course you can cache a filtered RDD too, but it will only be persisted into memory after an action has been called on the filtered RDD.