Search code examples
apache-sparkcachingpersist

Why Apache Spark has added cache() method even though we can achieve the same functionality by calling persist(StorageLevel.MEMORY_ONLY)


Why spark added cache() method in its library i.e. rdd.py even though it internally calls self.persist(StorageLevel.MEMORY_ONLY) as stated below:

def cache(self):
    """
    Persist this RDD with the default storage level (C{MEMORY_ONLY}).
    """
    self.is_cached = True
    self.persist(StorageLevel.MEMORY_ONLY)
    return self

Solution

  • cache is a convenience method to cache a Dataframe. Persist is an advanced method which can take storage level as parameter and persist the dataframe accordingly.

    The default storage level for cache and persist are same and as you mentioned duplicated. You can use either. In Scala implementation cache calls persist def cache(): this.type = persist(). This tells me that persist is the real implementation and cache is sugar syntax.