Why spark added cache() method in its library i.e. rdd.py even though it internally calls self.persist(StorageLevel.MEMORY_ONLY) as stated below:
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self
cache
is a convenience method to cache a Dataframe. Persist
is an advanced method which can take storage level as parameter and persist the dataframe accordingly.
The default storage level for cache
and persist
are same and as you mentioned duplicated. You can use either.
In Scala implementation cache
calls persist
def cache(): this.type = persist()
. This tells me that persist
is the real implementation and cache
is sugar syntax.