Search code examples
databricksazure-databricks

Does CACHE TABLE persist if session is restarted?


I have a scenario where I'm reading data from remote storage:

df = spark.read.load("abfss://[email protected]/mydata.csv")

It's a small dataset of a few GBs and it takes around 1 to 2 mins to load.

The same notebook is manually run everyday and I'm looking to make some small optimisations.

It appears cache() and persist() will not help because the data will be uncached\unpersisted at the end of the session?

Is it an ok pattern to write the data to local storage on the cluster and read it from there, e.g.

localfile = '/X/myfile.parquet'

if os.path.exists(localfile):
   df = spark.read.parquet(localfile)
else:
   df = spark.read.csv("abfss://[email protected]/mydata.csv")
   # do some basic munging
   df.write.parquet(localfile)

How can I determine where the local disks (i.e. disks attached to driver and worker nodes) are mounted and are the users permitted to write to them?


Update:

The cluster will occasionally get restarted, but not often.


Solution

  • Since your cluster is restarted periodically, I would not write to disk but instead as a Delta Table to cloud storage (S3, Azure Blob Storage) if possible.

    This should speed up your query immensely.