pySpark performance issue - repeatedly query rawdata

I'm using pySpark to query a large data set in AWS S3 bucket with below code. The rawdata query is very slow. Therefore I'm trying to improve the performance with cache and temp views. But looks like it doesn't work.

baseName = "/local/rawdata"
df_data = spark.read.parquet(baseName)

df_data = df_data.filter(df_data["supplyUuid"]=="xxxxxxxxx...")
df_data = df_data.select(["subscriptionState", "dateTime"])
df_data.cache().createOrReplaceTempView("subscriptionHistory")
display(df_data)

Here I get a two columns table with 93 recorders.

Then I run below code to get the start and end date time of different subscription states.

df_data.groupBy("subscriptionState").agg(f.min("dateTime"), f.max("dateTime"))

Suppose, I've already cache the subset data with 93 recorders in a temp table. The group is run based on the subset. But actually I found the 2nd part of the code takes hours and seems like it query the whole rawdata again.

What happens here? How to improve the performance?

Solution

Just to explain Spark has a lazy execution model, which means any transformations will not be executed unless there is an action coming at the end.

So calling cache will only cache the results while executing the first action at the end, so you will benefit from caching unless you are going to reuse the same dataframe in another action.

So calling cache then createOrReplaceTempView then groupby won't be executed immediately, and the execution time is for the whole operation from reading the files until the group by and aggregations.

Since I'm not aware of your complete logic and enviroment, I'd suggest to start with the repartition so you can split the load on more workers.

You can find also more tuning tips from the official documentation.