I want to persist a dataframe even after writing to hive table.
<change data capture code>
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count() #count is 100
df.write.mode("append").insertInto("schema.table")
df.count() #count is 0 || because it's recalculating change data capture part
Here it seems that df is getting unpersisted after writing to hive. Is this behavior expected if yes then how can we fix this?
you can persist rdd after converting df to rdd.
Storing schema so that we can convert back rdd to df
rdd_schema = df.schema
df_rdd = df.rdd.persist(StorageLevel.MEMORY_AND_DISK)
df.count() #count is 100
df.write.mode("append").insertInto("schema.table")
Now df is gone so we can use persisted rdd to get back df
df_persisted = spark.createDataFrame(df_rdd, schema=rdd_schema)
df_persisted.count() #count is 100 || This will calculate from persisted rdd