Search code examples
apache-sparkdataframehivepysparkpersist

stop auto unpersist of dataframe after writing to hive table


I want to persist a dataframe even after writing to hive table.

<change data capture code> 
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count() #count is 100
df.write.mode("append").insertInto("schema.table")
df.count() #count is 0 || because it's recalculating change data capture part

Here it seems that df is getting unpersisted after writing to hive. Is this behavior expected if yes then how can we fix this?


Solution

  • you can persist rdd after converting df to rdd.

    Storing schema so that we can convert back rdd to df

    rdd_schema = df.schema
    df_rdd = df.rdd.persist(StorageLevel.MEMORY_AND_DISK)
    
    df.count() #count is 100
    df.write.mode("append").insertInto("schema.table")
    

    Now df is gone so we can use persisted rdd to get back df

    df_persisted = spark.createDataFrame(df_rdd, schema=rdd_schema)
    df_persisted.count() #count is 100 || This will calculate from persisted rdd