Search code examples
scalaapache-sparkhadooppysparkapache-hudi

How to delete key for all commits in HUDI Table (history)?


For a HUDI table the goal is to apply GDPR and delete a key of a table. I'm only able to delete data fror the latest commit of the table.

How can I make sure the key is deleted for all commits on the HUDI table?

I did a POC: I executed a hard delete which should the complete row.

hard_delete_df = spark.sql("SELECT * FROM table_x where emp_id='1' ")
hudi_options['hoodie.datasource.write.operation'] = 'delete'
hard_delete_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

This happens but ONLY for the latest commit. If timetravel, executed as below, is used I still see the deleted row for the older commits.

df_commitbeforedelete = spark.read \
  .format("org.apache.hudi")\
  .option("as.of.instant", "timebeforedelete") \
  .load("s3a://hudi-s3/table_x")
df_commitbeforedelete.show()

Solution

  • You cannot run operation such delete or upsert on previous commited files. Time travel is meant for read-only.

    You have to rely on cleaning so that hudi auto removes old commited files.