Search code examples
pythonamazon-web-servicespysparkparquet

How to search and delete specific lines from a parquet file in pyspark? (data purge)


I'm starting a project to adjust the data lake for the specific purge of data, to comply with data privacy legislation.

Basically the owner of the data opens a call requesting the deletion of records for a specific user, and I need to sweep all AWS S3 bucktes by checking all parquet files and delete this specific record from all parquet files in my data lake.

Has anyone developed a similar project in python or pyspark?

Can you suggest what would be the good market practice for this case?

Today what I'm doing is reading all the parquet files, throwing it to a dataframe , filtering that dataframe excluding the current record, and rewriting the partition where that record was. This solution even works, but to purge where I need to look at a 5-year history, the processing is very heavy.

Can anyone suggest me a more practical solution?

remembering that my parquet files are in AWS S3, there are Athena tables, and my script will run in EMR (Pyspark)

Thanks


Solution

  • Try using Athena iceberg tables as Iceberg tables are ACID compliant. I believe this is good practice for Athena and delete/update specific rows in S3.

    Example:

    Step1:Create iceberg table.

    CREATE TABLE iceberg_table (
      id int,
      data string,
      category string) 
    PARTITIONED BY (category, bucket(16,id)) 
    LOCATION 's3://DOC-EXAMPLE-BUCKET/iceberg-folder' 
    TBLPROPERTIES (
      'table_type'='ICEBERG',
      'format'='parquet',
      'write_target_data_file_size_bytes'='536870912'
    );
    

    Step2 :

    1. Load the data from your existing table into Iceberg table.

    2. Then try out your requirements here and convert all your parquet tables as iceberg tables.

    3. Optimize the table to get optimal performance.