Search code examples
pythonamazon-web-servicesamazon-s3aws-glue

How can I delete a specific record from my AWS Glue table?


How can I delete a specific record from my AWS Glue table using Python? My table is linked to an S3 bucket that contains multiple files.

So far, the only method I've found to delete a row/record is by deleting the file in the bucket, either using boto3.delete_object or purge_s3_path. In both cases, you need to first identify the exact file containing the data you want to remove (I’m still unsure how to handle that part).

However, it's common for these files to contain multiple records. As a result, simply deleting the entire file isn't feasible, which introduces additional complexity.

Note that the solution need to work with any type of file (CSV, JSON, etc...).


Solution

  • To delete specific records from your AWS Glue table, especially when it's linked to an S3 bucket with multiple files (like CSV or JSON), you'd need to leverage a lakehouse infra that supports record-level operations. Using a lakehouse infra like Apache Hudi, Delta Lake, or Apache Iceberg is necessary for this kind of functionality

    These formats maintain metadata and file structures that allow you to delete or update individual records without impacting other data, unlike CSV/JSON where you'd need to manipulate entire files.

    With these formats, you can query and delete specific records using PySpark or similar tools while keeping the data in S3.

    A standard approach like boto3.delete_object or purge_s3_path works at the file level, but that isn't ideal when the file contains multiple records. These formats, like CSV or JSON, don't natively support deletion of specific rows without rewriting the file, which adds complexity.