Search code examples
amazon-s3amazon-dynamodbamazon-emr

How to have EMRFS consistent view on S3 buckets with retention policy?


I am using an AWS EMR compute cluster (version 5.27.0) , which uses S3 for data persistence. This cluster both reads and writes to S3.

S3 has an issue of eventual consistency, because of which after writing data, it cannot be immediately listed. Due to this I use EMRFS with DynamoDB to store newly written paths for immediate listing.

Problem now is that I have to set a retention policy on S3, because of which data more than a month old will get deleted from S3. However, in doing so , the data does not get deleted from EMRFS DynamoDB table, leading to consistency issues.

My question is , how can I ensure that on setting the retention policy in S3, the same paths get deleted from the DynamoDB table?

One naive solution I have come up with is to define a Lambda, which fires periodically, and sets TTL of say 1 day on the DynamoDB records manually. Is there a better approach than this ?


Solution

  • You can configure DynamoDB with same expiration policy as your S3 objects have

    https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/

    and in this case, you ensure both DynamoDB and S3 have the same existing objects