I am looking for a way to avoid duplicates in my etl pipeline target s3 bucket when same data is sent again froms source. Is there a way in glue dynamicframe datasets where I can compare the unique key from source (data read from s3 in json format) and only insert in target s3 bucket in parquet format if unique key is not found in glue catalog dynamicframe read from target bucket.
I have seen joins (inner, left and right) but nothing in the form of "not in".
Thanks Jeet
This is not straight forward. You would need to read the whole target and do a left_anti
join on the key, where the left data frame is the new data and the right the existing target table.
There are open source frameworks like Delta Lake, that enables you to do that more easy and performant though.