I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.
These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.
Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?
From https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
The Apache Parquet and ORC formats are currently not supported.
UPDATE
Since Jul 26 2019 AWS Glue supports Parquet and ORC formats as well for bookmarking
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html