Search code examples
amazon-web-servicesapache-sparkaws-glue

AWS glue incremental load


I have a S3 bucket where everyday files are getting dumped. AWS crawler crawls the data from this location.On the very first day when my glue job runs it takes all the data present in the table that is created by AWS crawler.For example on very first day three files are there.(i.e. file1.txt,file2.txt,file3.txt) and glue job processes these files on the first day of glue job execution.On the second day another two files reaches to S3 location.Now in S3 location these are the files present.(i.e. file1.txt,file2.txt,file3.txt,file4.txt,file5.txt).Can i somehow design my AWS crawler in such a way that on the next day of job execution it just reads two files (file4.txt,file5.txt)?Or else how can I write AWS glue job just to identify these incremental files?


Solution

  • You need to enable AWS job bookmark for glue and it will be able to persist the state of already processed data. You can refer to the link below about how to do it.

    aws glue job bookmark