amazon-web-services aws-glue aws-glue-data-catalog

AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job

I have new to AWS Glue. I am using AWS Glue Crawler to crawl data from two S3 buckets. I have one file in each bucket. AWS Glue Crawler creates two tables in AWS Glue Data Catalog and I am also able to query the data in AWS Athena.

My understanding was in order to get data in Athena I need to create Glue job and that will pull the data in Athena but I was wrong. Is it correct to say that Glue crawler places data in Athena without the need of Glue job and if we need to push our data in DB like SQL , Oracle etc. then we need to Glue Job ?

How I can configure the Glue Crawler that it fetches only the delta data and not all data all the time from the source bucket ?

Any help is appreciated ?

Solution

The Glue crawler is only used to identify the schema that your data is in. Your data sits somewhere (e.g. S3) and the crawler identifies the schema by going through a percentage of your files.

You then can use a query engine like Athena (managed, serverless Apache Presto) to query the data, since it already has a schema.

If you want to process / clean / aggregate the data, you can use Glue Jobs, which is basically managed serverless Spark.