How data retrieved from metadata created tables in Glue Script

In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.

Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.

val input = glueContext
      .getCatalogSource(database = "my_data_base", tableName = "my_table")
      .getDynamicFrame()

Does above line retrieve data from actual source via metadata tables?

I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.

Solution

When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.

Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:

[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.

Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.

Yes you need to use getCatalogSource if you want to load tables from Glue catalog.

Does Catalog look into Crawler and refer to actual source and load data?

Check out the diagram in this [link][2] . It will give you an idea about the flow.

What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?

Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.

What if my Source has lots of million of records? then will this load all records or how in this case?

It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue. [1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html [2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html [3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main [4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets