In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3
) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource
method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource
if you want to load tables from Glue catalog.
Check out the diagram in this [link][2] . It will give you an idea about the flow.
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue. [1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html [2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html [3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main [4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets