I am working on crawling data to data catalog via aws glue
. But I am a bit confused about the database definition. From what I can find in aws doc, A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize your tables into separate categories.
. I wonder what exactly a database contains. Does it load all the data from other data sources and create a catalog on them? Or does it only contain catalog? How do I know the size of tables in glue database? And what type of database it uses, like nosql
, rds
?
For example, I create a crawler to load data from s3
and create a catalog table in glue
. Does the glue
table includes all the data from s3 bucket
? If I delete s3
bucket, will it have impact on other jobs in glue which runs against the catalog table created by the crawler?
If the catalog table only includes data schema, how can I keep it update to data if my data source is modified?
The Catalog is just a metadata store. Its mission is to document the data that lives elsewhere, and to export that to other tools, like Athena or EMR, so they can discover the data.
Data is not replicated into the catalog, but remains in the origin. If you remove the table from the catalog, the data in origin remains intact.
If you delete the origin data (as you described in your question), the other services will not have access to the data anymore, as it is deleted. If you run the crawler again it should detect it is not there.
If you want to keep the crawler schema up to date, you can either schedule automatic runs of the crawler, or execute on demand whenever your data changes. When the crawler is run again it will update accordingly things like the number of records, partitions, or even changes in the schema. Please refer to the documentation to see the effect changes in the schema can have on your catalog.