Looking for ways to create an inventory of what AWS glue catalog databases exist and the tables they contain to include creation date for reporting. I have looked at querying information_schema but doesn't appear I can get a creation date that way. The tables we are creating are not using the standard ETL crawling but using other ways to create the tables or in some cases using standard sql DDL. Hoping to find either an Athena query or other scripting options to get this data so I can make the data available for AWS quicksight and potentially our metadata catalog.
You can consider to use AWS SDK to fetch what you need. Specifically using python and boto3 glue client, you can get when a table was created and updated.
Here an example:
import os
import boto3
glue_client = boto3.client('glue')
databases = glue_client.get_databases()['DatabaseList']
inventory = []
for database in databases:
tables = glue_client.get_tables(
DatabaseName=database['Name']
)
for table in tables['TableList']:
inventory.append({
'database': database['Name'],
'table': table['Name'],
'location': table['StorageDescriptor']['Location'],
'partition_keys': table.get('PartitionKeys', []),
'table_type': table['TableType'],
'created_at': table['CreateTime'].strftime('%Y-%m-%d %H:%M:%S'),
'updated_at': table['UpdateTime'].strftime('%Y-%m-%d %H:%M:%S'),
})
inventory
contains when the table was created/updated, table location and other informations e.g. partitions and table_type
If you want to use this extracted information, you just have to persist it back to s3 and register to glue catalog.
To do so, you can using AWS data wrangler:
import awswrangler as wr
import pandas as pd
inventory_df = pd.DataFrame.from_records(inventory)
wr.s3.to_parquet(
df=inventory_df
path='s3://bucket/prefix',
dataset=True,
mode='overwrite',
database='default', # Athena/Glue database
table='tables_inventory' # Athena/Glue table
)
The inventory will be then available in athena querying default.tables_inventory and therefore usable in Quicksights.
It's worth to mention that using such approach (Glue apis) is way faster then using Athena information schema, at the same time you need a pipeline that runs regularly.