Search code examples
amazon-web-servicesaws-glueamazon-athena

query AWS glue database and table metadata for inventory reporting


Looking for ways to create an inventory of what AWS glue catalog databases exist and the tables they contain to include creation date for reporting. I have looked at querying information_schema but doesn't appear I can get a creation date that way. The tables we are creating are not using the standard ETL crawling but using other ways to create the tables or in some cases using standard sql DDL. Hoping to find either an Athena query or other scripting options to get this data so I can make the data available for AWS quicksight and potentially our metadata catalog.


Solution

  • You can consider to use AWS SDK to fetch what you need. Specifically using python and boto3 glue client, you can get when a table was created and updated.

    Here an example:

    import os
    
    import boto3
    
    
    glue_client = boto3.client('glue')
    
    databases = glue_client.get_databases()['DatabaseList']
    
    
    inventory = []
    
    for database in databases:
        tables = glue_client.get_tables(
            DatabaseName=database['Name']
        )
        for table in tables['TableList']:
            inventory.append({
                'database': database['Name'],
                'table': table['Name'],
                'location': table['StorageDescriptor']['Location'],
                'partition_keys': table.get('PartitionKeys', []),
                'table_type': table['TableType'],
                'created_at': table['CreateTime'].strftime('%Y-%m-%d %H:%M:%S'),
                'updated_at': table['UpdateTime'].strftime('%Y-%m-%d %H:%M:%S'),
            })
    

    inventory contains when the table was created/updated, table location and other informations e.g. partitions and table_type If you want to use this extracted information, you just have to persist it back to s3 and register to glue catalog. To do so, you can using AWS data wrangler:

    import awswrangler as wr
    import pandas as pd
    
    inventory_df = pd.DataFrame.from_records(inventory)
    
    wr.s3.to_parquet(
        df=inventory_df
        path='s3://bucket/prefix',
        dataset=True,
        mode='overwrite',
        database='default',  # Athena/Glue database
        table='tables_inventory'  # Athena/Glue table
    )
    
    

    The inventory will be then available in athena querying default.tables_inventory and therefore usable in Quicksights.

    It's worth to mention that using such approach (Glue apis) is way faster then using Athena information schema, at the same time you need a pipeline that runs regularly.