Search code examples
databaseamazon-web-servicesaws-glueaws-glue-data-catalog

Adding comments to database columns and retrieving from AWS Glue


I'm trying to incorporate a AWS GLUE Data Catalog to my Data Lake I'm building out. I'm using a few different databases and would like to add COMMENTS to columns in a few of these tables. These databases include Redshift and MySql. I usually add the comments to the column by doing something along the lines of

COMMENT ON COLUMN table.column_name IS 'This is the comment';

Now i know that Glue has a comment field that shows in the GUI. Is there a way to sync the comment field in Glue with the comments I add to the columns in a DB?


Solution

  • In order to update some meta information about a table that has been defined in AWS Glue Data Catalog, you would need to use a combination of get_table() and update_table() methods with boto3 for example .

    Here is the most naive approach to do that:

    import boto3
    from pprint import pprint
    
    glue_client = boto3.client('glue')
    
    database_name = "__SOME_DATABASE__"
    table_name = "__SOME_TABLE__"
    
    response = glue_client.get_table(
        DatabaseName=database_name,
        Name=table_name
    )
    original_table = response['Table']
    

    Here original_table adheres response syntax defined by get_table(). However, we need to remove some fields from it so it would pass validation when we use update_table(). List of allowed keys could be obtained by passing original_table directly to update_table() without any chagnes

    allowed_keys = [
        "Name",
        "Description",
        "Owner",
        "LastAccessTime",
        "LastAnalyzedTime",
        "Retention",
        "StorageDescriptor",
        "PartitionKeys",
        "ViewOriginalText",
        "ViewExpandedText",
        "TableType",
        "Parameters"
    ]
    updated_table = dict()
    for key in allowed_keys:
        if key in original_table:
            updated_table[key] = original_table[key]
    

    For simplicity sake, we will change comment of the very first column from the table

    new_comment = "Foo Bar"
    updated_table['StorageDescriptor']['Columns'][0]['Comment'] = new_comment
    
    response = glue_client.update_table(
        DatabaseName=database_name,
        TableInput=updated_table
    )
    
    pprint(response)
    

    Obviously, if you want to add a comment to a specific column you would need to extend this to

    new_comment = "Targeted Foo Bar"
    target_column_name = "__SOME_COLUMN_NAME__"
    for col in updated_table['StorageDescriptor']['Columns']:
        if col['Name'] == target_column_name:
            col['Comment'] = new_comment
    
    response = glue_client.update_table(
        DatabaseName=database_name,
        TableInput=updated_table
    )
    
    pprint(response)