I tried to find a way through the Glue API docs, but there is no attribute or method related to the functions get_table(**kwargs)
or get_tables(**kwargs)
.
I imagine something akin to the following (pseudo-)code:
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
for table in response['TableList']:
files = table["files"] # NOTE: the keyword "files" is invented
# Do something else
...
As far as I can see from the docs, the table
from the reponse["TableList"]
should be a dictionary; yet none of its keys seem to give access to the files stored in it.
The solution to the problem was using awswrangler.
The following functions checks all AWS Glue
Tables within a database for a specific list of recently uploaded files. Whenever the filename matches, it is going to yield the associated table dictionary. These yielded tables are those which have been recently updated.
def _yield_recently_updated_glue_tables(upload_path_list: List[str],
db_name: str) -> Union(dict, None):
"""Check which tables have been updated recently.
Args:
upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
db_name (str): name of the AWS Glue database
Yields:
Union(dict, None): AWS Glue table dictionaries recently updated
"""
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_name):
for table_dict in response['TableList']:
table_name = table_dict['Name']
s3_bucket_path = awswrangler.catalog.get_table_location(
database=db_name, table=table_name)
s3_filepaths = list(
awswrangler.s3.describe_objects(s3_bucket_path).keys())
table_was_updated = False
for upload_file in upload_path_list:
if upload_file in s3_filepaths:
table_was_updated = True
break
if table_was_updated:
yield table_dict