Search code examples
python-3.xamazon-web-servicesboto3aws-sdkaws-glue

List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3


I tried to find a way through the Glue API docs, but there is no attribute or method related to the functions get_table(**kwargs) or get_tables(**kwargs).

I imagine something akin to the following (pseudo-)code:

client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
    for table in response['TableList']:
        files = table["files"]  # NOTE: the keyword "files" is invented
        # Do something else
        ...

As far as I can see from the docs, the table from the reponse["TableList"] should be a dictionary; yet none of its keys seem to give access to the files stored in it.


Solution

  • The solution to the problem was using awswrangler.

    The following functions checks all AWS Glue Tables within a database for a specific list of recently uploaded files. Whenever the filename matches, it is going to yield the associated table dictionary. These yielded tables are those which have been recently updated.

    def _yield_recently_updated_glue_tables(upload_path_list: List[str],
                                            db_name: str) -> Union(dict, None):
        """Check which tables have been updated recently.
    
        Args:
            upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
            db_name (str): name of the AWS Glue database
    
        Yields:
            Union(dict, None): AWS Glue table dictionaries recently updated
        """
        client = boto3.client('glue')
        paginator = client.get_paginator('get_tables')
        for response in paginator.paginate(DatabaseName=db_name):
            for table_dict in response['TableList']:
                table_name = table_dict['Name']
                s3_bucket_path = awswrangler.catalog.get_table_location(
                    database=db_name, table=table_name)
                s3_filepaths = list(
                    awswrangler.s3.describe_objects(s3_bucket_path).keys())
                table_was_updated = False
                for upload_file in upload_path_list:
                    if upload_file in s3_filepaths:
                        table_was_updated = True
                        break
                if table_was_updated:
                    yield table_dict