Search code examples
apache-spark-sqlaws-glue

How to list all databases and tables in AWS Glue Catalog?


I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.

How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.

What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.


Solution

  • I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)

    In my case I needed table names in Glue Job Script console

    Finally I used boto library and retrieved database and table names with Glue client:

    import boto3
    
    
    client = boto3.client('glue',region_name='us-east-1')
    
    responseGetDatabases = client.get_databases()
    
    databaseList = responseGetDatabases['DatabaseList']
    
    for databaseDict in databaseList:
    
        databaseName = databaseDict['Name']
        print '\ndatabaseName: ' + databaseName
    
        responseGetTables = client.get_tables( DatabaseName = databaseName )
        tableList = responseGetTables['TableList']
    
        for tableDict in tableList:
    
             tableName = tableDict['Name']
             print '\n-- tableName: '+tableName
    

    Important thing is to setup the region properly

    Reference: get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases

    get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables