amazon-web-services aws-glue aws-glue-data-catalog

How to create a data catalog in Amazon Glue externally?

I want to create a data catalog externally in Amazon Glue. Is there any way?

Solution

AWS Glue Data Catalog consists of meta information about various data sources within AWS, e.g. S3, DynamoDB etc. Instead of using Crawlers or AWS Console, you can populate data catalog directly with AWS Glue API related to different structures, like Database, Table etc. AWS provides several SDKs for different languages, e.g. boto3 for python with easy to use object-oriented API. So as long as you know how your data structure, you can use methods

Create Database definition:

from pprint import pprint
import boto3

client = boto3.client('glue')
response = client.create_database(
    DatabaseInput={
        'Name': 'my_database',  # Required
        'Description': 'Database created with boto3 API',
        'Parameters': {
            'my_param_1': 'my_param_value_1'
        },
    }
)
pprint(response)

# Output
{
    'ResponseMetadata': {
        'HTTPHeaders': {
            'connection': 'keep-alive',
            'content-length': '2',
            'content-type': 'application/x-amz-json-1.1',
            'date': 'Fri, 11 Oct 2019 12:37:12 GMT',
            'x-amzn-requestid': '12345-67890'
        },
        'HTTPStatusCode': 200,
        'RequestId': '12345-67890',
        'RetryAttempts': 0
    }
}

Create Table definition:

response = client.create_table(
    DatabaseName='my_database',
    TableInput={
        'Name': 'my_table',
        'Description': 'Table created with boto3 API',
        'StorageDescriptor': {
            'Columns': [
                {
                    'Name': 'my_column_1',
                    'Type': 'string',
                    'Comment': 'This is very useful column',
                },
                {
                    'Name': 'my_column_2',
                    'Type': 'string',
                    'Comment': 'This is not as useful',
                },
            ],
            'Location': 's3://some/location/on/s3',
        },
        'Parameters': {
            'classification': 'json',
            'typeOfData': 'file',
        }
    }
)

pprint(response)

# Output
{
    'ResponseMetadata': {
        'HTTPHeaders': {
            'connection': 'keep-alive',
            'content-length': '2',
            'content-type': 'application/x-amz-json-1.1',
            'date': 'Fri, 11 Oct 2019 12:38:57 GMT',
            'x-amzn-requestid': '67890-12345'
        },
        'HTTPStatusCode': 200,
        'RequestId': '67890-12345',
        'RetryAttempts': 0
    }
}