Search code examples
amazon-web-servicesamazon-s3amazon-athenaaws-glue

Add a partition on glue table via API on AWS?


I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?


Solution

    1. You can configure you're glue crawler to get triggered every 5 mins

    2. You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:

       import boto3
      
       athena = boto3.client('athena')
      
       def lambda_handler(event, context):
           athena.start_query_execution(
               QueryString = "MSCK REPAIR TABLE mytable",
               ResultConfiguration = {
                   'OutputLocation': "s3://some-bucket/_athena_results"
               }
      
    3. Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.

      Example from Athena manual:

       ALTER TABLE orders ADD
         PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
         PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';