Search code examples
amazon-web-servicesamazon-s3hiveamazon-athenaamazon-quicksight

How to make MSCK REPAIR TABLE execute automatically in AWS Athena


I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.

After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.

The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.

Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?


Solution

  • There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?

    From any of these, you should be able to fire off the following CLI command.

    $ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

    Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.

    An example Lambda Function could be written as such:

    import boto3
    
    def lambda_handler(event, context):
        bucket_name = 'some_bucket'
    
        client = boto3.client('athena')
    
        config = {
            'OutputLocation': 's3://' + bucket_name + '/',
            'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
    
        }
    
        # Query Execution Parameters
        sql = 'MSCK REPAIR TABLE some_database.some_table'
        context = {'Database': 'some_database'}
    
        client.start_query_execution(QueryString = sql, 
                                     QueryExecutionContext = context,
                                     ResultConfiguration = config)
    

    You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.

    Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.