Search code examples
amazon-web-servicesaws-glueamazon-athenaamazon-quicksight

How to get last run Datetime of Crawler in Athena?


I have AWS Glue Crawler which runs twice a day and populates data in Athena.

Quicksight takes data from Athena and shows it in a dashboard.

I am implementing LastDataRefresh (Datetime) to show in a Quicksight dashboard. Is there a way I can get the last crawler run datetime so that I can store it in an Athena table and show in Quicksight ?

Any other suggestions are also welcome.


Solution

  • TL;DR Extract the crawler last run time from Glue's CloudWatch logs

    Glue sends a series of events to CloudWatch during each crawler run. Extract and process the "finished running" logs from /aws-glue/crawlers log group to get the latest for each crawler.

    Logs for a single crawler run:

    2021-12-15T12:08:54.448+01:00   [7dd..] BENCHMARK : Running Start Crawl for Crawler lorawan_datasets_bucket_crawler
    2021-12-15T12:09:12.559+01:00   [7dd..] BENCHMARK : Classification complete, writing results to database jokerman_events_database
    2021-12-15T12:09:12.560+01:00   [7dd..] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
    2021-12-15T12:09:27.064+01:00   [7dd..] BENCHMARK : Finished writing to Catalog
    2021-12-15T12:12:13.768+01:00   [7dd..] BENCHMARK : Crawler has finished running and is in state READY
    

    Extract and process the BENCHMARK : Crawler has finished running and is in state READY logs:

    import boto3
    from datetime import datetime, timedelta
    
    def get_last_runs():
      session = boto3.Session(profile_name='sandbox', region_name='us-east-1')
      logs = session.client('logs')
    
      startTime = startTime = datetime.now() - timedelta(days=14)
    
      # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.filter_log_events
      filtered_events = logs.filter_log_events(
          logGroupName="/aws-glue/crawlers",
          # https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#matching-terms-events
          filterPattern="BENCHMARK state READY", # match "BENCHMARK : Crawler has finished running and is in state READY" messages
          startTime=int(startTime.timestamp()*1000)
      )
    
      completed_runs = [
          {"crawler": m.get("logStreamName"), "timestamp": datetime.fromtimestamp(m.get("timestamp")/1000).isoformat()}
          for m in filtered_events["events"]
      ]
    
      # rework the list to get a dictionary of the last runs by crawler
      crawlers = set([r['crawler'] for r in completed_runs])
      last_runs = dict()
    
      for n in crawlers:
        last_runs[n] = max([d["timestamp"] for d in completed_runs if d["crawler"] == n])
    
      print(last_runs)
    

    Output:

    {
      'lorawan_datasets_bucket_crawler': '2021-12-15T12:12:13.768000',
      'jokerman_lorawan_events_table_crawler': '2021-12-15T12:12:12.007000'
    }