I have AWS Glue Crawler which runs twice a day and populates data in Athena.
Quicksight takes data from Athena and shows it in a dashboard.
I am implementing LastDataRefresh (Datetime) to show in a Quicksight dashboard. Is there a way I can get the last crawler run datetime so that I can store it in an Athena table and show in Quicksight ?
Any other suggestions are also welcome.
TL;DR Extract the crawler last run time from Glue's CloudWatch logs
Glue sends a series of events to CloudWatch during each crawler run. Extract and process the "finished running" logs from /aws-glue/crawlers
log group to get the latest for each crawler.
Logs for a single crawler run:
2021-12-15T12:08:54.448+01:00 [7dd..] BENCHMARK : Running Start Crawl for Crawler lorawan_datasets_bucket_crawler
2021-12-15T12:09:12.559+01:00 [7dd..] BENCHMARK : Classification complete, writing results to database jokerman_events_database
2021-12-15T12:09:12.560+01:00 [7dd..] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2021-12-15T12:09:27.064+01:00 [7dd..] BENCHMARK : Finished writing to Catalog
2021-12-15T12:12:13.768+01:00 [7dd..] BENCHMARK : Crawler has finished running and is in state READY
Extract and process the BENCHMARK : Crawler has finished running and is in state READY
logs:
import boto3
from datetime import datetime, timedelta
def get_last_runs():
session = boto3.Session(profile_name='sandbox', region_name='us-east-1')
logs = session.client('logs')
startTime = startTime = datetime.now() - timedelta(days=14)
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.filter_log_events
filtered_events = logs.filter_log_events(
logGroupName="/aws-glue/crawlers",
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#matching-terms-events
filterPattern="BENCHMARK state READY", # match "BENCHMARK : Crawler has finished running and is in state READY" messages
startTime=int(startTime.timestamp()*1000)
)
completed_runs = [
{"crawler": m.get("logStreamName"), "timestamp": datetime.fromtimestamp(m.get("timestamp")/1000).isoformat()}
for m in filtered_events["events"]
]
# rework the list to get a dictionary of the last runs by crawler
crawlers = set([r['crawler'] for r in completed_runs])
last_runs = dict()
for n in crawlers:
last_runs[n] = max([d["timestamp"] for d in completed_runs if d["crawler"] == n])
print(last_runs)
Output:
{
'lorawan_datasets_bucket_crawler': '2021-12-15T12:12:13.768000',
'jokerman_lorawan_events_table_crawler': '2021-12-15T12:12:12.007000'
}