I have the following pipeline setup in AWS (trying not to give away too much company info away here). This may be over simplification, but here goes.
We have an ETL process that contain datasources that are crawled daily, and then staged in Athena. We then have Glue jobs setup to pick up specific tables from the staged data in Athena that writes to a MySQL database hosted elsewhere. (This is all setup and working).
The company requires us to make sure that the column names, datatypes, nullable or not etc are maintained in mySQL. We do NOT drop and recreate tables on MySQL, as there are other departments that may require some other transformations etc to be performed on the data - also views tied to these, together with referential integrity etc.
What I then need to achieve, is tracking Column Rename, Column Datatype change, Column Added, Column Removed, New Table Added, Table Removed etc on athena tables, as soon as the data is staged.
I have tried setting up a changelog table, and then querying information schema :
SELECT * FROM information_schema.columns
WHERE table_schema = 'table'
and table_name = 'table_name'
and then pulling the resulting columns into the changelog table by counting the columns the previous day, vs counting it today. If there is a change, compare the column names returned from the above query to what existed the previous days (this is stored in changelog as follows :
table_name varchar(255) ,
count_columns int,
table_id int,
created_at timestamp ,
updated_at timestamp
then table_id references another table
table_id int (foreign key to changelog)
column_name varchar(45),
column_datatype varchar(255)
column_nullable char(5),
created_at timestamp ,
updated_at timestamp
I know there's no such thing as referential integrity in Athena, but I am creating my own. It's one of the reasons I feel this solution is messy and takes too long (loading each table and comparing etc). Is there nothing built into Athena or AWS Glue that can handle this? Can we perhaps have a Lambda query the data in athena, compare source/destination and then correct it in mysql?
So here's how I solved it.
I created the following tables :
A dbtables table that contains a snapshot of current tables.
Step 1 - an AWS glue job that converts the json source data and converts it into parquet - writing it to a staging bucket.
Step 2 - A Crawler that crawls the data in staging.
Step 3 Every morning, a lambda runs that does the following :
Event Trigger - a Write to an S3 staging bucket
Lambda (using AWS Wrangler as a layer)