Need Ideas for a Pipeline Column Name, Column Number, Datatype change detector in Athena - Glue

I have the following pipeline setup in AWS (trying not to give away too much company info away here). This may be over simplification, but here goes.

We have an ETL process that contain datasources that are crawled daily, and then staged in Athena. We then have Glue jobs setup to pick up specific tables from the staged data in Athena that writes to a MySQL database hosted elsewhere. (This is all setup and working).

The company requires us to make sure that the column names, datatypes, nullable or not etc are maintained in mySQL. We do NOT drop and recreate tables on MySQL, as there are other departments that may require some other transformations etc to be performed on the data - also views tied to these, together with referential integrity etc.

What I then need to achieve, is tracking Column Rename, Column Datatype change, Column Added, Column Removed, New Table Added, Table Removed etc on athena tables, as soon as the data is staged.

I have tried setting up a changelog table, and then querying information schema :

SELECT * FROM information_schema.columns
WHERE table_schema = 'table'
and table_name = 'table_name'

and then pulling the resulting columns into the changelog table by counting the columns the previous day, vs counting it today. If there is a change, compare the column names returned from the above query to what existed the previous days (this is stored in changelog as follows :

table_name varchar(255) ,
 count_columns int,
 table_id int,
 created_at timestamp  ,
 updated_at timestamp

then table_id references another table

table_id int (foreign key to changelog)
column_name varchar(45),
column_datatype varchar(255)
column_nullable char(5),
created_at timestamp  ,
updated_at timestamp

I know there's no such thing as referential integrity in Athena, but I am creating my own. It's one of the reasons I feel this solution is messy and takes too long (loading each table and comparing etc). Is there nothing built into Athena or AWS Glue that can handle this? Can we perhaps have a Lambda query the data in athena, compare source/destination and then correct it in mysql?

Solution

So here's how I solved it.

I created the following tables :

A dbtables table that contains a snapshot of current tables.

Step 1 - an AWS glue job that converts the json source data and converts it into parquet - writing it to a staging bucket.

Step 2 - A Crawler that crawls the data in staging.

Step 3 Every morning, a lambda runs that does the following :

Event Trigger - a Write to an S3 staging bucket

Lambda (using AWS Wrangler as a layer)

Create a table called dbtables_staging which pulls the table columns for todays tables and inserts the column info for them (datatype, nullable, name etc)
Using a left join, I compare whats in dbtables to dbtables_staging - whats missing is written to a dbchangelog table
dbtables_staging is dropped, dbtables overwritten with new table structure (by dynamically creating another bucket so as to not include previous data - this is Athena specific as it uses s3 as a source, so you have to point to a different source)
I then have another Lambda to send slack notifications to users who need to know
Then PowerBI dashboard connected using Simba Athena ODBC Connector x64 to track analytics on changes (for style points).