I have built an Athena parquet version based on daily csv raw data. So far, I have around 6TB of data in parquet structure.
Since the raw data file has changed adding a couple of columns to the daily export, I need to add few more columns to the "Athena schema".
Which is the most efficient way to do it? Furthermore how can I handle the historical data which do not have this column?
The easiest way would be to add a new table, but I prefer to keep the db-like structure as it is.
Thanks in advance
In Athena tables are just metadata, dropping a table leaves the data in place. Changing a table schema can be done by dropping the table and creating a new one with the new schema – no data will be harmed.
The way Athena works with Parquet is that it maps columns in the table schema to columns in the files by name, and if a column doesn't exist in a file Athena will treat that as if it was all NULL.
Using these two facts you can recreate your table with the new schema and you will be fine. If you want to test that it works first, create a new table with the new schema and run some queries, then you drop both tables and recreate the table with the original name.
There is a way to do it without dropping the table, but it requires using the Glue Data Catalog API directly, see UpdateTable
. Athena does not have any equivalent operation, you can't alter or rename tables.