I am trying to create an Athena table over s3 parquet files.
I created a table in glue catalog (Data Catalog -> Databases -> Tables -> Add Table).
Added s3 location and uploaded a schema manually.
When I query the table in Athena there are no record in the query result - simple select * from table
.
I searched the problem and saw a possible explanation for this is having several different schemas under the location folder. That is the case for me, I have under the same s3 folder parquet files with two different schemas due to addition of some fields - which means the new schema contains the previous. The data is partitioned by dt and tm, and one of the tm's has a new schema - when I tried to create a table over this single tm it worked and returned records as expected.
Is there a way to "merge" the schemas? Because using Crawlers I saw there's an option to update the schema:
Schema updates in the data store:
Update the table definition in the data catalog
What I would have wanted is to see the new columns, and to have nulls in the data belongs to previous parquets under these new columns.
I that's not possible, should I use glue in this situation? there are new files in the folder every day and I wanted to avoid re-crawling the folder every day.
Thanks in advance.
Parquet format supports schema evolution (SOURCE : https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html).
You just need to create a table with all columns including the new columns that are not in your old parquet files, but in the newest files.
If a column in your schema but not in the parquet file, the column will appear as null which is what you were hoping for. This applies even if parquet files containing different column definitions are in the same location.
Another way you can create tables is using the CREATE EXTERNAL TABLES directly in SQL : https://docs.aws.amazon.com/athena/latest/ug/create-table.html
SQL syntax looks roughtly like this :
create external table nameofyourtable (
COLUMN1_NAME type,
COLUMN2_NAME type,
COLUMN3_NAME type,
...
COLUMNX_NAME type ) Stored as parquet location "s3://bucketname/.../parquetfilebucket"