Search code examples
amazon-web-servicesamazon-s3avroaws-glueamazon-athena

AWS Athena - duplicate columns due to partitionning


We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly. The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.

What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:

HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns

Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ? That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..

Or any other suggestions ?


Solution

  • Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.

    Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.

    The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.