Search code examples
amazon-athenaaws-glueamazon-kinesis-firehose

How to change column names of autodetected partitions created by Glue Crawler?


I have bucket which is used as destination for a Kinesis Firehose stream.

Firehose automatically creates date-based prefixes on that bucket using the yyyy/mm/dd/HH format.

Then I created a crawler that will search for data into this bucket and configured it as follow:

Crawler configuration

After running the crawler, it creates a table with the following schema:

| #   | Column name   | Data type | Key           |
| --- | -----------   | --------- | ------------- |
| 1   | numberissues  | int       |               |
| 2   | group         | string    |               |
| 3   | createdat     | string    |               |
| 4   | companyunitid | string    |               |
| 5   | partition_0   | string    | Partition (0) |
| 6   | partition_1   | string    | Partition (1) |
| 7   | partition_2   | string    | Partition (2) |
| 8   | partition_3   | string    | Partition (3) |

If I rename the partition-* to their right counterparts year, month, day and hour, the table is ready for me to use.

However, if the crawler runs again, the schema revets the column names to the original partition-*.

I know this would work for Hive partition schemas year=2018/month=04..., but I want to know if it's possible to "hint" Glue about the partition field names.

Another alternative would be trying to change the Firehose prefixing, but I couldn't find anything that suggests this is even possible.


Solution

  • In this case you can set the "Ignore the change and don't update the data catalog" option.

    Then you can rename the columns. This will allow the crawler to detect new partitions on the next run but keep therenamed names.