Search code examples
amazon-web-servicesparquetaws-glueamazon-athenaamazon-kinesis-firehose

AWS Glue - Adding fileld to a struct field


I have a table defined in AWS Glue. I use AWS Kinesis streams to stream logs into S3 using this table definition, using parquet file format. It's partitioned by date.

One of the fields in the table is a struct with several fields, event_payload, one of them an array of structs. Recently I added a new field to the inner struct in the log data. I want to add it in the table definition so that it will be written to the S3, and so that I can query it using AWS Athena.

I tried editing the table schema directly in the console. It does write the data to S3, but I get an exception in Athena when querying:

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'event_payload' in table 'c2s.logs' is declared as type 'struct<...>', but partition 'year=2019/month=201910/day=20191026/hour=2019102623' declared column 'event_payload' as type 'struct<...>'.

I tried deleting all the partitions and repairing the table, as specified here, but I got another error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://.../year=2019/month=201910/day=20191022/hour=2019102216/beaconFirehose-4-2019-10-22-16-34-21-71f183d2-207e-4ae9-98fe-07dda0bab70c.parquet (offset=0, length=801679): Schema mismatch, metastore schema for row column event_payload.markings.element has 8 fields but parquet schema has 7 fields

So the schema has a field which is not present in the data. Is there a way to specify an optional field? If it's not present, just make it null.


Solution

  • As per link schema updates on nested structure is not supported in Athena. One way to make this work can be to flatten the struct type with the help of relalaionalize operator in Glue. for ex:

    val frames: Seq[DynamicFrame] = lHistory.relationalize(rootTableName = "hist_root", stagingPath = redshiftTmpDir, JsonOptions.empty)