Search code examples
pythonamazon-web-servicesaws-glueamazon-kinesis-firehoseapache-iceberg

How do I insert timestamp data into an AWS Glue managed Iceberg table using AWS Firehose?


Using AWS Firehose to ingest data into an Iceberg table managed by AWS Glue, I'm unable to insert timestamp data.

Firehose

I'm trying to insert data using the following script:

json_data = json.dumps(
    {
        "ADF_Record": {
            "foo": "bar",
            "baz": "2024-09-04T18:56:15.114"
        },
        "ADF_Metadata": { 
            "OTF_Metadata": {
                "DestinationDatabaseName": "my_db",
                "DestinationTableName": "my_table",
                "Operation": "INSERT"
            }
        }
    }
)

response = boto3.client("firehose").put_record(
    DeliveryStreamName="my_stream",
    Record={"Data": json_data.encode()}
)

Note that the baz value corresponds to a timestamp of type TimestampType.withoutZone as referenced in the Firehose documentation.

Glue

  • My table is of the Iceberg type.
  • I did not define any additional SerDe library or SerDe parameters.
  • The table schema is :
    • foo : string
    • baz : timestamp

Error

Whenever I try to insert data using this method, no data is delivered and I get this error on Firehose side :

Firehose is unable to convert column data in your record to the column type specified within the schema. Table: my_db.my_table

Things I tried

  • Data is written when baz is removed from the payload (the pipeline seems functional without timestamp).
  • Switching to epoch format (1725476175114000) doesn't help. Glue creates a new version of the table with baz as date and the written data is not legible.
  • Switching to TimestampType.WithZone results in the same error.
  • Trying a SerDe library like org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe and timestamp.formats parameter doesn't help. Glue creates a new version of the table and removes the SerDe parameters altogether.

I'm about to give up and just write timestamps as string. Any insight is appreciated!


Solution

  • I think I figured it out. Modifying the schema of your Glue table has no incidence on the schema of the underlying Iceberg table (i.e. metadata are not altered by Glue), it only goes one way. This resulted in an inconsistency between schemas : baz was a date for Iceberg and a timestamp for Glue.

    What worked was to delete baz with PyIceberg (an Athena query might do the trick too) and recreate it with the timestamp type. This will update the metadata on Iceberg side and the Glue schema altogether. Now, baz is properly interpreted by Firehose, Glue and Iceberg.