Search code examples
hiveamazon-kinesisamazon-athenaamazon-kinesis-firehose

Kinesis Firehose to s3: data delivered to wrong hour in s3 path


I'm using Kinesis Firehose to buffer IoT data, and write it to s3. Firehose writes buffers to s3 in the format s3://bucket.me.com/YYYY/MM/DD/HH

Data that comes in at 10:59a may get buffered by Firehose and not written out until 11:00a (s3://bucket.me.com/2017/03/09/11).

The problem is, when creating partitions for Athena, the partition for hour 10 will not contain all the data for hour 10 because it is in the hour 11 path.

Here is an example that better illustrates:

An IoT sends the following data to Firehose, which at 2a writes it to s3://bucket.me.com/2017/03/24/02/file-0000. The file contents look like this:

{"id":1,"dt":"2017-03-24 01:59:40"}
{"id":2,"dt":"2017-03-24 01:59:41"}
{"id":3,"dt":"2017-03-24 01:59:42"}

I then create an Athena table:

CREATE EXTERNAL TABLE sensor_data (
    id string,
    dt timestamp)
PARTITIONED BY (year int, month int, day int, hour int)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket.me.com/';

ALTER TABLE sensor_data ADD PARTITION (year=2017,month=3,day=24,hour=1) location 's3://bucket.me.com/2017/03/24/01/';
ALTER TABLE sensor_data ADD PARTITION (year=2017,month=3,day=24,hour=2) location 's3://bucket.me.com/2017/03/24/02/';

When I run select * from sensor_data where hour = 1, I won't get the 3 records above returned because it will only read from the s3 path defined for partition hour=1 (and the 3 records are really in the hour=2 partition).

How do I avoid this problem?


Solution

  • You cant avoid it entirely, but writing more often will create more accurate results in the appropriate hour.