amazon-web-services amazon-s3 amazon-athena amazon-kinesis-firehose

querying AWS firehose data in s3 with Athena - can't query batches

I am pushing logs to an S3 bucket via Firehose.

The data has a very simple format:

{
   email: "some email",
   message: "a log message",
   data: "{ /* ...some json */ }"
}

I created this table definition for Athena:

CREATE EXTERNAL TABLE `logs`(
  `email` string COMMENT 'from deserializer', 
  `message` string COMMENT 'from deserializer', 
  `data` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://USERLOGS/'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1583271303')

It works well over single entries, where the s3 file is a single json blob, but the way firehose works it batches entries into files in s3; only the first entry in the batch is being queried.

How do I make it so the entire batch is queried?

I have 100 blobs but can only see 6 because of this.

Solution

I've encountered a similar issue.

The way Firehose produces files in S3 is by concatenating each record to an existing file, which produces invalid JSON in fact. Meaning, if you put two records :

{"key":"1", "value":"v1"}

and

{"key":"2", "value":"v2"}

they will end up being written to S3 like this:

{"key":"1", "value":"v1"}{"key":"2", "value":"v2"}

On the other hand, Athena expects to find each record in a new row. This being said, the only workaround that I've been able to come up with is inserting '\n' in records that are being sent to Firehose, so that the S3 output looks like this:

{"key":"1", "value":"v1"}
{"key":"2", "value":"v2"}

I hope this helps!