sql amazon-web-services hive aws-glue amazon-athena

AWS Athena custom data format?

I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.

This is how the log lines look:

2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}

which is a timestamp followed by space and the JSON line I want to query.

Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.

Solution

Define table with single column:

CREATE EXTERNAL TABLE your_table(
 line STRING
)

ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/path/mylogs/';

You can extract timestamp and JSON using regexp, then parse JSON separately:

select ts, 
       json_extract(json_col, '$.reqid') AS reqid 
       ...
from
(
select regexp_extract(line, '(.*?) +',1) as ts,
       regexp_extract(line, '(.*?) +(.*)',2) as json_col
  from your_table
)s

Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:

 CREATE EXTERNAL TABLE your_table (
     ts STRING,
     json_col STRING
 )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
 WITH SERDEPROPERTIES (
 "input.regex" = "^(.*?) +(.*)$"
 ) 
 LOCATION 's3://mybucket/path/mylogs/';

 SELECT ts, json_extract(json_col, '$.reqid') AS reqid  ...
 FROM your_table