Search code examples
sqlamazon-web-serviceshiveaws-glueamazon-athena

AWS Athena custom data format?


I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.

This is how the log lines look:

2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}

which is a timestamp followed by space and the JSON line I want to query.

Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.


Solution

  • Define table with single column:

    CREATE EXTERNAL TABLE your_table(
     line STRING
    )
    
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '\t'
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
    LOCATION 's3://mybucket/path/mylogs/';
    

    You can extract timestamp and JSON using regexp, then parse JSON separately:

    select ts, 
           json_extract(json_col, '$.reqid') AS reqid 
           ...
    from
    (
    select regexp_extract(line, '(.*?) +',1) as ts,
           regexp_extract(line, '(.*?) +(.*)',2) as json_col
      from your_table
    )s
    

    Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:

     CREATE EXTERNAL TABLE your_table (
         ts STRING,
         json_col STRING
     )
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
     WITH SERDEPROPERTIES (
     "input.regex" = "^(.*?) +(.*)$"
     ) 
     LOCATION 's3://mybucket/path/mylogs/';
    
     SELECT ts, json_extract(json_col, '$.reqid') AS reqid  ...
     FROM your_table