I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.
This is how the log lines look:
2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}
which is a timestamp followed by space and the JSON line I want to query.
Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.
Define table with single column:
CREATE EXTERNAL TABLE your_table(
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/path/mylogs/';
You can extract timestamp and JSON using regexp, then parse JSON separately:
select ts,
json_extract(json_col, '$.reqid') AS reqid
...
from
(
select regexp_extract(line, '(.*?) +',1) as ts,
regexp_extract(line, '(.*?) +(.*)',2) as json_col
from your_table
)s
Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:
CREATE EXTERNAL TABLE your_table (
ts STRING,
json_col STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(.*?) +(.*)$"
)
LOCATION 's3://mybucket/path/mylogs/';
SELECT ts, json_extract(json_col, '$.reqid') AS reqid ...
FROM your_table