I am trying to create AWS Athena table based on the logs stored in S3. I intend to use RegEx to create the table but I could not find RegEx which will work for me
CREATE EXTERNAL TABLE `dev_logs`(
`date_time` string COMMENT '',
`type` string COMMENT '',
`request_id` string COMMENT '',
`body` string COMMENT '',
`exception` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='\\[([^ ]* +[^ ]*)\\] \\[([^ ]*)\\] ([^ ]*) \\[([^ ]* +[^ ]*)\\] (\\*)'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://dev/logs'
The logs are in this format
[2020-05-04 10:26:56.393] [INFO] [123] [Building host...] []
[2020-05-04 10:27:01.623] [INFO] [] [Starting Service checks...] [exception details]
The regex is
'\\[(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d*)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\].*?$'
In Hive you can easily debug it using regexp_replace:
select regexp_replace('[2020-05-04 10:26:56.393] [INFO] [123] [Building host...] []',
'\\[(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d*)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\].*?$',
'$1, $2, $3, $4, $5')
Result
2020-05-04 10:26:56.393, INFO, 123, Building host...,