I'm trying to use Grok expressions in Athena, mostly as a tool to debug Grok expressions in AWS Glue Classifiers.
This works:
CREATE EXTERNAL TABLE example_grok (
myColumn string
)
ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='(%{WORD:header},%{WORD:file_type},%{GREEDYDATA:head_rest})|(%{DETAILS:det},%{WORD:icp_number},%{GREEDYDATA:det_rest})',
'input.grokCustomPatterns' = 'DETAILS DET'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://my-secret-bucket/path/';
I would like to specify several custom patterns, but the documentation doesn't have an example, and none of the delimiters that I have tried, either inside or outside of the string, have worked.
For example, these do NOT work
New line delimited (with no leading spaces, those are just for this post):
'input.grokCustomPatterns' =
'POSTFIX_QUEUEID [0-9A-F]{7,12}
HEADER HDR'
As a "json" array:
'input.grokCustomPatterns' = ['POSTFIX_QUEUEID [0-9A-F]{7,12}','HEADER HDR']
With multiple entries:
'input.grokCustomPatterns'='HEADER (HDR)',
'input.grokCustomPatterns'='POSTFIX_QUEUEID [0-9A-F]{7,12}',
Any assistance is appreciated,
AWS responded to the documentation improvement that I requested. A literal \n
separates patterns.
To include multiple pattern entries into the input.grokCustomPatterns expression, use the newline escape character (\n) to separate them, as follows: 'input.grokCustomPatterns'='INSIDE_QS ([^\"])\nINSIDE_BRACKETS ([^\]])').