Search code examples
regexunicodelogstashaws-gluegrok

Exclude Unicode symbol in Grok pattern


Is there any option to exclude Unicode symbol from line directly in grok pattern. I'm trying to read json data line by line thought AWS Glue "getSourceWithFormat" method which use grok pattern for string parsing.

line in file:

{"age":12,"test":0,"f":"\u0085 NE 911,Aven","f2":"090","f3":"U019"}

if I use: %{GREEDYDATA:message} it will return only part of line: {"age":12,"test":0,"f":" because of \u0085(new line) symbol.

How I could skip this symbol directly in my pattern in order to get full message in output?

Thanks.


Solution

  • The problem here is that the %{GREEDYDATA:message} is actually a .* pattern. A dot does not match line break characters by default in NFA regex engines.

    If you use it with Grok, you need to tell the Onigmo regex engine that this %{GREEDYDATA:message} should match line break chars, too, and this can be done by adding (?m) at the start of the pattern.

    Also, as a work around, you can replace %{GREEDYDATA:message} with (?<message>[\w\W]*).