logstash elastic-stack logstash-grok elk

How to prevent "Timeout executing grok" and _groktimeout tag

I have a log entry whose last part keeps changing depending on few HTTPS conditions.

sample Logs:

INFO  [2021-09-27 23:07:58,632] [dw-1001 - POST /abc/api/v3/pqr/options] [386512709095023:] [ESC[36mUnicornClientESC[0;39m]: 
<"type": followed by 11000 characters including space words symbols <----- variable length.

grok pattern:

%{LOGLEVEL:loglevel}\s*\[%{TIMESTAMP_ISO8601:date}\]\s*\[%{GREEDYDATA:requestinfo}\]\s*\[%{GREEDYDATA:logging_id}\:%{GREEDYDATA:token}\]\s*\[(?<method>[^\]]+)\]\:\s*(?<messagebody>(.|\r|\n)*)

(.|\r|\n)*) this works fine if the variable part of the log is small, but when a large log is encountered, it throws an exception:

[2021-09-27T17:24:40,867][WARN ][logstash.filters.grok    ] Timeout executing grok '%{LOGLEVEL:loglevel}\s*\[%{TIMESTAMP_ISO8601:date}\]\s*\[%{GREEDYDATA:requestinfo}\]\s*\[%{GREEDYDATA:logging_id}\:%{GREEDYDATA:token}\]\s*\[(?<method>[^\]]+)\]\:\s*(?<messagebody>(.|\r|\n)*)' against field 'message' with value 'Value too large to output (178493 bytes)! First 255 chars are: INFO  [2021-09-27 11:50:14,005] [dw-398 - POST /xxxxx/api/v3/xxxxx/options] [e3acfd76-28a6-0000-0946-0c335230a57e:]

and CPU starts choking and persistent queue increases and Lag in kibana. Any suggestions?

Solution

Performance problems in grok and timeouts are not usually a problem when the pattern matches the message, they are a problem when the pattern fails to match.

The first thing to do is anchor your patterns if possible. This blog post has performance data on how effective this is. In your case, when the pattern does not match, grok will start at the beginning of the line to see if LOGLEVEL matches. If it does NOT match, then it will start at the second character of the line and see if LOGLEVEL matches. If it keeps not matching it will have to make thousands of attempts to match the pattern, which is really expensive. If you change your pattern to start with ^%{LOGLEVEL:loglevel}\s*\[ then the ^ means that grok only has to evaluate the match against LOGLEVEL at the start of each line of [message]. If you change it to be "\A%{LOGLEVEL:loglevel}\s*\[ then it will only evaluate the match at the very beginning of the [message] field.

Secondly, if possible, avoid GREEDYDATA except at the end of the pattern. When matching a 10 KB string against a pattern that has multiple GREEDYDATAs, if the pattern does not match then each GREEDYDATA will be tried against thousands of different substrings, resulting in millions of attempts to do the match for each event (it's not quite this simple, but failing to match does get very expensive). Try changing GREEDYDATA to DATA and if it still works then keep it.

Thirdly, if possible, replace GREEDYDATA/DATA with a custom pattern. For example, it appears to me that \[%{GREEDYDATA:requestinfo}\] could be replaced with \[(?<requestinfo>[^\]]+) and I would expect that to be cheaper when the overall pattern does not match.

Fourthly, I would seriously consider using dissect rather than grok

dissect { mapping => { "message" => "%{loglevel->} [%{date}] [%{requestinfo}] [%{logging_id}:%{token}] [%{method}]: %{messagebody}" } }

However, there is a bug in the dissect filter where if "->" is used in the mapping then a single delimiter does not match, multiple delimiters are required. Thus that %{loglevel->} would match against INFO [2021, but not against ERROR [2021. I usually do

mutate { gsub => [ "message", "\s+", " " ] }

and remove the -> to workaround this. dissect is far less flexible and far less powerful than grok, which makes is much cheaper. Note that dissect will create empty fields, like grok with keep_empty_captures enabled, so you will get a [token] field that contains "" for that message.