Is there any way to grok parse URIPATHPARAM when the URL contains invalid characters

Quick background: using access logging from HAProxy and parsing it using grok. HAProxy's %{+Q}r log variable prints "<http verb> <uri> <HTTP version>" which we are parsing using

"%{WORD:method} %{URIPATHPARAM:url} HTTP/%{NUMBER:httpversion}"

This works fine for most requests but when we are hit with various kinds of scanners trying to do injection attacks etc. by sending junk in the URL grok fails to parse the uri. Here are some examples that crash this grok filter:

"GET /index.html?14068'#22><bla> HTTP/1.1"
"GET /index.html?fName=\Windows\system.ini%00&lName=&guestEmail= HTTP/1.1"

Can anyone think of a solution that would preferably parse even invalid URIs or at least not crash, i.e. parse as much of the URL as possible and discard junk?

Solution

Yes, by using the multiple match ability of grok.

https://groups.google.com/forum/#!topic/logstash-users/H3_3gnWY2Go

https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#plugins-filters-grok-match

When combined with break_on_match => true (the default), you can specify multiple patterns for grok to try and it will stop after it finds a matching pattern and applies it.

Here, if the first pattern doesn't work, it will try the next pattern which uses a NOTSPACE, which will eat up those bad characters, and tags the field bad_url instead of url

filter {
  grok { 
    match => { 
      "message" => [ 
        "%{WORD:method} %{URIPATHPARAM:url} HTTP/%{NUMBER:httpversion}", 
        "%{WORD:method} %{NOTSPACE:bad_url} HTTP/%{NUMBER:httpversion}" 
      ]
    }
    break_on_match => true
  }
}