Search code examples
elasticsearchlogstashlogstash-grokelastic-stack

Grok parsing with special characters in message


In Logstash/grok how can I parse messages with special characters in the danish alphabet, such as æøå?

I'm trying to parse the following message (IIS log file):

2016-06-12 18:15:10 server01 192.168.10.1 GET /test/charæfoobar pagenumber=2 443 - 192.168.100.31 HTTP/1.1 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:47.0)+Gecko/20100101+Firefox/47.0 https://domain.com/test/char%C3%A6foobar domain.com 200 0 0 5493 559 515

With this pattern:

%{TIMESTAMP_ISO8601:logTimestamp} %{NOTSPACE:server} %{IP:serverIp} %{WORD:method} %{URIPATHPARAM:page} %{NOTSPACE:querystring} %{NUMBER:port} %{NOTSPACE:username} %{IP:clientIp} %{NOTSPACE:httpVersion} %{NOTSPACE:useragent} %{NOTSPACE:referer} %{NOTSPACE:siteDomain} %{NUMBER:status} %{NUMBER:substatus} %{NUMBER:win32Status} %{NUMBER:bytesSent:int} %{NUMBER:bytesReceived:int} %{NUMBER:timetaken:int}

I've been debugging with this tool: http://grokconstructor.appspot.com/ and it seems to choke on the æ character in the message.

I'm using the Filebeat log shipper with the encoding set to UTF-8, and IIS outputs logs in UTF-8 as well. It ships directly to Logstash.

Any ideas?


Solution

  • According to RFC 1738 on Uniform Resource Locators (URL):

    URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.

    Since the character æ, i.e. unicode E6, is in the 80-FF range, it would need to be encoded into the %C3%A6 hexadecimal equivalent. If your URL was properly encoded to /test/char%C3%A6foobar as it is the case in the referrer URL, then grok would parse it properly.

    UPDATE

    If you want to handle those non-ASCII characters, instead of using the URIPATHPARAM pre-defined pattern, you can build your own pattern off that one and include the non-ASCII characters you want to consider.