Search code examples
elasticsearchelasticsearch-analyzers

How to add additional separators to the standard tokenizer?


Given this text (example from nginx logs)

646#646: *226999 SSL_do_handshake() failed (SSL: error:1417D18C:SSL routines:tls_process_client_hello:version too low) while SSL handshaking, client: 192.0.2.0, server: 0.0.0.0:443

the standard tokenizer produces

646
646
226999
ssl_do_handshake
failed
ssl
error
1417d18c:ssl
routines:tls_process_client_hello:version
too
low
while
ssl
handshaking
client
192.0.2.0
server
0.0.0.0
443

I would like the tokens 1417d18c:ssl and routines:tls_process_client_hello:version to be additional split on the :. However, I do not want the ssl_do_handshake or 192.0.2.0 to be split any further nor should e.g. can't be tokenised to can, t.

Is there a way to apply additional splitting after a built-in tokenizer?

Am I stuck with using pattern? In which case what regular expression duplicates the behaviour of standard?


Solution

  • You seem to add on to the standard analyzers. If you are ok with what standard analyzer does and just additionally want the produced tokens to further get tokenised by : then you can define standard analyzer as your custom analyzer as give here and add a pattern capture token filter to further tokenize the tokens produced by the standard tokenizer.

    So define the analyzer and token filter as below:

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "logs": {
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "log"
              ]
            }
          },
          "filter": {
            "log": {
              "type": "pattern_capture",
              "patterns": [
                "([^:]+)"
              ],
              "preserve_original" : false
            }
          }
        }
      }
    }