Given this text (example from nginx logs)
646#646: *226999 SSL_do_handshake() failed (SSL: error:1417D18C:SSL routines:tls_process_client_hello:version too low) while SSL handshaking, client: 192.0.2.0, server: 0.0.0.0:443
the standard
tokenizer produces
646
646
226999
ssl_do_handshake
failed
ssl
error
1417d18c:ssl
routines:tls_process_client_hello:version
too
low
while
ssl
handshaking
client
192.0.2.0
server
0.0.0.0
443
I would like the tokens 1417d18c:ssl
and routines:tls_process_client_hello:version
to be additional split on the :
. However, I do not want the ssl_do_handshake
or 192.0.2.0
to be split any further nor should e.g. can't
be tokenised to can
, t
.
Is there a way to apply additional splitting after a built-in tokenizer?
Am I stuck with using pattern
? In which case what regular expression duplicates the behaviour of standard
?
You seem to add on to the standard analyzers. If you are ok with what standard analyzer does and just additionally want the produced tokens to further get tokenised by :
then you can define standard analyzer as your custom analyzer as give here and add a pattern capture token filter to further tokenize the tokens produced by the standard tokenizer.
So define the analyzer and token filter as below:
{
"settings": {
"analysis": {
"analyzer": {
"logs": {
"tokenizer": "standard",
"filter": [
"lowercase",
"log"
]
}
},
"filter": {
"log": {
"type": "pattern_capture",
"patterns": [
"([^:]+)"
],
"preserve_original" : false
}
}
}
}
}