Search code examples
regexlogstash-grokregex-greedy

Regex for matching repeating k/v pairs plus trailing string in logstash


I need to write a bit of regex that is a bit over my head. The goal here is to parse the following type of log lines inside a logstash filter:

severity=I time=2017-02-23T10:04:31Z [SKYLIGHT] [0.5.1] Unable to start
severity=I time=2017-02-23T10:04:31Z adapter=redis adapter_host=1.1.1.1 Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b
severity=I time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d method=GET path=/somepath.json format=json controller=app action=index status=200 duration=30.47 view=10.04
severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]

Essentially the output format is a set of arbitrary k=v pairs, followed by an occasional "raw message". Just using the logstash k/v filter directly produces undesired behavior since the trailing "message" can have k=v formats nested inside of it - such as path=/admin/luke in the final log line above. My working plan is to capture log into two parts, the k/v pairs as a string, and the trailing message, at which point the k/v string could be sent into the normal logstash kv filter. So for instance, the final log line would produce two groups:

severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d

SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]

With the end goal of the log document to be:

[
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "message": "[SKYLIGHT] [0.5.1] Unable to start"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z"
        "adapter": "redis",
        "adapter_host": "1.1.1.1",
        "message": "Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "message": "[SKYLIGHT] [0.5.1] Unable to start"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "remote_ip": "1.1.1.1",
        "uuid": "daa8090d",
        "method": "GET",
        "path": "/somepath.json",
        "format": "json",
        "controller": "app",
        "action": "index",
        "status": "200",
        "duration": "30.47",
        "view": "10.04"
    },
    {
        "severity": "D",
        "time": "2017-02-23T10:04:31Z",
        "remote_ip": "1.1.1.1",
        "uuid": "daa8090d",
        "message": "SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]"
    }
]

Thank you!


Solution

  • For each row use the following regular expression:

    (?:([^ =]+)=([^ =]+) ?)|(.+)
    

    Explanation:

    • (?: - "External", non-capturing group (xxxx=yyyy).
    • ([^ =]+) - First capturing group (xxxx).
    • = - Equals sign (between xxxx and yyyy).
    • ([^ =]+) - Second capturing group (yyyy).
    • ? - A space (may occur).
    • ) - End of the "external" group.
    • | - Separator between variants.
    • (.+) - Second variant - third capturing group, any non-empty sequence of chars.

    Note that regex processor initially tries the 1st variant (before the |), capturing xxxx=yyyy pairs.

    Then, if the 1st variant failed (after all xxxx=yyyy pairs), the 2nd variant is tried, capturing the message (if any).

    I tried this regex using an online verifier (regex101.com) for each your input row.

    E.g. for the last row (severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}) I got the following results:

    Match 1
    Full match  0-11    `severity=D `
    Group 1.    0-8     `severity`
    Group 2.    9-10    `D`
    
    Match 2
    Full match  11-37   `time=2017-02-23T10:04:31Z `
    Group 1.    11-15   `time`
    Group 2.    16-36   `2017-02-23T10:04:31Z`
    
    Match 3
    Full match  37-55   `remote_ip=1.1.1.1 `
    Group 1.    37-46   `remote_ip`
    Group 2.    47-54   `1.1.1.1`
    
    Match 4
    Full match  55-69   `uuid=daa8090d `
    Group 1.    55-59   `uuid`
    Group 2.    60-68   `daa8090d`
    
    Match 5
    Full match  69-133  `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`
    Group 3.    69-133  `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`
    

    Note that in case of matches No 1 to 4, groups 1 and 2 were found.

    But for the last match, group 3 was found.

    So, processing each match, you have to check:

    • If group 1 is not empty, then group 2 is also not empty and they contain k and v.

    • Otherwise, group 3 holds the content of the message.