Regex for matching repeating k/v pairs plus trailing string in logstash

I need to write a bit of regex that is a bit over my head. The goal here is to parse the following type of log lines inside a logstash filter:

severity=I time=2017-02-23T10:04:31Z [SKYLIGHT] [0.5.1] Unable to start
severity=I time=2017-02-23T10:04:31Z adapter=redis adapter_host=1.1.1.1 Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b
severity=I time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d method=GET path=/somepath.json format=json controller=app action=index status=200 duration=30.47 view=10.04
severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]

Essentially the output format is a set of arbitrary k=v pairs, followed by an occasional "raw message". Just using the logstash k/v filter directly produces undesired behavior since the trailing "message" can have k=v formats nested inside of it - such as path=/admin/luke in the final log line above. My working plan is to capture log into two parts, the k/v pairs as a string, and the trailing message, at which point the k/v string could be sent into the normal logstash kv filter. So for instance, the final log line would produce two groups:

severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d

SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]

With the end goal of the log document to be:

[
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "message": "[SKYLIGHT] [0.5.1] Unable to start"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z"
        "adapter": "redis",
        "adapter_host": "1.1.1.1",
        "message": "Cache read: /model/reference/6235290d29a17a935f4d3d72d2e0a903750dd54b"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "message": "[SKYLIGHT] [0.5.1] Unable to start"
    },
    {
        "severity": "I",
        "time": "2017-02-23T10:04:31Z",
        "remote_ip": "1.1.1.1",
        "uuid": "daa8090d",
        "method": "GET",
        "path": "/somepath.json",
        "format": "json",
        "controller": "app",
        "action": "index",
        "status": "200",
        "duration": "30.47",
        "view": "10.04"
    },
    {
        "severity": "D",
        "time": "2017-02-23T10:04:31Z",
        "remote_ip": "1.1.1.1",
        "uuid": "daa8090d",
        "message": "SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}]"
    }
]

Thank you!

Solution

For each row use the following regular expression:

(?:([^ =]+)=([^ =]+) ?)|(.+)

Explanation:

(?: - "External", non-capturing group (xxxx=yyyy).
([^ =]+) - First capturing group (xxxx).
= - Equals sign (between xxxx and yyyy).
([^ =]+) - Second capturing group (yyyy).
? - A space (may occur).
) - End of the "external" group.
| - Separator between variants.
(.+) - Second variant - third capturing group, any non-empty sequence of chars.

Note that regex processor initially tries the 1st variant (before the |), capturing xxxx=yyyy pairs.

Then, if the 1st variant failed (after all xxxx=yyyy pairs), the 2nd variant is tried, capturing the message (if any).

I tried this regex using an online verifier (regex101.com) for each your input row.

E.g. for the last row (severity=D time=2017-02-23T10:04:31Z remote_ip=1.1.1.1 uuid=daa8090d SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}) I got the following results:

Match 1
Full match  0-11    `severity=D `
Group 1.    0-8     `severity`
Group 2.    9-10    `D`

Match 2
Full match  11-37   `time=2017-02-23T10:04:31Z `
Group 1.    11-15   `time`
Group 2.    16-36   `2017-02-23T10:04:31Z`

Match 3
Full match  37-55   `remote_ip=1.1.1.1 `
Group 1.    37-46   `remote_ip`
Group 2.    47-54   `1.1.1.1`

Match 4
Full match  55-69   `uuid=daa8090d `
Group 1.    55-59   `uuid`
Group 2.    60-68   `daa8090d`

Match 5
Full match  69-133  `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`
Group 3.    69-133  `SOLR Request (18.3ms) [path=/admin/luke parameters={numTerms: 0}`

Note that in case of matches No 1 to 4, groups 1 and 2 were found.

But for the last match, group 3 was found.

So, processing each match, you have to check:

If group 1 is not empty, then group 2 is also not empty and they contain k and v.
Otherwise, group 3 holds the content of the message.